### ⚠ IMPORTANT ⚠

You will need at least 22GB of VRAM (GPU RAM) to run this notebook.

If you're running this locally - please ensure you have the correct hardware to support the fine-tuning.

Please make sure you're using the following instance:

![image](https://i.imgur.com/ji210Ug.png)

# Fine-tuning Embedding Models

In the following Notebook we will be exploring one of the most powerful techniques to take your single-domain RAG pipelines to the next level.

Fine-tuning Embeddings Models!

- 🤝 Breakout Room #2
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating Retrieval with Embedding Model

But before any of that, we need to grab some dependencies, and set up some boilerplate!

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key, and Hugging Face token!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

In [12]:
pip install -qU llama-index-llms-openai llama-index-embeddings-openai llama-index-finetuning

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-auth 2.20.0 requires urllib3<2.0, but you have urllib3 2.2.1 which is incompatible.
azureml-core 1.51.0.post1 requires urllib3<2.0.0,>=1.23, but you have urllib3 2.2.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [13]:
pip install -qU llama-index-readers-file llama-index-embeddings-huggingface

Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install -qU "sentence_transformers==2.7.0"

Note: you may need to restart the kernel to use updated packages.


### API Key Section!

In classic fashion, we'll need to provide our OpenAI API key!

We'll also provide our Hugging Face token (with `Write` access) in order to save our model on the Hub!

In [5]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

In [6]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Task 2: Loading Data

The data can be found in [this GitHub repo](https://github.com/AI-Maker-Space/DataRepository/tree/main/high-performance-rag).

In this case, the data is related to research articles about Camelids (aka: Llamas, Alpacas, Camels!)

In [7]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 71 (delta 19), reused 28 (delta 8), pack-reused 8[K
Receiving objects: 100% (71/71), 69.00 MiB | 43.89 MiB/s, done.
Resolving deltas: 100% (19/19), done.
Updating files: 100% (37/37), done.


In [8]:
%cd ./DataRepository/high-performance-rag/

/mnt/batch/tasks/shared/LS_root/mounts/clusters/expcompute/code/Users/Nithin.Kamavaram/exp1/DataRepository/high-performance-rag


In [9]:
!unzip "Camel Papers Test.zip"

Archive:  Camel Papers Test.zip
  inflating: Camel Papers Test/Acute respiratory distress syndrome in an alpaca cria.pdf  
  inflating: Camel Papers Test/Alpaca liveweight variations and fiber production in Mediterranean range of Chile.pdf  


In [10]:
!unzip "Camel Papers Train.zip"

Archive:  Camel Papers Train.zip
  inflating: Camel Papers Train/Antibody response to the epsilon toxin ofClostridium perfringensfollowing vaccination of Lama glamacrias.pdf  
  inflating: Camel Papers Train/Comparative pigmentation of sheep, goats, and llamas what colors are possible through selection.pdf  
  inflating: Camel Papers Train/Conservative management of a ruptured.pdf  
  inflating: Camel Papers Train/Evaluation of cholesterol and vitamin E concentrations in adult alpacas and nursing crias.pdf  
  inflating: Camel Papers Train/Influence of effects on quality traits and relationships between traits of the llama fleece..pdf  
  inflating: Camel Papers Train/Influence of Follicular Fluid on in Vitro.pdf  
  inflating: Camel Papers Train/Neurological Causes of Diaphragmatic Paralysis in 11 Alpacas.pdf  
  inflating: Camel Papers Train/On the morphology of the cerebellum of the alpaca (Lama pacos)..pdf  
  inflating: Camel Papers Train/Relationships between integumental charact

Now we can begin building our simple index for each of the training directories, and the validation directories.

We will use LlamaIndex's `SimpleNodeParser` to achieve this!

In [15]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

TRAIN_FILES = "Camel Papers Train"
EVAL_FILES = "Camel Papers Test"

In [17]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

def load_corpus(directory, verbose=False):
    if verbose:
        print(f"Loading files in {directory}")

    reader = SimpleDirectoryReader(directory)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

In [18]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
eval_nodes = load_corpus(EVAL_FILES, verbose=True)

Loading files in Camel Papers Train
Loaded 91 docs


Parsing nodes:   0%|          | 0/91 [00:00<?, ?it/s]

Parsed 160 nodes
Loading files in Camel Papers Test
Loaded 9 docs


Parsing nodes:   0%|          | 0/9 [00:00<?, ?it/s]

Parsed 18 nodes


In [23]:
print(train_nodes[0].text)

Brief Original Article  
 
Antibody response to the epsilon toxin of  Clostridium perfringens  following 
vaccination of Lama glama  crias  
 
Adriana B. Bentancor1, Pablo Halperi n2 Myriam Flores3, Fabián Iribarren4,5 
 
1Microbiología,  2Histología, 3Estadística, 4Enfermedades Infecciosas , Facultad de Ciencias Veterinarias  Universidad de Buenos 
Aires, Argentina,  Chorroarín 280.  Cdad. Autónoma de Buenos Aires,  Argentina   
5Instituto Rosenbusch SA  
 
Abstract  
Background : Enterotoxaemia produced by Clostridium perfringens  A, C and D is an important cause of mortality in young llamas. There is 
no data on antibody responses following vaccination with epsilon toxin.  
Methodology: Twenty -six L. glama crias w ere divided into four groups which were vaccinated with a commercial vaccine (Mancha 
Gangrena Enterotoxemia, Instituto Rosembusch Sociedad Anónima, Argentina) on days 0, 21 and 42 or left as unvaccinated contro ls. An 
indirect ELISA was compared with the mo use neutrali

Now that we've split our source documents into a number of nodes, we can move on to constructing a fine-tuning dataset.

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-3.5-turbo`.

We'll start by using LlamaIndex's `generate_qa_embedding_pairs` and storing it in a `EmbeddingQAFinetuneDataset`.

The basic idea here is straightforward enough:

1. We look at a node
2. We generate a question that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

> NOTE: Keep in mind that the below example uses 100 nodes to generate the QA pairs. This results in 100 calls to `gpt-3.5-turbo` feel free to reduce the number of nodes.

In [24]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [25]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")

In [26]:
train_dataset = generate_qa_embedding_pairs(train_nodes[:60], llm=llm)
train_dataset.save_json("train_dataset.json")

100%|██████████| 60/60 [01:47<00:00,  1.79s/it]


In [27]:
eval_dataset = generate_qa_embedding_pairs(eval_nodes[:6], llm=llm)
eval_dataset.save_json("eval_dataset.json")

100%|██████████| 6/6 [00:08<00:00,  1.48s/it]


In [28]:
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
eval_dataset = EmbeddingQAFinetuneDataset.from_json("eval_dataset.json")

In [46]:
len(eval_dataset.queries)

12

In [47]:
len(eval_dataset.corpus)

6

In [50]:
eval_dataset.queries

{'33594dec-3ef3-4f95-972a-6016c22ac6cc': 'What are the clinical signs and initial treatment provided to the alpaca cria presented in the case report?',
 '025bd677-c0ff-4783-9807-e6b4ad93de9c': 'How was acute respiratory distress syndrome diagnosed in the alpaca cria, and what treatment was successful in resolving the condition?',
 '3b29b54f-8a36-4b7f-a14c-daf1099dda3d': 'What are the major problems identified in the initial assessment of the patient, and what differentials were considered for these findings?',
 '8891598d-31d4-496e-b529-bdae9fbb00c8': 'Describe the abnormalities found in the chemistry panel of the patient, including values that were outside the reference intervals.',
 '071e47ab-fc03-4049-9af6-550d621c9bb0': 'What treatment protocol was followed for the cria presenting with neonatal maladjustment syndrome, septicemia, and congenital malformation? Include details such as medication administration, fluid therapy, and other supportive measures.',
 '6c692fea-cfb2-4edb-ac0f-f

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

> NOTE: If you are limited by your compute - you can use the `snowflake-arctic-embed-m` model instead, which will run on the free T4 GPU instance in Colab.

####❓ Question 1:

How many parameters does `snowflake-arctic-embed-l` have?

**ANSWER**: 
335 million parameters

In [51]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset, # Dataset to be trained on
    val_dataset=eval_dataset, # Dataset to evaluate on
    model_id="Snowflake/snowflake-arctic-embed-l", # HuggingFace reference to base embeddings model
    model_output_path="snowflake_finetune_camelids", # Output directory for fine-tuned embeddings model
    epochs=4 # Number of Epochs to train for
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/83.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

All that's left to do now is call `.finetune()`!

In [52]:
finetune_engine.finetune()

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/12 [00:00<?, ?it/s]

Iteration:   0%|          | 0/12 [00:00<?, ?it/s]

Iteration:   0%|          | 0/12 [00:00<?, ?it/s]

Iteration:   0%|          | 0/12 [00:00<?, ?it/s]

Now that we've fine-tuned our embeddings model, lets grab the model out of the engine so we can use it later!

> NOTE: You should be able to safely avoid any warnings relating to weights here.

In [53]:
finetuned_embedding_model = finetune_engine.get_finetuned_model()
finetuned_embedding_model

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_camelids and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HuggingFaceEmbedding(model_name='snowflake_finetune_camelids', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fad369dd2e0>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

In [56]:
finetuned_embedding_model

HuggingFaceEmbedding(model_name='snowflake_finetune_camelids', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fad369dd2e0>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

In [57]:
from sentence_transformers import SentenceTransformer

fine_tuned_embedding = SentenceTransformer(
    "snowflake_finetune_camelids"
)

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_camelids and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [58]:
fine_tuned_embedding

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [59]:
fine_tuned_embedding.save_to_hub(repo_id="Nithin29/snowflake-ft-camelids-l")

The `save_to_hub` method is deprecated and will be removed in a future version of SentenceTransformers. Please use `push_to_hub` instead for future model uploads.


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/Nithin29/snowflake-ft-camelids-l/commit/649241acb15c86ce1345adfa47a2f8a55447e16d'

## Task 5: Evaluating Retrieval with Embedding Model

Now that we've fine-tuned our model - let's see how it performs against OpenAI's `text-embedding-3-small` model, and the base non-fine-tuned version of the model.

In [63]:
from tqdm.notebook import tqdm
from llama_index.core.schema import TextNode
from llama_index.core import Settings, VectorStoreIndex


def evaluate(
    dataset,
    embed_model,
    top_k=2,
    verbose=False,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items() if text != ""]
    index = VectorStoreIndex(
        nodes,
        show_progress=True,
        embed_model=embed_model
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            'is_hit': is_hit,
            'retrieved': retrieved_ids,
            'expected': expected_id,
            'query': query_id,
        }
        eval_results.append(eval_result)
    return eval_results

####❓Question 2:

Describe what the `evaluate` function is doing in the above cell in natural language.

**ANSWER**

- Initialization: It extracts the corpus, queries, and relevant documents from the provided dataset.
- TextNode Creation: For each entry in the corpus, it creates a TextNode object (assuming each entry is non-empty), which likely includes some form of identifier and text content.
- Index Creation: These TextNode objects are then used to build a VectorStoreIndex. This index uses an embedding model specified by embed_model to convert text into vector representations, facilitating efficient similarity searches.
- Retriever Setup: It configures a retriever from the index to fetch the top k most similar documents (specified by top_k) for a given query.
- Evaluation Loop: For each query, it retrieves the top k documents and checks if the expected relevant document (from relevant_docs) is among them. This is captured in a hit or miss boolean (is_hit).
- Result Compilation: It collects results for each query, including whether the relevant document was retrieved (is_hit), the ids of the retrieved documents (retrieved_ids), the expected relevant document id (expected_id), and the query id (query).
- Return: Finally, it returns a list of these results, providing a detailed evaluation of the retrieval performance for each query.

In [83]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer

def evaluate_sentence_transformers(
    dataset,
    model_id,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = SentenceTransformer(model_id)
    return evaluator(model, output_path="./")

####❓Question 3:

Describe what the `evaluate_st` function is doing in the above cell in natural language.

**ANSWER**:

- Extract Data: It extracts the corpus, queries, and relevant documents from the input dataset.
- Set Up Evaluator: It initializes an InformationRetrievalEvaluator with the queries, corpus, and relevant documents. The evaluator will assess how well the model can retrieve relevant documents for each query.
- Load Model: It loads a sentence transformer model specified by model_id.
- Evaluate Model: It evaluates the model using the InformationRetrievalEvaluator, which will likely run queries through the model and compare the model's output against the expected relevant documents to assess accuracy.
- Return Evaluation Results: It returns the results of the evaluation, which are saved to a specified output path.

In [61]:
import json

with open("eval_dataset.json", 'r+') as f:
    eval_dataset_json = json.load(f)

### Text Embedding 3 Small Results

We'll compare our results against OpenAI's `text-embedding-3-small` model, so we'll need to load it up!

In [64]:
from llama_index.embeddings.openai import OpenAIEmbedding

text_embedding_3_small = OpenAIEmbedding(model="text-embedding-3-small")
te3_val_results = evaluate(eval_dataset_json, text_embedding_3_small)

Generating embeddings:   0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/12 [00:00<?, ?it/s]

In [66]:
len(te3_val_results)

12

Let's look at what an example of our results looks like.

In [67]:
import pandas as pd

df_te3 = pd.DataFrame(te3_val_results)

In [68]:
df_te3

Unnamed: 0,is_hit,retrieved,expected,query
0,True,"[3658ef6e-0944-403b-bc30-f2db4bf393e1, 63386f7...",3658ef6e-0944-403b-bc30-f2db4bf393e1,33594dec-3ef3-4f95-972a-6016c22ac6cc
1,True,"[3658ef6e-0944-403b-bc30-f2db4bf393e1, 885c1e6...",3658ef6e-0944-403b-bc30-f2db4bf393e1,025bd677-c0ff-4783-9807-e6b4ad93de9c
2,True,"[a54cfedf-6446-43d0-bac3-1c029f0a155b, 65c8902...",a54cfedf-6446-43d0-bac3-1c029f0a155b,3b29b54f-8a36-4b7f-a14c-daf1099dda3d
3,True,"[a54cfedf-6446-43d0-bac3-1c029f0a155b, 63386f7...",a54cfedf-6446-43d0-bac3-1c029f0a155b,8891598d-31d4-496e-b529-bdae9fbb00c8
4,True,"[63386f7f-6313-4a2d-a2bf-cb4890d87ef0, 65c8902...",63386f7f-6313-4a2d-a2bf-cb4890d87ef0,071e47ab-fc03-4049-9af6-550d621c9bb0
5,True,"[63386f7f-6313-4a2d-a2bf-cb4890d87ef0, 65c8902...",63386f7f-6313-4a2d-a2bf-cb4890d87ef0,6c692fea-cfb2-4edb-ac0f-ffac11544450
6,False,"[885c1e65-50e3-40c4-9b72-da3cd65f3b1e, 3658ef6...",65c8902d-9ab3-4271-9020-9208c3a716f2,0986bbe9-d55b-4264-8d01-8ff221c56435
7,True,"[65c8902d-9ab3-4271-9020-9208c3a716f2, 885c1e6...",65c8902d-9ab3-4271-9020-9208c3a716f2,758042a9-e359-4140-ab5e-c155de09f372
8,True,"[664034a5-5165-43e1-af2f-81184e5d595d, 65c8902...",664034a5-5165-43e1-af2f-81184e5d595d,2504507e-5796-42b2-8175-e0800cc4f9ac
9,True,"[65c8902d-9ab3-4271-9020-9208c3a716f2, 664034a...",664034a5-5165-43e1-af2f-81184e5d595d,78aa2915-fbb8-4972-8f29-df0f618bafee


####❓Question 4:

What do these `[313de41e-534b...]` IDs mean?

**ANSWER**:

- Retrieved IDs: These are the identifiers (IDs) of the documents or items that the model has selected as relevant or related to the query based on its understanding and retrieval process. These IDs correspond to the entries in the corpus that the model outputs as the result of a query.
- Expected IDs: These IDs represent the correct or relevant documents as per the dataset's ground truth. These are the IDs of documents that are actually relevant to the query and should ideally be retrieved by the model.

Now let's look at the mean value of `is_hit`.

In [69]:
hit_rate_ada = df_te3['is_hit'].mean()
hit_rate_ada

0.9166666666666666

Overall, we see `text-embedding-3-small` getting a `0.9` "hit rate".

### Base Embeddings Model Results

Let's get the evaluation for our base embedding model (pre-fine-tuning).

In [70]:
base_embed_model_id = "Snowflake/snowflake-arctic-embed-l"
base_embed_model = SentenceTransformer(base_embed_model_id)

arctic_base = "local:Snowflake/snowflake-arctic-embed-l"
arctic_base_val_results = evaluate(eval_dataset_json, arctic_base)

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/83.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/12 [00:00<?, ?it/s]

In [71]:
df_arctic_base = pd.DataFrame(arctic_base_val_results)

In [73]:
df_arctic_base

Unnamed: 0,is_hit,retrieved,expected,query
0,False,"[65c8902d-9ab3-4271-9020-9208c3a716f2, 664034a...",3658ef6e-0944-403b-bc30-f2db4bf393e1,33594dec-3ef3-4f95-972a-6016c22ac6cc
1,False,"[65c8902d-9ab3-4271-9020-9208c3a716f2, 664034a...",3658ef6e-0944-403b-bc30-f2db4bf393e1,025bd677-c0ff-4783-9807-e6b4ad93de9c
2,True,"[664034a5-5165-43e1-af2f-81184e5d595d, a54cfed...",a54cfedf-6446-43d0-bac3-1c029f0a155b,3b29b54f-8a36-4b7f-a14c-daf1099dda3d
3,True,"[664034a5-5165-43e1-af2f-81184e5d595d, a54cfed...",a54cfedf-6446-43d0-bac3-1c029f0a155b,8891598d-31d4-496e-b529-bdae9fbb00c8
4,False,"[664034a5-5165-43e1-af2f-81184e5d595d, 65c8902...",63386f7f-6313-4a2d-a2bf-cb4890d87ef0,071e47ab-fc03-4049-9af6-550d621c9bb0
5,False,"[65c8902d-9ab3-4271-9020-9208c3a716f2, 664034a...",63386f7f-6313-4a2d-a2bf-cb4890d87ef0,6c692fea-cfb2-4edb-ac0f-ffac11544450
6,True,"[664034a5-5165-43e1-af2f-81184e5d595d, 65c8902...",65c8902d-9ab3-4271-9020-9208c3a716f2,0986bbe9-d55b-4264-8d01-8ff221c56435
7,True,"[664034a5-5165-43e1-af2f-81184e5d595d, 65c8902...",65c8902d-9ab3-4271-9020-9208c3a716f2,758042a9-e359-4140-ab5e-c155de09f372
8,True,"[664034a5-5165-43e1-af2f-81184e5d595d, 65c8902...",664034a5-5165-43e1-af2f-81184e5d595d,2504507e-5796-42b2-8175-e0800cc4f9ac
9,True,"[664034a5-5165-43e1-af2f-81184e5d595d, 65c8902...",664034a5-5165-43e1-af2f-81184e5d595d,78aa2915-fbb8-4972-8f29-df0f618bafee


In [72]:
hit_rate_bge = df_arctic_base['is_hit'].mean()
hit_rate_bge

0.5

With a `0.5` hit rate - the base embedding model is absolutely terrible when compared to `text-embedding-3-small` from OpenAI!

Because this is a local `SentenceTransformer`, we can evaluate it with the `SentenceTransformer` evaluation helper-function as well!

In [75]:
eval_dataset_json

{'queries': {'33594dec-3ef3-4f95-972a-6016c22ac6cc': 'What are the clinical signs and initial treatment provided to the alpaca cria presented in the case report?',
  '025bd677-c0ff-4783-9807-e6b4ad93de9c': 'How was acute respiratory distress syndrome diagnosed in the alpaca cria, and what treatment was successful in resolving the condition?',
  '3b29b54f-8a36-4b7f-a14c-daf1099dda3d': 'What are the major problems identified in the initial assessment of the patient, and what differentials were considered for these findings?',
  '8891598d-31d4-496e-b529-bdae9fbb00c8': 'Describe the abnormalities found in the chemistry panel of the patient, including values that were outside the reference intervals.',
  '071e47ab-fc03-4049-9af6-550d621c9bb0': 'What treatment protocol was followed for the cria presenting with neonatal maladjustment syndrome, septicemia, and congenital malformation? Include details such as medication administration, fluid therapy, and other supportive measures.',
  '6c692fea

In [84]:
evaluate_sentence_transformers(eval_dataset_json, "Snowflake/snowflake-arctic-embed-l", name='arctic-l')

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





0.4680555555555555

Not great results - let's see what fine-tuning can do for us!

### Fine-tuned Results

In [77]:
finetuned = "local:snowflake_finetune_camelids"
eval_results_finetuned = evaluate(eval_dataset_json, finetuned)

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_camelids and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Generating embeddings:   0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/12 [00:00<?, ?it/s]

In [78]:
df_finetuned = pd.DataFrame(eval_results_finetuned)

In [79]:
hit_rate_finetuned = df_finetuned['is_hit'].mean()
hit_rate_finetuned

0.8333333333333334

This is a marked improvement when compared to the base model. Absolutely fantastic!

In [85]:
evaluate_sentence_transformers(eval_dataset_json, "snowflake_finetune_camelids", name='finetuned')

You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_camelids and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.7569444444444443

It's also a marked improvement on the `SentenceTransformer` evaluation!

### Conclusion

Now we can compare the 3 embeddings models to see which performed the best!

In [86]:
df_te3['model'] = 'te3'
df_arctic_base['model'] = 'arctic-baseline'
df_finetuned['model'] = 'arctic-fine-tuned'

In [87]:
df_all = pd.concat([df_te3, df_arctic_base, df_finetuned])
df_all.groupby('model').mean('is_hit')

Unnamed: 0_level_0,is_hit
model,Unnamed: 1_level_1
arctic-baseline,0.5
arctic-fine-tuned,0.833333
te3,0.916667
