# Finetune the Embedding Model on MLDE

### Dataset preparation
To make our Retrieval Augmented Generation (RAG) Application more effective, we can fine tune our embedding model on our dataset to make it better at retrieving the right chunks when we ask a question. The dataset we need to train it on would be pairs of questions and the chunk it should help retrieve. We have our data in json format, so the first think we need to do is generate questions from it. We're going to use LLamaIndex and OpenAI to generate the questions. We've also included the datasets pre-generated if you want to skip this part.

Lets start with getting the content from each of the objects in our data.

In [1]:
!pip install llama-index-finetuning

Collecting llama-index-finetuning
  Downloading llama_index_finetuning-0.1.4-py3-none-any.whl (26 kB)
Collecting llama-index-embeddings-adapter<0.2.0,>=0.1.2
  Downloading llama_index_embeddings_adapter-0.1.3-py3-none-any.whl (4.5 kB)
Collecting llama-index-postprocessor-cohere-rerank<0.2.0,>=0.1.1
  Downloading llama_index_postprocessor_cohere_rerank-0.1.2-py3-none-any.whl (2.7 kB)
Collecting sentence-transformers<3.0.0,>=2.3.0
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting llama-index-llms-openai<0.2.0,>=0.1.1
  Downloading llama_index_llms_openai-0.1.7-py3-none-any.whl (9.3 kB)
Collecting llama-index-core<0.11.0,>=0.10.11.post1
  Downloading llama_index_core-0.10.14.post1-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m46.5 MB/s[0m eta [36m0:0

In [4]:
import os
import json

def extract_single_value_from_json_files(directory, key):
    """
    Reads every JSON file in a directory and extracts a single value from each object based on the specified key.

    Args:
    - directory (str): The directory path containing JSON files.
    - key (str): The key to extract from each object.

    Returns:
    - values_list (list): A list containing the extracted values from all JSON files.
    """
    values_list = []
    for filename in os.listdir(directory):
        if filename.endswith('.json'):
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r') as file:
                try:
                    data = json.load(file)
                    for obj in data:
                        if key in obj and obj[key] is not None:
                            values_list.append(obj[key])
                except json.JSONDecodeError:
                    print(f"Error decoding JSON file: {file_path}")
    return values_list

# Example usage:
directory_path = './documents'
key_to_extract = 'content'
result = extract_single_value_from_json_files(directory_path, key_to_extract)
print(len(result))
print(result[:5])

1825
['Release Date: September 25, 2023 Breaking Changes Kubernetes: Remove the agent_reattach_enabled config option. Agent reattach is now always enabled. Agent: Take the default value for the --visible-gpus option from the CUDA_VISIBLE_DEVICES or ROCR_VISIBLE_DEVICES environment variables, if defined. New Features SDK: Add the ability to keep track of what experiments use a particular checkpoint or model version for inference. SDK: Add Checkpoint.get_metrics and ModelVersion.get_metrics methods. Kubernetes: Support enabling and disabling agents to prevent Determined from scheduling jobs on specific nodes. Upgrading from a version before this feature to a version after this feature only on Kubernetes will cause queued allocations to be killed on upgrade. Users can pause queued experiments to avoid this. Improvements Enable reporting and display of metrics with floating-point epoch values. API: Allow the reporting of duplicate metrics across multiple report_metrics calls with the same 

Ok, we've read in the content into a list. Now we'll want to parse them with a sentence splitter to build nodes.

In [5]:
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)


text_list = result
documents = [Document(text=t) for t in text_list]

nodes = node_parser.get_nodes_from_documents(documents,show_progress=True)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1825/1825 [00:01<00:00, 917.15it/s] 


In [6]:
print(nodes[0].metadata)
print(nodes[0].text)
print("--------")
print(nodes[1].metadata)
print(nodes[1].text)

{}
Release Date: September 25, 2023 Breaking Changes Kubernetes: Remove the agent_reattach_enabled config option. Agent reattach is now always enabled. Agent: Take the default value for the --visible-gpus option from the CUDA_VISIBLE_DEVICES or ROCR_VISIBLE_DEVICES environment variables, if defined. New Features SDK: Add the ability to keep track of what experiments use a particular checkpoint or model version for inference. SDK: Add Checkpoint.get_metrics and ModelVersion.get_metrics methods. Kubernetes: Support enabling and disabling agents to prevent Determined from scheduling jobs on specific nodes. Upgrading from a version before this feature to a version after this feature only on Kubernetes will cause queued allocations to be killed on upgrade. Users can pause queued experiments to avoid this. Improvements Enable reporting and display of metrics with floating-point epoch values. API: Allow the reporting of duplicate metrics across multiple report_metrics calls with the same step

Cool,  now we have our text chunked up in a list of nodes. Next thing we're going to do is take a sample of the data. How about 250 each for training and validation.

In [9]:
import random
subset = random.sample(nodes, 500)
test, train = subset[:250], subset[250:]

print(len(test), len(train))

250 250


Perfect, now we have 250 chunks randomly sampled from our data for training and 250 for validation. Lets use OpenAI gpt-3.5-turbo model to generate questions for these chunks. After that we'll store them in json to use for training. You can skip this part and use the existing json files in the `experiment/` folder instead. If you do decide to run it, replace the existing files in the experiment folder with the ones you generated.

In [None]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
import os
from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=train
)
test_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=test
)

train_dataset.save_json("demo_dataset.json")
test_dataset.save_json("test_dataset.json")

## Training on MLDE

Now that we have our data, lets fine tune a model on MLDE. We're going to use `BAAI/bge-m3` but any of the `BAAI` `bge` models should work well enough.
We're going to send our experiment to MLDE. Make sure you have the determined client installed (`pip install determined`) and that you're logged in (`det -m <your master url> auth login`)


In [None]:
!det -m https://mlde.i006ua.tryezmeral.com:443 e create experiment/const.yaml ./experiment