# Fine-Tuning Embedding Models for RAG Pipeline Optimization

Base embedding models used for both knowledge base embedding and query embedding for context retrieval in RAG-based applications generally work well, but there are ways to optimize their performance to improve correct information retrieval based on historical user queries and more efficiently retrieve domain-specific information.

**Essentially, fine-tuning embedding models on your data to improve your RAG application!**

I've gone through various papers and implementations of embedding model fine-tuning techniques and determined that the most efficient way to get this improvement is through a **query-only linear adapter**, or training a simple linear layer transformation to better represent user queries in embedding space for improved retrieval.

This allows us to very easily plug into existing RAG pipelines and optimize for our specific task without needing to completely re-embed our knowledge base or use a lot of resources training larger models, making this a simple, cost/compute-effective way to improve retrieval performance.

<img src="./media/adapters_explainer.png" width=800>

Additionally, while it's preferred to use existing labeled data gathered through something like RAG Question Answering Chatbot logs, it is possible to also improve embedding representations with synthetically generated labels.

In this notebook we will:
1. Define a RAG application to optimize
2. Generate a synthetic dataset with gpt-4o-mini
3. Test retrieval metrics to gather a baseline for all-MiniLM-L6-v2
4. Create and train a linear adapter
5. Plug the adapter onto all-MiniLM-L6-v2 and assess performance

Along the way, we'll be implementing many methodologies from [ChromaDB's Research](https://research.trychroma.com/embedding-adapters) on a small scale for task-specific performance increases, specifically their recommendations for triplet loss, random negative sampling, and linear query-only transformation.

<img src="./media/validation_chart.png" width=1200>

Model adapters trained in this notebook published to [AdamLucek/all-MiniLM-L6-v2-query-only-linear-adapter-AppleQA](https://huggingface.co/AdamLucek/all-MiniLM-L6-v2-query-only-linear-adapter-AppleQA)

---
# Synthetic Dataset Creation

To optimize document retrieval effectively, a crucial component is having access to high-quality labeled data. This data typically consists of pairs matching user queries with their most relevant documents. While the ideal scenario would involve collecting and labeling this data from real-world user interactions and testing, for demonstration purposes we can simulate this data by generating potential queries for each chunk of our knowledgebase.

As a plus [it's been researched](https://arxiv.org/pdf/2401.00368) that using LLMs to generate synthetic data for text embedding improvement can lead to gains!

In [1]:
import os
import json
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

### Loading [Apple's 2024 Environmental Report](https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2024.pdf)

This will be our main document and candidate for optimization. The concept here is that our "RAG Application" we're optimizing would be some sort of question and answering chat flow over this document. Our end goal is to improve accurate document retrieval based on the user's questions.

In [2]:
# Will be used as Training Data
apple_loader = PyPDFLoader("/Users/alucek/Documents/Jupyter_Notebooks/ft_emb/data/Apple_Environmental_Progress_Report_2024.pdf")
apple_pages = apple_loader.load()

apple_document = ""
for i in range(len(apple_pages)):
    apple_document += apple_pages[i].page_content

### Chunking Documents

Using a token based chunker with the same parameters that [OpenAI's file search tool](https://platform.openai.com/docs/assistants/tools/file-search/how-it-works) uses for chunk size and overlap. This will split the documents into manageable chunks to be embedded and retreived. 

In [3]:
# Split PDFs
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=800,
    chunk_overlap=400,
)

apple_chunks = text_splitter.split_text(apple_document)

### Defining Chunk Question Generation Chain

As mentioned, it is best to use real testing data and labeling from your RAG application query/retrieval pairs- but for demonstration we will use some synthetic QA pair generation via an LLM.

The hope is that for each chunk of text that we have, we can create a possible user query that would most likely retrieve that chunk of text. This will allow us to further on down the line test the same user query, and assess based on retrieval/rank of the expected chunk.

In [4]:
label_template = """
You are an AI assistant tasked with generating a single, realistic question-answer pair based on a given document. The question should be something a user might naturally ask when seeking information contained in the document.

Given: {chunk}

Instructions:
1. Analyze the key topics, facts, and concepts in the given document, choose one to focus on.
2. Generate twenty similar questions that a user might ask to find the information in this document that does NOT contain any company name.
3. Use natural language and occasionally include typos or colloquialisms to mimic real user behavior in the question.
4. Ensure the question is semantically related to the document content WITHOUT directly copying phrases.
5. Make sure that all of the questions are similar to eachother. I.E. All asking about a similar topic/requesting the same information.

Output Format:
Return a JSON object with the following structure:
```json
{{
  "question_1": "Generated question text",
  "question_2": "Generated question text",
  ...
}}
```

Be creative, think like a curious user, and generate your 20 similar questions that would naturally lead to the given document in a semantic search. Ensure your response is a valid JSON object containing only the questions.

"""

label_prompt = ChatPromptTemplate.from_template(label_template)
llm = ChatOpenAI(temperature=1.0, model="gpt-4o-mini")

label_chain = label_prompt | llm | JsonOutputParser()

In [5]:
label_chain.invoke({"chunk": apple_chunks[20]})

{'question_1': 'How much less energy does the iMac use compared to ENERGY STAR requirements?',
 'question_2': 'What makes the iMac energy efficient?',
 'question_3': "Can you tell me about the iMac's energy consumption compared to standards?",
 'question_4': 'What percentage less energy does the iMac consume than required?',
 'question_5': 'How does the iMac compare in energy usage to the ENERGY STAR benchmark?',
 'question_6': "Is there any info on the iMac's energy efficiency ratings?",
 'question_7': 'What are the energy savings on the iMac compared to ENERGY STAR?',
 'question_8': 'How efficiently does the iMac use energy relative to environmental standards?',
 'question_9': 'What’s the energy reduction percentage of the new iMac models?',
 'question_10': 'How energy-efficient is the latest iMac according to ENERGY STAR?',
 'question_11': "What details can you provide about the iMac's energy savings?",
 'question_12': 'Does the iMac meet or exceed ENERGY STAR energy requirements?',

### Training & Validation Split

With 215 chunks and 20 questions each, we have 4300 Question+Chunk pairs. These were shuffled into an 80/20 train/validation set resulting in:  
- Training set size: 3440  
- Validation set size: 860

The dataset has been uploaded to [AdamLucek/apple-environmental-report-QA-retrieval](https://huggingface.co/datasets/AdamLucek/apple-environmental-report-QA-retrieval)

In [6]:
with open('/Users/alucek/Documents/Jupyter_Notebooks/ft_emb/data/train.json', 'r') as f:
    train_data = json.load(f)

with open('/Users/alucek/Documents/Jupyter_Notebooks/ft_emb/data/validation.json', 'r') as f:
    validation_data = json.load(f)

---
# Setting Up Vector Database

We'll be using my go-to open source VDB [ChromaDB](https://www.trychroma.com/) as our application database. You would want to use a testing environment of whatever vector database your application is currently using and the same retrieval parameters to ensure it's optimized for your specific use case and data store. 

By default, ChromaDB uses [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) as their embedding model, which is the embedding model we will be using as our foundation for training!

In [7]:
import chromadb

# Create chroma client
path = "/Users/alucek/Documents/Jupyter_Notebooks/ft_emb/chromadb"
client = chromadb.PersistentClient(path=path)

# Create collections for both our specific simulated RAG pipelines
apple_collection = client.get_or_create_collection(name='apple_collection', metadata={"hnsw:space": "cosine"})

### Adding Chunks to VDB

Simply embedding all chunks into the database.

In [None]:
# Add apple chunks to vdb
i = 0
for chunk in apple_chunks:
    apple_collection.add(
    documents=[chunk],
    ids=[f"chunk_{i}"]
    )
    i += 1

### Function for Document Retrieval

Takes in the embedding, and retrieves the top 10 similar results from the database.

In [8]:
def retrieve_documents_embeddings(query_embedding, k=10):
    query_embedding_list = query_embedding.tolist()
    
    results = apple_collection.query(
        query_embeddings=[query_embedding_list],
        n_results=k)
    return results['documents'][0]

---
# Base Model Evaluation

As mentioned, we will be using [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) as our base model for optimization. This technique is possible to use for any embedding model, so you'd want to assess your embedding model of choice here.

We'll be focusing on two specific metrics to optimize towards:
- **Mean Reciprocal Rank**
- **Recall@k**

### Metric: Mean Reciprocal Rank (MRR)

$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}$

Where:

- $|Q|$ is the number of queries  
- $rank_i$ is the rank of the first correct answer for the $i$-th query

MRR (Mean Reciprocal Rank) measures how high the first correct answer appears in the list, on average. MRR is particularly useful when there's only one relevant item in the list or when we're primarily interested in the position of the first correct result.

In [9]:
def reciprocal_rank(retrieved_docs, ground_truth, k):
    try:
        rank = retrieved_docs.index(ground_truth) + 1
        return 1.0 / rank if rank <= k else 0.0
    except ValueError:
        return 0.0

### Metric: Recall@K

$\text{Recall@k} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \mathbb{1}(rank_i \leq k)$

Where:

- $|Q|$ is the number of queries
- $rank_i$ is the rank of the ground truth item for the $i$-th query
- $\mathbb{1}()$ is the indicator function, which equals 1 if the condition inside the parentheses is true, and 0 otherwise
- $k$ is the cut-off rank

Recall@k, also known as hit rate, measures the proportion of relevant items that are successfully retrieved within the top k results. In this context with one ground truth document, it's a binary measure that checks if the ground truth (correct item) is present in the top k retrieved documents.

In [52]:
def hit_rate(retrieved_docs, ground_truth, k):
    return 1.0 if ground_truth in retrieved_docs[:k] else 0.0

### Load the base model

We are using [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), a 22.7M parameter embedding model which creates 384 dimensional dense vector representations of text content. Our goal then is to create an adapter than can better map the generated embedding of our query to the original vectorspace representation of our knowledgebase.

In [11]:
from sentence_transformers import SentenceTransformer

# Load the base model
base_model = SentenceTransformer('all-MiniLM-L6-v2')

### Evaluation Function

This will run our metric calculations within our vector database setup and return an average MRR and R@K

In [14]:
import numpy as np

def validate_embedding_model(validation_data, base_model, k=10):
    hit_rates = []
    reciprocal_ranks = []
    
    for data_point in validation_data:
        question = data_point['question']
        ground_truth = data_point['chunk']
        
        # Generate embedding for the question
        question_embedding = base_model.encode(question)
        
        # Retrieve documents using the embedding
        retrieved_docs = retrieve_documents_embeddings(question_embedding, k)
        
        # Calculate metrics
        hr = hit_rate(retrieved_docs, ground_truth, k)
        rr = reciprocal_rank(retrieved_docs, ground_truth, k)
        
        hit_rates.append(hr)
        reciprocal_ranks.append(rr)
    
    # Calculate average metrics
    avg_hit_rate = np.mean(hit_rates)
    avg_reciprocal_rank = np.mean(reciprocal_ranks)
    
    return {
        'average_hit_rate': avg_hit_rate,
        'average_reciprocal_rank': avg_reciprocal_rank
    }

results = validate_embedding_model(validation_data, base_model)
print(f"Average Hit Rate @10: {results['average_hit_rate']}")
print(f"Mean Reciprocal Rank @10: {results['average_reciprocal_rank']}")

Average Hit Rate @10: 0.6116279069767442
Mean Reciprocal Rank @10: 0.31327104097452935


### Baseline Interpretation

**R@K: 0.6116**
- With k=10, for 61.16% of the queries, the correct answer was found within the top 10 results.

**MRR: 0.3133**
- The reciprocal of 0.3133 is about 3.2, indicating that on average, the first correct result appears at about position 3.2 of 10.

Our goal then is to increase both of these so our correct document is placed higher in the retrieval ranking (calculated as more relevant), and shows up more often within our number of retrieved documents.

---
# Linear Adapter Training

<img src="./media/adapter_diagram.png" width=800>

As stated in the introduction, we will be training a query-only linear adapter.

This has the added benefits of:
- Super lightweight, only one single linear transformation layer from the embedding
- Minimal added compute at run time, can be trained quickly on minimal hardware 
- No need to re-embed your knowledgebase
- Proven to be almost as effective as full embedding model fine tuning 

We'll be using some techniques from the [ChromaDB embedding adapters paper](https://research.trychroma.com/embedding-adapters) findings like triplet loss, random negative sampling, and some of their hyperparameters. These are outlined below:

In [15]:
import random
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
from torch.nn.utils import clip_grad_norm_

### Random Negative Sampling

<img src="./media/negative_sampling.png" width=800>

Negative sampling involves randomly selecting unrelated or irrelevant examples (called "negative samples") during the training process.

1. **Purpose**: It helps the model learn to distinguish between relevant and irrelevant information more effectively.

2. **Process**: Along with the correct (positive) matches for a query, the model is also shown randomly selected incorrect (negative) matches.

3. **Benefit**: This exposes the model to a wider range of examples, helping it develop a better understanding of what makes a good match versus a poor one.

4. **Efficiency**: It's a computationally cheap way to improve performance, as it doesn't require carefully curated negative examples.

By introducing these random negative samples, the model adapter better learns to create embeddings that not only bring relevant items closer together in the vector space, but also push irrelevant items further apart. This leads to more robust and discriminative embeddings, ultimately improving the model's ability to retrieve relevant information accurately.

<img src="./media/negative_sampling_2.png" width=300>

In [17]:
# Load NVIDIA 10K Document
# Will be used for random negative sampling
nvidia_loader = PyPDFLoader("/Users/alucek/Documents/Jupyter_Notebooks/ft_emb/data/nvidia_10k.pdf")
nvidia_pages = nvidia_loader.load()

nvidia_document = ""
for i in range(len(nvidia_pages)):
    nvidia_document += nvidia_pages[i].page_content

nvidia_chunks = text_splitter.split_text(nvidia_document)

In [18]:
def random_negative():
    random_sample = random.choice(nvidia_chunks)
    return random_sample

In [20]:
random_negative()

'price basis. In certain cases, we can establish standalone selling price based on directly observable prices of products or services sold separately in comparable\ncircumstances to similar customers. If standalone selling price is not directly observable, such as when we do not sell a product or service separately , we\ndetermine standalone selling price based on market data and other observable inputs.\nChange in Accounting Estimate\nIn February 2023, we assessed the useful lives of our property , plant, and equipment. Based on advances in technology and usage rate, we increased the\nestimated useful life of a majority of the server , storage, and network equipment from three years to a range of four to five years, and assembly and test\nequipment from five years to seven years. The estimated effect of this change for fiscal year 2024 was a benefit of $33 million and $102 million for cost of\nrevenue and operating expenses, respectively , which resulted in an increase in operating in

### Torch Module and Linear Transformations

<img src="./media/linear_layer.png" width=500>

A linear transformation is a mathematical operation that takes an input vector and produces an output vector while preserving the operations of vector addition and scalar multiplication.

In the context of machine learning and neural networks, a linear transformation is typically represented as:

$f(x) = Wx + b$

Where:
- $x$ is the input vector
- $W$ is a matrix of weights
- $b$ is a bias vector

Internally, `nn.Linear` creates:
- A weight matrix $W$ of shape (output_features, input_features)
- A bias vector $b$ of shape (output_features)

This is all saved using PyTorch and their `Module` class, and these are the trainable parameters that we will be optimizing.

In [21]:
class LinearAdapter(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.linear = nn.Linear(input_dim, input_dim)
    
    def forward(self, x):
        return self.linear(x)

### Triplet Dataset Preparation

<img src="./media/tripletdataset.png" width=600>

The `TripletDataset` class is a custom dataset class that inherits from PyTorch's `Dataset` designed to work with triplet loss where each data point consists of three parts: a query, a positive example, and a negative example.

1. It retrieves the item at index `idx` from the training data.
2. Extracts the query and positive example from the item.
3. Uses the `negative_sampler` to generate a negative example.
4. Encodes the query, positive, and negative examples into embeddings using the `base_model`.
5. Returns the triplet of embeddings (query, positive, negative).

In [22]:
class TripletDataset(Dataset):
    def __init__(self, data, base_model, negative_sampler):
        self.data = data
        self.base_model = base_model
        self.negative_sampler = negative_sampler

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        query = item['question']
        positive = item['chunk']
        negative = self.negative_sampler()
        
        query_emb = self.base_model.encode(query, convert_to_tensor=True)
        positive_emb = self.base_model.encode(positive, convert_to_tensor=True)
        negative_emb = self.base_model.encode(negative, convert_to_tensor=True)
        
        return query_emb, positive_emb, negative_emb

### Trainer Script

This script follows the classic machine learning optimization flow in this way:

1. Learning rate scheduler setup with warmup and decay phases
2. Initialization of LinearAdapter, loss function, optimizer, and dataloader
3. Calculation of total training steps and setup of learning rate scheduler
4. Batch generation from the TripletDataset dataloader
5. Forward pass through the LinearAdapter
6. Triplet Margin Loss calculation
7. Backpropagation and gradient clipping
8. Optimization parameter updates and learning rate adjustment
9. Epoch-wise loss reporting
10. Return of the trained adapter

Let's talk briefly about how random negative sampling and triplet loss are used:

<img src="./media/triplet_loss.png" width=500>

Triplet loss is a type of loss function used in various machine learning tasks, particularly in metric learning and embedding learning. Its primary goal is to learn embeddings such that similar examples are closer together in the embedding space while dissimilar examples are farther apart.

The triplet loss operates on triplets of data points:

1. Anchor (A): The reference sample
2. Positive (P): A sample similar to the anchor
3. Negative (N): A sample dissimilar to the anchor

defined as: $L = max(d(A, P) - d(A, N) + margin, 0)$

Where:
- $d(x, y)$ is the distance function (Euclidean)
- margin is a hyperparameter that enforces a minimum distance between the positive and negative pairs

The loss encourages the model to learn embeddings where:

$d(A, P) < d(A, N) - margin$

This means the distance between the anchor and positive should be smaller than the distance between the anchor and negative, by at least the margin. The Negative document is dynamically randomly sampled from our NVIDIA form 10-K 

In [23]:
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps):
    def lr_lambda(current_step):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        return max(0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps)))
    return LambdaLR(optimizer, lr_lambda)

def train_linear_adapter(base_model, train_data, negative_sampler, num_epochs=10, batch_size=32, 
                         learning_rate=2e-5, warmup_steps=100, max_grad_norm=1.0, margin=1.0):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Initialize the LinearAdapter
    adapter = LinearAdapter(base_model.get_sentence_embedding_dimension()).to(device)
    
    # Define loss function and optimizer
    triplet_loss = nn.TripletMarginLoss(margin=margin, p=2)
    optimizer = AdamW(adapter.parameters(), lr=learning_rate)
    
    # Create dataset and dataloader
    dataset = TripletDataset(train_data, base_model, negative_sampler)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    # Calculate total number of training steps
    total_steps = len(dataloader) * num_epochs
    
    # Create learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
    
    # Training loop
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in dataloader:
            query_emb, positive_emb, negative_emb = [x.to(device) for x in batch]
            
            # Forward pass
            adapted_query_emb = adapter(query_emb)
            
            # Compute loss
            loss = triplet_loss(adapted_query_emb, positive_emb, negative_emb)
            
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            
            # Gradient clipping
            clip_grad_norm_(adapter.parameters(), max_grad_norm)
            
            optimizer.step()
            scheduler.step()
            
            total_loss += loss.item()
        
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}")
    
    return adapter

### Training Function Execution!

Saves the hyperparameters and linear adapter in the same file

In [24]:
# Define the kwargs dictionary
adapter_kwargs = {
    'num_epochs': 1,
    'batch_size': 32,
    'learning_rate': 0.003,
    'warmup_steps': 100,
    'max_grad_norm': 1.0,
    'margin': 1.0
}

# Train the adapter using the kwargs dictionary
trained_adapter = train_linear_adapter(base_model, train_data, random_negative, **adapter_kwargs)

# Create a dictionary to store both the adapter state_dict and the kwargs
save_dict = {
    'adapter_state_dict': trained_adapter.state_dict(),
    'adapter_kwargs': adapter_kwargs
}

# Save the combined dictionary
torch.save(save_dict, '/Users/alucek/Documents/Jupyter_Notebooks/ft_emb/adapters/linear_adapter_1epoch.pth')

Epoch 1/1, Loss: 0.4718


---
# Evaluate Adapter Performance

Now that we have our trained query-only linear adapter, let's assess its performance on our knowledgebase compared to our baseline model.

### Applying the Adapter

Below is the function to get the original embedding for the query and run it through our trained adapter.

This is essentially the new function for embedding your user query.

In [25]:
# Function to encode query using the adapter
def encode_query(query, base_model, adapter):
    device = next(adapter.parameters()).device
    query_emb = base_model.encode(query, convert_to_tensor=True).to(device)
    adapted_query_emb = adapter(query_emb)
    return adapted_query_emb.cpu().detach().numpy()

### Loading the Adapter

In [50]:
# Later, loading and using the saved information
loaded_dict = torch.load('/Users/alucek/Documents/Jupyter_Notebooks/ft_emb/adapters/linear_adapter_30epochs.pth')

# Recreate the adapter
loaded_adapter = LinearAdapter(base_model.get_sentence_embedding_dimension())  # Initialize with appropriate parameters
loaded_adapter.load_state_dict(loaded_dict['adapter_state_dict'])

# Access the training parameters
training_params = loaded_dict['adapter_kwargs']

print("Adapter loaded successfully.")
print("Training parameters used:")
for key, value in training_params.items():
    print(f"{key}: {value}")

Adapter loaded successfully.
Training parameters used:
num_epochs: 30
batch_size: 32
learning_rate: 0.003
warmup_steps: 100
max_grad_norm: 1.0
margin: 1.0


### Adapter Evaluation Function

New evaluation function to replicate the original experiment, however this time with adapter support

In [53]:
def evaluate_adapter(validation_data, base_model, adapter, k=10):
    hit_rates = []
    reciprocal_ranks = []
    
    for data_point in validation_data:
        question = data_point['question']
        ground_truth = data_point['chunk']
        
        # Generate embedding for the question
        question_embedding = encode_query(question, base_model, adapter)
        # Retrieve documents using the embedding
        retrieved_docs = retrieve_documents_embeddings(question_embedding, k)
        
        # Calculate metrics
        hr = hit_rate(retrieved_docs, ground_truth, k)
        rr = reciprocal_rank(retrieved_docs, ground_truth, k)
        
        hit_rates.append(hr)
        reciprocal_ranks.append(rr)
    
    # Calculate average metrics
    avg_hit_rate = np.mean(hit_rates)
    avg_reciprocal_rank = np.mean(reciprocal_ranks)
    
    return {
        'average_hit_rate': avg_hit_rate,
        'average_reciprocal_rank': avg_reciprocal_rank
    }

results = evaluate_adapter(validation_data, base_model, loaded_adapter, k=10)
print(f"Average Hit Rate @10: {results['average_hit_rate']}")
print(f"Mean Reciprocal Rank @10: {results['average_reciprocal_rank']}")

Average Hit Rate @10: 0.6662790697674419
Mean Reciprocal Rank @10: 0.33240956072351424


Post training, our adapter gave us an average hit rate/recall of 66.7%, a percentage point increase of 4.8 over baseline of 61.9%- **a 7.8% improvement**. And a mean reciprocal rank of 0.332, so our expected document tends to be placed at place 3.0, compared to the baseline of 3.2 (0.3110)- **a 6.2% improvement**.

### Validation Metrics Compared to Baseline
<img src="./media/validation_chart.png" width=1200>

We can conclude that our 30 epoch trained version on these hyperparameters gave us the biggest improvement, and began to overfit and lose ability to generalize when increasing to 40. To see how our model is fitting to the data, we can run the metrics on our training data, visualized below:

### Visualizing Model Fitting on Training Data
<img src="./media/training_fit.png" width=1200>

Decent fitting, not rampant overfitting. If user queries are the same as some of the frequent queries in the training data, they will definitely have a big boost in expected document retrieval accuracy.