# Overview
The purpose of this notebook is to demonstrate how to evaluate an embedding model as well as how to evaluate your chunking strategy. Typically, you get better performance boosts out of picking a solid chunking strategy over swapping out an embedding model. In this notebook, we will demonstrate both. 

# Background
Before evaluating an embedding model, it’s important to understand “What” we’re using an embedding model for. The most popular public benchmark is the Massive Text Embedding Benchmark or MTEB. HuggingFace maintains a leaderboard to compare general purpose embedding models against each other to see how they stack up against a wide range of tasks. 

This is a decent starting place, but you have to ask yourself, how well does this dataset compliment the task I really care about. If I’m creating a RAG Solution for Lawyers, I’m much more interested in how well the embedding model works for comparing legal text vs. how well it works for medical text. This is why it’s important to build out your own evaluation. A model that might not rank high on a general-purpose benchmark, could rank very high on your specific use case. If none of them work very well, then you can make a case for fine tuning an existing model on your data.

**What Metrics Should You Care About?**
To answer this question, we need to understand what we’re using the embedding model for. For an information retrieval use case, we care about different metrics than we would for a clustering use case. Because retrieval tends to be the most important use case in RAG, lets focus on retrieval.

Classic methods apply here to embeddings model like recall@k, precision@k, etc..

**How to Evaluate**
To perform this evaluation, you need to set up a retrieval task. Generate vector representations of items (documents or chunks) in a shared semantic space and perform a K-nearest-neighbor search on them using a similarity measure (e.g. cosine-similarity, dot-product). This gives you the top-k retrieved item for each query.

You need a set of relevance judgments that indicate which documents are relevant to each query. These are typically created by human annotators or derived from click data in product systems.

For each query, count the number of relevant items in the top-k retrieved results. Calculate the precision using (number of relevant documents / k). Average the precision values across all queries. 
Apply these same techniques to other metrics like recall, NDCG, or MAP for a more comprehensive evaluation.

## How to create relevance judgements? 
This is a pretty manual process. For this example, I pasted large chunks of the Opensearch documentation into Claude and asked Claude to come up with a couple example questions about the context. I find it easier to build a validation set where the answer corresponds to 1 to 3 pages. You'll likely tweak your chunking strategy over time, but the relative file paths will stay constant so you don't have to redo your validation dataset every time you make a change to the chunks. 


# What Will We Do? 
* We will start with a basic sentence splitting chunking strategy, create embeddings for them, and store them in an in memory vector store (chromaDB).
* We will then use our evaluation dataset (which I already created) to run multiple experiments using different embedding models and chunking strategies to see which gives us the best results based on our metrics. 

**Lets get started!**

# Initialize clients

In [4]:
import chromadb
import boto3
from chromadb.config import Settings

# Initialize Chroma client from our persisted store
chroma_client = chromadb.PersistentClient(path="../data/chroma")

# Also initialize the bedrock client so we can call some embedding models!
session = boto3.Session(profile_name='default')
bedrock = boto3.client('bedrock-runtime')

# Start Running Experiments!

## Experiment 1
In this first experiment we're going to set up a retrieval task using ChromaDB, Titan Text V1 as our embedding mode, and use very small chunks.


#### Word of Caution: On Using GenAI Frameworks
For chunking, we'll use LlamaIndex. There are many tools/frameworks for ingesting documents and implementing chunking strategies. I personally like LlamaIndex because it offers a lot of advanced chunking options. It creates "nodes" that can be converted into different formats for ingestion.

When using these GenAI frameworks, it's best to not rely too heavily on them. In the example below, we'll use LlamaIndex but we'll wrap the chunking logic in a class and normalize the output to a class that we create named RAGChunk. This way, we aren't too reliant on the framework. None of these frameworks are particularly "stable" (as of 09/2024) and newer versions are often times not backwards compatible. 

It's best to contain the package to minimize the impact on the rest of your system if/when you need to have more control.

In [5]:
from typing import List, Dict, Any
from pydantic import BaseModel
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import Node
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline

import re

# Create a class to use instead of LlamaIndex Nodes. This way we decouple our chroma collections from LlamaIndexes
class RAGChunk(BaseModel):
    id_: str
    text: str
    metadata: Dict[str, Any] = {}


class SentenceSplitterChunkingStrategy:
    def __init__(self, input_dir: str, chunk_size: int = 256, chunk_overlap: int = 128):
        self.input_dir = input_dir
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.pipeline = self._create_pipeline()

        # Helper to get regex pattern for normalizing relative file paths.
        self.relative_path_pattern = rf"{re.escape(input_dir)}(/.*)"

    def _extract_relative_path(self, full_path):
        # Get Regex pattern
        pattern = self.relative_path_pattern
        match = re.search(pattern, full_path)
        if match:
            return match.group(1).lstrip('/')
        return None

    def _create_pipeline(self) -> IngestionPipeline:
        transformations = [
            SentenceSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap),
        ]
        return IngestionPipeline(transformations=transformations)

    def load_documents(self) -> List[Document]:
        # If you're using a different type of file besides md, you'll want to change this. 
        return SimpleDirectoryReader(
            input_dir=self.input_dir, 
            recursive=True,
            required_exts=['.md']
        ).load_data()

    def to_ragchunks(self, nodes: List[Node]) -> List[RAGChunk]:
        return [
            RAGChunk(
                id_=node.node_id,
                text=node.text,
                metadata={
                    **node.metadata,
                    'relative_path': self._extract_relative_path(node.metadata['file_path'])
                }
            )
            for node in nodes
        ]

    def process(self) -> List[RAGChunk]:
        documents = self.load_documents()
        nodes = self.pipeline.run(documents=documents)
        rag_chunks = self.to_ragchunks(nodes)
        
        print(f"Processing complete. Created {len(rag_chunks)} chunks.")
        return rag_chunks

## Create Chunks
In this step we'll use a custom wrapper we built around LlamaIndex. It will split up documents from the input dir into ~512 chunk sizes with the overlap (or smaller if the file isn't that big). We should get around ~16k chunks.

In [6]:
chunking_strategy = SentenceSplitterChunkingStrategy(
    input_dir="../data/opensearch-docs/documentation-website",
    chunk_size=256,
    chunk_overlap=128
)

# Get the nodes from the chunker.
chunks: RAGChunk = chunking_strategy.process()

Processing complete. Created 23494 chunks.


### Setup Retrieval Task
The next step is to set up a retrieval task. To do this, we'll use chromaDB as our vector database. We've built a wrapper around the retrieval task and created a BaseRetrievalTask class to inherit from. If you'd like to experiment with more complicated retrieval pattern, you can write your own implementation and the rest of the notebook will run accordingly. 

We'll also leverage Chromas feature that allows us to specify an embedding function when creating a collection. This makes ingestion simpler because Chroma will automatically apply the same embedding function to our queries as it did for our documents. It's just nicer to keep the embedding function and DB together. 

In [7]:
from pydantic import BaseModel
from typing import List, Dict
from abc import ABC, abstractmethod
import chromadb
from chromadb.api.types import EmbeddingFunction
from typing import List, Dict, Any
from concurrent.futures import ThreadPoolExecutor, as_completed
from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction


class RetrievalResult(BaseModel):
    id: str
    document: str
    embedding: List[float]
    distance: float
    metadata: Dict = {}

# Base retrieval class. Can be reused if you decide to implement a different retrieval class.
class BaseRetrievalTask(ABC):
    @abstractmethod
    def retrieve(self, query_text: str, n_results: int) -> List[RetrievalResult]:
        """
        Retrieve documents based on the given query.

        Args:
            query (str): The query string to search for.

        Returns:
            List[RetrievalResult]: A list of RetrievalResult objects that are relevant to the query.
        """
        pass



# Example of a concrete implementation
class ChromaDBRetrievalTask(BaseRetrievalTask):

    def __init__(self, chroma_client, collection_name: str, embedding_function, chunks: List[RAGChunk]):
        self.client = chroma_client
        self.collection_name = collection_name
        self.embedding_function = embedding_function
        self.chunks = chunks

        # Create the collection
        self.collection = self._create_collection()

    def _create_collection(self):
        return self.client.get_or_create_collection(
            name=self.collection_name,
            embedding_function=self.embedding_function
        )

    def add_chunks_to_collection(self, batch_size: int = 20, num_workers: int = 10):
        batches = [self.chunks[i:i + batch_size] for i in range(0, len(self.chunks), batch_size)]
        
        with ThreadPoolExecutor(max_workers=num_workers) as executor:
            futures = [executor.submit(self._add_batch, batch) for batch in batches]
            for future in as_completed(futures):
                future.result()  # This will raise an exception if one occurred during the execution
        print('Finished Ingesting Chunks Into Collection')

    def _add_batch(self, batch: List[RAGChunk]):
        self.collection.add(
            ids=[chunk.id_ for chunk in batch],
            documents=[chunk.text for chunk in batch],
            metadatas=[chunk.metadata for chunk in batch]
        )

    def retrieve(self, query_text: str, n_results: int = 5) -> List[RetrievalResult]:
        # Query the collection
        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results,
            include=['embeddings', 'documents', 'metadatas', 'distances']
        )

        # Transform the results into RetrievalResult objects
        retrieval_results = []
        for i in range(len(results['ids'][0])):
            retrieval_results.append(RetrievalResult(
                id=results['ids'][0][i],
                document=results['documents'][0][i],
                embedding=results['embeddings'][0][i],
                distance=results['distances'][0][i],
                metadata=results['metadatas'][0][i] if results['metadatas'][0] else {}
            ))

        return retrieval_results

### Populate the vectorDB
In the next section we'll define our embedding function and populate the in memory database with our vectors. **Note** We've already indexed the chunks for you which are loaded when retrieving or creating the collection and specifying a persistant chromaDB store. If you'd like to redo it yourself, feel free to, but it does take a while.

In [8]:
from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction

# Define some experiment variables
TITAN_TEXT_EMBED_V1_ID: str = 'amazon.titan-embed-text-v1'
EXPERIMENT_1_COLLECTION_NAME: str = 'experiment_1_collection'

# This is a handy function Chroma implemented for calling bedrock. Lets use it!
embedding_function = AmazonBedrockEmbeddingFunction(
    session=session,
    model_name=TITAN_TEXT_EMBED_V1_ID
)

# Create our retrieval task. All retrieval tasks in this tutorial implement BaseRetrievalTask which has the method retrieve()
# If you'd like to extend this to a different retrieval configuration, all you have to do is create a class that that implements
# this abstract class and the rest is the same!
experiment_1_retrieval_task: BaseRetrievalTask = ChromaDBRetrievalTask(
    chroma_client = chroma_client, 
    collection_name = EXPERIMENT_1_COLLECTION_NAME,
    embedding_function = embedding_function,
    chunks = chunks
)

# If you've already created collection, comment out this line
experiment_1_retrieval_task.add_chunks_to_collection()

Finished Ingesting Chunks Into Collection


In [9]:
# Lets verify it works!
print(len(experiment_1_retrieval_task.retrieve('What does * do?', n_results=1)) == 1)

True


### Pull In Validation Dataset
We've already created a validation dataset for you. Through a combination of human curation & trial and error, we created a set of 25 questions users might ask a RAG system designed to answer questions from OpenSearch documentation. We've also annotated the questions with the relevant relative paths of the documents.

In [10]:
import pandas as pd

def get_clean_eval_dataset():
    EVAL_PATH = '../data/eval-datasets/1_embeddings_validation.csv'
    eval_df = pd.read_csv(EVAL_PATH)

    # Clean up the DataFrame
    eval_df = eval_df.rename(columns=lambda x: x.strip())  # Remove any leading/trailing whitespace from column names
    eval_df = eval_df.drop(columns=[col for col in eval_df.columns if col.startswith('Unnamed')])  # Remove unnamed columns
    eval_df = eval_df.dropna(how='all')  # Remove rows that are all NaN
    
    # Strip whitespace from string columns
    for col in eval_df.select_dtypes(['object']):
        eval_df[col] = eval_df[col].str.strip()
    
    # Ensure 'relevant_doc_ids' is a string column
    eval_df['relevant_doc_ids'] = eval_df['relevant_doc_ids'].astype(str)

    return eval_df

eval_df = get_clean_eval_dataset()

### Define Metrics
The IRMetricsCalculator below calculates a series of metrics that will be useful when evaluating your RAG system. Remember, we are only evaluating the retrieval at this stage, not the models ability to create an answer from the IR results.

#### Metrics
* precision@k:
* recall@k:
* ndcg@k

These individual metrics will be our basis for creating an aggregate view of our validation dataset to get a sense for how well it's performing

In [11]:
import json
import numpy as np

#  Helper class for calculating metrics.
class IRMetricsCalculator:
    def __init__(self, df):
        self.df = df

    @staticmethod
    def precision_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        return len(set(relevant) & set(retrieved_k)) / k if k > 0 else 0

    @staticmethod
    def recall_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        return len(set(relevant) & set(retrieved_k)) / len(relevant) if len(relevant) > 0 else 0

    @staticmethod
    def dcg_at_k(relevant, retrieved, k):
        retrieved_k = retrieved[:k]
        dcg = 0
        for i, item in enumerate(retrieved_k):
            if item in relevant:
                dcg += 1 / np.log2(i + 2)
        return dcg

    @staticmethod
    def ndcg_at_k(relevant, retrieved, k):
        dcg = IRMetricsCalculator.dcg_at_k(relevant, retrieved, k)
        idcg = IRMetricsCalculator.dcg_at_k(relevant, relevant, k)
        return dcg / idcg if idcg > 0 else 0

    @staticmethod
    def parse_json_list(json_string):
        try:
            return json.loads(json_string)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {json_string} with error {e}")
            return []

    def calculate_metrics(self, k_values=[1, 3, 5, 10]):
        for k in k_values:
            self.df[f'precision@{k}'] = self.df.apply(lambda row: self.precision_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
            self.df[f'recall@{k}'] = self.df.apply(lambda row: self.recall_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
            self.df[f'ndcg@{k}'] = self.df.apply(lambda row: self.ndcg_at_k(
                self.parse_json_list(row['relevant_doc_ids']),
                self.parse_json_list(row['retrieved_doc_ids']), k), axis=1)
        return self.df

# Setup Task Runner
In the step below we'll setup a task runner that will iterate through our dataframe, run a retrieval task on the input and use our IRCalculator to generate metrics on the results

In [12]:
class RetrievalTaskRunner:
    def __init__(self, eval_df: pd.DataFrame, retrieval_task: BaseRetrievalTask):
        self.eval_df = eval_df
        self.retrieval_task = retrieval_task

    def _get_unique_file_paths(self, results: List[RetrievalResult]) -> List[str]:
        # Since Python 3.7, dicts retain insertion order.
        return list(dict.fromkeys(r.metadata['relative_path'] for r in results))
        

    def run(self) -> pd.DataFrame:
        # Make a copy of the dataframe so we don't modify the original.
        df = pd.DataFrame(self.eval_df)
        
        results = []
        for index, row in df.iterrows():
            query: str = row['query_text']
            
            # Run retrieval task
            retrieval_results: List[RetrievalResult] = self.retrieval_task.retrieve(query)
            
            # Extract unique page numbers for comparison with validation dataset.
            ordered_filepaths: List[str] = self._get_unique_file_paths(retrieval_results)

            retrieved_chunks = [ {'relative_path': r.metadata['relative_path'], 'chunk': r.document} for r in retrieval_results ]

            # Create new record
            result = {
                'query_text': query,
                'relevant_doc_ids': row['relevant_doc_ids'],
                'retrieved_doc_ids': json.dumps(ordered_filepaths),
                'retrieved_chunks': json.dumps(retrieved_chunks), # Best way to preserve the chunks
            }
            results.append(result)

        new_dataframe = pd.DataFrame(results)
        # return new_dataframe

        ir_calc: IRMetricsCalculator = IRMetricsCalculator(new_dataframe)
        return ir_calc.calculate_metrics()

### Execute first experiment
In the command below, we'll execute our first experiment and store the results in a dataframe

In [13]:
experiment_1_results: pd.DataFrame = RetrievalTaskRunner(eval_df, experiment_1_retrieval_task).run()

## Create a Summary
You can view the results for each individual query, but it doesn't quite tell the whole story. The last thing we need to do to conclude experiment 1 is to create a summary view showing Mean Average Precision, Mean Reciprocal Rank (MRR), as well as general averages across all the individual metrics we calculated. This should let us know how well the retrieval task is performing.

To do that, we've created another helper class to summarize the results and give us a comprehensive view of the results

In [14]:
import pandas as pd
import numpy as np
from typing import List

class ExperimentSummarizer:
    def __init__(self, df):
        self.df = pd.DataFrame(df)
        self.summary_df = None

    @staticmethod
    def calculate_ap(relevant_docs, retrieved_docs):
        relevant_set = set(relevant_docs.split(','))
        retrieved_list = retrieved_docs.split(',')
        relevant_count = 0
        total_precision = 0
        
        for i, doc in enumerate(retrieved_list, 1):
            if doc in relevant_set:
                relevant_count += 1
                total_precision += relevant_count / i
        
        return total_precision / len(relevant_set) if relevant_set else 0

    @staticmethod
    def calculate_reciprocal_rank(relevant_docs, retrieved_docs):
        relevant_set = set(relevant_docs.split(','))
        retrieved_list = retrieved_docs.split(',')
        
        for i, doc in enumerate(retrieved_list, 1):
            if doc in relevant_set:
                return 1 / i
        
        return 0

    def calculate_map(self):
        self.df['AP'] = self.df.apply(lambda row: self.calculate_ap(row['relevant_doc_ids'], row['retrieved_doc_ids']), axis=1)
        return self.df['AP'].mean()

    def calculate_mrr(self):
        self.df['RR'] = self.df.apply(lambda row: self.calculate_reciprocal_rank(row['relevant_doc_ids'], row['retrieved_doc_ids']), axis=1)
        return self.df['RR'].mean()

    def calculate_mean_metrics(self):
        return self.df[[
            'precision@1', 'recall@1', 'ndcg@1',
            'precision@3', 'recall@3', 'ndcg@3',
            'precision@5', 'recall@5', 'ndcg@5'
        ]].mean()

    def calculate_top_k_percentages(self):
        top_1 = (self.df['precision@1'] > 0).mean() * 100
        top_3 = (self.df['precision@3'] > 0).mean() * 100
        top_5 = (self.df['precision@5'] > 0).mean() * 100
        return top_1, top_3, top_5

    def analyze(self):
        map_score = self.calculate_map()
        mrr_score = self.calculate_mrr()
        mean_metrics = self.calculate_mean_metrics()
        top_1, top_3, top_5 = self.calculate_top_k_percentages()

        self.summary_df = pd.DataFrame({
            'Metric': [
                'MAP (Mean Average Precision)',
                'MRR (Mean Reciprocal Rank)',
                'Mean Precision@1', 'Mean Recall@1', 'Mean NDCG@1',
                'Mean Precision@3', 'Mean Recall@3', 'Mean NDCG@3',
                'Mean Precision@5', 'Mean Recall@5', 'Mean NDCG@5',
                '% Queries with Relevant Doc in Top 1',
                '% Queries with Relevant Doc in Top 3',
                '% Queries with Relevant Doc in Top 5'
            ],
            'Value': [
                map_score,
                mrr_score,
                mean_metrics['precision@1'], mean_metrics['recall@1'], mean_metrics['ndcg@1'],
                mean_metrics['precision@3'], mean_metrics['recall@3'], mean_metrics['ndcg@3'],
                mean_metrics['precision@5'], mean_metrics['recall@5'], mean_metrics['ndcg@5'],
                top_1, top_3, top_5
            ]
        })
        return self.summary_df

    def get_summary(self):
        if self.summary_df is None:
            self.analyze()
        return self.summary_df

In [15]:
# Lets use the class above to create aggregate metrics to see how well the system performs.
experiment_1_summary = ExperimentSummarizer(experiment_1_results).analyze()

In [16]:
experiment_1_summary

Unnamed: 0,Metric,Value
0,MAP (Mean Average Precision),0.173611
1,MRR (Mean Reciprocal Rank),0.229167
2,Mean Precision@1,0.333333
3,Mean Recall@1,0.236111
4,Mean NDCG@1,0.333333
5,Mean Precision@3,0.180556
6,Mean Recall@3,0.416667
7,Mean NDCG@3,0.36345
8,Mean Precision@5,0.125
9,Mean Recall@5,0.472222


# Takeaways from Experiment 1
The results aren't.. terrible. But they could also be a lot better. We plan to add a ReRank step in the next notebook, so while rank based metrics are important at this stage, we care more about whether the top k results have relevant data in them. This makes recall@5 arguably the most important metric to evaluate on in this step.

However, all the metrics are important to understand how the base IR task is performing. We want to limit the amount of context we pass back to the model to save on input token cost so knowing precesion@1 and precision@5 give us an idea of how well the embeddings are working on their own at ranking the results. We can then take these runs and compare it against a re-ranked list to see (if) re-rank improves this task.

## Next Steps
Let's pick larger chunks to see if it performs better.

# Experiment 2 - Larger Chunk Sizes
At 2046 chunk sizes, there's very few documents in our opensearch docs that would be split up. This essentially becomes page level chunking. Lets try it out!


In [17]:
# Lets define some smaller chunks
chunking_strategy = SentenceSplitterChunkingStrategy(
    input_dir="../data/opensearch-docs/documentation-website",
    chunk_size=2046,
    chunk_overlap=128
)

# Get the nodes from the chunker.
chunks: RAGChunk = chunking_strategy.process()

# Define some experiment variables
TITAN_TEXT_EMBED_V1_ID: str = 'amazon.titan-embed-text-v1'
EXPERIMENT_2_COLLECTION_NAME: str = 'experiment_2_collection'

# This is a handy function Chroma implemented for calling bedrock. Lets use it!
embedding_function = AmazonBedrockEmbeddingFunction(
    session=session,
    model_name=TITAN_TEXT_EMBED_V1_ID
)

# Create our retrieval task. All retrieval tasks in this tutorial implement BaseRetrievalTask which has the method retrieve()
# If you'd like to extend this to a different retrieval configuration, all you have to do is create a class that that implements
# this abstract class and the rest is the same!
experiment_2_retrieval_task: BaseRetrievalTask = ChromaDBRetrievalTask(
    chroma_client = chroma_client, 
    collection_name = EXPERIMENT_2_COLLECTION_NAME,
    embedding_function = embedding_function,
    chunks = chunks
)

Processing complete. Created 2079 chunks.


In [18]:
# If you've already created collection, comment out this line
experiment_2_retrieval_task.add_chunks_to_collection()

Finished Ingesting Chunks Into Collection


In [19]:
# Lets verify it works!
print(len(experiment_2_retrieval_task.retrieve('What does * do?', n_results=1)) == 1)

True


In [20]:
# Setup a new Task Runner for experiment 2
experiment_2_results: pd.DataFrame = RetrievalTaskRunner(eval_df, experiment_2_retrieval_task).run()

In [21]:
# Lets use the class above to create aggregate metrics to see how well the system performs.
experiment_2_summary = ExperimentSummarizer(experiment_2_results).analyze()

In [22]:
print(experiment_2_summary)

                                  Metric      Value
0           MAP (Mean Average Precision)   0.062500
1             MRR (Mean Reciprocal Rank)   0.125000
2                       Mean Precision@1   0.416667
3                          Mean Recall@1   0.326389
4                            Mean NDCG@1   0.416667
5                       Mean Precision@3   0.208333
6                          Mean Recall@3   0.437500
7                            Mean NDCG@3   0.414691
8                       Mean Precision@5   0.141667
9                          Mean Recall@5   0.520833
10                           Mean NDCG@5   0.448755
11  % Queries with Relevant Doc in Top 1  41.666667
12  % Queries with Relevant Doc in Top 3  50.000000
13  % Queries with Relevant Doc in Top 5  58.333333


## Compare Experiment 2 with Experiment 1
The results above look better. However, it's kind of hard to visualize how much better. Lets use another helper class to compare the results between two experiments and pretty print the results!

In [23]:
class ExperimentComparator:
    def __init__(self, *experiment_data):
        self.experiments = experiment_data

    def compare_metrics(self):
        merged_df = pd.DataFrame({'Metric': self.experiments[0][0]['Metric']})
        for df, name in self.experiments:
            merged_df = pd.merge(merged_df, df, on='Metric', how='left')
            merged_df = merged_df.rename(columns={'Value': name})
        
        base_exp = self.experiments[0][1]
        for df, name in self.experiments[1:]:
            merged_df[f'Change_{name}_vs_{base_exp}'] = merged_df[name] - merged_df[base_exp]
            merged_df[f'PercentChange_{name}_vs_{base_exp}'] = ((merged_df[name] - merged_df[base_exp]) / merged_df[base_exp]) * 100
        
        return merged_df

    def print_comparison(self):
        comparison = self.compare_metrics()
        
        def color_change(val):
            if pd.isna(val):
                return ''
            return 'color: red' if val < 0 else 'color: green' if val > 0 else ''
        
        def background_color_change(val):
            if pd.isna(val):
                return ''
            return 'background-color: #ffcccb' if val < 0 else 'background-color: #90ee90' if val > 0 else ''
        
        change_columns = [col for col in comparison.columns if col.startswith('Change_') or col.startswith('PercentChange_')]
        styled = comparison.style
        
        for col in change_columns:
            styled = styled.map(color_change, subset=[col])
            styled = styled.map(background_color_change, subset=[col])
        
        numeric_columns = comparison.select_dtypes(include=[np.number]).columns
        format_dict = {col: '{:.6f}' for col in numeric_columns}
        
        for col in change_columns:
            if col.startswith('PercentChange_'):
                format_dict[col] = '{:.2f}%'
        
        styled = styled.format(format_dict)
        return styled

    def analyze(self):
        return self.print_comparison()

In [24]:
experiment_comparator = ExperimentComparator(
    (experiment_1_summary, "Experiment1"),
    (experiment_2_summary, "Experiment2")
)
experiment_comparator.analyze()

Unnamed: 0,Metric,Experiment1,Experiment2,Change_Experiment2_vs_Experiment1,PercentChange_Experiment2_vs_Experiment1
0,MAP (Mean Average Precision),0.173611,0.0625,-0.111111,-64.00%
1,MRR (Mean Reciprocal Rank),0.229167,0.125,-0.104167,-45.45%
2,Mean Precision@1,0.333333,0.416667,0.083333,25.00%
3,Mean Recall@1,0.236111,0.326389,0.090278,38.24%
4,Mean NDCG@1,0.333333,0.416667,0.083333,25.00%
5,Mean Precision@3,0.180556,0.208333,0.027778,15.38%
6,Mean Recall@3,0.416667,0.4375,0.020833,5.00%
7,Mean NDCG@3,0.36345,0.414691,0.051241,14.10%
8,Mean Precision@5,0.125,0.141667,0.016667,13.33%
9,Mean Recall@5,0.472222,0.520833,0.048611,10.29%


# Takeaways from Experiment 2
Switching to larger chunks actually made the results better. That's why we build validation datasets to understand what works!

Diving into the results a bit. It seems like the retrieval in general could use a boost in general. We're using an older version of Titan Text embeddings, lets use a newer version with the same chunk size as experiment 2 and see how that improves

# Experiment 3 - Titan Embeddings V2

In [25]:
# Lets define some smaller chunks
chunking_strategy = SentenceSplitterChunkingStrategy(
    input_dir="../data/opensearch-docs/documentation-website",
    chunk_size=2048,
    chunk_overlap=128
)

# Get the nodes from the chunker.
chunks: RAGChunk = chunking_strategy.process()

# Define some experiment variables
TITAN_TEXT_EMBED_V2_ID: str = "amazon.titan-embed-text-v2:0"
EXPERIMENT_3_COLLECTION_NAME: str = 'experiment_3_collection'

# Update our embeddings model to a newer one.
embedding_function = AmazonBedrockEmbeddingFunction(
    session=session,
    model_name=TITAN_TEXT_EMBED_V2_ID
)

# Create our retrieval task. All retrieval tasks in this tutorial implement BaseRetrievalTask which has the method retrieve()
# If you'd like to extend this to a different retrieval configuration, all you have to do is create a class that that implements
# this abstract class and the rest is the same!
experiment_3_retrieval_task: BaseRetrievalTask = ChromaDBRetrievalTask(
    chroma_client = chroma_client, 
    collection_name = EXPERIMENT_3_COLLECTION_NAME,
    embedding_function = embedding_function,
    chunks = chunks
)    

Processing complete. Created 2078 chunks.


In [26]:
# # If you've already created collection, comment out this line
experiment_3_retrieval_task.add_chunks_to_collection()

Finished Ingesting Chunks Into Collection


In [27]:
# Setup a new Task Runner for experiment 2
experiment_3_results: pd.DataFrame = RetrievalTaskRunner(eval_df, experiment_3_retrieval_task).run()
# Lets use the class above to create aggregate metrics to see how well the system performs.
experiment_3_summary = ExperimentSummarizer(experiment_3_results).analyze()

experiment_comparator = ExperimentComparator(
    (experiment_2_summary, "Experiment2"),
    (experiment_3_summary, "Experiment3")
)
experiment_comparator.analyze()

Unnamed: 0,Metric,Experiment2,Experiment3,Change_Experiment3_vs_Experiment2,PercentChange_Experiment3_vs_Experiment2
0,MAP (Mean Average Precision),0.0625,0.06713,0.00463,7.41%
1,MRR (Mean Reciprocal Rank),0.125,0.138889,0.013889,11.11%
2,Mean Precision@1,0.416667,0.375,-0.041667,-10.00%
3,Mean Recall@1,0.326389,0.284722,-0.041667,-12.77%
4,Mean NDCG@1,0.416667,0.375,-0.041667,-10.00%
5,Mean Precision@3,0.208333,0.194444,-0.013889,-6.67%
6,Mean Recall@3,0.4375,0.416667,-0.020833,-4.76%
7,Mean NDCG@3,0.414691,0.39534,-0.019351,-4.67%
8,Mean Precision@5,0.141667,0.15,0.008333,5.88%
9,Mean Recall@5,0.520833,0.555556,0.034722,6.67%


# Takeaways from Experiment 3
Switching to titan V2 did actually improve our performance significantly! In general these scores are pretty decent. If you're following along, you should see a mean recall@5 at around 78%. The queries with relevant docs in them should be around 87%. 

# Conclusion
By playing with the chunk size and embedding model, we made some very impressive improvements over our first iteration. The numbers shown above are pretty decent. 

# TODO
For those following along, what kind of advanced chunking / embedding strategies can you think of to implement that might get our metrics closer to:

* MAP > .8
* precision@1 > .7
* Recall@5 > .9
* NDCG@5 > .8
* % of queries with relevant Docs in Top 5 > .9

# Takeways
We ran 3 quick experiments to determine a good chunk size and compared different embedding models from Amazon. There are many more chunking strategies that offer more advanced capabilities. Hierarchical, Semantic, and summarization chunking strategies can also greatly improve performance. For this notebook, we elected to start with very basic ones. It's worth noting that LlamaIndex's SentenceSplitter (which we used) does its best to keep sentences together.

# Next Steps
Move over to the ReRank notebook to see how ReRank could improve our metrics!