<a href="https://colab.research.google.com/github/TimSim/RAG/blob/main/High_performance_RAG_(and_Evaluation)_with_LlamaIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# High-performance RAG (and Evaluation) with LlamaIndex

In the following Notebook we will be exploring two of the most powerful techniques to take your single-domain RAG pipelines to the next level. We'll also be discussing methods that you can use to evaluate your RAG pipeline to get insight into how its performance improves over time!

- Fine-tuning Embeddings Model
- Expanding Context Window from Retrieved Node

But before any of that, we need to grab some dependencies, and set up some boilerplate!

## Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

This notebook will require the use of GPT-4, and the final evaluation piece might exceed the standard rate-limit. You will need to modify the evaluation pipeline to ensure you aren't faced with a rate limit!

### Nest Asyncio

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

In [None]:
!pip install openai llama_index pypdf -q -U

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m877.7/877.7 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.5/276.5 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m80.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Provide OpenAI API Key

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Loading Data

The data can be found in [this GitHub repo](https://github.com/AI-Maker-Space/DataRepository/tree/main/high-performance-rag).

It is a collection of Academic Papers related to Camelids!

In [None]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 30 (delta 7), reused 21 (delta 7), pack-reused 8[K
Receiving objects: 100% (30/30), 49.76 MiB | 40.28 MiB/s, done.
Resolving deltas: 100% (7/7), done.


In [None]:
%cd DataRepository/high-performance-rag

/content/DataRepository/high-performance-rag


In [None]:
!unzip "Camel Papers Test.zip"

Archive:  Camel Papers Test.zip
  inflating: Camel Papers Test/Acute respiratory distress syndrome in an alpaca cria.pdf  
  inflating: Camel Papers Test/Alpaca liveweight variations and fiber production in Mediterranean range of Chile.pdf  


In [None]:
!unzip "Camel Papers Train.zip"

Archive:  Camel Papers Train.zip
  inflating: Camel Papers Train/Antibody response to the epsilon toxin ofClostridium perfringensfollowing vaccination of Lama glamacrias.pdf  
  inflating: Camel Papers Train/Comparative pigmentation of sheep, goats, and llamas what colors are possible through selection.pdf  
  inflating: Camel Papers Train/Conservative management of a ruptured.pdf  
  inflating: Camel Papers Train/Evaluation of cholesterol and vitamin E concentrations in adult alpacas and nursing crias.pdf  
  inflating: Camel Papers Train/Influence of effects on quality traits and relationships between traits of the llama fleece..pdf  
  inflating: Camel Papers Train/Influence of Follicular Fluid on in Vitro.pdf  
  inflating: Camel Papers Train/Neurological Causes of Diaphragmatic Paralysis in 11 Alpacas.pdf  
  inflating: Camel Papers Train/On the morphology of the cerebellum of the alpaca (Lama pacos)..pdf  
  inflating: Camel Papers Train/Relationships between integumental charact

Now we can begin building our simple index for each of the training directories, and the validation directories.

We will use LlamaIndex's `SimpleNodeParser` to achieve this!

In [None]:
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import MetadataMode

TRAIN_FILES = "Camel Papers Train"
VAL_FILES = "Camel Papers Test"

In [None]:
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import MetadataMode

def load_corpus(directory, verbose=False):
    if verbose:
        print(f"Loading files in {directory}")

    reader = SimpleDirectoryReader(directory)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

In [None]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

Loading files in Camel Papers Train
Loaded 91 docs


[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Parsing documents into nodes:   0%|          | 0/91 [00:00<?, ?it/s]

Parsed 155 nodes
Loading files in Camel Papers Test
Loaded 9 docs


Parsing documents into nodes:   0%|          | 0/9 [00:00<?, ?it/s]

Parsed 17 nodes


Now that we've split our source documents into a number of nodes, we can move on to constructing a fine-tuning dataset.

## Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-3.5-turbo`.

We'll start by using LlamaIndex's `generate_qa_embedding_pairs` and storing it in a `EmbeddingQAFinetuneDataset`.

The basic idea here is straightforward enough:

1. We look at a node
2. We generate a question that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [None]:
from llama_index.finetuning import (
    generate_qa_embedding_pairs,
    EmbeddingQAFinetuneDataset,
)

In [None]:
train_dataset = generate_qa_embedding_pairs(train_nodes)
train_dataset.save_json("train_dataset.json")

100%|██████████| 155/155 [06:01<00:00,  2.33s/it]


In [None]:
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")

In [None]:
val_dataset = generate_qa_embedding_pairs(val_nodes)
val_dataset.save_json("val_dataset.json")

100%|██████████| 17/17 [00:39<00:00,  2.35s/it]


In [None]:
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

## Fine-tuning `BAAI/bge-small-en-v1.5`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using BAAI's [`bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

In [None]:
!pip install sentence_transformers -q -U

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m69.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m63.3 MB

We'll be leveraging LlamaIndex's `SentenceTransformersFinetuneEngine` to make fine-tuning our embeddings model a breeze.

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset, # Dataset to be trained on
    model_id="BAAI/bge-small-en-v1.5", # HuggingFace reference to base embeddings model
    model_output_path="llama_model_v1", # Output directory for fine-tuned embeddings model
    val_dataset=val_dataset, # Dataset to validate on
    epochs=2 # Number of Epochs to train for
)

Downloading (…)5b79a/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b34665b79a/README.md:   0%|          | 0.00/89.1k [00:00<?, ?B/s]

Downloading (…)4665b79a/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)5b79a/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)b34665b79a/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)665b79a/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

All that's left to do now is call `.finetune()`!

In [None]:
finetune_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/31 [00:00<?, ?it/s]

Iteration:   0%|          | 0/31 [00:00<?, ?it/s]

Now that we've fine-tuned our embeddings model, lets grab the model out of the engine so we can use it later!

In [None]:
finetuned_embedding_model = finetune_engine.get_finetuned_model()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Evaluating Embeddings Model

We're going to be evaluating our newly fine-tuned model against the base model using the evaluation pipeline provided by the `sentence_transformers` library.

You can find out all about the `InformationRetrievalEvaluator` [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/InformationRetrievalEvaluator.py).

The score we'll be looking at by default is `Mean Average Precision @ K` or `MAP@K`. Though more results can be found in the `/results` directory.

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path

def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

In [None]:
evaluate_st(val_dataset, "BAAI/bge-small-en-v1.5", name="bge")

0.7727941176470587

In [None]:
evaluate_st(val_dataset, "llama_model_v1", name="finetuned")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


0.8348389355742297

## Advanced Retrieval Method: Sentence Window Retrieval

Fine-tuning our embeddings is a powerful way to ensure we're better at retrieving the correct context - but we can go a step further and improve the way we actually look at context as well.

In this demonstration, we'll be leveraging the idea of a SentenceWindowNodeParser and metadata replacement to take our retrieval to the next level.

At a high level, what we're doing is straightforward:

1. We parse our document into sentence-wise nodes.
2. We find the most relevant sentence-wise nodes to our query.
3. We add additional context based on a "window" around that base sentence-wise node.
4. We use that enhanced context as context for our LLM!


Let's look at this with a visual example:

In [None]:
block_1 = """
I went to Tosche Station. I bought a Power Converter. I live on a planet with 2 Moons. My name is Luke Skywalker.
"""

sentences = block_1.split(".")
print(sentences)

chunks = [block_1[:50], block_1[50:100], block_1[100:]]
print(chunks)

['\nI went to Tosche Station', ' I bought a Power Converter', ' I live on a planet with 2 Moons', ' My name is Luke Skywalker', '\n']
['\nI went to Tosche Station. I bought a Power Conver', 'ter. I live on a planet with 2 Moons. My name is L', 'uke Skywalker.\n']


In [None]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding, HuggingFaceEmbedding
from llama_index.node_parser import SentenceWindowNodeParser, SimpleNodeParser

# window node parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=6,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# simple node parser
simple_node_parser = SimpleNodeParser.from_defaults()

# base Query Engine LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

# fine-tuned Embeddings model
embed_model = HuggingFaceEmbedding(
    model_name="llama_model_v1"
)

# base Embeddings model
embed_model_base = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en"
)

# fine-tuned ServiceContext
ctx = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
)

# base ServiceContext
ctx_base = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model_base
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Let's create nodes using our `node_parser` and `simple_node_parser` after loading our documents found in the `TRAIN_FILES` directory.

In [None]:
documents = SimpleDirectoryReader(
    TRAIN_FILES
).load_data()

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
base_nodes = simple_node_parser.get_nodes_from_documents(documents)

Now we can create their respecitve `VectorStoreIndex`s for each set of nodes.

In [None]:
from llama_index import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes, service_context=ctx)

In [None]:
base_index = VectorStoreIndex(base_nodes, service_context=ctx)

In the following step, we'll set up our `MetadataReplacementPostProcessor` which is what will replace our sentences (`original_text`) with our expanded contexts (`window`).

Remember, we're retrieving the `top_k` (3, in this case) sentences - and then converting them to their surrounding context.

In [None]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

query_engine = sentence_index.as_query_engine(
    similarity_top_k=3,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

Let's look at a sample response!

In [None]:
window_response = query_engine.query("How do camelid genetics influence wool quality?")

In [None]:
window_response.response

'Camelid genetics play a significant role in determining wool quality. The genetics of camelids, such as llamas and alpacas, influence various traits that contribute to wool quality, including fiber diameter, color, type of fleece, fiber length, and uniformity of diameter. For example, the genetics of llamas and alpacas determine the natural colors and patterns of their wool, with llamas exhibiting greater color variation compared to alpacas. Additionally, the genetics of camelids control the growth and formation of fibers, which consists of different phases regulated by genetic, nutritional, and hormonal factors. The proteins that form the wool are encoded by keratin genes and keratin-associated proteins, which are expressed in a highly regulated manner during hair follicle growth. While some genetic selection programs have been implemented to improve fleece characteristics in domestic camelids, the genetics mechanisms controlling wool traits in llamas and alpacas are not fully unders

We can also look at the visual representation of what happened, with our original sentence - and then our expanded context window.



In [None]:
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

Window: 79 August 2022, Vol.  12, No.  4
be studied.  The purpose of this review is to update the reader 
on the current state of knowledge of fiber genetics in domestic South American camelids and to discuss how genomics and the emergence of modern technologies for sequencing and discovering genetic variants will contribute to the advancement in this field.
 Coat Color Genetics
Llamas and alpacas have more than 22 natural colors ran -
ging from black and brown through gray and fawn to white, including all intermediate shades.  Llamas present greater color variation compared to alpacas; tricolor phenotypes may be ob -
served and the presence of white spots is common in llamas.  Additionally, this variety of colors and patterns normally oc -
curs in the same herd, unlike alpaca’s herds that tend to be more homogeneous.  The difference can be attributed to the se -
lection process during the domestication of each species.  The 
llama, as a multipurpose animal, was selected for greater bo

Let's compare to the same query using the simple nodes.

In [None]:
query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query("How do camelid genetics influence wool quality?")

In [None]:
vector_response.response

'Camelid genetics can influence wool quality through various mechanisms. Genetic variations in coat color genes, such as MC1R and ASIP, can affect the pigmentation of the wool, resulting in different color patterns. Additionally, genes involved in fiber growth and color, such as FGF5, can impact the length and texture of the wool fibers. Other genes, like high-glycine-tyrosine keratin genes, have been linked to wool fiber diameter. These genetic factors contribute to the overall quality and characteristics of camelid wool.'

## Evaluating our Pipeline

We'll be leveraging LlamaIndex's evaluation tools to evaluate our pipeline today.

We'll be relying on the [`DatasetGenerator`](https://github.com/run-llama/llama_index/blob/main/llama_index/evaluation/dataset_generation.py) to create our `QueryResponseDataset` leveraging `GPT-4`.

The dataset generated will be similar to before - which is a Question/Context dataset.

> NOTE: GPT-4 powered evaluation can be expensive and fairly time-consuming. Ensure you've scoped out cost before proceeding with evaluation.

In [None]:
import random
from llama_index.evaluation import (
    DatasetGenerator,
    QueryResponseDataset,
)

# the number of nodes to evaluate
num_nodes_eval = 10

# selecting a random sample of nodes
sample_eval_nodes = random.sample(base_nodes, num_nodes_eval)

# setting up our GPT-4 powered evaluation context
eval_service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4"))

# creating our dataset generator
dataset_generator = DatasetGenerator(
    sample_eval_nodes,
    service_context=eval_service_context,
    show_progress=True,
    num_questions_per_chunk=2,
)

Now we can simply fire off our `dataset_generator` and wait!

In [None]:
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()

100%|██████████| 10/10 [00:10<00:00,  1.05s/it]
100%|██████████| 2/2 [00:18<00:00,  9.00s/it]
100%|██████████| 2/2 [00:07<00:00,  3.92s/it]
100%|██████████| 2/2 [00:05<00:00,  2.71s/it]
100%|██████████| 2/2 [00:25<00:00, 12.63s/it]
100%|██████████| 2/2 [00:03<00:00,  1.96s/it]
100%|██████████| 2/2 [00:06<00:00,  3.46s/it]
100%|██████████| 2/2 [00:05<00:00,  2.95s/it]
100%|██████████| 2/2 [00:11<00:00,  5.84s/it]
100%|██████████| 2/2 [00:14<00:00,  7.28s/it]
100%|██████████| 2/2 [00:10<00:00,  5.11s/it]


In [None]:
eval_dataset.save_json("llama_eval_qr_dataset.json")

In [None]:
eval_dataset = QueryResponseDataset.from_json("llama_eval_qr_dataset.json")

We'll be using the following standard evaluation metrics provided by LlamaIndex.

- CorrectnessEvaluator - [Code](https://github.com/run-llama/llama_index/blob/main/llama_index/evaluation/correctness.py)
- SemanticSimilarityEvaluator - [Code](https://github.com/run-llama/llama_index/blob/main/llama_index/evaluation/semantic_similarity.py)
- RelevancyEvaluator - [Code](https://github.com/run-llama/llama_index/blob/main/llama_index/evaluation/relevancy.py)
- FaithfulnessEvaluator - [Code](https://github.com/run-llama/llama_index/blob/main/llama_index/evaluation/faithfulness.py)

In [None]:
from llama_index.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator
)

evaluator_c = CorrectnessEvaluator(service_context=eval_service_context)
evaluator_s = SemanticSimilarityEvaluator(service_context=eval_service_context)
evaluator_r = RelevancyEvaluator(service_context=eval_service_context)
evaluator_f = FaithfulnessEvaluator(service_context=eval_service_context)

Next, we'll set up additional evaluation tools, these tools will mostly be used to make evaluating and collecting our evaluations a bit simpler. Thanks, LlamaIndex!

In [None]:
from llama_index.evaluation.eval_utils import get_responses, get_results_df
from llama_index.evaluation import BatchEvalRunner

max_samples = 15

eval_qs = eval_dataset.questions
ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]

Next up, we'll set up `QueryEngine`s for our two pipelines we wish to evaluate and let them predict!

First up is our SentenceWindow-MetaDataReplacement pipeline powered by fine-tuned embeddings.

In [None]:
query_engine = sentence_index.as_query_engine(
    similarity_top_k=3,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)
pred_responses_finetuned_embeds = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

100%|██████████| 5/5 [00:07<00:00,  1.59s/it]


Next is our Simple Retrieval Base Embeddings pipeline.

In [None]:
base_index_base_embeddings = VectorStoreIndex(base_nodes, service_context=ctx_base)
base_embeddings_base_query_engine = base_index_base_embeddings.as_query_engine(
  similarity_top_k=3
)
base_pred_responses_base_embedings = get_responses(
    eval_qs[:max_samples], base_embeddings_base_query_engine, show_progress=True
)

100%|██████████| 5/5 [00:05<00:00,  1.14s/it]


In [None]:
import numpy as np

pred_response_strs_finetuned_embeds = [str(p) for p in pred_responses_finetuned_embeds]
base_pred_response_strs_base_embeds = [str(p) for p in base_pred_responses_base_embedings]

We'll create our evaluator dict, which will help create the appropriate `pd.DataFrame` in the final step - and set up our `BatchEvalRunner` which will be used to evaluate our pipelines responses against using GPT-4!

In [None]:
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r,
    "semantic_similarity": evaluator_s,
}

batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

In [None]:
base_eval_results_base_embeddings = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses_base_embedings[:max_samples],
    reference=ref_response_strs[:max_samples],
)

100%|██████████| 20/20 [00:20<00:00,  1.04s/it]


In [None]:
eval_results_finetuned_embeddings = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses_finetuned_embeds[:max_samples],
    reference=ref_response_strs[:max_samples],
)

100%|██████████| 20/20 [00:20<00:00,  1.03s/it]


Finally we can look at our results, which I'll let speak for themselves!

In [None]:
results_df = get_results_df(
    [
        base_eval_results_base_embeddings,
        eval_results_finetuned_embeddings],
    ["Base Retriever w Base Embeddings", "Sentence Window Retriever w FT Embeddings"],
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)

In [None]:
display(results_df.sort_values(by=['semantic_similarity'], ascending=False))

Unnamed: 0,names,correctness,relevancy,faithfulness,semantic_similarity
1,Sentence Window Retriever w FT Embeddings,4.2,0.933333,1.0,0.960761
0,Base Retriever w Base Embeddings,3.86667,0.866667,0.866667,0.959939
