# 00- Baseline Strategy

## RAGE-1: RAG Experiment Harness

### Setting up the Experiment

Having ran [00-Chunking Strategies](./00-Chunking%20Strategies.ipynb) we now have our base data and our evaluation dataset. Now it's time to run our first experiment!

Given a dataframe containing our chunked documents we must:

- Chunk the documents
- Embed the chunks.
- Store it in a vector database.
- Query the db using our ground truth (GT) question answer pairs.
- Create a generation prompt which includes the question, and the retrieved "context".
- Run the generation prompt and store the Question, Answer, GT Answer, Context for our evaluation framework.

Let's follow our [baseline experiment](experiments/01-chunking-strategies-baseline.md). 

In [1]:
import pandas as pd

experiment_name = "baseline_pubmed_articles"
evaluation_data = pd.read_csv('data/qa_pairs.csv')
input_data = pd.read_csv('data/ds_subset.csv')

#### Chunking the documents

Based on our earlier analysis, let's take a chunk size of 400 words with an overlap of 50. We can calculate the exact length of requests later, but 400 words is approximately 500 tokens per chunk based on our analysis and we should be able to fit multiple results into our context.

In [2]:
from rag.chunking import chunk_string_with_overlap

# Create a new DataFrame with each chunk as a separate row
chunks = []
doc_ids = []
chunk_ids = []
for idx, row in input_data.iterrows():
    article_chunks = chunk_string_with_overlap(input_text=row['article'], chunk_length=400, overlap=50)
    chunks.extend(article_chunks)
    doc_ids.extend([row['doc_id']] * len(article_chunks))
    chunk_ids.extend([f"{row['doc_id']}-{i+1}" for i in range(len(article_chunks))])

ds_chunked = pd.DataFrame({'doc_id': doc_ids, 'chunk_id': chunk_ids, 'chunks': chunks})
ds_chunked.to_csv('data/ds_chunked.csv', index=False)


Let's take a look at some of our chunks - it helps to do a sense check as if we see something wrong, it's a lot easier to fix now than after creating your embeddings and search indexes.

In [3]:
# display a random sample of the chunked data show the full string
for chunk in ds_chunked['chunks'].sample(5):
    print(chunk)
    print('\n') 

cutoff values for stg and stg / tsh ratio in our study have high sensitivity and specificity in predicting the outcome . this study is one of its kinds , investigating the association of stg / tsh ratio with ablation outcome in patients with dtc . we found that patients with stg / tsh ratio < 0.35 before rra had 11.64 times greater chance of successful ablation compared to those with stg / tsh ratio > 0.35 with sensitivity of 80.0% and specificity of 81.4% ( p . however limitation of our study was its retrospective nature ; therefore , we could retrieve data of only 75 patients who fulfilled our inclusion criteria . many of the previous studies regarding role of stg in predicting rra outcome were also retrospective in nature [ 8 , 15 , 16 , 18 , 29 , 31 ] . another limitation of this study is use of different ablative doses in low , intermediate , and high risk groups which could have an impact on success of ablation as british thyroid association 2007 guidelines favor use of high dose

Already we can see that sentences are broken, and that perhaps this isn't a great way of splitting our information. That being said we are generating a baseline and we expect the subsequent experiments to offer significant uplift. For now let's proceed.

> KEY TAKEAWAY: Even before we've run expensive and time consuming API calls, we can see that the results are not ideal. It's often worth iterating on this before investing time and money in the more nuanced tuning approaches.

#### Embed the chunks

We will use the `ada-v2` embedding model for this example as it is fairly powerful and well understood. It's worth noting that this will not always be the best model, paticularly when data contains topics and content that relate to finding outside of the embedding model's training data. Fine tuning an embedding model on a specific corpus (particularly in the case of highly specialised data) is also a popular option.

Most vector databases implement a wrapper around common embedding functions. Here we will configure and use the wrapper for openai embeddings in ChromaDB. This function is used to embed all documents in a collection, and also to embed queries as they come in.

In creating the index, we also need to specify the measurement method. For illustratice purposes we've used cosine similarity. In reality, for enterprise use cases index design is a rich topic in itself. Again, call out if you'd like more content on index design and choice of search engine! 

> KEY TAKEAWAY: Your choice of embedding model matters! It should be consistent across your index. and should be relevant to your data.


In [4]:
import chromadb.utils.embedding_functions as embedding_functions
from dotenv import find_dotenv, load_dotenv
import os

load_dotenv(find_dotenv())   
# Specify Embedding model
embedding_model = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL")

# Used in chromadb to embed docs and queries
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=os.getenv("AZURE_OPENAI_API_KEY"),
                api_base=os.getenv("AZURE_OPENAI_ENDPOINT"),
                api_type="azure",
                api_version=os.getenv("OPENAI_API_VERSION"),
                model_name=embedding_model
            )
# Create a new collection in ChromaDB

from chromadb import PersistentClient

chroma_client = PersistentClient(path="./data/chroma_db")

index_name = f"experiment_{experiment_name}"
collection = chroma_client.get_or_create_collection(name=index_name,embedding_function=openai_ef, metadata={"hnsw:space": "cosine"})

collection.add(
    # embeddings=ds_chunked['ada_v2'].tolist(),
    documents=ds_chunked['chunks'].tolist(),
    metadatas=[{"doc_id": doc_id} for doc_id in ds_chunked['doc_id']],
    ids=ds_chunked['chunk_id'].tolist()
    )
    

Now let's take a quick look at an example question and the response.

In [5]:
results = collection.query(
    query_texts=[evaluation_data['question'][3]],
    n_results=5)

print(evaluation_data['question'][3])
print(results)


We can see that the query returns the ID's, scores (distances), metadata and documents (chunks) for the top 5 documents in the collection when scored by cosine similarity. The chunks will form our context, and we will use the metadata for lineage. In our case, we see that there are a number of chunks from the same document - this can be seen as a positive indicator given our corpus is quite specific and documents can be distinct. 

> NOTE: It will depend on your use case and data whether there is a concept of "the right doc"

Now let's take a look at our augmentation and generation steps and apply this at scale!

#### Augmentation and Generation

Here we are enriching the question with the new (and hopefully relevant!) context we have unearthed from the vector database. To do that, we'll need another prompt template. Once we have this, we can submit the prompt to our generation model and receive the answer to our question.

Let's use the previous example.

In [None]:
from rag.augmentation import get_context, contruct_prompt

context = get_context(evaluation_data['question'][3], collection, 3)
prompt = contruct_prompt(context, evaluation_data['question'][3])

print(prompt)

We can see that we've created a new prompt that's ready to be submited to our generation model. Let's take a look at a single call.

In [None]:
from helper.openai_utils import general_prompt, create_client

oai_client = create_client()

response = general_prompt(oai_client, prompt, model=os.getenv("GEN_STEP_MODEL"))

print(f"Question: {evaluation_data['question'][3]}")
print(f"Correct Answer: {evaluation_data['ground_truth'][3]}")

print(f"Generated Answer: {response}")

This is looking pretty good! What you've likely just experienced is the infamous "vibe check" for LLM based applications. We'll get on to more formal measurement soon. But first, let's get answers to all 250 questions.

> NOTE: The execution time of this will heavily depend on your model selection. For GPT-36-turbo-16k it should complete in roughly 3 minutes.

In [None]:
model = os.getenv("GEN_STEP_MODEL")
multi_threading = True

#Create a distinct copy of evaluation_data to store the results
results_df = evaluation_data.copy()

if os.path.exists(f'data/results-{experiment_name}-{model}.csv'):
    print("File exists, reading in...")

    results_df = pd.read_csv(f'data/results-{experiment_name}-{model}.csv')

else:
    def generation_step(question):
        context = get_context(question, collection,3)
        prompt = contruct_prompt(context, question)
        return general_prompt(oai_client, prompt, model=model)

    if multi_threading == True:
        with Pool() as pool:
            results_multiprocessing = pool.map(generation_step, results_df['question'])
        results_df['answer'] = results_multiprocessing

    else:
        results_df['answer'] = results_df['question'].apply(lambda x: generation_step(x))

    #TODO: Refactor this so only one call for context

    # Check if the column exists
    if 'contexts' not in results_df.columns:
        results_df['contexts'] = [get_context(q, collection) for q in results_df['question']]

    #write out to CSV
    results_df.to_csv(f'data/results-{experiment_name}-{model}.csv', index=False)

display(results_df.head())

Now we have a dataframe with the questions, true answers, generated answers, and the context used to generate them, we can start to look at whether or not the answers are any good. To do that, we'll use a popular open source LLM evaluation framework called [Ragas](https://docs.ragas.io/en/stable/).

## RAGE-2: RAG Evaluation Harness

Now we've generated our answers, let's measure them. We'll be using the following measures to evaluate our results:

1. *Faithfulness*: This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

2. *Answer Relevancy*: The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy.

3. *Answer Semantic Similarity*: The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.

For more information on how these are calculated you can visit the documentation [here](https://docs.ragas.io/en/stable/concepts/metrics/index.html#ragas-metrics).

In [None]:
from eval.evaluate import ragas_evaluate


#check if results_df exists, if not import it
if 'results_df' not in locals():
    import os
    import ast
    import pandas as pd
    from dotenv import find_dotenv, load_dotenv

    load_dotenv(find_dotenv())

    experiment_name = "baseline_pubmed_articles"
    model = os.getenv("GEN_STEP_MODEL")
    results_df = pd.read_csv(f'data/results-{experiment_name}-{model}.csv')

    # Convert the contexts to a list of strings using ast
    results_df['contexts'] = results_df['contexts'].apply(ast.literal_eval)

# Calculate metrics and store
results = ragas_evaluate(results_df)
pd_results = results.to_pandas()

from pprint import pprint
pprint(results)

pd_results = results.to_pandas()
display(pd_results)