Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with gpt-4o-mini

In [4]:
import pandas as pd

github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [5]:
df = df.iloc[:300]

In [6]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


Using embeddings model multi-qa-mpnet-base-dot-v1 from the Sentence Transformer library
Create the embeddings for the first LLM answer
What's the first value of the resulting vector?

In [7]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
answer_llm = df.iloc[0].answer_llm

In [12]:
print(embedding_model.encode(answer_llm)[0])

-0.42244655


Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?

In [13]:
from tqdm.notebook import tqdm

In [27]:
evaluations2 = []
for i, row in tqdm(df.iterrows()):
    emb_answ_llm = embedding_model.encode(row.answer_llm)
    emb_answ_orig = embedding_model.encode(row.answer_orig)
    evaluations2.append(emb_answ_llm.dot(emb_answ_orig))

0it [00:00, ?it/s]

In [31]:
import numpy as np

print(np.percentile(evaluations2, 75))


31.67430877685547


Computing the cosine
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

Compute the norm of a vector
Divide each element by this norm
So, for vector v, it'll be v / ||v||

In numpy, this is how you do it:

norm = np.sqrt((v * v).sum())
v_norm = v / norm
Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

In [41]:
def normalize(v):
    norm = np.sqrt((v * v).sum())
    return v / norm

In [42]:
evaluations_norm = []
for i, row in tqdm(df.iterrows()):
    emb_answ_llm = normalize(embedding_model.encode(row.answer_llm))
    emb_answ_orig = normalize(embedding_model.encode(row.answer_orig))
    evaluations_norm.append(emb_answ_llm.dot(emb_answ_orig))

0it [00:00, ?it/s]

In [44]:
print(np.percentile(evaluations_norm, 75))

0.8362348973751068


Rouge

Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

install python package for it:
pip install rouge

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [47]:
!pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [51]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df.iloc[10]

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [52]:
print(scores)

{'rouge-1': {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}, 'rouge-2': {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}, 'rouge-l': {'r': 0.3939393939393939, 'p': 0.3939393939393939, 'f': 0.393939388939394}}


In [53]:
print(r)

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object


There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

rouge-1 - the overlap of unigrams,
rouge-2 - bigrams,
rouge-l - the longest common subsequence

What's the F score for rouge-1

In [56]:
print(scores['rouge-1']['f'])

0.45454544954545456


Average rouge score
Let's compute the average between rouge-1, rouge-2 and rouge-l for the same record from Q4

In [64]:
avg_rouge = sum([rouge['f'] for rouge in scores.values()]) / len(scores)

In [65]:
print(avg_rouge)

0.35490034990035496


Average rouge score for all the data points
Now let's compute the score for all the records

rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
And create a dataframe from them

In [70]:
all_scores = []

for _, row in tqdm(df.iterrows()):
    scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])[0]
    all_scores.append([rouge['f'] for rouge in scores.values()])

0it [00:00, ?it/s]

In [73]:
scores_df = pd.DataFrame(all_scores, columns=['rouge-1', 'rouge-2', 'rouge-l'])
scores_df.head()

Unnamed: 0,rouge-1,rouge-2,rouge-l
0,0.095238,0.028169,0.095238
1,0.125,0.055556,0.09375
2,0.415584,0.177778,0.38961
3,0.216216,0.047059,0.189189
4,0.142076,0.033898,0.120219


In [74]:
scores_df['rouge_average'] = scores_df.mean(axis=1)
scores_df.head()

Unnamed: 0,rouge-1,rouge-2,rouge-l,rouge_average
0,0.095238,0.028169,0.095238,0.072882
1,0.125,0.055556,0.09375,0.091435
2,0.415584,0.177778,0.38961,0.327658
3,0.216216,0.047059,0.189189,0.150821
4,0.142076,0.033898,0.120219,0.098731


What's the agerage rouge_2 across all the records

In [75]:
print(scores_df['rouge-2'].mean())

0.20696501983423318
