## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

Solution:

* Video: TBA
* Notebook: TBA

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)

In [15]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from rouge import Rouge

In [2]:
## Read it:
url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv?raw=1'
df = pd.read_csv(url)

In [3]:
## We will use only the first 300 documents:

df = df.iloc[:300]
df

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp
...,...,...,...,...,...
295,An alternative way to load the data using the ...,Above users showed how to load the dataset dir...,8d209d6d,What is an alternative way to load the data us...,machine-learning-zoomcamp
296,You can directly download the dataset from Git...,Above users showed how to load the dataset dir...,8d209d6d,How can I directly download the dataset from G...,machine-learning-zoomcamp
297,You can fetch data for homework using the `req...,Above users showed how to load the dataset dir...,8d209d6d,Could you share a method to fetch data for hom...,machine-learning-zoomcamp
298,If the status code is 200 when downloading dat...,Above users showed how to load the dataset dir...,8d209d6d,What should I do if the status code is 200 whe...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42 ANS
* -0.22
* -0.02
* 0.21

In [4]:
model_name = "multi-qa-mpnet-base-dot-v1"
embedding_model = SentenceTransformer(model_name)

In [5]:
answer_llm = df.iloc[0].answer_llm
embeddings = embedding_model.encode(answer_llm)
first_value = embeddings[0]

In [8]:
first_value

-0.42244658

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67 ANS
* 41.67
* 51.67

In [9]:
for i, row in tqdm(df.iterrows()):
    embeddings_answer_llm = embedding_model.encode(row.answer_llm)
    embeddings_answer_orig = embedding_model.encode(row.answer_orig)  
    df.at[i, "cosine"] = embeddings_answer_llm.dot(embeddings_answer_orig)

300it [01:16,  3.94it/s]


In [10]:
df["cosine"].describe()

count    300.000000
mean      27.495996
std        6.384743
min        4.547925
25%       24.307841
50%       28.336862
75%       31.674304
max       39.476013
Name: cosine, dtype: float64

In [11]:
np.percentile(df["cosine"], 75)

31.674304008483887

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83 ANS
* 0.93

In [12]:
def vector_normalized(vector):
    norm = np.sqrt((vector * vector).sum())
    return vector / norm

In [13]:
for i, row in tqdm(df.iterrows()):
    embeddings_answer_llm = embedding_model.encode(row.answer_llm)
    embeddings_answer_orig = embedding_model.encode(row.answer_orig)
    embeddings_answer_llm_normalized = vector_normalized(embeddings_answer_llm)
    embeddings_answer_orig_normalized = vector_normalized(embeddings_answer_orig)

    df.at[i, "cosine_normalized"] = embeddings_answer_llm_normalized.dot(embeddings_answer_orig_normalized)

300it [00:17, 16.99it/s]


In [14]:
np.percentile(df["cosine_normalized"], 75)

0.8362347632646561

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45 ANS
- 0.55
- 0.65

In [16]:
row_10 = df.iloc[10]
row_10

answer_llm           Yes, all sessions are recorded, so if you miss...
answer_orig          Everything is recorded, so you won’t miss anyt...
document                                                      5170565b
question                          Are sessions recorded if I miss one?
course                                       machine-learning-zoomcamp
cosine                                                       32.344711
cosine_normalized                                             0.777956
Name: 10, dtype: object

In [17]:
rouge_scorer = Rouge()

In [18]:
rouge_scores = rouge_scorer.get_scores(row_10.answer_llm, row_10.answer_orig)
rouge_scores

[{'rouge-1': {'r': 0.45454545454545453,
   'p': 0.45454545454545453,
   'f': 0.45454544954545456},
  'rouge-2': {'r': 0.21621621621621623,
   'p': 0.21621621621621623,
   'f': 0.21621621121621637},
  'rouge-l': {'r': 0.3939393939393939,
   'p': 0.3939393939393939,
   'f': 0.393939388939394}}]

In [19]:
rouge_scores[0]['rouge-1']['f']

0.45454544954545456

## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35 ANS
- 0.45
- 0.55
- 0.65

In [21]:
rouge_1_f1 = rouge_scores[0]['rouge-1']['f']
rouge_2_f1 = rouge_scores[0]['rouge-2']['f']
rouge_l_f1 = rouge_scores[0]['rouge-l']['f']

In [22]:
np.mean([rouge_1_f1,rouge_2_f1,rouge_l_f1])

0.35490034990035496

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records and create a dataframe from them.

What's the average `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40 ANS

In [30]:
all_rouge_scores = []

for i, row in tqdm(df.iterrows()):
    rouge_scores = rouge_scorer.get_scores(row.answer_llm, row.answer_orig)
    all_rouge_scores.append(rouge_scores)

300it [00:00, 413.02it/s]


In [31]:
df.head(20)

Unnamed: 0,answer_llm,answer_orig,document,question,course,cosine,cosine_normalized,rouge_l_f1
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,17.515997,0.506754,0.153846
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,13.418406,0.388549,0.153846
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,25.313255,0.718599,0.153846
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,12.147417,0.337266,0.153846
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,18.747723,0.521792,0.153846
5,"The course videos are pre-recorded, and you ca...","The course videos are pre-recorded, you can st...",39fda9f0,Are the course videos live or pre-recorded?,machine-learning-zoomcamp,33.970398,0.830532,0.153846
6,You can start watching the course videos right...,"The course videos are pre-recorded, you can st...",39fda9f0,When can I start watching the course videos?,machine-learning-zoomcamp,30.251696,0.746283,0.153846
7,"Yes, the live office hours sessions are recorded.","The course videos are pre-recorded, you can st...",39fda9f0,Are the live office hours sessions recorded?,machine-learning-zoomcamp,29.521582,0.694406,0.153846
8,You can find the office hours sessions in the ...,"The course videos are pre-recorded, you can st...",39fda9f0,Where can I find the office hours sessions?,machine-learning-zoomcamp,35.272198,0.846886,0.153846
9,You can access the pre-recorded course videos ...,"The course videos are pre-recorded, you can st...",39fda9f0,Where can I access the pre-recorded course vid...,machine-learning-zoomcamp,27.751755,0.655907,0.153846


In [32]:
round(df.rouge_l_f1.mean(),1)

0.2