In [29]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from rouge import Rouge
from tqdm.auto import tqdm

## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

Solution:

* Video: TBA
* Notebook: TBA

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```

In [3]:
github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"
url = f"{github_url}?raw=1"
df = pd.read_csv(url)
df = df.iloc[:300]

df

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp
...,...,...,...,...,...
295,An alternative way to load the data using the ...,Above users showed how to load the dataset dir...,8d209d6d,What is an alternative way to load the data us...,machine-learning-zoomcamp
296,You can directly download the dataset from Git...,Above users showed how to load the dataset dir...,8d209d6d,How can I directly download the dataset from G...,machine-learning-zoomcamp
297,You can fetch data for homework using the `req...,Above users showed how to load the dataset dir...,8d209d6d,Could you share a method to fetch data for hom...,machine-learning-zoomcamp
298,If the status code is 200 when downloading dat...,Above users showed how to load the dataset dir...,8d209d6d,What should I do if the status code is 200 whe...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

### Answer

The first value of the resulting vector is `-0.42`

In [10]:
model_name = "multi-qa-mpnet-base-dot-v1"
embedding_model = SentenceTransformer(model_name)

answer_llm = df.iloc[0].answer_llm
embedding = embedding_model.encode(answer_llm)
print(f"Shape of embedding: ", len(embedding))
print(embedding[:5])

Shape of embedding:  768
[-0.4224466  -0.22485617 -0.3240584  -0.2847585   0.00725639]


In [11]:
embedding[0]

-0.4224466

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

### Answer

The 75% percentile of the score is `31.67`

In [23]:
df_eval = df.copy()

dict_v_llm = {}
dict_v_orig = {}

evaluations = []

for idx, row in tqdm(df_eval.iterrows(), total=df_eval.shape[0]):
    answer_orig = row["answer_llm"]
    answer_llm = row["answer_orig"]

    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)

    dict_v_llm[idx] = v_llm
    dict_v_orig[idx] = v_orig

    similarity = v_llm.dot(
        v_orig
    )  # not normalized vectors! this is not cosine similarity
    evaluations.append(similarity)

# Append to dataframe
df_eval["dot_product"] = evaluations
df_eval

100%|██████████| 300/300 [15:05<00:00,  3.02s/it]


Unnamed: 0,answer_llm,answer_orig,document,question,course,dot_product
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,17.515991
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,13.418400
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,25.313255
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,12.147413
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,18.747734
...,...,...,...,...,...,...
295,An alternative way to load the data using the ...,Above users showed how to load the dataset dir...,8d209d6d,What is an alternative way to load the data us...,machine-learning-zoomcamp,34.001778
296,You can directly download the dataset from Git...,Above users showed how to load the dataset dir...,8d209d6d,How can I directly download the dataset from G...,machine-learning-zoomcamp,33.690857
297,You can fetch data for homework using the `req...,Above users showed how to load the dataset dir...,8d209d6d,Could you share a method to fetch data for hom...,machine-learning-zoomcamp,34.491524
298,If the status code is 200 when downloading dat...,Above users showed how to load the dataset dir...,8d209d6d,What should I do if the status code is 200 whe...,machine-learning-zoomcamp,27.538351


In [25]:
df_eval["dot_product"].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547925
25%       24.307848
50%       28.336872
75%       31.674307
max       39.476009
Name: dot_product, dtype: float64

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

### Answer

The 75% percentile of the cosine similarity is `0.83`

In [26]:
def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

In [27]:
dict_v_norm_llm = {}
dict_v_norm_orig = {}

evaluations_cosine = []

for idx, row in tqdm(df_eval.iterrows(), total=df_eval.shape[0]):
    answer_orig = row["answer_llm"]
    answer_llm = row["answer_orig"]

    # Already obtained!
    # v_llm = embedding_model.encode(answer_llm)
    # v_orig = embedding_model.encode(answer_orig)

    v_llm = dict_v_llm[idx]
    v_orig = dict_v_orig[idx]

    # normalize vectors
    v_llm_norm = normalize_vector(v=v_llm)
    v_orig_norm = normalize_vector(v=v_orig)

    cosine_similarity = v_llm_norm.dot(
        v_orig_norm
    )  # not normalized vectors! this is not cosine similarity
    evaluations_cosine.append(cosine_similarity)

# Append to dataframe
df_eval["cosine_similarity"] = evaluations_cosine
df_eval

100%|██████████| 300/300 [00:00<00:00, 13541.52it/s]


Unnamed: 0,answer_llm,answer_orig,document,question,course,dot_product,cosine_similarity
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,17.515991,0.506754
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,13.418400,0.388549
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,25.313255,0.718599
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,12.147413,0.337266
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,18.747734,0.521792
...,...,...,...,...,...,...,...
295,An alternative way to load the data using the ...,Above users showed how to load the dataset dir...,8d209d6d,What is an alternative way to load the data us...,machine-learning-zoomcamp,34.001778,0.914175
296,You can directly download the dataset from Git...,Above users showed how to load the dataset dir...,8d209d6d,How can I directly download the dataset from G...,machine-learning-zoomcamp,33.690857,0.902190
297,You can fetch data for homework using the `req...,Above users showed how to load the dataset dir...,8d209d6d,Could you share a method to fetch data for hom...,machine-learning-zoomcamp,34.491524,0.904734
298,If the status code is 200 when downloading dat...,Above users showed how to load the dataset dir...,8d209d6d,What should I do if the status code is 200 whe...,machine-learning-zoomcamp,27.538351,0.726782


In [28]:
df_eval["cosine_similarity"].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651274
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine_similarity, dtype: float64

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

### Answer

The F1 score for `rouge-1` is `0.45`

In [36]:
row_frame = df.iloc[10, :]
row_frame

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object

In [45]:
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(row_frame["answer_llm"], row_frame["answer_orig"])[0]
scores["rouge-1"]

{'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}

## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65


### Answer

The average F1 score between `rouge-1`, `rouge-2` and `rouge-l` is `0.35`

In [46]:
np.mean([scores[name]["f"] for name in ["rouge-1", "rouge-2", "rouge-l"]])

0.35490034990035496

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records and create a dataframe from them.

What's the average `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

### Answer

The average F1 score for `rouge-2` across all records is `0.20`

In [49]:
df_eval_rouge = df.copy()

fscore_1 = []
fscore_2 = []
fscore_l = []


for idx, row in tqdm(df_eval_rouge.iterrows(), total=df_eval_rouge.shape[0]):
    answer_orig = row["answer_llm"]
    answer_llm = row["answer_orig"]

    rouge_scorer = Rouge()

    scores = rouge_scorer.get_scores(answer_llm, answer_orig)[0]

    fscore_1.append(scores["rouge-1"]["f"])
    fscore_2.append(scores["rouge-2"]["f"])
    fscore_l.append(scores["rouge-l"]["f"])

# Append to dataframe
df_eval_rouge["fscore_1"] = fscore_1
df_eval_rouge["fscore_2"] = fscore_2
df_eval_rouge["fscore_l"] = fscore_l
df_eval_rouge

100%|██████████| 300/300 [00:00<00:00, 372.22it/s]


Unnamed: 0,answer_llm,answer_orig,document,question,course,fscore_1,fscore_2,fscore_l
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,0.095238,0.028169,0.095238
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,0.125000,0.055556,0.093750
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,0.415584,0.177778,0.363636
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,0.216216,0.047059,0.135135
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,0.142076,0.033898,0.120219
...,...,...,...,...,...,...,...,...
295,An alternative way to load the data using the ...,Above users showed how to load the dataset dir...,8d209d6d,What is an alternative way to load the data us...,machine-learning-zoomcamp,0.654545,0.540984,0.618182
296,You can directly download the dataset from Git...,Above users showed how to load the dataset dir...,8d209d6d,How can I directly download the dataset from G...,machine-learning-zoomcamp,0.590164,0.460432,0.573770
297,You can fetch data for homework using the `req...,Above users showed how to load the dataset dir...,8d209d6d,Could you share a method to fetch data for hom...,machine-learning-zoomcamp,0.654867,0.564516,0.637168
298,If the status code is 200 when downloading dat...,Above users showed how to load the dataset dir...,8d209d6d,What should I do if the status code is 200 whe...,machine-learning-zoomcamp,0.304762,0.132231,0.304762


In [50]:
df_eval_rouge["fscore_2"].describe()

count    300.000000
mean       0.206965
std        0.153550
min        0.000000
25%        0.097809
50%        0.178671
75%        0.286181
max        0.739130
Name: fscore_2, dtype: float64

## Submit the results

* Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw4
* It's possible that your answers won't match exactly. If it's the case, select the closest one.