## Homework: Evaluation and Monitoring

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```

In [1]:
import pandas as pd

In [2]:
github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"

url = f"{github_url}?raw=1"
df = pd.read_csv(url)
df = df.iloc[:300]

In [3]:
df.shape

(300, 5)

In [4]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

In [5]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [6]:
model_name = "multi-qa-mpnet-base-dot-v1"
embedding_model = SentenceTransformer(model_name)

model.safetensors:  43%|####3     | 189M/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
answer_llm = df.iloc[0].answer_llm
answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [8]:
v_llm = embedding_model.encode(answer_llm)

In [9]:
v_llm[0]

-0.42244682

In [10]:
len(v_llm)

768

In [11]:
v_orig = embedding_model.encode(df.iloc[0].answer_orig)
v_orig[0]

-0.030214256

In [12]:
len(v_orig)

768

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

In [19]:
from tqdm.auto import tqdm
import numpy as np

In [14]:
list(df.index[:2])

[0, 1]

In [17]:
evaluations_list = []
for _, row in tqdm(df.iterrows()):
    v_llm = embedding_model.encode(row["answer_llm"])
    v_orig = embedding_model.encode(row["answer_orig"])
    evaluations_list.append(v_llm.dot(v_orig))

0it [00:00, ?it/s]

In [18]:
evaluations_list[0]

17.515999

In [25]:
evaluations = np.array(evaluations_list)

In [31]:
evaluations[0]

17.515999

In [26]:
percentile_score_75th = np.percentile(evaluations, 75)
percentile_score_75th

31.67430877685547

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [27]:
def norm_vector(v):
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

In [28]:
def cosine_similarity(v_llm, v_orig):
    v_llm_norm = norm_vector(v_llm)
    v_orig_norm = norm_vector(v_orig)
    return v_llm_norm.dot(v_orig_norm)

In [29]:
evaluations_norm = []
for _, row in tqdm(df.iterrows()):
    v_llm = embedding_model.encode(row["answer_llm"])
    v_orig = embedding_model.encode(row["answer_orig"])
    evaluations_norm.append(cosine_similarity(v_llm, v_orig))

0it [00:00, ?it/s]

In [30]:
evaluations_norm[0]

0.50675434

In [48]:
percentile_cosine_score_75th = np.percentile(evaluations_norm, 75)
percentile_cosine_score_75th

0.836234837770462

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

In [33]:
from rouge import Rouge

In [34]:
rouge_scorer = Rouge()

In [35]:
r = df.iloc[10]
r

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you wonâ€™t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object

In [36]:
scores = rouge_scorer.get_scores(r["answer_llm"], r["answer_orig"])[0]

In [37]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [38]:
f1_score_rouge_1 = scores["rouge-1"]["f"]
f1_score_rouge_1

0.45454544954545456

## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65

In [39]:
# Extracting the F-scores
f_scores = [values['f'] for values in scores.values()]

# Calculating the average of the F-scores
average_f_score = sum(f_scores) / len(f_scores)
average_f_score

0.35490034990035496

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records and create a dataframe from them.

What's the average `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

In [40]:
scores = rouge_scorer.get_scores(r["answer_llm"], r["answer_orig"])

In [41]:
scores

[{'rouge-1': {'r': 0.45454545454545453,
   'p': 0.45454545454545453,
   'f': 0.45454544954545456},
  'rouge-2': {'r': 0.21621621621621623,
   'p': 0.21621621621621623,
   'f': 0.21621621121621637},
  'rouge-l': {'r': 0.3939393939393939,
   'p': 0.3939393939393939,
   'f': 0.393939388939394}}]

In [42]:
def compute_rouge_scores(df):
    """
    Compute ROUGE scores for each record in the dataframe.

    This function takes a dataframe containing columns 'answer_llm' and 'answer_orig',
    computes the ROUGE-1, ROUGE-2, and ROUGE-L F-scores for each record, and returns
    a new dataframe with these F-scores.

    Parameters:
    df (pd.DataFrame): DataFrame containing the generated answers and the original answers.

    Returns:
    pd.DataFrame: DataFrame containing the ROUGE-1, ROUGE-2, and ROUGE-L F-scores for each record.
    """
    # Initialize the Rouge scorer
    rouge_scorer = Rouge()

    # List to store the F-scores
    records = []

    # Iterate over each record in the dataframe
    for idx, row in df.iterrows():
        scores = rouge_scorer.get_scores(row["answer_llm"], row["answer_orig"])[0]
        record = {
            'rouge-1_f': scores['rouge-1']['f'],
            'rouge-2_f': scores['rouge-2']['f'],
            'rouge-l_f': scores['rouge-l']['f']
        }
        records.append(record)

    # Create a new dataframe from the F-scores
    scores_df = pd.DataFrame(records)
    return scores_df

In [43]:
scores_df = compute_rouge_scores(df)

In [44]:
scores_df.head()

Unnamed: 0,rouge-1_f,rouge-2_f,rouge-l_f
0,0.095238,0.028169,0.095238
1,0.125,0.055556,0.09375
2,0.415584,0.177778,0.38961
3,0.216216,0.047059,0.189189
4,0.142076,0.033898,0.120219


In [45]:
scores_df.shape

(300, 3)

In [47]:
average_rouge_2_f_score = scores_df['rouge-2_f'].mean()

# print(scores_df)
print("Average ROUGE-2 F-score:", average_rouge_2_f_score)

Average ROUGE-2 F-score: 0.20696501983423318
