# Homework: Evaluation and Monitoring

### Getting the ground truth data

In [1]:
import pandas as pd

In [3]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
df = df.iloc[:300]

### Q1. Getting the embeddings model

In [4]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

  from .autonotebook import tqdm as notebook_tqdm
You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [13]:
df.iloc[0]

answer_llm     You can sign up for the course by visiting the...
answer_orig    Machine Learning Zoomcamp FAQ\nThe purpose of ...
document                                                0227b872
question                     Where can I sign up for the course?
course                                 machine-learning-zoomcamp
Name: 0, dtype: object

In [5]:
answer_llm = df.iloc[0].answer_llm

What's the first value of the resulting vector?  
  
**-0.42**  
-0.22  
-0.02  
0.21  

In [6]:
answer_llm_vector = embedding_model.encode(answer_llm)
answer_llm_vector[0]

-0.42244655

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* **31.67**
* 41.67
* 51.67

In [7]:
def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [8]:
results_gpt4o = df.to_dict(orient='records')

In [9]:
from tqdm.auto import tqdm

evaluations = []

for record in tqdm(results_gpt4o):
    sim = compute_similarity(record)
    evaluations.append(sim)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:22<00:00,  2.11it/s]


In [10]:
df['similarity'] = evaluations
df['similarity'].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547924
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
Name: similarity, dtype: float64

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

In [11]:
import numpy as np

def norm_vector(v) :
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

def compute_cos_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)
    
    return norm_vector(v_llm).dot(norm_vector(v_orig))

In [12]:
evaluations = []

for record in tqdm(results_gpt4o):
    sim = compute_cos_similarity(record)
    evaluations.append(sim)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:21<00:00,  2.11it/s]


What's the 75% cosine in the scores?

* 0.63
* 0.73
* **0.83**
* 0.93

In [13]:
df['cosine'] = evaluations
df['cosine'].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine, dtype: float64

## Q4. Rouge

the ROUGE score - This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.
It can give a more nuanced view of text similarity than just cosine similarity alone.

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- **0.45**
- 0.55
- 0.65

In [14]:
df.iloc[10]

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
similarity                                             32.344711
cosine                                                  0.777956
Name: 10, dtype: object

In [16]:
from rouge import Rouge

r = df.iloc[10]

rouge_scorer = Rouge()
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])
scores

[{'rouge-1': {'r': 0.45454545454545453,
   'p': 0.45454545454545453,
   'f': 0.45454544954545456},
  'rouge-2': {'r': 0.21621621621621623,
   'p': 0.21621621621621623,
   'f': 0.21621621121621637},
  'rouge-l': {'r': 0.3939393939393939,
   'p': 0.3939393939393939,
   'f': 0.393939388939394}}]

In [18]:
scores[0]['rouge-1']['f']

0.45454544954545456

## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- **0.35**
- 0.45
- 0.55
- 0.65



In [23]:
# Initialize lists to store the values
rouge_types = []
recall = []
precision = []
f_score = []

# Loop through the data to extract the values
for rouge_type, metrics in scores[0].items():
    rouge_types.append(rouge_type)
    recall.append(metrics['r'])
    precision.append(metrics['p'])
    f_score.append(metrics['f'])
    
# Create a DataFrame
df_scores = pd.DataFrame({
    'rouge_type': rouge_types,
    'r': recall,
    'p': precision,
    'f': f_score
})

df_scores.set_index('rouge_type', inplace=True)
df_scores

Unnamed: 0_level_0,r,p,f
rouge_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
rouge-1,0.454545,0.454545,0.454545
rouge-2,0.216216,0.216216,0.216216
rouge-l,0.393939,0.393939,0.393939


In [24]:
df_scores.f.mean()

0.35490034990035496

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records and create a dataframe from them.

What's the average `rouge_2` across all the records?

- 0.10
- **0.20**
- 0.30
- 0.40

In [30]:
scores_list = []

for _, row in df.iterrows():
    scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])
    scores_list.append(scores[0])

In [32]:
df_rouge = pd.DataFrame(columns=['rouge-1','rouge-2','rouge-l','rouge-avg'])
data = []

for i in range(len(scores_list)):
    rouge_1 = scores_list[i]['rouge-1']['f']
    rouge_2 = scores_list[i]['rouge-2']['f']
    rouge_l = scores_list[i]['rouge-l']['f']
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
    data.append({
        'rouge-1': rouge_1,
        'rouge-2': rouge_2,
        'rouge-l': rouge_l,
        'rouge-avg': rouge_avg
    })
df_rouge = pd.concat([df_rouge, pd.DataFrame(data)], ignore_index=True)
df_rouge

  df_rouge = pd.concat([df_rouge, pd.DataFrame(data)], ignore_index=True)


Unnamed: 0,rouge-1,rouge-2,rouge-l,rouge-avg
0,0.095238,0.028169,0.095238,0.072882
1,0.125000,0.055556,0.093750,0.091435
2,0.415584,0.177778,0.389610,0.327658
3,0.216216,0.047059,0.189189,0.150821
4,0.142076,0.033898,0.120219,0.098731
...,...,...,...,...
295,0.654545,0.540984,0.618182,0.604570
296,0.590164,0.460432,0.557377,0.535991
297,0.654867,0.564516,0.637168,0.618851
298,0.304762,0.132231,0.304762,0.247252


In [34]:
df_rouge['rouge-2'].mean()

0.20696501983423318