## Import the data

In [3]:
import pandas as pd

In [4]:
github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"

url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [6]:
df_new = df.iloc[:300]

In [8]:
# check the shape of the data
df_new.shape

(300, 5)

In [13]:
# check the column of the dataset
df_new.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

*Create the embeddings for the first LLM answer:*

```python
answer_llm = df.iloc[0].answer_llm
```
*What's the first value of the resulting vector?*

In [9]:
model_name = "multi-qa-mpnet-base-dot-v1"

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
answer_llm = df_new.iloc[0].answer_llm

In [12]:
embedding_model.encode(answer_llm)[0]

-0.42244655

*answer to question one : -0.42*

## Q2. Computing the dot product

*Now for each answer pair, let's create embeddings and compute dot product between them*

*We will put the results (scores) into the evaluations list*

*What's the 75% percentile of the score?*

In [14]:
#used to display progress bar for loops and iterators
from tqdm.auto import tqdm

In [29]:
evaluation = []

for index, record in tqdm(df_new.iterrows()):
    answer_org = record['answer_orig']
    answer_llm = record['answer_llm']
    
    llm = embedding_model.encode(answer_org)
    orig = embedding_model.encode(answer_llm)
    
    dot_product = llm.dot(orig)
    evaluation.append(dot_product)

0it [00:00, ?it/s]

In [30]:
df_new["score"] = evaluation

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new["score"] = evaluation


In [31]:
df_new["score"].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547923
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
Name: score, dtype: float64

*Answer to question 3: 31.67*

## Q3. Computing the cosine

*From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.*

*So we need to normalize them. To do it, we Compute the norm of a vector Divide each element by this norm*

*So, for vector v, it'll be* 

$$
\frac{v}{\|v\|}
$$ 

*In numpy, this is how you do it:*

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```
*Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity*

*What's the 75% cosine in the scores?*

In [32]:
# import dependency
import numpy as np

In [83]:
def cosine_similarity(df, embedding_model):
    similarity = []

    for index, record in tqdm(df.iterrows(), total=df.shape[0]):
        answer_org = record['answer_orig']
        answer_llm = record['answer_llm']
        
        # Encode the answers to get the vectors
        llm = embedding_model.encode(answer_llm)
        orig = embedding_model.encode(answer_org)
        
        # Compute norms for each vector
        norm_llm = np.sqrt(np.sum(llm ** 2))
        norm_orig = np.sqrt(np.sum(orig ** 2))
        
        # Normalize the vectors
        llm_norm = llm / norm_llm
        orig_norm = orig / norm_orig
        
        # Compute the cosine similarity (dot product of normalized vectors)
        dot_product = np.dot(llm_norm, orig_norm)
        similarity.append(dot_product)
    
    return similarity

In [81]:
df_new["cosine"] = cosine_similarity(df_new, embedding_model)

  0%|          | 0/300 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new["cosine"] = cosine_similarity(df_new, embedding_model)


In [82]:
df_new["cosine"].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine, dtype: float64

*Answer to question 3: 0.83*

## Q4. Rouge

*Now we will explore an alternative metric - the ROUGE score.*

*This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.*

*It can give a more nuanced view of text similarity than just cosine similarity alone.*

*We don't need to implement it ourselves, there's a python package for it:*
```bash
pip install rouge
```
*(The latest version at the moment of writing is 1.0.1)*

*Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)*
``` python
from rouge import Rouge
rouge_scorer = Rouge()
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```
*There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.*

*rouge-1 - the overlap of unigrams, rouge-2 - bigrams, rouge-l - the longest common subsequence*

*What's the F score for rouge-1?*

In [62]:
from rouge import Rouge
rouge_scorer = Rouge()

In [66]:
r = df_new.iloc[10]

In [75]:
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [76]:
scores['rouge-1']['f']

0.45454544954545456

*Answer to question 4: 0.45*

## Q5. Average rouge score
*Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record from Q4*

In [77]:
rouge_1_f1 = scores['rouge-1']['f']
rouge_2_f1 = scores['rouge-2']['f']
rouge_l_f1 = scores['rouge-l']['f']

In [79]:
average_f1 = (rouge_1_f1 + rouge_2_f1 + rouge_l_f1) / 3
average_f1

0.35490034990035496

*Answer to question 5: 0.35*

## Q6. Average rouge score for all the data points
*Now let's compute the score for all the records and create a dataframe from them.*

*What's the average rouge_2 across all the records?*

In [86]:
rouge_2_f1_scores = []

# Compute ROUGE-2 F-scores for all records
for index, record in tqdm(df_new.iterrows(), total=df.shape[0]):
    scores = rouge_scorer.get_scores(record['answer_llm'], record['answer_orig'])[0]
    rouge_2_f1 = scores['rouge-2']['f']
    rouge_2_f1_scores.append(rouge_2_f1)

# Compute the average ROUGE-2 F-score
average_rouge_2_f1 = sum(rouge_2_f1_scores) / len(rouge_2_f1_scores)

  0%|          | 0/300 [00:00<?, ?it/s]

In [87]:
print(f"Average ROUGE-2 F1 Score across all records: {average_rouge_2_f1}")

Average ROUGE-2 F1 Score across all records: 0.20696501983423318


*Answer to question 6: 0.20*