In [20]:
import pandas as pd

## Get the Data Generated from the Previous Module

This dataset, derived from the gpt4o mini, includes fields such as 'answer_llm', 'answer_orig', 'document', 'question', and 'course'.

- 'answer_llm': The response provided by the language model to the generated question.
- 'answer_orig': The original answer from the ground truth dataset, which was used to generate the corresponding question.
- 'document': The identifier for the document.
- 'question': The question generated from the original answer.
- 'course': The related course.

The dataset was compiled by processing a ground truth dataset that contains questions, associated courses, and document IDs. This ground truth dataset was produced from the original set of FAQ documents. For each answer in these documents, an LLM generated five related questions, recording the document ID linked to each question.

Using the fields 'answer_llm' and 'answer_orig', we can compute the cosine similarity to assess the accuracy of the LLM's responses to the generated questions.

In [21]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [22]:
df = df.iloc[:300] #only use the first 300 documents

## Get the embeddings model

multi-qa-mpnet-base-dot-v1 from the Sentence Transformer library

In [23]:
from sentence_transformers import SentenceTransformer
model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)



## Create the embeddings for the first LLM answer

In [24]:
answer_llm = df.iloc[0].answer_llm
print(answer_llm)

You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).


In [25]:
first_answer_embedding = embedding_model.encode(answer_llm)
print(first_answer_embedding[0])

-0.42244658


## Compute Dot Product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

In [26]:
from tqdm.auto import tqdm

In [27]:
import numpy as np

In [31]:
evaluations = []

for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    v_answer_llm = embedding_model.encode(row['answer_llm']) 
    v_answer_orig = embedding_model.encode(row['answer_orig'])
    score = v_answer_llm.dot(v_answer_orig)
    evaluations.append(score)
print(evaluations)


100%|██████████| 300/300 [01:04<00:00,  4.65it/s]

[np.float32(17.515991), np.float32(13.4184), np.float32(25.313255), np.float32(12.147417), np.float32(18.747738), np.float32(33.970406), np.float32(30.251701), np.float32(29.52158), np.float32(35.272198), np.float32(27.75177), np.float32(32.34471), np.float32(31.44184), np.float32(36.380714), np.float32(33.340504), np.float32(30.606163), np.float32(32.503044), np.float32(29.674448), np.float32(24.35346), np.float32(20.132465), np.float32(23.995481), np.float32(30.880281), np.float32(32.692432), np.float32(30.04917), np.float32(16.078163), np.float32(31.796417), np.float32(37.980003), np.float32(20.83905), np.float32(32.61287), np.float32(38.894203), np.float32(34.051826), np.float32(28.263874), np.float32(27.124828), np.float32(23.975262), np.float32(26.340145), np.float32(18.658112), np.float32(25.016403), np.float32(21.101128), np.float32(33.726795), np.float32(29.340345), np.float32(28.654509), np.float32(29.608585), np.float32(30.810738), np.float32(33.331203), np.float32(26.220482




In [32]:
percentile_75 = np.percentile(evaluations, 75)

print("The 75th percentile is:", percentile_75)

The 75th percentile is: 31.674313


## The results are not within the range [0,1]. We need to normalize the vectors

In [38]:
def compute_norm(v):
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

In [39]:
def evaluate(df):
    evaluations = []
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):
        v_answer_llm = compute_norm(embedding_model.encode(row['answer_llm'])) 
        v_answer_orig = compute_norm(embedding_model.encode(row['answer_orig']))
        score = v_answer_llm.dot(v_answer_orig)
        evaluations.append(score)
    return evaluations

In [40]:
evaluations = evaluate(df)

100%|██████████| 300/300 [01:04<00:00,  4.67it/s]


In [41]:
percentile_75 = np.percentile(evaluations, 75)

print("The 75th percentile is:", percentile_75)

The 75th percentile is: 0.8362348


## Now we will use an alternative metric - the ROUGE score

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

n-grams are contiguous sequence of n items from a given sample of text or speech. These items can be phonemes, syllables, letters, words, or base pairs according to the level of textual analysis being conducted.

In [42]:
from rouge import Rouge
rouge_scorer = Rouge()

In [43]:
r = df.iloc[10]

In [45]:
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

{'rouge-1': {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}, 'rouge-2': {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}, 'rouge-l': {'r': 0.3939393939393939, 'p': 0.3939393939393939, 'f': 0.393939388939394}}


There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

- rouge-1 - the overlap of unigrams,
- rouge-2 - bigrams,
- rouge-l - the longest common subsequence

In [46]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

## Average rouge score

Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record

In [47]:
# Extract F-scores using list comprehension
f_scores = [scores[key]['f'] for key in scores]

# Calculate the average F-score using NumPy
average_f_score = np.mean(f_scores)

print("The average F-score is:", average_f_score)

The average F-score is: 0.35490034990035496
