## Monitoring answer quality of LLMs
- different types of quality metrics:
    - Vector similarity between expected and LLM answer
    - LLM-as-a-judge to compute toxicity of LLM answer
    - LLM-as-a-judge to assess quality of LLM answer

- store computed metrics in relational database
- use grafana to visualize metrics over time

Monitoring answer quality with user feedback
- store chat sessions and collect user feedback in database
- use grafana to visualize user feedback and corresponding chat sessions

What else is there?
- further quality metrics  & user feedback
    - i.e. bias and fairness, topic clustering, textual user feedback
    - system metrics: latency, traffic, errors, saturation (also referred to as "4x golden signals)
    - *cost* of used infrastructure, i.e. vector store and LLM API -> see more details on (https://www.linkedin.com/in/magdalenakuhn)[here]

## Offline / Online evaluation


recap:

1. get a RAG 

2. evaluate retrieval
- hitrate
- mrr (mean reciprocal rank)

3. monitor / evaluate the prompt

- offline evaluation
    - cosine similarity (closeness of answer to the expectation)
    - LLM as a judge

- online evaluation
    - A/B tests, experiments
    - user feedback

### Offline (RAG) evaluation


In [None]:
def rag(q):
    search_results = search(q)
    prompt = build_prompt(q, search_results)
    answer = llm(prompt)
    return answer

Notebook `offline-rag-evaluation.ipynb`

- cosine similarity between LLM Answer & Original (Expected) answer
- loop over questions & save cosine similarity
- 1830 documents cost quite something (3.5 is cheaper)
- cost ~10€ for 4o
- 3.5 costs 10 times less (about 1€)
- 4o mini is even cheaper (30 cents)

A -> Q -> A' cosine similarity(A, A')

- use `.describe()` to see stats (mean similarity, min, max ...)
- use seaborn to build histogram

### LLM-as-a-judge:
- prompt ChatGPT if the answer is relevant with the reason for the evaluation. Give both answer and ground truth.  

## Homework

In [1]:
import pandas as pd
github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"
url = f'{github_url}?raw=1'
df = pd.read_csv(url)


In [19]:
df = df.iloc[:300]

df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from the Sentence Transformer library



In [20]:
from sentence_transformers import SentenceTransformer
model_name = "multi-qa-mpnet-base-dot-v1"
embedding_model = SentenceTransformer(model_name)

- What's the first value of the resulting vector?



In [21]:
answer_llm = df.iloc[0].answer_llm
embeddings = embedding_model.encode(answer_llm)
embeddings[0]

-0.42244643

## Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?



In [22]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [23]:
import numpy as np
scores = []

for i in range(len(df)):
    answer_llm = df.iloc[i].answer_llm
    answer_orig = df.iloc[i].answer_orig
    
    # Generate embeddings
    answer_llm_embeddings = embedding_model.encode(answer_llm)
    answer_orig_embeddings = embedding_model.encode(answer_orig)
    
    # Calculate dot product (cosine similarity)
    similarity_score = np.dot(answer_llm_embeddings, answer_orig_embeddings)
    scores.append(similarity_score)

# Calculate the 75th percentile of the scores
percentile_75 = np.percentile(scores, 75)
print("75th Percentile Score:", percentile_75)

75th Percentile Score: 31.67430305480957


In [24]:
import numpy as np

def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    return v / norm

# Assuming `embedding_model` is already defined and has the `encode` method.
# Also assuming `df` is your dataframe.

cosine_similarities = []

for i in range(len(df)):
    answer_llm = df.iloc[i].answer_llm
    answer_orig = df.iloc[i].answer_orig
    
    # Generate embeddings
    answer_llm_embeddings = embedding_model.encode(answer_llm)
    answer_orig_embeddings = embedding_model.encode(answer_orig)
    
    # Normalize embeddings
    answer_llm_embeddings_norm = normalize_vector(answer_llm_embeddings)
    answer_orig_embeddings_norm = normalize_vector(answer_orig_embeddings)
    
    # Calculate cosine similarity (dot product of normalized vectors)
    cosine_similarity = np.dot(answer_llm_embeddings_norm, answer_orig_embeddings_norm)
    cosine_similarities.append(cosine_similarity)

# Calculate the 75th percentile of the cosine similarities
percentile_75 = np.percentile(cosine_similarities, 75)
print("75th Percentile Cosine Similarity:", percentile_75)


75th Percentile Cosine Similarity: 0.8362346738576889


Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

`pip install rouge`
(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [26]:

from rouge import Rouge
rouge_scorer = Rouge()

r = df.iloc[10]


scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [29]:
scores["rouge-1"]["f"]

0.45454544954545456

In [34]:
pd.DataFrame.from_dict(scores).mean(axis = 1)

r    0.3549
p    0.3549
f    0.3549
dtype: float64

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

rouge-1 - the overlap of unigrams,
rouge-2 - bigrams,
rouge-l - the longest common subsequence

In [42]:
# Lists to store the scores
rouge_1_scores = []
rouge_2_scores = []
rouge_l_scores = []

# Compute ROUGE scores for all records
for i in range(len(df)):
    r = df.iloc[i]
    scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    
    # Append scores to lists
    rouge_1_scores.append(rouge_1)
    rouge_2_scores.append(rouge_2)
    rouge_l_scores.append(rouge_l)

# Calculate the average ROUGE-2 score
average_rouge_2 = sum(rouge_2_scores) / len(rouge_2_scores)
print("Average ROUGE-2 Score:", average_rouge_2)

Average ROUGE-2 Score: 0.2069650198342332
