## Evaluation data
For this homework, we will use the same dataset we generated in the videos.

Let's get them:

* documents: the FAQ with the ids,
* ground_truth: the questions with the course and id, 5 for each FAQ document
* search trough storaged documents()

In [263]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, documents contains the documents from the FAQ database with unique IDs, and ground_truth contains generated question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [4]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

  from .autonotebook import tqdm as notebook_tqdm


### Q1. Minsearch text
Now let's evaluate our usual minsearch approach, but tweak the parameters. Let's use the following boosting params:

``` 
boost = {'question': 1.5, 'section': 0.1}
``` 

What's the hitrate for this approach?

* 0.64
* 0.74
* 0.84
* 0.94

In [5]:
def minsearch_search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [9]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x755fb1cbfe50>

In [10]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:11<00:00, 397.90it/s]


### A1.

In [227]:
print(f"the value of the hit rate is {hit_rate(relevance_total)}")

the value of the hit rate is 0.848714069591528


### Embeddings
The latest version of minsearch also supports vector search. We will use it:

from minsearch import VectorSearch
We will also use TF-IDF and Singular Value Decomposition to create embeddings from texts. You can refer to our "Create Your Own Search Engine" workshop if you want to know more about it.

```bash 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
``` 

Let's create embeddings for the "question" field:
```bash 

    texts = []

    for doc in documents:
        t = doc['question']
        texts.append(t)

    pipeline = make_pipeline(
        TfidfVectorizer(min_df=3),
        TruncatedSVD(n_components=128, random_state=1)
    )
    X = pipeline.fit_transform(texts)
``` 

In [239]:
from minsearch import VectorSearch

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

##### Data sneak peek

In [240]:
texts[0]

'Course - When will the course start?'

## Q2. Vector search for question

Now let's index these embeddings with minsearch:

```bash
    vindex = VectorSearch(keyword_fields={'course'})
    vindex.fit(X, documents)
``` 

In [241]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x755fb9492a10>

In [242]:
total_relevance = []
for q in tqdm(ground_truth):
    doc_id = q['document']
    vector = pipeline.transform([q['question']])[0]
    results = vindex.search(vector, filter_dict={'course': q['course']}, num_results=5)
    relevance = [d['id'] == doc_id for d in results]
    total_relevance.append(relevance)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:05<00:00, 865.91it/s]


### A2

In [243]:
print(f"the value of the MRR is {mrr(total_relevance)}")

the value of the MRR is 0.3572833369353793


### Q3. Vector search for question and answer

We only used question in Q2. We can use both question and answer:

In [248]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

##### Data sneak peek

In [249]:
texts[0]

"Course - When will the course start? The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel."

Using the same pipeline (min_df=3 for TF-IDF vectorizer and n_components=128` for SVD), evaluate the performance of this approach

What's the hitrate?

* 0.62
* 0.72
* 0.82
* 0.92


In [250]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x755fb9495cd0>

In [251]:
total_relevance = []
for q in tqdm(ground_truth):
    doc_id = q['document']
    vector = pipeline.transform([q['question']])[0]
    results = vindex.search(vector, filter_dict={'course': q['course']}, num_results=5)
    relevance = [d['id'] == doc_id for d in results]
    total_relevance.append(relevance)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:06<00:00, 756.93it/s]


### A3.

In [254]:
print(f"the value of the hit rate is {hit_rate(total_relevance)}")

the value of the hit rate is 0.8210503566025502


### Q4. Qdrant
Now let's evaluate the following settings in Qdrant:

```bash
    text = doc['question'] + ' ' + doc['text']
    model_handle = "jinaai/jina-embeddings-v2-small-en"
    limit = 5
```

What's the MRR?

* 0.65
* 0.75
* 0.85
* 0.95

In [267]:
from fastembed import TextEmbedding
from qdrant_client import QdrantClient, models

In [268]:
client = QdrantClient("http://qdrant:6333")

In [300]:
model_handle = "jinaai/jina-embeddings-v2-small-en"

# Define the collection name
collection_name = "hm5-rag"
EMBEDDING_DIMENSIONALITY = 512
# Create the collection with specified vector parameters
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)
embedder = TextEmbedding(model_name=model_handle)

In [311]:
points = []
for idx, doc in enumerate(documents):

    text = (doc['question'] + '?' + ' ' + doc['text']).strip('\n')
    embedding = next(embedder.embed(text))
    
    point = models.PointStruct(
            id=idx,
            vector=embedding, #embed text locally with "jinaai/jina-embeddings-v2-small-en" from FastEmbed
            payload={
                "id": doc['id'],
                "course": course['course']
            } #save all needed metadata fields
        )
    points.append(point)

In [312]:
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [313]:
def qdrant_search(query, limit=5):
    embedding = next(embedder.embed(query))
    results = client.query_points(
        collection_name=collection_name,
        query=embedding,
        limit=limit,
        with_payload=True
    )
    return results

In [316]:
total_relevance_qdrant = []
for q in tqdm(ground_truth):
    doc_id = q['document']
    results = qdrant_search(q['question']).points
    relevance = [d.payload['id'] == q['document'] for d in results]
    total_relevance_qdrant.append(relevance)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:45<00:00, 100.83it/s]


### A4.

In [318]:
print(f"the value of the mrr rate is {mrr(total_relevance_qdrant)}")

the value of the mrr rate is 0.8238095238095243


### Q4. Qdrant version 2

In [328]:
from fastembed import TextEmbedding

from qdrant_client import QdrantClient, models
from fastembed.embedding import DefaultEmbedding

# Initialize Qdrant client
client = QdrantClient("http://qdrant:6333")

# Model and collection setup
collection_name = "hm5-rag-v2"
model_handle = "jinaai/jina-embeddings-v2-small-en"
EMBEDDING_DIMENSIONALITY = 512

# Load the embedding model
embedder = DefaultEmbedding(model_name=model_handle)

# Delete collection if it exists (optional)
if client.collection_exists(collection_name=collection_name):
    client.delete_collection(collection_name=collection_name)

# Create new collection
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)



True

In [330]:
points = []

for idx, doc in enumerate(documents):
    text = doc["question"] + " " + doc["text"]
    vector = next(embedder.embed([text]))  # Embed the text

    point = models.PointStruct(
        id=idx,
        vector=vector,
        payload={
            "id": doc["id"],
            "course": doc["course"],
            "question": doc["question"],
            "text": doc["text"]
        }
    )
    points.append(point)

# Upload all points
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [333]:
def qdrant_search(query, limit=5):
    vector = next(embedder.embed([query]))

    results = client.search(
        collection_name=collection_name,
        query_vector=vector,
        limit=limit,
        with_payload=True
    )

    return results

def mean_reciprocal_rank(all_relevance):
    reciprocal_ranks = []

    for relevance in all_relevance:
        try:
            rank = relevance.index(True)
            reciprocal_ranks.append(1.0 / (rank + 1))
        except ValueError:
            reciprocal_ranks.append(0.0)

    return np.mean(reciprocal_ranks)

In [334]:
total_relevance = []

for q in tqdm(ground_truth):
    correct_id = q['document']
    retrieved = qdrant_search(q['question'])
    relevance = [hit.payload['id'] == correct_id for hit in retrieved]
    total_relevance.append(relevance)

mrr = mean_reciprocal_rank(total_relevance)
print("MRR:", mrr)

  results = client.search(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [01:00<00:00, 76.72it/s]

MRR: 0.8238707585908795





### Q5. Cosine similarity
In the second part of the module, we looked at evaluating the entire RAG approach. In particular, we looked at comparing the answer generated by our system with the actual answer from the FAQ.

One of the ways of doing it is using the cosine similarity. Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors. In geometrical sense, it's the cosine of the angle between the vectors. Look up "cosine similarity geometry" if you want to learn more about it.

For us, it means that we need two things:

First, we normalize each of the vectors
then, compute the dot product
So, we get this:

```
def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)
```

For normalization, we first compute the vector norm (its length), and then divide the vector by it:

```
def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm
```

(where np is import numpy as np)

Or we can simplify it:

```
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)
```

Now let's use this function to compute the A->Q->A cosine similarity.

We will use the results from [our gpt-4o-mini evaluations](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-evaluation/rag_evaluation/data/results-gpt4o-mini.csv):

```
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)
```

When creating embeddings, we will use a simple way - the same we used in the Embeddings section:

```
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
```

Let's fit the vectorizer on all the text data we have:

```
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)
```

Now use the transform method of the pipeline to create the embeddings and calculate the cosine similarity between each pair.

What's the average cosine?

* 0.64
* 0.74
* 0.84
* 0.94
 
This is how you do it:

* For each answer pair, compute
     * `v_llm` for the answer from the LLM
     * `v_orig` for the original answer
     * then compute the cosine between them
* At the end, take the average

In [189]:
import numpy as np

def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)

def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm

def cosine(u, v):
    u_norm = np.sqrt(u.dot(u.T))
    v_norm = np.sqrt(v.dot(v.T))
    return u.dot(v.T) / (u_norm * v_norm)


results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

In [176]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)

0,1,2
,steps,"[('tfidfvectorizer', ...), ('truncatedsvd', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,128
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,1
,tol,0.0


### A5.

In [192]:
v_llm = pipeline.transform(df_results['answer_llm'])
v_orig = pipeline.transform(df_results['answer_orig'])

def cosine(u, v):
    u_norm = np.linalg.norm(u)
    v_norm = np.linalg.norm(v)
    return u.dot(v) / (u_norm * v_norm)

sims = [cosine(u,v) for u, v in zip(v_llm, v_orig)]
average_cosine = np.mean(sims)
print(f"Average cosine similarity: {average_cosine:.2f}")

Average cosine similarity: 0.84


### Q6. Rouge
And alternative way to see how two texts are similar is ROUGE.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:



In [193]:
!pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [195]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df_results.iloc[10]

In [201]:
scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)

In [197]:
scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

* `rouge-1`  - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence
For the 10th document, Rouge-1 F1 score is 0.45

Let's compute it for the pairs in the entire dataframe. What's the average Rouge-1 F1?

* 0.25
* 0.35
* 0.45
* 0.55

### A6.

In [212]:
rouge_score = [ score['rouge-1']['f'] for score in rouge_scorer.get_scores(df_results.answer_llm, df_results.answer_orig) ]

In [218]:
rouge_score_df = pd.DataFrame({
    'id': df_results['document'].to_list(),
    'rouge_score': rouge_score
})

In [225]:
print( f"the value of the average Rouge-1 F1 is {(rouge_score_df['rouge_score'].mean().__str__())}")

the value of the average Rouge-1 F1 is 0.3516946452113943
