In [1]:
import pandas as pd

parquet_file = 'dane.parquet'
df = pd.read_parquet(parquet_file)
df = df[:10000]
df

Unnamed: 0,question,answer
0,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha..."
1,What is in front of the Notre Dame Main Building?,"Architecturally, the school has a Catholic cha..."
2,The Basilica of the Sacred heart at Notre Dame...,"Architecturally, the school has a Catholic cha..."
3,What is the Grotto at Notre Dame?,"Architecturally, the school has a Catholic cha..."
4,What sits on top of the Main Building at Notre...,"Architecturally, the school has a Catholic cha..."
...,...,...
9995,What is the name of Thammasat University's eng...,Institutes of technology with different origin...
9996,What is Thailand's only government-established...,Institutes of technology with different origin...
9997,What year was Suranaree University of Technolo...,Institutes of technology with different origin...
9998,What is the name of the best-known private ins...,Institutes of technology with different origin...


## 3. Analiza działania modelu retriever (bi-enkoder) na wybranym zbiorze danych (np. sentence-transformers/squad dostępny w ramach platformy HuggingFace [2.5 pkt]

https://huggingface.co/datasets/sentence-transformers/squad)

sugeruje się wykorzystanie modelu multi-qa-mpnet-base-dot-v1

In [2]:
from datasets import Dataset

data = Dataset.from_pandas(df)

data = data.map(
    lambda x: {"answer_length": len(x["answer"].split()), "question_length": len(x["question"].split())}
)

data = data.filter(lambda x: x['answer_length'] > 15)

  from .autonotebook import tqdm as notebook_tqdm
Map: 100%|██████████| 10000/10000 [00:00<00:00, 55483.73 examples/s]
Filter: 100%|██████████| 10000/10000 [00:00<00:00, 450574.08 examples/s]


In [3]:
def concatenate_text(examples):
    return {
        "text": examples["question"]
        + " \n "
        + examples["answer"]
    }


data = data.map(concatenate_text)
data

Map: 100%|██████████| 10000/10000 [00:00<00:00, 33335.19 examples/s]


Dataset({
    features: ['question', 'answer', 'answer_length', 'question_length', 'text'],
    num_rows: 10000
})

### b
wykonać osadzenia zapytań i dokumentów (kolumny question i
answer) oraz przeprowadzić analizę przykładowych wyszukiwań, tym
razem posługując się innym tutorialem:
https://sbert.net/examples/applications/semantic-search/README.html

In [6]:
import torch
from sentence_transformers import SentenceTransformer


embedder = SentenceTransformer("multi-qa-mpnet-base-dot-v1")

# byly duplikaty
corpus = list(set(data['text']))

# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=False)

# Query sentences:
queries = [
    "Where is the Notre Dame?",
    "Where did the Virgin Mary allegedly appear in 1858?",
    "How old is the oldest structure at Notre Dame?",
    "What is Beyonce's daughter name?",
    "When was Chopin's funeral?",
    "What was the cause of Chopin's death?",
    "What body of water separates the Hudson and East Rivers?",
    "Between New York City and which nearby urban center is the Hudson River classified as an estuary?",
    "Who was the founder of the National Review magazine?", 
]

In [7]:
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=False)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(similarity_scores, k=top_k)

    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(scores, indices):
        print(corpus[idx])
        print(f"(Score: {score:.4f})")



Query: Where is the Notre Dame?
Top 5 most similar sentences in corpus:
The school known as Notre Dame is known by a more lengthy name, what is it? 
 The University of Notre Dame du Lac (or simply Notre Dame /ˌnoʊtərˈdeɪm/ NOH-tər-DAYM) is a Catholic research university located adjacent to South Bend, Indiana, in the United States. In French, Notre Dame du Lac means "Our Lady of the Lake" and refers to the university's patron saint, the Virgin Mary. The main campus covers 1,250 acres in a suburban setting and it contains a number of recognizable landmarks, such as the Golden Dome, the "Word of Life" mural (commonly known as Touchdown Jesus), and the Basilica.
(Score: 25.4066)
What type of institution is the Notre Dame? 
 The University of Notre Dame du Lac (or simply Notre Dame /ˌnoʊtərˈdeɪm/ NOH-tər-DAYM) is a Catholic research university located adjacent to South Bend, Indiana, in the United States. In French, Notre Dame du Lac means "Our Lady of the Lake" and refers to the universi

#### Analiza wyników

Wykorzystalismy model SentenceTransformer do wygenerowania osadzeń tekstowych i bazowalismy na podobieństwie cosinusowym.
Mozna stwierdzic, ze w wiekszosci odpowiedzi pasują kontekstowo do zapytań, co wskazuje na poprawne osadzenia semantyczne.


##### Przykłady:
**Query: Where did the Virgin Mary allegedly appear in 1858?** \
3/5 dobrych odpowiedzi zawierających miejsce, o które zapytaliśmy. Reszta podaje informacje, które równiez dotyczą objawienia Virgin Mary, ale nie tego, o które pytaliśmy. 

**Query: What is Beyonce's daughter name?** \
Tylko jedno z 5 najblizszych odpowiedzi zawiera imię córki Beyonce, reszta to informacje o niej samej.

**Query: Where is the Notre Dame?** \
Wszystkie odpowiedzi pasują.

**Query: Between New York City and which nearby urban center is the Hudson River classified as an estuary?** \
4/5 zawiera odpowiedź na pytanie.

### c. 
należy dodać indeks wyszukiwania FAISS w celu dopasowania
dokumentów do zapytań za pomocą osadzeń i ponownie wykonać
wyszukiwanie (ustalić top_k = 5)

In [None]:
import faiss
import numpy as np

#create a FAISS index
dimension = corpus_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(corpus_embeddings)

top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=False)
    query_embedding_np = np.array(query_embedding).reshape(1, -1)

    #search the FAISS index
    distances, indices = index.search(query_embedding_np, top_k)

    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus:")

    for distance, idx in zip(distances[0], indices[0]):
        print(corpus[idx]) 
        print(f"(Distance: {distance:.4f})")


Query: Where is the Notre Dame?
Top 5 most similar sentences in corpus:
How large is Notre Dame in acres? 
 The University of Notre Dame du Lac (or simply Notre Dame /ˌnoʊtərˈdeɪm/ NOH-tər-DAYM) is a Catholic research university located adjacent to South Bend, Indiana, in the United States. In French, Notre Dame du Lac means "Our Lady of the Lake" and refers to the university's patron saint, the Virgin Mary. The main campus covers 1,250 acres in a suburban setting and it contains a number of recognizable landmarks, such as the Golden Dome, the "Word of Life" mural (commonly known as Touchdown Jesus), and the Basilica.
(Distance: 24.3512)
What type of institution is the Notre Dame? 
 The University of Notre Dame du Lac (or simply Notre Dame /ˌnoʊtərˈdeɪm/ NOH-tər-DAYM) is a Catholic research university located adjacent to South Bend, Indiana, in the United States. In French, Notre Dame du Lac means "Our Lady of the Lake" and refers to the university's patron saint, the Virgin Mary. The

### d
zaimplementować wybraną metrykę oceny skuteczności wyszukiwania

In [None]:
#pzrykład ze strony
# relevant results for query #1, #2, and #3
# actual_relevant = [
#     [2, 4, 5, 7],
#     [1, 4, 5, 7],
#     [5, 8]
# ]


def calculate_mrr(actual_relevant):

    # number of queries
    Q = len(actual_relevant)

    # calculate the reciprocal of the first actual relevant rank
    cumulative_reciprocal = 0
    for i in range(Q):
        if len(actual_relevant[i]) != 0:
            first_result = actual_relevant[i][0]
            reciprocal = 1 / first_result
            cumulative_reciprocal += reciprocal
            print(f"query #{i+1} = 1/{first_result} = {reciprocal}")
        else:
            print(f"query #{i+1} = 0")

    # calculate mrr
    mrr = 1/Q * cumulative_reciprocal

    print("MRR =", round(mrr,2))


### e

porównać wyniki uzyskane bez wykorzystania indeksu i wraz z
wykorzystaniem indeksu FAISS.

#### without faiss

In [None]:
actual_relevant = [
    [1,2,3,4,5],
    [1, 2, 4],
    [1],
    [2],
    [1,2,3],
    [1,2,3,4,5],
    [1,2,3,4,5],
    [1,3,4,5],
    [1,2,3,4]
]

calculate_mrr(actual_relevant)

query #1 = 1/1 = 1.0
query #2 = 1/1 = 1.0
query #3 = 1/1 = 1.0
query #4 = 1/2 = 0.5
query #5 = 1/1 = 1.0
query #6 = 1/1 = 1.0
query #7 = 1/1 = 1.0
query #8 = 1/1 = 1.0
query #9 = 1/1 = 1.0
MRR = 0.94


#### with faiss

In [12]:
actual_relevant_faiss = [
    [1,2,3,4,5],
    [1, 3, 5],
    [1],
    [],
    [1, 4, 5],
    [1, 3, 5],
    [1,2,3,5],
    [1, 5],
    [1,2,3,4]
]

calculate_mrr(actual_relevant_faiss)

query #1 = 1/1 = 1.0
query #2 = 1/1 = 1.0
query #3 = 1/1 = 1.0
query #4 = 0
query #5 = 1/1 = 1.0
query #6 = 1/1 = 1.0
query #7 = 1/1 = 1.0
query #8 = 1/1 = 1.0
query #9 = 1/1 = 1.0
MRR = 0.89


## 4. Analiza działania modelu r e-ranker (cross-enkoder) [1.5 pkt]

### a
do potoku wyszukiwania zaimplementowanego w podpunkcie 2 należy
dodać moduł re-rankingu

In [None]:
import pandas as pd
from sentence_transformers import CrossEncoder

def rerank_with_cross_encoder(question, samples_df):
    cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")

    texts = [(question, row["text"]) for _, row in samples_df.iterrows()]
    
    scores = cross_encoder.predict(texts)
    
    samples_df["rerank_scores"] = scores
    
    samples_df = samples_df.sort_values("rerank_scores", ascending=False)
    
    return samples_df

### b. 
ponownie wykonać wyszukiwania (top_k = 5) dla wybranych zapytań ze
zbioru i dokonać r e-r ankingu za pomocą cross-enkodera (proszę
wykorzystać jeden z pre-trenowanych modeli:
https://sbert.net/docs/pretrained-models/ce-msmarco.html)

In [None]:
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=False)

    similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(similarity_scores, k=top_k)

    samples = [{"text": corpus[i]} for i in indices]
    samples_df = pd.DataFrame(samples)
    samples_df["scores"] = scores.cpu().numpy()  # Convert tensor to numpy for the DataFrame

    reranked_samples_df = rerank_with_cross_encoder(query, samples_df)
    
    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus after re-ranking:")

    for _, row in reranked_samples_df.iterrows():
        print(f"TEXT: {row['text']}")
        print(f"RE-RANKED SCORE: {row['rerank_scores']}")
        print()

    print("=" * 50)



Query: Where is the Notre Dame?
Top 5 most similar sentences in corpus after re-ranking:
TEXT: Who is the patron saint of Notre Dame? 
 The University of Notre Dame du Lac (or simply Notre Dame /ˌnoʊtərˈdeɪm/ NOH-tər-DAYM) is a Catholic research university located adjacent to South Bend, Indiana, in the United States. In French, Notre Dame du Lac means "Our Lady of the Lake" and refers to the university's patron saint, the Virgin Mary. The main campus covers 1,250 acres in a suburban setting and it contains a number of recognizable landmarks, such as the Golden Dome, the "Word of Life" mural (commonly known as Touchdown Jesus), and the Basilica.
RE-RANKED SCORE: 8.273205757141113

TEXT: What type of institution is the Notre Dame? 
 The University of Notre Dame du Lac (or simply Notre Dame /ˌnoʊtərˈdeɪm/ NOH-tər-DAYM) is a Catholic research university located adjacent to South Bend, Indiana, in the United States. In French, Notre Dame du Lac means "Our Lady of the Lake" and refers to t

### c. 
ponownie ocenić skuteczność wyszukiwania za pomocą ustalonych
metryk oceny i porównać z poprzednio uzyskanymi wynikami.

In [35]:
actual_relevant_reranked = [
    [1,2,3,4,5],
    [1, 2, 4, 5],
    [1],
    [1],
    [1,2,3],
    [1,2,3,4,5],
    [1,2,3,4,5],
    [1,2,3,4,5],
    [1,2,3,4]
]

calculate_mrr(actual_relevant)

query #1 = 1/1 = 1.0
query #2 = 1/1 = 1.0
query #3 = 1/1 = 1.0
query #4 = 1/1 = 1.0
query #5 = 1/1 = 1.0
query #6 = 1/1 = 1.0
query #7 = 1/1 = 1.0
query #8 = 1/1 = 1.0
query #9 = 1/1 = 1.0
MRR = 1.0


Dodany moduł re-rankingu polega na ponownym ocenieniu i posortowaniu wyników wyszukiwania za pomocą modelu CrossEncoder (u nas: ms-marco-TinyBERT-L-2-v2). Uzyskane wyniki wyszukiwania są bardziej precyzyjne i lepiej odpowiadają na zapytania. Metryka oceny, która została przez nas przyjęta to MRR, najwyzszą wartość osiągnęła przy dodanym module rerankingu.