# Introduction to Natural Language Processing 2 Lab03

## Introduction

On this part, we will evaluate a simple semantic search with and without nearest neighbours approximation. We fill first create a searchable index, then evaluate queries on it in terms of Mean Average Precision and speed.

### Create a searchable index

In [3]:
# Use the Beir library to extract the test set of the DBPedia entity dataset.

from beir import util
from beir.datasets.data_loader import GenericDataLoader

def extract_test(dataset):
    """
    Extract the test test set with the Beir library.

    Parameters
    ----------
    dataset : str
        The name of the dataset.

    Returns
    -------
    corpus, queries, qrels : dict
        The test set.
    """
    url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
    data_path = util.download_and_unzip(url, "datasets")
    corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
    return corpus, queries, qrels

dataset = "dbpedia-entity"
corpus, queries, qrels = extract_test(dataset)

  from tqdm.autonotebook import tqdm


datasets/dbpedia-entity.zip:   0%|          | 0.00/610M [00:00<?, ?iB/s]

  0%|          | 0/4635922 [00:00<?, ?it/s]

What are the three values returned by Beir, and how are they presented ?

...

*Let's see what the first values of each value look like*

In [4]:
list(queries.items())[:10]

[('INEX_LD-2009022', 'Szechwan dish food cuisine'),
 ('INEX_LD-2009039', 'roman architecture'),
 ('INEX_LD-2009053', 'finland car industry manufacturer saab sisu'),
 ('INEX_LD-2009061', 'france second world war normandy'),
 ('INEX_LD-2009062', 'social network group selection'),
 ('INEX_LD-2009063', 'D-Day normandy invasion'),
 ('INEX_LD-2009074', 'web ranking scoring algorithm'),
 ('INEX_LD-2009115', 'virtual museums'),
 ('INEX_LD-2010004', 'Indian food'),
 ('INEX_LD-2010014', 'composer museum')]

In [5]:
print("Number of queries:", len(queries))

Number of queries: 400


In [6]:
print("Number of corpus:", len(corpus))
list(corpus.items())[:5]

Number of corpus: 4635922


[('<dbpedia:Animalia_(book)>',
  {'text': "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold.   A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket.",
   'title': 'Animalia (book)'}),
 ('<dbpedia:Academy_Award_for_Best_Production_Design>',
  {'text': "The Academy Awards are the oldest awards ceremony for achievements in motion pictures. The Academy Award for Best Production Design recognizes achievement in art direction on a film. The category's original name was Best Art Direction, but was changed to its current name in 2012 for the 85th Academy Awards.  This change resulted from the Art Director's branch of the Academy being renamed the Designer's branch.",
   'title': 'Academy Award for Best Production Design'}),
 ('<dbpedia:An_American_in_Paris>',
  {'tex

In [7]:
qrels['INEX_LD-2009022']

{'<dbpedia:Afghan_cuisine>': 0,
 '<dbpedia:Akan_cuisine>': 0,
 '<dbpedia:Ambuyat>': 0,
 '<dbpedia:American_Chinese_cuisine>': 1,
 '<dbpedia:Ants_climbing_a_tree>': 2,
 '<dbpedia:Baingan_bharta>': 1,
 '<dbpedia:Bamischijf>': 0,
 '<dbpedia:Black_cardamom>': 0,
 '<dbpedia:Brazilian_cuisine>': 0,
 '<dbpedia:British_cuisine>': 0,
 '<dbpedia:Caribbean_cuisine>': 0,
 '<dbpedia:Cellophane_noodles>': 1,
 '<dbpedia:Ceviche>': 0,
 '<dbpedia:Chana_masala>': 0,
 '<dbpedia:Chen_Kenichi>': 1,
 '<dbpedia:Chen_Kenmin>': 1,
 '<dbpedia:Chicago-style_pizza>': 0,
 '<dbpedia:Chicken_(food)>': 0,
 '<dbpedia:Chifle>': 0,
 '<dbpedia:Chili_oil>': 2,
 '<dbpedia:Chinatown,_Los_Angeles>': 0,
 '<dbpedia:Chinatown>': 1,
 '<dbpedia:Chinese_cuisine>': 2,
 '<dbpedia:Churumuri_(food)>': 0,
 '<dbpedia:Cookbook>': 0,
 '<dbpedia:Cooking>': 0,
 '<dbpedia:Couscous>': 0,
 '<dbpedia:Cuban_cuisine>': 0,
 '<dbpedia:Cuisine>': 0,
 '<dbpedia:Cuisine_of_Jharkhand>': 0,
 '<dbpedia:Cuisine_of_the_Southern_United_States>': 0,
 '<dbped

To ease the problem, extract all the document from the corpus which are relevant to at least one query. Then, add 100K random documents which are not relevant to any query. Make sure the process is reproducible by setting the random seed on whatever random sampling method you use.

In [8]:
import random

corpus_reduced_dict = {}


qrels_irrelevant = []
# first we get all the corpus that are relevant to queries
for id, query in queries.items():
    # Get al the relevant content with the id of the query
    qrels_for_id = qrels[id]

    # Get all the queries that are relevant (value 1 or 2)
    qrels_relevant = []
    for id, relevancy in qrels_for_id.items():
        if relevancy > 0:
            qrels_relevant.append(id)
    
    # get the corpus elements corresponding and add them the the reduced corpus
    for id in qrels_relevant:
        corpus_reduced_dict[id] = corpus[id]

corpus_irrelevant = {}
for id, corp in corpus.items():
    if id not in corpus_reduced_dict:
        corpus_irrelevant[id] = corp

corpus_reduced = list(corpus_reduced_dict.values())

# Choose the 100k random documents
random.seed(42)
corpus_random_keys = random.sample(list(corpus_irrelevant), 100000)

for id in corpus_random_keys:
    corpus_reduced.append(corpus[id])
    corpus_reduced_dict[id] = corpus[id]

print("Total length of the documents chosen:", len(corpus_reduced))


Total length of the documents chosen: 114877


Now, we should be ready to start experimenting with our smaller dataset. Use the
sentence-transformers library to index your dataset. As queries and documents
are different, use an asymetric similarity models. Pick one model across the
ones proposed. Make sure to document your choice, and why you picked it (because
of accuracy, speed, ...).

**Response:**
On the documentation for asymetric similarity models, we can choose from
multiple models. Ones being for cosine-similarity and the others being tuned for
dot-product. The two models in each categorie having the best accuracies are:
- msmarco-distilbert-base-v4 (for cosine-similarity).
- msmarco-distilbert-base-tas-b (for dot-product).

Here we have short passages, and no very long paragraph. As recommended in the
docummentation we are going to use the dot model.

In [9]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('msmarco-distilbert-base-tas-b')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/547 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Embed the reduced corpus and the queries using the chosen model.

In [10]:
corpus_embeddings = model.encode(corpus_reduced, show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/3590 [00:00<?, ?it/s]

In [11]:
queries_embeddings = model.encode(list(queries.values()), show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/13 [00:00<?, ?it/s]

In [12]:
print("Corpus embeddings:\n", corpus_embeddings)
print("Queries embeddings:\n", queries_embeddings)

Corpus embeddings:
 [[-0.01574452  0.00755482 -0.00977126 ...  0.01982186  0.01842167
  -0.00372798]
 [-0.00526378  0.03504582 -0.01447121 ...  0.03222864  0.009765
   0.01768874]
 [-0.02864999 -0.0007168   0.0042687  ...  0.00381514  0.00689343
  -0.04419588]
 ...
 [-0.00157356  0.01447238  0.00181426 ...  0.00996677 -0.03956144
   0.02850431]
 [-0.04041719 -0.00454527  0.02482736 ... -0.00896119 -0.02426494
  -0.01452523]
 [-0.02210533  0.03162533 -0.00218728 ...  0.00680958 -0.00261383
   0.01642022]]
Queries embeddings:
 [[ 0.0127345   0.00926684  0.00078945 ...  0.00171826  0.00395792
  -0.01994021]
 [ 0.01767569  0.03175258  0.00826589 ... -0.0185557  -0.03254417
   0.00382851]
 [ 0.01210524  0.00428766  0.00128748 ... -0.02177538 -0.01737423
  -0.02737603]
 ...
 [-0.02529465 -0.00727855  0.02930839 ... -0.02321852  0.00115845
   0.00779354]
 [-0.00716715  0.01388096  0.00451163 ...  0.00384752  0.04274915
  -0.01549989]
 [ 0.02388809  0.01915977  0.00160654 ... -0.02157108  0.01

In [18]:
from sentence_transformers import util as st_util

hits = st_util.semantic_search(queries_embeddings, corpus_embeddings, top_k = 100, score_function=st_util.dot_score)
hits[2]

[{'corpus_id': 190, 'score': 0.8819634914398193},
 {'corpus_id': 187, 'score': 0.861335277557373},
 {'corpus_id': 195, 'score': 0.8579280972480774},
 {'corpus_id': 196, 'score': 0.8578128814697266},
 {'corpus_id': 199, 'score': 0.8567620515823364},
 {'corpus_id': 181, 'score': 0.8533239960670471},
 {'corpus_id': 194, 'score': 0.8527295589447021},
 {'corpus_id': 185, 'score': 0.8472866415977478},
 {'corpus_id': 189, 'score': 0.846343457698822},
 {'corpus_id': 175, 'score': 0.8427202105522156},
 {'corpus_id': 193, 'score': 0.8402367830276489},
 {'corpus_id': 184, 'score': 0.8382342457771301},
 {'corpus_id': 186, 'score': 0.8378230333328247},
 {'corpus_id': 182, 'score': 0.8364739418029785},
 {'corpus_id': 178, 'score': 0.8349751830101013},
 {'corpus_id': 179, 'score': 0.8327335119247437},
 {'corpus_id': 192, 'score': 0.8312983512878418},
 {'corpus_id': 12765, 'score': 0.8170641660690308},
 {'corpus_id': 83335, 'score': 0.8155922293663025},
 {'corpus_id': 5083, 'score': 0.8138207793235779

Using the annotated set of queries, compute the Mean Average Precision (MAP) @100 as well as the average time per query.

The problem with the results we have is that we only get a `corpus_id`. We need to transform the `corpus_id` to the matching corpus key to be able to compute the confusion matrix.

In [24]:
hits_corpus = [{} for _ in range(len(corpus_reduced))]
list_corpus_reduced = list(corpus_reduced_dict.items())

idx = 0
for result in hits:
  for item in result:
    corpus_key = list_corpus_reduced[item['corpus_id']][0]
    #new_val[corpus_key] = item['score']
    hits_corpus[idx][corpus_key] = item['score']
  idx +=1

print("Corpus keys with score values:\n")
hits_corpus[0]

Corpus keys with score values:



{'<dbpedia:Sichuan_cuisine>': 0.8736340403556824,
 '<dbpedia:Manchow_soup>': 0.8186438083648682,
 '<dbpedia:List_of_Chinese_dishes>': 0.8183452486991882,
 '<dbpedia:List_of_Indian_dishes>': 0.8120450973510742,
 '<dbpedia:Guizhou_cuisine>': 0.8087615966796875,
 '<dbpedia:Meze>': 0.8066407442092896,
 '<dbpedia:American_Chinese_cuisine>': 0.8043344020843506,
 '<dbpedia:Indian_Chinese_cuisine>': 0.8026471734046936,
 '<dbpedia:Shuizhu>': 0.8021990656852722,
 '<dbpedia:Chinese_cuisine>': 0.8020032048225403,
 '<dbpedia:Indian_Singaporean_cuisine>': 0.8009045720100403,
 '<dbpedia:Gujarati_cuisine>': 0.8008530735969543,
 '<dbpedia:List_of_Vietnamese_dishes>': 0.8006097674369812,
 '<dbpedia:History_of_Chinese_cuisine>': 0.799730122089386,
 '<dbpedia:Indian_cuisine>': 0.7989543676376343,
 '<dbpedia:List_of_foods_of_the_Southern_United_States>': 0.7984433770179749,
 '<dbpedia:Quyrdak>': 0.7977863550186157,
 '<dbpedia:Szászvár>': 0.7977515459060669,
 '<dbpedia:Cuisine_of_Gower>': 0.7970753312110901

In [42]:
# We transform the qrels into a list to have only the information about the relevency
list_qrels = list(qrels.values())

In [47]:
# Now we compute the average precision for each query

def average_precision_k(hit_score, qrel_query):
  
  # First we filter the qrel_query to only have the relevant responses
  qrel_query =  {key:val for key, val in qrel_query.items() if val > 0}
  keys_query = set(hit_score.keys())
  keys_qrel_query = set(qrel_query.keys())

  # We check which hit is relevant
  relevant_hit = keys_query.intersection(keys_qrel_query)
  
  precision_k = [0 for _ in range(100)]
  list_hit_corpus = list(hit_score.keys())
  nb_of_relevant = 0
  for i in range(1, 100):
    # Here we check if the content is relevant, if it is we set the value
    # else we let it at 0 
    if list_hit_corpus[i - 1] in relevant_hit:
      nb_of_relevant += 1
      precision_k[i - 1] = nb_of_relevant / i

  a_precision_k = sum(precision_k) / len(keys_qrel_query)
    
  return a_precision_k

a_precision_k = average_precision_k(hits_corpus[0], list_qrels[0])
print("The AP@K of the first query is:\n", a_precision_k)

The AP@K of the first query is:
 0.18899296195509976


In [None]:
def mean_average_prevision_k(hit_corpus, list_qrels):
  

# Approximate nearest neighbours

Find a good set of parameters for the chosen ANN library and compute the Mean Average Precision @100.

We chose to use the Faiss library for the next part.

In [None]:
import faiss

# FIXME ALL OF IT
xb = corpus
xq = queries
queries_list = list(queries.values())
d = # dimension

index = faiss.IndexFlatL2(d)   # build the index
index.add(corpus_reduced)                  # add vectors to the index

k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(corpus_reduced[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(queries_list, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

explain what the parameters you picked are, and why you chose them