# Introduction to Natural Language Processing 2 Lab03

## Introduction

On this part, we will evaluate a simple semantic search with and without nearest neighbours approximation. We fill first create a searchable index, then evaluate queries on it in terms of Mean Average Precision and speed.

### Create a searchable index

In [2]:
# Use the Beir library to extract the test set of the DBPedia entity dataset.

from beir import util
from beir.datasets.data_loader import GenericDataLoader

def extract_test(dataset):
    """
    Extract the test test set with the Beir library.

    Parameters
    ----------
    dataset : str
        The name of the dataset.

    Returns
    -------
    corpus, queries, qrels : dict
        The test set.
    """
    url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
    data_path = util.download_and_unzip(url, "datasets")
    corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
    return corpus, queries, qrels

dataset = "dbpedia-entity"
corpus, queries, qrels = extract_test(dataset)

  0%|          | 0/4635922 [00:00<?, ?it/s]

What are the three values returned by Beir, and how are they presented ?

...

*Let's see what the first values of each value look like*

In [3]:
list(queries.items())[:10]

[('INEX_LD-2009022', 'Szechwan dish food cuisine'),
 ('INEX_LD-2009039', 'roman architecture'),
 ('INEX_LD-2009053', 'finland car industry manufacturer saab sisu'),
 ('INEX_LD-2009061', 'france second world war normandy'),
 ('INEX_LD-2009062', 'social network group selection'),
 ('INEX_LD-2009063', 'D-Day normandy invasion'),
 ('INEX_LD-2009074', 'web ranking scoring algorithm'),
 ('INEX_LD-2009115', 'virtual museums'),
 ('INEX_LD-2010004', 'Indian food'),
 ('INEX_LD-2010014', 'composer museum')]

In [4]:
print("Number of queries:", len(queries))

Number of queries: 400


In [5]:
print("Number of corpus:", len(corpus))
list(corpus.items())[:5]

Number of corpus: 4635922


[('<dbpedia:Animalia_(book)>',
  {'text': "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold.   A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket.",
   'title': 'Animalia (book)'}),
 ('<dbpedia:Academy_Award_for_Best_Production_Design>',
  {'text': "The Academy Awards are the oldest awards ceremony for achievements in motion pictures. The Academy Award for Best Production Design recognizes achievement in art direction on a film. The category's original name was Best Art Direction, but was changed to its current name in 2012 for the 85th Academy Awards.  This change resulted from the Art Director's branch of the Academy being renamed the Designer's branch.",
   'title': 'Academy Award for Best Production Design'}),
 ('<dbpedia:An_American_in_Paris>',
  {'tex

In [6]:
qrels['INEX_LD-2009022']

{'<dbpedia:Afghan_cuisine>': 0,
 '<dbpedia:Akan_cuisine>': 0,
 '<dbpedia:Ambuyat>': 0,
 '<dbpedia:American_Chinese_cuisine>': 1,
 '<dbpedia:Ants_climbing_a_tree>': 2,
 '<dbpedia:Baingan_bharta>': 1,
 '<dbpedia:Bamischijf>': 0,
 '<dbpedia:Black_cardamom>': 0,
 '<dbpedia:Brazilian_cuisine>': 0,
 '<dbpedia:British_cuisine>': 0,
 '<dbpedia:Caribbean_cuisine>': 0,
 '<dbpedia:Cellophane_noodles>': 1,
 '<dbpedia:Ceviche>': 0,
 '<dbpedia:Chana_masala>': 0,
 '<dbpedia:Chen_Kenichi>': 1,
 '<dbpedia:Chen_Kenmin>': 1,
 '<dbpedia:Chicago-style_pizza>': 0,
 '<dbpedia:Chicken_(food)>': 0,
 '<dbpedia:Chifle>': 0,
 '<dbpedia:Chili_oil>': 2,
 '<dbpedia:Chinatown,_Los_Angeles>': 0,
 '<dbpedia:Chinatown>': 1,
 '<dbpedia:Chinese_cuisine>': 2,
 '<dbpedia:Churumuri_(food)>': 0,
 '<dbpedia:Cookbook>': 0,
 '<dbpedia:Cooking>': 0,
 '<dbpedia:Couscous>': 0,
 '<dbpedia:Cuban_cuisine>': 0,
 '<dbpedia:Cuisine>': 0,
 '<dbpedia:Cuisine_of_Jharkhand>': 0,
 '<dbpedia:Cuisine_of_the_Southern_United_States>': 0,
 '<dbped

To ease the problem, extract all the document from the corpus which are relevant to at least one query. Then, add 100K random documents which are not relevant to any query. Make sure the process is reproducible by setting the random seed on whatever random sampling method you use.

In [41]:
import random

corpus_reduced_dict = {}


qrels_irrelevant = []
# first we get all the corpus that are relevant to queries
for id, query in queries.items():
    # Get al the relevant content with the id of the query
    qrels_for_id = qrels[id]

    # Get all the queries that are relevant (value 1 or 2)
    qrels_relevant = []
    for id, relevancy in qrels_for_id.items():
        if relevancy > 0:
            qrels_relevant.append(id)
    
    # get the corpus elements corresponding and add them the the reduced corpus
    for id in qrels_relevant:
        corpus_reduced_dict[id] = corpus[id]

corpus_irrelevant = {}
for id, corp in corpus.items():
    if id not in corpus_reduced_dict:
        corpus_irrelevant[id] = corp

corpus_reduced = list(corpus_reduced_dict.values())

# Choose the 100k random documents
random.seed(42)
corpus_random_keys = random.sample(list(corpus_irrelevant), 100000)

for id in corpus_random_keys:
    corpus_reduced.append(corpus[id])
    corpus_reduced_dict[id] = corpus[id]

print("Total length of the documents chosen:", len(corpus_reduced))


Total length of the documents chosen: 114877


Now, we should be ready to start experimenting with our smaller dataset. Use the
sentence-transformers library to index your dataset. As queries and documents
are different, use an asymetric similarity models. Pick one model across the
ones proposed. Make sure to document your choice, and why you picked it (because
of accuracy, speed, ...).

**Response:**
On the documentation for asymetric similarity models, we can choose from
multiple models. Ones being for cosine-similarity and the others being tuned for
dot-product. The two models in each categorie having the best accuracies are:
- msmarco-distilbert-base-v4 (for cosine-similarity).
- msmarco-distilbert-base-tas-b (for dot-product).

Here we have short passages, and no very long paragraph. As recommended in the
docummentation we are going to use the cosine-similarity model.

In [8]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('msmarco-distilbert-base-v4')

Embed the reduced corpus and the queries using the chosen model.

In [9]:
corpus_embeddings = model.encode(corpus_reduced, show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/3590 [00:00<?, ?it/s]

In [10]:
queries_embeddings = model.encode(list(queries.values()), show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/13 [00:00<?, ?it/s]

In [11]:
print("Corpus embeddings:\n", corpus_embeddings)
print("Queries embeddings:\n", queries_embeddings)

Corpus embeddings:
 [[ 0.02892603  0.01089791 -0.04476007 ... -0.01733907  0.0476133
  -0.04828098]
 [-0.00495155  0.07971302 -0.03203864 ...  0.05053858 -0.0137853
   0.00460507]
 [ 0.02674319  0.0154624  -0.07005778 ...  0.01372932  0.01781973
  -0.02996341]
 ...
 [-0.03009509  0.00963994  0.05870777 ... -0.01945857 -0.06338064
   0.077155  ]
 [-0.07460181  0.03335125  0.06471888 ...  0.070548    0.03953546
   0.0464682 ]
 [ 0.01524896 -0.00534951 -0.0069381  ...  0.01098635 -0.03259463
  -0.00770727]]
Queries embeddings:
 [[ 0.00943593  0.03258258 -0.01114585 ...  0.00950551  0.01634416
  -0.009321  ]
 [ 0.09193251  0.02389679 -0.00662003 ...  0.00077045 -0.0023773
   0.04637513]
 [-0.0036161   0.01975921 -0.03008859 ... -0.04314195 -0.0122631
  -0.04154114]
 ...
 [-0.02767188 -0.02205209  0.03170809 ... -0.03545771  0.03756684
   0.04217032]
 [ 0.04552852  0.0297135  -0.0533476  ...  0.04462763  0.08285883
  -0.0132812 ]
 [-0.00222221 -0.00571648  0.02701664 ...  0.04191509  0.0155

In [12]:
from sentence_transformers import util as st_util

hits = st_util.semantic_search(queries_embeddings, corpus_embeddings)
hits

[[{'corpus_id': 26, 'score': 0.6137957572937012},
  {'corpus_id': 24, 'score': 0.4821910262107849},
  {'corpus_id': 12, 'score': 0.47387373447418213},
  {'corpus_id': 598, 'score': 0.4495074450969696},
  {'corpus_id': 538, 'score': 0.4484955072402954},
  {'corpus_id': 0, 'score': 0.4403638243675232},
  {'corpus_id': 23527, 'score': 0.4280218780040741},
  {'corpus_id': 18, 'score': 0.42014843225479126},
  {'corpus_id': 489, 'score': 0.4190826416015625},
  {'corpus_id': 8, 'score': 0.41114819049835205}],
 [{'corpus_id': 46, 'score': 0.7588297128677368},
  {'corpus_id': 39, 'score': 0.7356396913528442},
  {'corpus_id': 128, 'score': 0.6805390119552612},
  {'corpus_id': 132, 'score': 0.6600256562232971},
  {'corpus_id': 135, 'score': 0.6357126235961914},
  {'corpus_id': 138, 'score': 0.6102633476257324},
  {'corpus_id': 157, 'score': 0.606996476650238},
  {'corpus_id': 47, 'score': 0.595538318157196},
  {'corpus_id': 60, 'score': 0.5819493532180786},
  {'corpus_id': 56, 'score': 0.58181595

Using the annotated set of queries, compute the Mean Average Precision (MAP) @100 as well as the average time per query.

In [60]:
# First we take the first 100 queries to analyze
queries_100 = {}

for k, v in queries.items():
  queries_100[k] = v

# We take the results of the first 100 results matching the queries
hits_100_ids = hits[:100]
print(hits_100_ids)

[[{'corpus_id': 26, 'score': 0.6137957572937012}, {'corpus_id': 24, 'score': 0.4821910262107849}, {'corpus_id': 12, 'score': 0.47387373447418213}, {'corpus_id': 598, 'score': 0.4495074450969696}, {'corpus_id': 538, 'score': 0.4484955072402954}, {'corpus_id': 0, 'score': 0.4403638243675232}, {'corpus_id': 23527, 'score': 0.4280218780040741}, {'corpus_id': 18, 'score': 0.42014843225479126}, {'corpus_id': 489, 'score': 0.4190826416015625}, {'corpus_id': 8, 'score': 0.41114819049835205}], [{'corpus_id': 46, 'score': 0.7588297128677368}, {'corpus_id': 39, 'score': 0.7356396913528442}, {'corpus_id': 128, 'score': 0.6805390119552612}, {'corpus_id': 132, 'score': 0.6600256562232971}, {'corpus_id': 135, 'score': 0.6357126235961914}, {'corpus_id': 138, 'score': 0.6102633476257324}, {'corpus_id': 157, 'score': 0.606996476650238}, {'corpus_id': 47, 'score': 0.595538318157196}, {'corpus_id': 60, 'score': 0.5819493532180786}, {'corpus_id': 56, 'score': 0.5818159580230713}], [{'corpus_id': 190, 'scor

The problem with the results we have is that we only get a `corpus_id`. We need to transform the `corpus_id` to the matching corpus key to be able to compute the confusion matrix.

In [75]:
hits_100 = [[] for _ in range(100)]
list_corpus_reduced = list(corpus_reduced_dict.items())

idx = 0
for result in hits_100_ids:
  for item in result:
    new_val = {}
    corpus_key = list_corpus_reduced[item['corpus_id']][0]
    new_val[corpus_key] = item['score']
    hits_100[idx].append(new_val)
  idx +=1

print("Corpus keys with score values:\n", hits_100)

Corpus keys with score values:


# Approximate nearest neighbours

Find a good set of parameters for the chosen ANN library and compute the Mean Average Precision @100.

We chose to use the Faiss library for the next part.

In [None]:
import faiss

# FIXME ALL OF IT
xb = corpus
xq = queries
queries_list = list(queries.values())
d = # dimension

index = faiss.IndexFlatL2(d)   # build the index
index.add(corpus_reduced)                  # add vectors to the index

k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(corpus_reduced[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(queries_list, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

explain what the parameters you picked are, and why you chose them