<a href="https://colab.research.google.com/github/Coperr/information-retrieval/blob/main/notebook3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# inforet 2024 3

### bm 25 tests and query expansion by synonyms

In [33]:

# no time to lose:
!wget https://gerdes.fr/saclay/inforet/our_msmarco.zip
!unzip our_msmarco.zip
# this will be big: 1.2gb!
# you will get three files

--2025-03-25 16:06:01--  https://gerdes.fr/saclay/inforet/our_msmarco.zip
Resolving gerdes.fr (gerdes.fr)... 54.38.81.127
Connecting to gerdes.fr (gerdes.fr)|54.38.81.127|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458004951 (437M) [application/zip]
Saving to: ‘our_msmarco.zip.1’


2025-03-25 16:06:41 (11.3 MB/s) - ‘our_msmarco.zip.1’ saved [458004951/458004951]

Archive:  our_msmarco.zip
replace our.msmarco.docs.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: our.msmarco.docs.tsv    
  inflating: our.msmarco.queries.tsv  
  inflating: our.msmarco.gold.tsv    


In [None]:
!pip install rank_bm25 spacy Sense2Vec
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
from tqdm.notebook import tqdm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from rank_bm25 import BM25Okapi
import spacy
from sense2vec import Sense2Vec
tqdm.pandas()

In [None]:
# this turns on the autotimer, so that every cell has a timing information below
try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime
# stop using:
# %unload_ext autotime

## getting the best combinations from last time and writing them into files

In [None]:
origdocs = pd.read_csv('our.msmarco.docs.tsv',sep='\t',usecols=[1,2,3])
origdocs['title'].fillna('-', inplace=True)
origdocs['body'].fillna('-', inplace=True)
origdocs

In [None]:
docs = pd.DataFrame(columns = ['docid', 'text'])
docs['docid']=origdocs.docid
docs['text']=origdocs.title+' '+origdocs.body
docs

In [None]:
del origdocs # saving memory

In [None]:
docs.to_csv('our.text.msmarco.docs.tsv',sep='\t', columns=['docid','text'])

#### and now the pre-tokenization for bm25

In [None]:
vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode')
tokenized_corpus = docs.text.progress_apply(vectorizer.build_analyzer())
#docs['docid'].to_frame().join(tokenized_corpus)

In [None]:
type(tokenized_corpus)

## reading back in just for checking the files - or for restarting here

In [3]:
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
from tqdm.notebook import tqdm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from rank_bm25 import BM25Okapi

tqdm.pandas()

In [4]:
# this is a different doc, no longer distinguishing title and body
docs = pd.read_csv('our.text.msmarco.docs.tsv',sep='\t',usecols=[1,2])
docs = docs.sample(n=5000, random_state=42) #reduce to 5 documents
docs

Unnamed: 0,docid,text
80203,D3490776,Impact of Overseas Immersion Homestay Experien...
23087,D587475,"What is Asthma? Causes, Symptoms, & Treatment ..."
30175,D3442790,Diazepam (Valium) Diazepam (Valium)Brand Names...
3502,D1996176,Buy Coconut Oil: Your Guide to Buying Coconut ...
48307,D1877482,Coca-Cola Beverages and Products About Us Coca...
...,...,...
38880,D773459,Italy in February Italy in February Visiting I...
38935,D2596149,"Reaching Closure: Tips for Healing Chronic, âN..."
87519,D3524977,. (Nu Wave Mini) New York Strip with Grilled ...
31567,D1874808,"Climate, Average Weather of Austria Countries ..."


In [5]:
vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode')
tokenized_corpus = docs.text.progress_apply(vectorizer.build_analyzer())

  0%|          | 0/5000 [00:00<?, ?it/s]

In [6]:
# use only col 1 if you have memory problems and do BM25 only
queries = pd.read_csv('our.msmarco.queries.tsv',sep='\t',usecols=[1,2])
training_queries = queries.iloc[:100]  # only 100 for training
testing_queries = queries.iloc[100:150]  # small test set
training_queries

Unnamed: 0,qid,query
0,687888,what is a jpe
1,480210,price for asphalt driveway
2,591004,what causes pressure skin bruising
3,260536,how long drive from flagstaff to grand canyon
4,39422,average number of bowel movements per day for ...
...,...,...
95,301689,how might carbon dioxide affect results
96,467006,number of years served hadrian
97,1060836,why are orca whales at seaworld so aggressive
98,262843,how long is a school day


In [7]:
gold = pd.read_csv('our.msmarco.gold.tsv',sep='\t',usecols=[1,3,4,5])

# filter by queries and docids that are in the smaller sets
query_ids = training_queries.iloc[:, 0].unique()
doc_ids = docs.iloc[:, 0].unique()

gold = gold[gold['qid'].isin(query_ids) & gold['docid'].isin(doc_ids)]

gold

Unnamed: 0,qid,docid,rank,score
203,133639,D586662,4,-5.02977
228,133639,D92943,29,-5.54113
236,133639,D384171,37,-5.60725
237,133639,D2789560,38,-5.61268
239,133639,D2171112,40,-5.61718
...,...,...,...,...
99533,676923,D1325913,34,-5.07376
99543,676923,D1289281,44,-5.09804
99566,676923,D210690,67,-5.15034
99573,676923,D2615758,74,-5.17469


# redoing the vectorization for my two best results

### 🚧 todo:
### use TfidfVectorizer, BM25Okapi, and our own BM25 function
to measure whether there are significant differences.


In [8]:
def pAt10(qid):
    query = queries[queries.qid==qid]['query']
    qv = vectorizer.transform(query)
    xqv = X*qv.T
    pred10i = np.argpartition(xqv.toarray().flatten(), -10)[-10:]
    intersection = np.intersect1d(docs.iloc[pred10i].docid,gold[gold.qid==qid].docid)
    return len(intersection)/10

In [10]:
vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode')
X = vectorizer.fit_transform(docs.text)
print(len(vectorizer.get_feature_names_out()),'features, for example',vectorizer.get_feature_names_out()[44444:44449])
tfidfresults = training_queries.qid.progress_apply(pAt10)
tfidfresults.mean()

276650 features, for example ['8683' '86841' '868421' '8685map' '8686']


  0%|          | 0/100 [00:00<?, ?it/s]

np.float64(0.41999999999999993)

In [11]:
def pAt10Bm25(qid):
    tquery = queries[queries.qid==qid]['query'].apply(vectorizer.build_analyzer())
    doc_scores = bm25.get_scores(tquery.tolist()[0])
    pred10i = np.argpartition(doc_scores, -10)[-10:]
    intersection = np.intersect1d(docs.iloc[pred10i].docid,gold[gold.qid==qid].docid)
    return len(intersection)/10

In [12]:
bm25 = BM25Okapi(tokenized_corpus)
bm25results = training_queries.qid.progress_apply(pAt10Bm25)
bm25results.mean()

  0%|          | 0/100 [00:00<?, ?it/s]

np.float64(0.4759999999999999)

# 🔎 manual error mining
- let's look at where things go wrong

### 🚧 todo:
- what's the lowest p@10 we got
- what's the 10 questions that got the worst score, from worst to slightly better?


In [17]:
worst10bm25 = bm25results.nsmallest(10)
worst10bm25

Unnamed: 0,qid
79,0.0
10,0.1
29,0.1
37,0.1
57,0.1
11,0.2
14,0.2
50,0.2
75,0.2
89,0.2


In [19]:
qid = bm25results.idxmin()
worst_query = training_queries.loc[qid]
score = bm25results[qid]

print("lowest p@10:", score)
print("query ID:", qid)
print("query text:", worst_query.query)

lowest p@10: 0.0
query ID: 79
query text: what is the meaning of the name kameren


In [20]:
worst10bm25i = bm25results.nsmallest(10).index
worst10bm25i

Index([79, 10, 29, 37, 57, 11, 14, 50, 75, 89], dtype='int64')

In [21]:
training_queries.loc[worst10bm25i]

Unnamed: 0,qid,query
79,831688,what is the meaning of the name kameren
10,882355,what nutrient makes grass greener
29,1142083,define whiteout
37,417380,is mark applier?
57,237191,how is physics used in forensics science
11,510587,tartrazine what is it
14,180005,emitter meaning
50,834834,what is the name of the anatomical space in wh...
75,554607,what are adrenergic antagonist
89,1610,"7. how are dna, rna, and proteins related in t..."


In [22]:
training_queries.loc[worst10bm25.index].assign(p_at_10=worst10bm25.values)


Unnamed: 0,qid,query,p_at_10
79,831688,what is the meaning of the name kameren,0.0
10,882355,what nutrient makes grass greener,0.1
29,1142083,define whiteout,0.1
37,417380,is mark applier?,0.1
57,237191,how is physics used in forensics science,0.1
11,510587,tartrazine what is it,0.2
14,180005,emitter meaning,0.2
50,834834,what is the name of the anatomical space in wh...,0.2
75,554607,what are adrenergic antagonist,0.2
89,1610,"7. how are dna, rna, and proteins related in t...",0.2


### 🚧 todo:
- write a function showDoc that takes qid, rank, and predicted as parameters
    - if predicted=True, shows the predicted doc of rank rank to the query qid
    - if predicted=False, shows the gold doc
    - prints the first 999 characters of the texts
- for the worst query
    - look at the 10 best gold vs 10 best predicted
    - hypothetize why the results are so bad for the worst query

In [24]:
def showDoc(qid,rank,predicted=False):
    if predicted:
        query_text = queries.loc[queries.qid == qid, 'query'].values[0]
        query_tokens = vectorizer.build_analyzer()(query_text)
        doc_scores = bm25.get_scores(query_tokens)
        top_pred_indices = np.argsort(doc_scores)[-10:][::-1]  # Top 10, sorted high→low
        if rank >= len(top_pred_indices):
            print(f"[!] Only {len(top_pred_indices)} predicted documents available")
            return
        docid = docs.iloc[top_pred_indices[rank]].docid
    else:
        gold_docs = gold.get(qid, [])
        if rank >= len(gold_docs):
            print(f"[!] Only {len(gold_docs)} gold documents available for query {qid}")
            return
        docid = gold_docs[rank]

    text = docs.loc[docs.docid == docid, 'text'].values
    if len(text) == 0:
        print(f"[!] Document {docid} not found in docs")
    else:
        print(f"\n📄 DocID: {docid} — {'Predicted' if predicted else 'Gold'} Rank {rank}")
        print("-" * 60)
        print(text[0][:999])

showDoc(729561,7)
showDoc(729561,7, predicted=True)

[!] Only 0 gold documents available for query 729561

📄 DocID: D1681739 — Predicted Rank 7
------------------------------------------------------------
What is the meaning of the word disintegrate? Your browser does not support audio. What is the meaning of the word disintegrate? Looking for the meaning or definition of the word disintegrate? Here are some definitions. Verb ( transitive) To undo the integrity of, break into parts. ( intransitive) To fall apart, break up into parts. Find more words!disintegrate See Also What is another word for disintegrate? What is the opposite of disintegrate? Sentences with the word disintegrate How do you pronounce the word disintegrate? Words that rhyme with disintegrate What is the past tense of disintegrate? What is the adjective for disintegrate? What is the noun for disintegrate? More Words What is the meaning of the word disintegrants? What is the meaning of the word disintegrant? What is the meaning of the word disintegrable? What is the mean

### 🚧 todo: can we characterize these difficult cases?
- do they have specicific problems?
- do we know when we are doing badly?
    - are the distances between query vector and the best documents bigger than average?
    

# 🚀 spacy

- look at https://github.com/explosion/sense2vec/blob/master/README.md

In [None]:
nlp = spacy.load('en_core_web_lg') # or the smaller md model!!!

### 🚧 todo:
- explain what's going on here:

In [None]:
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyful")
sent1.similarity(sent2), sent1.similarity(sent3)

### let's try sense2vec

- depending on your machine, download one of the two versions of sense2vec from https://github.com/explosion/sense2vec/blob/master/README.md
  - s2v_reddit_2019_lg 	4 GB 	Reddit comments 2019 (01-07) 	part 1, part 2, part 3
      - cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz
  - s2v_reddit_2015_md 	573 MB 	Reddit comments 2015 	part 1
- unzip
- try it, and understand what's going on:


In [None]:
s2v = Sense2Vec().from_disk("./s2v_reddit_2019_lg")


In [None]:
seeds = "natural language processing, machine learning, artificial intelligence".split(',')
seed_keys = [s2v.get_best_sense(seed.strip()) for seed in seeds]
seed_keys

In [None]:
most_similar = s2v.most_similar(seed_keys, n=10)
most_similar

### 🚧 todo: what is it that you couldn't do in Word2Vec?
- just one line of answer.
- answer: ...

- most_similar is very slow. check this to speed things up (optional): https://towardsdatascience.com/how-to-build-a-fast-most-similar-words-method-in-spacy-32ed104fe498
### 🚧 todo:
- try also the following functions:
    - similarity, get_other_senses, get_freq, s2v[query]


In [None]:
...

### 🚧 todo:
- try whether expanding your query by adding similar terms to the 10 worst queries improves the results


In [None]:
...

### 🚧 todo:
- try misspelling a word and see whether you can fix that with sense2vec


### 🚧 todo:
- try embeddings for a few queries (all would take to long except if you have a GPU)
    - are the gold top 10 similar to the query itself?
    - check whether the gold top 10 answers for our most difficult question are really closer to the question than the currently predicted top10
         - how to get every doc as a vector:
             - https://spacy.io/api/doc#vector "A real-valued meaning representation. Defaults to an average of the token vectors."
        - every doc has a similarity function taking another doc as argument:
            - https://spacy.io/api/doc#similarity

In [None]:
...

Ellipsis

In [None]:
# not necessary but if you want to include your big s2v file
# combining spacy and sense2vec:
nlp = spacy.load("en_core_web_sm") # or whichever you downloaded
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("./s2v_reddit_2015_md") # or whichever you downloaded