# **Second improvement and error analysis**

Here is implemented the second improvement, an attempt to introduce some semantic inside the model, trying to replace words in the queries which were not working well with synonyms which could perform better.
In the second part of the notebook is shown the error analysis that led to this approach

In [1]:
!pip install python-terrier

Collecting python-terrier
  Downloading python-terrier-0.8.0.tar.gz (97 kB)
[?25l[K     |███▍                            | 10 kB 18.7 MB/s eta 0:00:01[K     |██████▊                         | 20 kB 23.6 MB/s eta 0:00:01[K     |██████████▏                     | 30 kB 11.6 MB/s eta 0:00:01[K     |█████████████▌                  | 40 kB 8.9 MB/s eta 0:00:01[K     |████████████████▉               | 51 kB 4.6 MB/s eta 0:00:01[K     |████████████████████▎           | 61 kB 5.4 MB/s eta 0:00:01[K     |███████████████████████▋        | 71 kB 5.4 MB/s eta 0:00:01[K     |███████████████████████████     | 81 kB 5.5 MB/s eta 0:00:01[K     |██████████████████████████████▍ | 92 kB 6.1 MB/s eta 0:00:01[K     |████████████████████████████████| 97 kB 3.6 MB/s 
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting pyjnius~=1.3.0
  Downloading pyjnius-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 36.8 MB/s 
[?25hCollecti

In [2]:
import pyterrier as pt
pt.init(mem=20000, boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])

terrier-assemblies 5.6 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.6 jar not found, downloading to /root/.pyterrier...
Done
terrier-prf -SNAPSHOT jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.8.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)



In [3]:
dataset = pt.datasets.get_dataset("trec-deep-learning-passages")

In [4]:
def msmarco_generate():
    dataset = pt.get_dataset("trec-deep-learning-passages")
    with pt.io.autoopen(dataset.get_corpus()[0], 'rt') as corpusfile:
        for l in corpusfile:
            docno, passage = l.split("\t")
            yield {'docno' : docno, 'text' : passage}

iter_indexer = pt.IterDictIndexer("./passage_index")
indexref = iter_indexer.index(msmarco_generate(), meta={'docno' : 20, 'text': 4096})


  if __name__ == '__main__':


Downloading msmarco_passage corpus to /root/.pyterrier/corpora/msmarco_passage/corpus
Downloading msmarco_passage tars to /root/.pyterrier/corpora/msmarco_passage/collection.tar.gz


collection.tar.gz:   0%|          | 0.00/987M [00:00<?, ?iB/s]

08:58:10.621 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 5 empty documents


In [5]:
index = pt.IndexFactory.of(indexref)

In [6]:
# Aggregate of everything
print(index.getCollectionStatistics().toString())

Number of documents: 8841823
Number of terms: 1170682
Number of postings: 215238456
Number of fields: 1
Number of tokens: 288759529
Field names: [text]
Positions:   false



In [7]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

In [56]:
from pandas._libs.tslibs.timedeltas import ints_to_pytimedelta
import requests
from bs4 import BeautifulSoup

#Retrieve the synonyms for a single word
def get_synonyms_for(word):
  response = requests.get("https://www.thesaurus.com/browse/{}".format(word))
  soup = BeautifulSoup(response.content, 'html.parser')

  synonyms = soup.find("ul",{"class","css-1i3oiir e1ccqdb60"})

  in_other_words = []

  #if there are no synonyms, return an empty list
  if synonyms is None:
    print('No synonyms have been found for ' + word)
    return in_other_words

  #otherwise, create the list of synonyms
  for synonym in synonyms:
    if "inline-block" in synonym.text:
      continue
    else:
      in_other_words.append(synonym.text.strip())

  #print(synonyms)
  print('\n synonyms for: {}'.format(word) + ': ')
  print(in_other_words)
  return in_other_words

In [57]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import operator

#Create a new query, swapping words which are less present in the passages with the synonym which is more present
def replace_with_high_idf_synonym(q):
  ps = PorterStemmer()
  words = q["query"].split(" ")
  new_query = ""

  stopwords = ["does", "do", "in", "are", "is"]

  for w in words:
    if w in stopwords:
      new_query += ' ' + w
      continue
    try:
      i = index.getLexicon()[w].getDocumentFrequency()
      print('document freq of ' + w +": " + str(i))
    except:
      i = 0
      print(w + ' is not present in the index')

    synonyms = get_synonyms_for(w)

    if len(synonyms) == 0:
      new_query += ' ' + w
      continue

    d = {}
    for s in synonyms:
      try:
        i_s = index.getLexicon()[ps.stem(s)].getDocumentFrequency()
      except:
        continue
      if i_s > i:
          d[s] = i_s

    if len(d) > 0:
      max_idf_synonym = max(d.items(), key=operator.itemgetter(1))[0]
      print('The synonym chosen is ' + max_idf_synonym)
      new_query += ' ' + max_idf_synonym
    else:
      new_query += ' ' + w
      
  print('The original query was: ' + q["query"])
  print('The new query is: ' + new_query)
  return new_query

In [58]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import operator

#Create a new query, swapping only the words which are NOT present in the passages with the synonym which is more present
def replace_with_high_idf_synonym_only_if_not_in_passages(q):
  ps = PorterStemmer()
  words = q["query"].split(" ")
  new_query = ""

  stopwords = ["does", "do", "in", "are", "is"]

  for w in words:
    if w in stopwords:
      new_query += ' ' + w
      continue
    #If the word is already present in the document, which means that the getDocumentFrequency() gives no error, it is not going to be replaced
    try:
      i = index.getLexicon()[w].getDocumentFrequency()
      print('document freq of ' + w +": " + str(i) + '. Since this word is present in the documents, it is not going to be replaced.')
      new_query += ' ' + w
      continue
    except:
      i = 0
      print(w + ' is not present in the index, so it is going to be replaced.')

    synonyms = get_synonyms_for(w)

    if len(synonyms) == 0:
      new_query += ' ' + w
      continue

    d = {}
    for s in synonyms:
      try:
        i_s = index.getLexicon()[ps.stem(s)].getDocumentFrequency()
      except:
        continue
      if i_s > i:
          d[s] = i_s

    if len(d) > 0:
      max_idf_synonym = max(d.items(), key=operator.itemgetter(1))[0]
      print('The synonym chosen is ' + max_idf_synonym)
      new_query += ' ' + max_idf_synonym
    else:
      new_query += ' ' + w
      
  print('The original query was: ' + q["query"])
  print('The new query is: ' + new_query)
  return new_query

In [59]:
#Since the previous methods did not work well, we manually implemented two dictionaries in order to replace words
#based on that we found on the error analysis
dictionary_abbreviations = {"ww1": "world war 1", "us" : "united states" , "lps" : "lanterman petris short", "rn" : "registered nurse", "bsn" : "bachelor of science in nursing" }
dictionary_synonyms = {"pelvic" : ["stomach"], "visceral" : ["internal"], "wifi" : ["Wi-Fi"], "thai" : ["thailand"], "margin" : ["boundary"] , "cerebral" : ["brainy"], "temperature" : ["condition"]}

In [60]:
#Replace a list of words which are considered abbreviations with the full sentence 
def replace_abbreviations(q):
  words = q["query"].split(" ")

  print('The original query was: ' + q["query"])

  for w in words:
    if w in dictionary_abbreviations.keys():
      q["query"] = q["query"].replace(w, dictionary_abbreviations[w])

  print('The new query is: ' + q["query"])
  return q["query"]

In [61]:
def get_manually_synonyms(word):
  if word in dictionary_synonyms:
    return dictionary_synonyms.get(word)

In [62]:
#Replace a list of words with their synonyms which have been chosen directly by us
def replace_with_high_idf_synonym_manual(q):
  ps = PorterStemmer()
  words = q["query"].split(" ")
  new_query = ""
  
  stopwords = ["does", "do", "in", "are", "is"]

  for w in words:
    if w in stopwords:
      new_query += ' ' + w
      continue
    try:
      i = index.getLexicon()[w].getDocumentFrequency()
      print('document freq of ' + w +": " + str(i))
    except:
      i = 0
      print(w + ' is not present in the index')

    synonyms = get_manually_synonyms(w)

    try:
      if len(synonyms) == 0:
        new_query += ' ' + w
        continue
    except:
        new_query += ' ' + w
        continue

    d = {}
    for s in synonyms:
      try:
        i_s = index.getLexicon()[ps.stem(s)].getDocumentFrequency()
      except:
        continue
      if i_s > i:
          d[s] = i_s

    if len(d) > 0:
      max_idf_synonym = max(d.items(), key=operator.itemgetter(1))[0]
      print('the synonym chosen is ' + max_idf_synonym)
      new_query += ' ' + max_idf_synonym
    else:
      new_query += ' ' + w
      
  print('The original query was: ' + q["query"])
  print('The new query is: ' + new_query)
  return new_query

Here are listed all the improvement proposed. All of them are pipelines where the transformation of the queries is then followed by the baseline

In [63]:
abbreviations = pt.apply.query(replace_abbreviations) >> bm25

In [64]:
replace_with_synonyms = pt.apply.query(replace_with_high_idf_synonym) >> bm25

In [65]:
replace_with_synonyms_only_if_null = pt.apply.query(replace_with_high_idf_synonym_only_if_not_in_passages) >> bm25

In [66]:
replace_with_synonyms_manual = pt.apply.query(replace_with_high_idf_synonym_manual) >> bm25

In [54]:
pt.Experiment(
  [bm25, abbreviations],
  dataset.get_topics("test-2019"), 
  dataset.get_qrels("test-2019"),
  eval_metrics=["ndcg", "map", "recip_rank"], 
  perquery = "true",
  filter_by_qrels = "true"
)

The original query was: do goldfish grow
The new query is: do goldfish grow
The original query was: what is wifi vs bluetooth
The new query is: what is wifi vs bluetooth
The original query was: why did the us volunterilay enter ww1
The new query is: why did the united states volunterilay enter world war 1
The original query was: definition declaratory judgment
The new query is: definition declaratory judgment
The original query was: right pelvic pain causes
The new query is: right pelvic pain causes
The original query was: what are the social determinants of health
The new query is: what are the social determinants of health
The original query was: does legionella pneumophila cause pneumonia
The new query is: does legionella pneumophila cause pneumonia
The original query was: how is the weather in jamaica
The new query is: how is the weather in jamaica
The original query was: types of dysarthria from cerebral palsy
The new query is: types of dysarthria from cerebral palsy
The original 

Unnamed: 0,name,qid,measure,value
27,BR(BM25),1037798,map,0.109533
28,BR(BM25),1037798,recip_rank,0.333333
29,BR(BM25),1037798,ndcg,0.433398
63,BR(BM25),104861,map,0.356779
64,BR(BM25),104861,recip_rank,1.000000
...,...,...,...,...
160,Compose(<pyterrier.transformer.ApplyQueryTrans...,915593,recip_rank,1.000000
161,Compose(<pyterrier.transformer.ApplyQueryTrans...,915593,ndcg,0.728993
168,Compose(<pyterrier.transformer.ApplyQueryTrans...,962179,map,0.058701
169,Compose(<pyterrier.transformer.ApplyQueryTrans...,962179,recip_rank,0.025000


In [68]:
pt.Experiment(
  [bm25, abbreviations, replace_with_synonyms, replace_with_synonyms_only_if_null, replace_with_synonyms_manual],
  dataset.get_topics("test-2019"), 
  dataset.get_qrels("test-2019"),
  eval_metrics=["ndcg", "map", "recip_rank"], 
  filter_by_qrels = "true",
  baseline=0,
  round=3,
  names = ["bm25","abbreviations", "replace with synonyms", "replace only if null", "replace manually"]
)

The original query was: do goldfish grow
The new query is: do goldfish grow
The original query was: what is wifi vs bluetooth
The new query is: what is wifi vs bluetooth
The original query was: why did the us volunterilay enter ww1
The new query is: why did the united states volunterilay enter world war 1
The original query was: definition declaratory judgment
The new query is: definition declaratory judgment
The original query was: right pelvic pain causes
The new query is: right pelvic pain causes
The original query was: what are the social determinants of health
The new query is: what are the social determinants of health
The original query was: does legionella pneumophila cause pneumonia
The new query is: does legionella pneumophila cause pneumonia
The original query was: how is the weather in jamaica
The new query is: how is the weather in jamaica
The original query was: types of dysarthria from cerebral palsy
The new query is: types of dysarthria from cerebral palsy
The original 

Unnamed: 0,name,map,recip_rank,ndcg,map +,map -,map p-value,recip_rank +,recip_rank -,recip_rank p-value,ndcg +,ndcg -,ndcg p-value
0,bm25,0.37,0.795,0.593,,,,,,,,,
1,abbreviations,0.37,0.838,0.597,2.0,1.0,0.959743,3.0,0.0,0.125346,2.0,1.0,0.686425
2,replace with synonyms,0.327,0.688,0.533,1.0,11.0,0.033763,1.0,9.0,0.036765,1.0,11.0,0.014546
3,replace only if null,0.364,0.764,0.568,2.0,4.0,0.353386,1.0,3.0,0.400526,1.0,5.0,0.165999
4,replace manually,0.362,0.701,0.574,1.0,5.0,0.336636,0.0,5.0,0.026931,1.0,5.0,0.206056


Here is performed the *error analysis* 

In [69]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [70]:
%cd drive/MyDrive

/content/drive/MyDrive


In [71]:
#Given a docid and a query, search if the doc id exists in top-100 ranked documents
def check_rel_doc_present(docid, qid, query, result,label):

    # Hypothesis that this if statement is always true
    if docid in list(result["docno"]):
        R = result[result["docno"]==docid]
        rank = R.iloc[0]["rank"]
        score = R.iloc[0]["score"]
        doc_id = R.iloc[0]["docid"]
        # print("rank : ", rank)
        # print("score: ", score)
        # print("docid : ", doc_id)
        return {"rank":rank, "docid":doc_id, "score":round(score,2), "qid":qid, "query":query, "label":label}
    else:
        # Then we have document that were not found in the top-1000 ranked documents despite 
        # the query they belong to had a qrel provided
        
        # print("not found")
        return {"rank":-1, "docid":docid, "score":-1, "qid":qid, "query":query, "label":label}

In [72]:
import pandas as pd
import gc
from tqdm import tqdm

#Given the test qrels, retrieve all the relevant documents and their score in the model in order to check if they have been retrieved
def check_qrel_queries():
    temp1 = dataset.get_topicsqrels("test-2019")[0]
    queries_df = temp1.sort_values(by=['qid'])

    temp2 = dataset.get_topicsqrels("test-2019")[1]
    qrels_df = temp2.sort_values(by=['qid'])
    
    docs_qrels = pd.DataFrame(columns =['rank','docid','score','qid','label'])

    for index, row in tqdm(qrels_df.iterrows()):
        qid = row['qid']

        #We don't care about non-relevant documents, so they are skipped
        if row['label'] == 0:
          continue

        docNo = row["docno"]        
        query = queries_df[queries_df["qid"] == qid].iloc[0]["query"]
        
        # Returns the top 1000 relevant documents for this query (that has as qrel)
        result = bm25.search(query)
        
        docs_qrels = docs_qrels.append(check_rel_doc_present(docNo, qid, query, result,row['label']), ignore_index=True)
        
        del result
        gc.collect()
        
    return docs_qrels

In [73]:
temp = check_qrel_queries()

9260it [18:59,  8.12it/s]


In [74]:
temp.to_csv("QrelAnalysisTest.csv", index=False)

In [88]:
DOCS = pd.read_csv("collection.tar.gz", sep='\t', header=None, names=["docid","passage"])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [89]:
DOCS

Unnamed: 0,docid,passage
0,collection.tsv,The presence of communication amid scientific ...
1,1,The Manhattan Project and its atomic bomb help...
2,2,Essay on The Manhattan Project - The Manhattan...
3,3,The Manhattan Project was the name for a proje...
4,4,versions of each volume as well as complementa...
...,...,...
8841819,8841819.0,Thousands of people across the United States w...
8841820,8841820.0,"The recipe that creates blue, for example, inc..."
8841821,8841821.0,"On Independence Days of yore, old-timey crowds..."
8841822,8841822.0,View full size image. Behind the scenes of the...


In [90]:
queries_without_doc_retrieved = temp[temp['score']==(-1.00)]

In [96]:
# RETRIEVE THESE DOCUMENTS TO SEE WHAT IS WRONG
passages = [(DOCS.iloc[(int(x))]['passage']) for x in queries_without_doc_retrieved['docid']]

In [97]:
queries_without_doc_retrieved['passages'] = passages

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [98]:
queries_without_doc_retrieved

Unnamed: 0,rank,docid,score,qid,label,query,passages
22,-1,5688423,-1.0,104861,2,cost of interior concrete flooring,if you wanted an 800 grit polish which is stil...
23,-1,5688422,-1.0,104861,1,cost of interior concrete flooring,extensive surface preparation such as grinding...
31,-1,4819155,-1.0,104861,1,cost of interior concrete flooring,"According to Cost Helper, having a professiona..."
37,-1,7302612,-1.0,104861,2,cost of interior concrete flooring,"Unlike tiles, carpeting and other flooring mat..."
40,-1,5248343,-1.0,104861,2,cost of interior concrete flooring,"If you wanted an 800 grit polish, which is sti..."
...,...,...,...,...,...,...,...
3980,-1,4284131,-1.0,87452,1,causes of military suicide,Long-Term Effects of PTSD. If left untreated s...
3981,-1,4307936,-1.0,87452,1,causes of military suicide,Suicide is detrimental to the readiness of the...
3982,-1,4079788,-1.0,87452,2,causes of military suicide,Understanding Drug Abuse in the Military. The ...
3983,-1,4096633,-1.0,87452,1,causes of military suicide,Soldiers and Marines who had more combat stres...


In [99]:
queries_without_doc_retrieved.to_csv("queries_not_retrieved_docs.csv", index=False)