<a href="https://colab.research.google.com/github/PaoloBarba/ADM_HM1_Barba_1885324/blob/main/DMT2023_Lab1_SE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#INTRODUCTION

#### Information Retrieval

Information Retrieval (IR) is finding material (usually **documents**) of an unstructured nature (usually **text**) that satisfies an information need from within large collections (usually stored on computers).

#### IR is an Empirical Science

* Information Retrieval has massively benefitted from a long history of excellent test collections;

* This has allowed many retrieval models to be developed and shown their effectiveness;

* Hence, IR has been a dataset-driven empirical science for 50 years!

##PyTerrier

PyTerrier is a software framework for information retrieval experiements in Python. 

*   It embeds all the previous frameworks;
*   It allows performing experiments in a **declarative way**.



#### Installing & Configuring
Installing PyTerrier is easy - it can be installed from the command-line in the normal way using Pip: `pip install python-terrier`

All usages of PyTerrier start by importing PyTerrier and starting it using the init() method:

```
import pyterrier as pt
pt.init()
```


#### Importing Datasets
The datasets module allows easy access to existing standard test collections. In particular, each defined dataset can download and provide easy access to:
* files containing the documents of the corpus
* topics (queries), as a dataframe, ready for retrieval
* relevance assessments (aka, labels or qrels), as a dataframe, ready for evaluation
* ready-made Terrier indices, where appropriate

#### Indexing
PyTerrier has a number of useful classes for creating Terrier indices, which can be used for retrieval, query expansion, etc. There are four indexer classes:

* You can create an index from TREC-formatted files, from a TREC test collection, using TRECCollectionIndexer.
* You can use FilesIndexer for indexing TXT, PDF, Microsoft Word files, etc.
* For indexing Pandas Dataframe you can use DFIndexer.
* For any **abitrary iterable dictionaries**, you can use IterDictIndexer.


#### Terrier Retrieval
**BatchRetrieve** is one of the most commonly used PyTerrier objects. It represents a retrieval transformation, in which queries are mapped to retrieved documents. BatchRetrieve uses a pre-existing Terrier index data structure, typically saved on disk.

#CODE

## Libraries

> We need to install **PyTerrier** as it is not part of Python standard libraries

In [1]:
pip install python-terrier

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-terrier
  Downloading python-terrier-0.9.2.tar.gz (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.4/104.4 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2
  Downloading pyjnius-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matchpy
  Downloading matchpy-0.5.5-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting chest
  

> Now we can import **PyTerrier**

In [2]:
import pyterrier as pt

> We also need to import the **pandas** library to handle DataFrames

In [3]:
import pandas as pd

In [5]:
import logging, sys
logging.disable(sys.maxsize)

### Initialization
> It is necessary to call the PyTerrier `init` function before Terrier classes and methods can be used. This function imports classes and also finds the correct version of Terrier to download if no version is specified.

> This is a peculiarity of PyTerrier and is not necessary for the other libraries we need.

In [6]:
if not pt.started():
  pt.init()

> The `started` method of the PyTerrier package returns *True* if `init()` has already been called, `False` otherwise.

###Custom functions
Here are hidden imports and custom functions defined to simplify certain steps or improve certain visual outputs.

In [None]:
from IPython.display import display, display_html

In [None]:
def format_query_result(results, df, field):
  merged_df = pd.merge(results, df, left_on='docno', right_on='docno') #merge the two dfs

  merged_df = merged_df.loc[:, ["rank", "score", *field]] #subset to columns we are interested in
  
  return merged_df

In [None]:
def display_query_result(results_dict, df):
  for field,results in results_dict.items():
    previous_max_col_width = pd.options.display.max_colwidth #save current max_col_width
    pd.options.display.max_colwidth = 1000 #change max_colwidth to display more text
    
    merged_df = format_query_result(results,df,field) #get formatted results
    
    display(merged_df) #display result

    pd.options.display.max_colwidth = previous_max_col_width #reset previous max_col_width

In [None]:
def display_matrix_results(results_dict):
  df_stylers = []
  for i,(k,v) in enumerate(results_dict.items()):
    if k[0]=="":
      app = "No preprocessing"
    else:
      app = k[0]
    df_stylers.append(v.style.format(formatter={("score"): "{:.1f}", ("text"): "{:.70}"}).set_table_attributes("style='display:inline'").set_caption(app+" with "+k[1]))

  app = df_stylers[0]._repr_html_()
  for i,df_styler in enumerate(df_stylers[1:]):
    if i%2==1:
      app += "<br>" + "-"*230 + "\n"
    app += df_styler._repr_html_()
  display_html(app, raw=True)

## Custom Dataset

### Data

> Here we define an illustrative dataframe on which to perform queries

In [None]:
df = pd.DataFrame({
  'docno':
    ['98', '81', '63', '59', '53', '65', '75', '45'],
    'title':
    ["Bank of Italy",
     "List of italian banks", 
     "Zoology class",
     "History of banking",
     "Title",
     "History of banking",
     "Spam document",
     "History of banking"
    ],
  'text':
    ["The Bank of Italy is the central bank of Italy and part of the European System of Central Banks.",
     "List of italian banks", 
     "History of animals that lived near banks of rivers",
     "Many of histories position the development of a banking system to medieval and Renaissance Italy",
     "History of banking",
     "History of banking: the development of the banking system...",
     "Of of of of of of of of of of of of of of of of apple",
     "spam apple of spam of spam bank spam spam history spam Sony spam spam soccer spam cinema spam spam sales spam buy spam spam TV spam offer spam job work spam Pokémon spam"
    ]
  })

In [None]:
df

Unnamed: 0,docno,title,text
0,98,Bank of Italy,The Bank of Italy is the central bank of Italy...
1,81,List of italian banks,List of italian banks
2,63,Zoology class,History of animals that lived near banks of ri...
3,59,History of banking,Many of histories position the development of ...
4,53,Title,History of banking
5,65,History of banking,History of banking: the development of the ban...
6,75,Spam document,Of of of of of of of of of of of of of of of o...
7,45,History of banking,spam apple of spam of spam bank spam spam hist...


### Indexing

> To create the index used to perform the queries, we must follow these steps:
1. Initialize a **DFIndexer** object, specifying in which folder to put the index
>> We also specify that we want to *overwrite* the index, if found.
2. Set a **preprocessing** configuration.
3. Pass the *text* and *docno* **fields** to the indexer

---

> We create a **function** that takes as argument a *preprocessing_configuration* string, as we want to **repeat** the index creation for various configurations, and returns the computed index. The *field* to be indexed is also an input variable (it will become clear later why).


In [None]:
def create_index(preprocessing_configuration, field):
  pd_indexer = pt.DFIndexer("./Inverted_Index", overwrite=True)
  
  pd_indexer.setProperty("termpipelines", preprocessing_configuration)
  
  indexref = pd_indexer.index(df[field], df["docno"])

  return indexref

###Querying

> After the index creation, we'll need to specify a **weighting model** (a.k.a. **scoring function**) to compute the score of a document given a query.

> As before, here we define a function that takes as input the previously initialized index and the name of the weighting model and returns the **retrieval model**.

In [None]:
def create_retrieval_model(indexref, scoring_function):
  return pt.BatchRetrieve(indexref, wmodel = scoring_function)

> **The retrieval model will take as input a query and return its result**.

### First Example

> First of all, we need to call the functions that we have just defined and create the index and the retrieval model

---

> To begin with, let's not use any **pre-processing** for the **index**

In [None]:
indexref = create_index(preprocessing_configuration = "", field = "text")

  indexref = pd_indexer.index(df[field], df["docno"])


> For the scoring function, let's use **CoordinateMatch**.

> We must also remember to feed the *indexref* to the `create_retrieval_model` function

In [None]:
terms_presence_nopreproc = create_retrieval_model(indexref, scoring_function = "CoordinateMatch")

> Now we are ready to perform our **query**. Let's assume we want to find out articles about the *History of banking*

In [None]:
query = "History of banking"
results = terms_presence_nopreproc.search(query)

> Let's visualize the results

In [None]:
results

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,4,53,0,3.0,History of banking
1,1,5,65,1,3.0,History of banking
2,1,2,63,2,2.0,History of banking
3,1,3,59,3,2.0,History of banking
4,1,7,45,4,2.0,History of banking
5,1,0,98,5,1.0,History of banking
6,1,1,81,6,1.0,History of banking
7,1,6,75,7,1.0,History of banking


> Since this is not much informative, let's add to each row the text corresponding to the document number (*docno*). We'll use a custom printing function defined earlier.

In [None]:
display_query_result({("text",):results},df)
#YOUR TURN: Try to alter the Pandas DataFrame to see the effect in the output and try to predict it

Unnamed: 0,rank,score,text
0,0,3.0,History of banking
1,1,3.0,History of banking: the development of the banking system...
2,2,2.0,History of animals that lived near banks of rivers
3,3,2.0,Many of histories position the development of a banking system to medieval and Renaissance Italy
4,4,2.0,spam apple of spam of spam bank spam spam history spam Sony spam spam soccer spam cinema spam spam sales spam buy spam spam TV spam offer spam job work spam Pokémon spam
5,5,1.0,The Bank of Italy is the central bank of Italy and part of the European System of Central Banks.
6,6,1.0,List of italian banks
7,7,1.0,Of of of of of of of of of of of of of of of of apple


### More rigorous tests

> Now that we have understood how to initialise an index, create a retrieval model and make a query, let's try to see what changes depending on the preprocessing used and the scoring function chosen.

In [None]:
possible_preprocessing = ["", #no preprocessing
                          "Stopwords", #remove stopwords
                          "EnglishSnowballStemmer", #Probably the most famous stemmer in the world
                          "Stopwords, EnglishSnowballStemmer"] #Both previous ones

possible_wmodels = ["CoordinateMatch", #Term presence
                    "Tf"] #Term frequency

#YOUR TURN: Try to change preprocessing and/or scoring function; use PyTerrier documentation for reference.

> Let us now build a program that tries every possible **combination of preprocessing configuration and scoring function**, and then we can check the difference in the results.

> The query will always be `"History of banking"`.

> A python `dict` will be used to save the results for each combination of preprocessing and scoring function.


In [None]:
query = "History of banking"
results_dict = {}
for preprocessing_configuration in possible_preprocessing:
  indexref = create_index(preprocessing_configuration, "text")
  for wmodel in possible_wmodels:
    retrieval_model = create_retrieval_model(indexref, wmodel)
    results = retrieval_model.search(query)

    results_dict[(preprocessing_configuration, wmodel)] = format_query_result(results,df,("text",))

  indexref = pd_indexer.index(df[field], df["docno"])


> Now, using a custom function, we will show all the results in a matrix fashion to highlight the differences between the different configurations.

In [None]:
display_matrix_results(results_dict)

Unnamed: 0,rank,score,text
0,0,3.0,History of banking
1,1,3.0,History of banking: the development of the banking system...
2,2,2.0,History of animals that lived near banks of rivers
3,3,2.0,Many of histories position the development of a banking system to medi
4,4,2.0,spam apple of spam of spam bank spam spam history spam Sony spam spam
5,5,1.0,The Bank of Italy is the central bank of Italy and part of the Europea
6,6,1.0,List of italian banks
7,7,1.0,Of of of of of of of of of of of of of of of of apple

Unnamed: 0,rank,score,text
0,0,16.0,Of of of of of of of of of of of of of of of of apple
1,1,5.0,History of banking: the development of the banking system...
2,2,4.0,The Bank of Italy is the central bank of Italy and part of the Europea
3,3,3.0,History of animals that lived near banks of rivers
4,4,3.0,Many of histories position the development of a banking system to medi
5,5,3.0,History of banking
6,6,3.0,spam apple of spam of spam bank spam spam history spam Sony spam spam
7,7,1.0,List of italian banks

Unnamed: 0,rank,score,text
0,0,2.0,History of banking
1,1,2.0,History of banking: the development of the banking system...
2,2,1.0,History of animals that lived near banks of rivers
3,3,1.0,Many of histories position the development of a banking system to medi
4,4,1.0,spam apple of spam of spam bank spam spam history spam Sony spam spam

Unnamed: 0,rank,score,text
0,0,3.0,History of banking: the development of the banking system...
1,1,2.0,History of banking
2,2,1.0,History of animals that lived near banks of rivers
3,3,1.0,Many of histories position the development of a banking system to medi
4,4,1.0,spam apple of spam of spam bank spam spam history spam Sony spam spam

Unnamed: 0,rank,score,text
0,0,3.0,History of animals that lived near banks of rivers
1,1,3.0,Many of histories position the development of a banking system to medi
2,2,3.0,History of banking
3,3,3.0,History of banking: the development of the banking system...
4,4,3.0,spam apple of spam of spam bank spam spam history spam Sony spam spam
5,5,2.0,The Bank of Italy is the central bank of Italy and part of the Europea
6,6,2.0,List of italian banks
7,7,1.0,Of of of of of of of of of of of of of of of of apple

Unnamed: 0,rank,score,text
0,0,16.0,Of of of of of of of of of of of of of of of of apple
1,1,7.0,The Bank of Italy is the central bank of Italy and part of the Europea
2,2,5.0,History of banking: the development of the banking system...
3,3,4.0,History of animals that lived near banks of rivers
4,4,4.0,Many of histories position the development of a banking system to medi
5,5,4.0,spam apple of spam of spam bank spam spam history spam Sony spam spam
6,6,3.0,History of banking
7,7,2.0,List of italian banks

Unnamed: 0,rank,score,text
0,0,2.0,History of animals that lived near banks of rivers
1,1,2.0,Many of histories position the development of a banking system to medi
2,2,2.0,History of banking
3,3,2.0,History of banking: the development of the banking system...
4,4,2.0,spam apple of spam of spam bank spam spam history spam Sony spam spam
5,5,1.0,The Bank of Italy is the central bank of Italy and part of the Europea
6,6,1.0,List of italian banks

Unnamed: 0,rank,score,text
0,0,3.0,The Bank of Italy is the central bank of Italy and part of the Europea
1,1,3.0,History of banking: the development of the banking system...
2,2,2.0,History of animals that lived near banks of rivers
3,3,2.0,Many of histories position the development of a banking system to medi
4,4,2.0,History of banking
5,5,2.0,spam apple of spam of spam bank spam spam history spam Sony spam spam
6,6,1.0,List of italian banks


### Indexing different fields

> Using the functions created earlier, we build the index and the resulting retrieval model for the *text* and *title* fields

In [None]:
text_indexref = create_index("Stopwords, EnglishSnowballStemmer","text")
text_retr_model = create_retrieval_model(text_indexref, "Tf")

title_indexref = create_index("Stopwords, EnglishSnowballStemmer","title")
title_retr_model = create_retrieval_model(title_indexref, "CoordinateMatch")

  indexref = pd_indexer.index(df[field], df["docno"])


> Now, we can get the results for the same query (*History of banking*) on **different** fields

In [None]:
query = "History of Banking"

results_text = format_query_result(text_retr_model.search(query),df,("docid","title","text"))
results_title = format_query_result(title_retr_model.search(query),df,("docid","title","text"))

In [None]:
results_text

Unnamed: 0,rank,score,docid,title,text
0,0,3.0,0,Bank of Italy,The Bank of Italy is the central bank of Italy...
1,1,3.0,5,History of banking,History of banking: the development of the ban...
2,2,2.0,2,Zoology class,History of animals that lived near banks of ri...
3,3,2.0,3,History of banking,Many of histories position the development of ...
4,4,2.0,4,Title,History of banking
5,5,2.0,7,History of banking,spam apple of spam of spam bank spam spam hist...
6,6,1.0,1,List of italian banks,List of italian banks


In [None]:
results_title

Unnamed: 0,rank,score,docid,title,text
0,0,2.0,3,History of banking,Many of histories position the development of ...
1,1,2.0,5,History of banking,History of banking: the development of the ban...
2,2,2.0,7,History of banking,spam apple of spam of spam bank spam spam hist...
3,3,1.0,0,Bank of Italy,The Bank of Italy is the central bank of Italy...
4,4,1.0,1,List of italian banks,List of italian banks


## Indexing file

> We can also try to index data not within PyTerrier but personal txt files.

> As an example, we will stream the collection of tweets "Democrat vs. Republican Tweets" obtained from Kaggle. It is formatted as "[docno] \t [text] \n". We have stored this data on the web for everyone to access.

> The data is in our shared Drive folder, in the data subfolder. To upload the file to this machine, we need to use the gdown command. We will see this command in more detail in the next sessions.

In [None]:
!gdown 12z4xEGPcenFhQm5BA4dFJkE8EIVdAssn

Downloading...
From: https://drive.google.com/uc?id=12z4xEGPcenFhQm5BA4dFJkE8EIVdAssn
To: /content/tweets.txt
  0% 0.00/12.4M [00:00<?, ?B/s]100% 12.4M/12.4M [00:00<00:00, 245MB/s]


> The `tweet_doc_iter` function will download the data from the specified url or file and return a python iterable in the form of a dict.

In [None]:
df = []
with open("tweets.txt","r") as f:
  for i, line in enumerate(f):
    if i % 1000 == 0:
        print(f'processing document {i}')
    record = line.strip().split('\t')
    if len(record) < 2:
      print(record)
      continue
    docno, text = record[0], record[1] 
    df.append({'docno': docno, 'text': text})
df = pd.DataFrame(df).set_index("docno",drop=False) #drop=False to keep the "docno" column

processing document 0
processing document 1000
processing document 2000
processing document 3000
processing document 4000
processing document 5000
processing document 6000
processing document 7000
processing document 8000
processing document 9000
processing document 10000
processing document 11000
processing document 12000
processing document 13000
processing document 14000
processing document 15000
processing document 16000
processing document 17000
processing document 18000
processing document 19000
processing document 20000
processing document 21000
processing document 22000
processing document 23000
processing document 24000
processing document 25000
processing document 26000
processing document 27000
processing document 28000
processing document 29000
processing document 30000
processing document 31000
processing document 32000
processing document 33000
processing document 34000
processing document 35000
processing document 36000
processing document 37000
processing document 38000

In [None]:
def tweet_doc_iter_from_file(df):
  for row in df.to_dict("records"):
    yield row

In [None]:
folder_pos = "./iter_index"

!rm -r $folder_pos #remove index if already created

rm: cannot remove './iter_index': No such file or directory


In [None]:
indexer = pt.IterDictIndexer(folder_pos, blocks=True)

doc_iter = tweet_doc_iter_from_file(df)
index3 = indexer.index(doc_iter, meta=['docno', 'text'])

  index3 = indexer.index(doc_iter, meta=['docno', 'text'])


16:26:41.281 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 2 empty documents


> As we can see, the `IterDictIndexer` object can index any iterable type

In [None]:
index_factory = pt.IndexFactory.of(index3)
print(index_factory.getCollectionStatistics().toString())

Number of documents: 86460
Number of terms: 113652
Number of postings: 1043269
Number of fields: 1
Number of tokens: 1064228
Field names: [text]
Positions:   true



In [None]:
tf = pt.BatchRetrieve(index3, wmodel="Tf", metadata=["docno", "text"])

retrieval_model = tf % 10

In [None]:
retrieval_model.search("bitcoin")

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,33335,33335,"Just like stock trades or financial assets, me...",0,1.0,bitcoin
1,1,33339,33339,The buying &amp; selling of #bitcoins &amp; ot...,1,1.0,bitcoin
2,1,50228,50228,We are discussing the challenges and opportuni...,2,1.0,bitcoin
3,1,70120,70120,RT @FortuneMagazine: Tax bill calls for bitcoi...,3,1.0,bitcoin
4,1,70121,70121,RT @SiliconANGLE: New law would introduce capi...,4,1.0,bitcoin
5,1,74245,74245,RT @RepLoudermilk: We are discussing the chall...,5,1.0,bitcoin
6,1,75811,75811,RT @listendestro: @realDonaldTrump Bitcoin rules.,6,1.0,bitcoin
7,1,84786,84786,RT @HouseScience: WATCH LIVE: Beyond #Bitcoin:...,7,1.0,bitcoin


In [None]:
retrieval_model.search("money laundering")

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,29094,29094,RT @RepTedDeutch: If youre a lobbyist who neve...,0,2.0,money laundering
1,1,29098,29098,"If youre a lobbyist who never gave us money, I...",1,2.0,money laundering
2,1,29247,29247,#Backpage.com CEO Carl Ferrer has pled guilty ...,2,2.0,money laundering
3,1,36277,36277,RT @OTLonESPN: For the leagues to expect any ...,3,2.0,money laundering
4,1,39028,39028,8 yrs ago #CitizensUnited opened the floodgate...,4,2.0,money laundering
5,1,39720,39720,The Mueller files and other documents suggest ...,5,2.0,money laundering
6,1,42835,42835,".@PPAdvocatesINKY, you know full well that mon...",6,2.0,money laundering
7,1,46846,46846,"Tonight, the House will pass the End Banking f...",7,2.0,money laundering
8,1,56161,56161,#PrisonReform is a money and a moral issueIt's...,8,2.0,money laundering
9,1,59697,59697,RT @WaysandMeansGOP: Because we are not talki...,9,2.0,money laundering


In [None]:
retrieval_model.search("Clinton")

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,17029,17029,Russian Spies met w/ Trump Campaign. To discus...,0,2.0,Clinton
1,1,86442,86442,We need answers now. We must investigate Clint...,1,2.0,Clinton
2,1,5416,5416,RT @DeanObeidallah: I agree 100% with Rudy Giu...,2,1.0,Clinton
3,1,9569,9569,"RT @JohnBrennan: I served 6 Presidents, 3 Rs &...",3,1.0,Clinton
4,1,19038,19038,RT @InvestigateRU: .@RepSarbanes: Trump campai...,4,1.0,Clinton
5,1,25364,25364,RT @RepAdamSchiff: Russian social media campai...,5,1.0,Clinton
6,1,27303,27303,RT @KellyannePolls: Astonished by the all-out ...,6,1.0,Clinton
7,1,28372,28372,"Background checks save lives. 24 years ago, Pr...",7,1.0,Clinton
8,1,28818,28818,Virtually every Clinton-related matter that Pr...,8,1.0,Clinton
9,1,29586,29586,I added a video to a @YouTube playlist http://...,9,1.0,Clinton


## PyTerrier Datasets

> The datasets module allows easy access to existing standard test collections. In particular, each defined dataset can download and provide easy access to:
* files containing the documents of the corpus,
* topics (**queries**), as a dataframe, ready for retrieval,
* relevance assessments (aka, labels or qrels), as a dataframe, ready for evaluation: the **ground truth**.

In [None]:
terrier_datasets = pt.datasets.list_datasets()

> Let's have a brief look at the list of datasets provided by the PyTerrier library

In [None]:
terrier_datasets

Unnamed: 0,dataset,topics,topics_lang,qrels,corpus,corpus_lang,index,info_url
0,50pct,"[training, validation]",en,"[training, validation]",,,"[ex2, ex3]",
1,antique,"[train, test]",en,"[train, test]",True,en,,https://ciir.cs.umass.edu/downloads/Antique/re...
2,vaswani,True,en,True,True,en,True,http://ir.dcs.gla.ac.uk/resources/test_collect...
3,msmarco_document,"[train, dev, test, test-2020, leaderboard-2020]",en,"[train, dev, test, test-2020]",True,en,True,https://microsoft.github.io/msmarco/
4,msmarcov2_document,"[train, dev1, dev2, valid1, valid2, trec_2021]",en,"[train, dev1, dev2, valid1, valid2]",,,True,https://microsoft.github.io/msmarco/TREC-Deep-...
...,...,...,...,...,...,...,...,...
641,irds:hc4,,,,,,,https://ir-datasets.com/hc4.html
654,irds:neuclir,,,,,,,https://ir-datasets.com/neuclir.html
655,irds:neuclir/1,,,,,,,https://ir-datasets.com/neuclir.html#neuclir/1
662,trec-deep-learning-docs,"[train, dev, test, test-2020, leaderboard-2020]",en,"[train, dev, test, test-2020]",True,en,True,https://microsoft.github.io/msmarco/


### Indexing

> The dataset we will use is the one called Covid, which is part of the TREC (Text REtrieval Conference) collection. This is a collection of biomedical literature articles of COVID-related topics.
> Each document contains title, abstract, doi and date.

> After defining the dataset name, we download the dataset using the command `pt.get_dataset`

In [None]:
dataset_name = 'irds:cord19/trec-covid'

dataset = pt.get_dataset(dataset_name) #download dataset

> First we define the folder where to put the index.

> We also remove the folder in case it has already been created. This command may be commented if we want to keep an already built index.

In [None]:
folder_pos = './irds:cord19/trec-covid/'

!rm -r $folder_pos #remove index if already created

> We create the indexer using the `IterDictIndexer` method. We need to specify the folder path. We also specify `meta_reverse = []`, that defaults to *docno*; this represents what metadata should we be able to resolve back to a *docid*. If there are two documents with the same docno (as we know they are in this case), not specifying it will lead to an error in index creation.

> The indexer is now ready to index the dataset, that is passed to it using the `get_corpus_iter` method. We specify also the `fields` to index, *abstract* and *title*, and the `meta` data we want to retrieve from each query. We have added *title* and *abstract* just to have a more informative query result, but normally you only need the `docno`.

In [None]:
indexer = pt.IterDictIndexer(folder_pos,  meta_reverse = [], blocks = True)
index1 = indexer.index(dataset.get_corpus_iter(), fields=('abstract','title'), meta=('docno','title','abstract'))

[INFO] [starting] building docstore
[INFO] If you have a local copy of https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/80d664e496b8b7e50a39c6f6bb92e0ef
[INFO] [starting] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv
docs_iter:   0%|                                    | 0/192509 [00:00<?, ?doc/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 0.00/269M [00:00<?, ?B/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 115k/269M [00:00<04:00, 1.12MB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.4%| 1.16M/269M [00:00<00:50, 5.27MB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 2.5%| 6.69M/269M [00:00<00:12, 20.9MB/s][A
https://ai2-semantic

cord19/trec-covid documents:   0%|          | 0/192509 [00:00<?, ?it/s]

  index1 = indexer.index(dataset.get_corpus_iter(), fields=('abstract','title'), meta=('docno','title','abstract'))


16:34:06.006 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 60 empty documents


> We use pyterrier's `IndexFactory` method to obtain some statistics on the newly created index, such as the number of documents and terms indexed.

In [None]:
index_factory = pt.IndexFactory.of(index1)
print(index_factory.getCollectionStatistics().toString())

Number of documents: 192509
Number of terms: 158515
Number of postings: 12290426
Number of fields: 2
Number of tokens: 19603234
Field names: [abstract, title]
Positions:   true



### Querying

> The query will be *protein variants*, the scoring function will be **tf-idf**

In [None]:
query = "protein variants"

In [None]:
tf_idf_results = pt.BatchRetrieve(index1, wmodel = "TF_IDF", metadata = ["docno", "title", "abstract"]).search(query, )

In [None]:
tf_idf_results

Unnamed: 0,qid,docid,docno,title,abstract,rank,score,query
0,1,97334,4nfxdppt,Structural variations in human ACE2 may influe...,"The recent pandemic of COVID-19, caused by SAR...",0,9.645741,protein variants
1,1,153468,m0w0fl2u,Structural variations in human ACE2 may influe...,"The recent pandemic of COVID19, caused by SARS...",1,9.645741,protein variants
2,1,74549,m0xvqplq,"D614G Spike Variant Does Not Alter IgG, IgM, o...",Emergence of a new variant of spike protein (D...,2,9.616081,protein variants
3,1,167441,eb7g9p1x,Localization of extensive deletions in the str...,Abstract The intracellular RNA of two neurotro...,3,9.605184,protein variants
4,1,177319,5pmp33d0,New variants of porcine epidemic diarrhea viru...,Four types of porcine epidemic diarrhea virus ...,4,9.534034,protein variants
...,...,...,...,...,...,...,...,...
995,1,142159,ynown7ra,Outbreak-Related Porcine Epidemic Diarrhea Vir...,"In late 2013, outbreaks of porcine epidemic di...",995,4.551310,protein variants
996,1,167386,y9e5fz3j,Complete Genome Sequence of the Porcine Epidem...,Porcine epidemic diarrhea virus (PEDV) is a ca...,996,4.551310,protein variants
997,1,40784,d14vro73,Interferon lambda 4 genotypes and resistance-a...,UNLABELLED Single-nucleotide polymorphisms (SN...,997,4.550411,protein variants
998,1,602,cjzecgrb,Geometry and Adhesion of Extracellular Domains...,[Image: see text] Forcedistance measurements h...,998,4.538761,protein variants


> Let's see an example of a results

In [None]:
tf_idf_results.iloc[2]["title"]

'D614G Spike Variant Does Not Alter IgG, IgM, or IgA Spike Seroassay Performance'

In [None]:
tf_idf_results.iloc[2]["abstract"]

'Emergence of a new variant of spike protein (D614G) with increased infectivity and transmissibility has prompted many to analyze the potential role of this variant in the SARS-CoV-2 pandemic. When a new variant emerges, there is a concern regarding whether an individual exposed to one variant of a virus will have cross-reactive immune memory to the second variant. Accordingly, we analyzed the serologic reactivity of D614 (original) and G614 variant spike proteins. We found that antibodies from a high-incid'

### Running Experiments
> PyTerrier aims to make it easy to conduct an information retrieval **experiment**, namely, to run a transformer **pipeline** over a set of queries, and evaluating the outcome using standard information retrieval evaluation metrics based on known relevant documents (obtained from a set relevance assessments, also known as qrels).

> The main way to achieve this is using `pt.Experiment()`.

> Let's first define some scoring functions.

> After, we specify that we want to use only the first 10 results.

In [None]:
CoordinateMatch = pt.BatchRetrieve(index1, wmodel="CoordinateMatch")
TF = pt.BatchRetrieve(index1, wmodel="Tf")
TF_IDF = pt.BatchRetrieve(index1, wmodel="TF_IDF")
LemurTF_IDF = pt.BatchRetrieve(index1, wmodel="LemurTF_IDF")
BM25 = pt.BatchRetrieve(index1, wmodel="BM25")
PL2 = pt.BatchRetrieve(index1, wmodel="PL2")
BM25F = pt.BatchRetrieve(index1, wmodel="BM25F")
PL2F = pt.BatchRetrieve(index1, wmodel="PL2F")

CoordinateMatch_at_10 = CoordinateMatch % 10
TF_at_10 = TF % 10
TF_IDF_at_10 = TF_IDF % 10
LemurTF_IDF_at_10 = LemurTF_IDF % 10
BM25_at_10 = BM25 % 10
PL2_at_10 = PL2 % 10
BM25F_at_10 = BM25F % 10
PL2F_at_10 = PL2F % 10

> Let's also try some **query expansion** techniques.
It is the process of reformulating a given query to improve retrieval performance in information retrieval operations
*   Finding synonyms of words, and searching for the synonyms as well,
*   Finding semantically related words (e.g. antonyms, meronyms, hyponyms, hypernyms),
*   Finding all the various morphological forms of words by stemming each word in the search query,
*   Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results,
*   Re-weighting the terms in the original query,
*   etc...

In [None]:
sdm = pt.rewrite.SDM()
bo1 = pt.rewrite.Bo1QueryExpansion(index1)
pipeline_1_at_10 = sdm >> TF_IDF % 10
pipeline_2_at_10 = TF_IDF >> bo1 >> TF_IDF % 10

> Extraction of the ground truth as a set of couples <query, relevant_results>.


In [None]:
topics = dataset.get_topics("title")
qrels = dataset.get_qrels()

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [00:00] [18.7kB] [7.33MB/s]
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [00:00] [1.14MB] [3.64MB/s]


Let's see what `topics` and `qrels` look like:

In [None]:
topics

Unnamed: 0,qid,query
0,1,coronavirus origin
1,2,coronavirus response to weather changes
2,3,coronavirus immunity
3,4,how do people die from the coronavirus
4,5,animal models of covid 19
5,6,coronavirus test rapid testing
6,7,serological tests for coronavirus
7,8,coronavirus under reporting
8,9,coronavirus in canada
9,10,coronavirus social distancing impact


`label` represents the relevance score. It can be binary (0 = not relevant, 1 = relevant) or graded (higher the label, higher the relevance).

In [None]:
qrels

Unnamed: 0,qid,docno,label,iteration
0,1,005b2j4b,2,4.5
1,1,00fmeepz,1,4
2,1,010vptx3,2,0.5
3,1,0194oljo,1,2.5
4,1,021q9884,1,4
...,...,...,...,...
69313,50,zvop8bxh,2,5
69314,50,zwf26o63,1,5
69315,50,zwsvlnwe,0,5
69316,50,zxr01yln,1,5


> Let's run the experiment.

In [None]:
res_exp_covid = pt.Experiment(
    [CoordinateMatch_at_10, TF_at_10, TF_IDF_at_10, LemurTF_IDF_at_10, BM25_at_10, PL2_at_10, BM25F_at_10, PL2F_at_10, pipeline_1_at_10, pipeline_2_at_10],
    topics,
    qrels,
    eval_metrics=["P_1", "P_3", "ndcg_cut_3", "P_5", "ndcg_cut_5", "P_10", "ndcg_cut_10", "num_q"],
    #round={"P_1":2, "P_3":2, "P_5":2, "P_10":2, "ndcg_cut_3":2, "ndcg_cut_5":2, "ndcg_cut_10":2, "num_q":0},
    names=["CoordinateMatch_at_10", "TF_at_10", "TF_IDF_at_10", "LemurTF_IDF_at_10", "BM25_at_10", "PL2_at_10", "BM25F_at_10", "PL2F_at_10", "pipeline_1_at_10", "pipeline_2_at_10"],
    highlight="bold"
)

#YOUR TURN: Try different evaluation metrics

In [None]:
res_exp_covid

Unnamed: 0,name,P_1,P_3,ndcg_cut_3,P_5,ndcg_cut_5,P_10,ndcg_cut_10,num_q
0,CoordinateMatch_at_10,0.28,0.26,0.213464,0.272,0.223892,0.284,0.231387,50.0
1,TF_at_10,0.08,0.2,0.158994,0.244,0.187297,0.212,0.171953,50.0
2,TF_IDF_at_10,0.72,0.706667,0.633464,0.684,0.624185,0.664,0.596951,50.0
3,LemurTF_IDF_at_10,0.66,0.686667,0.588268,0.66,0.577657,0.622,0.552401,50.0
4,BM25_at_10,0.72,0.7,0.62581,0.672,0.605385,0.672,0.598029,50.0
5,PL2_at_10,0.76,0.673333,0.597542,0.636,0.577651,0.62,0.56093,50.0
6,BM25F_at_10,0.68,0.64,0.574581,0.628,0.563384,0.6,0.539246,50.0
7,PL2F_at_10,0.76,0.7,0.601621,0.7,0.616347,0.662,0.591875,50.0
8,pipeline_1_at_10,0.72,0.72,0.63,0.68,0.616705,0.666,0.592973,50.0
9,pipeline_2_at_10,0.78,0.733333,0.669497,0.716,0.653297,0.648,0.601039,50.0


### Let's test another dataset.

> We use a subset of the **wikipedia** dataset

> We follow the same procedure as above to download the dataset and create the index

In [None]:
wiki_dataset_name = 'irds:wikir/en1k/test'

wiki_dataset = pt.get_dataset(wiki_dataset_name) #download dataset

#YOUR TURN: Try to use a different PyTerrier dataset (!!!Not every dataset has queries' relevance scores!!!)

In [None]:
folder_pos = './indices/wikir_en1k'
!rm -r $folder_pos #remove index if already created

rm: cannot remove './indices/wikir_en1k': No such file or directory


In [None]:
indexer = pt.IterDictIndexer(folder_pos,  meta_reverse=[], blocks=True)
index2 = indexer.index(wiki_dataset.get_corpus_iter(), fields=['text'])

wikir/en1k/test documents:   0%|          | 0/369721 [00:00<?, ?it/s]

[INFO] If you have a local copy of https://zenodo.org/record/3565761/files/wikIR1k.zip, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/554299bca984640cb283d6ba55753608
[INFO] [starting] https://zenodo.org/record/3565761/files/wikIR1k.zip

https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.0%| 0.00/165M [00:00<?, ?B/s][A
https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.0%| 32.8k/165M [00:00<12:16, 224kB/s][A
https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.0%| 81.9k/165M [00:00<10:09, 271kB/s][A
https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.1%| 131k/165M [00:00<09:38, 285kB/s] [A
https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.1%| 197k/165M [00:00<08:37, 318kB/s][A
https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.2%| 262k/165M [00:00<08:06, 338kB/s][A
https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.2%| 328k/165M [00:00<07:48, 352kB/s][A
https://zenodo.org/record/3565761/files/wikIR1k.zip: 0.2%| 393k

> Let's see some statistics on this new index

In [None]:
index_factory = pt.IndexFactory.of(index2)
print(index_factory.getCollectionStatistics().toString())

Number of documents: 369721
Number of terms: 674552
Number of postings: 30552936
Number of fields: 1
Number of tokens: 41306796
Field names: [text]
Positions:   true



> We set up the same experiment on the Covid dataset to highlight the differences with the previous case.

In [None]:
CoordinateMatch = pt.BatchRetrieve(index2, wmodel="CoordinateMatch")
TF = pt.BatchRetrieve(index2, wmodel="Tf")
TF_IDF = pt.BatchRetrieve(index2, wmodel="TF_IDF")
LemurTF_IDF = pt.BatchRetrieve(index2, wmodel="LemurTF_IDF")
BM25 = pt.BatchRetrieve(index2, wmodel="BM25")
PL2 = pt.BatchRetrieve(index2, wmodel="PL2")

CoordinateMatch_at_10 = CoordinateMatch % 10
TF_at_10 = TF % 10
TF_IDF_at_10 = TF_IDF % 10
LemurTF_IDF_at_10 = LemurTF_IDF % 10
BM25_at_10 = BM25 % 10
PL2_at_10 = PL2 % 10

In [None]:
sdm = pt.rewrite.SDM()
bo1 = pt.rewrite.Bo1QueryExpansion(index2)
pipeline_1_at_10 = sdm >> TF_IDF % 10
pipeline_2_at_10 = TF_IDF >> bo1 >> TF_IDF % 10
pipeline_3_at_10 = sdm >> BM25 % 10
pipeline_4_at_10 = BM25 >> bo1 >> BM25 % 10

In [None]:
topics = wiki_dataset.get_topics()
qrels = wiki_dataset.get_qrels()

> Let's run the experiment

In [None]:
res_exp_wiki_dataset = pt.Experiment(
    [CoordinateMatch_at_10, TF_at_10, TF_IDF_at_10, LemurTF_IDF_at_10, BM25_at_10, PL2_at_10, pipeline_1_at_10, pipeline_2_at_10, pipeline_3_at_10, pipeline_4_at_10],
    topics,
    qrels,
    eval_metrics=["P_1", "P_3", "ndcg_cut_3", "P_5", "ndcg_cut_5", "P_10", "ndcg_cut_10", "num_q"],
    #round={"P_1":2, "P_3":2, "P_5":2, "P_10":2, "ndcg_cut_3":2, "ndcg_cut_5":2, "ndcg_cut_10":2, "num_q":0},
    names=["CoordinateMatch_at_10", "TF_at_10", "TF_IDF_at_10", "LemurTF_IDF_at_10", "BM25_at_10", "PL2_at_10", "pipeline_1_at_10", "pipeline_2_at_10","pipeline_3_at_10", "pipeline_4_at_10"],
    highlight="bold"
)

In [None]:
res_exp_wiki_dataset

Unnamed: 0,name,P_1,P_3,ndcg_cut_3,P_5,ndcg_cut_5,P_10,ndcg_cut_10,num_q
0,CoordinateMatch_at_10,0.14,0.12,0.117148,0.106,0.114467,0.083,0.116916,99.0
1,TF_at_10,0.26,0.206667,0.24057,0.164,0.220615,0.133,0.210467,99.0
2,TF_IDF_at_10,0.55,0.383333,0.450609,0.314,0.410183,0.211,0.358951,99.0
3,LemurTF_IDF_at_10,0.55,0.396667,0.457909,0.32,0.415638,0.217,0.364739,99.0
4,BM25_at_10,0.55,0.393333,0.458176,0.318,0.413779,0.212,0.360993,99.0
5,PL2_at_10,0.53,0.38,0.444639,0.308,0.405116,0.211,0.356806,99.0
6,pipeline_1_at_10,0.55,0.376667,0.444564,0.314,0.408772,0.207,0.354761,99.0
7,pipeline_2_at_10,0.48,0.396667,0.444106,0.332,0.410683,0.229,0.367024,99.0
8,pipeline_3_at_10,0.55,0.39,0.456161,0.316,0.412736,0.211,0.360072,99.0
9,pipeline_4_at_10,0.47,0.406667,0.449809,0.34,0.414761,0.229,0.367396,99.0


> We can see that there is not always one method that prevails over the others, but it often depends on the metrics taken into consideration.