# PyTerrier ECIR Tutorial Notebook - Part 1

This is one of a series of Colab notebooks created for the [ECIR 2021](https://www.ecir2021.eu) Tutorial entitled '**IR From Bag-of-words to BERT and Beyond through Practical Experiment**'. It demonstrates the use of [PyTerrier](https://github.com/terrier-org/pyterrier) on the [CORD19 test collection](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

In particular, this notebooks has the following learning outcomes:
  - PyTerrier installation & configuration
  - indexing a collection
  - accessing an index
  - using the `BatchRetrieve` transformer for searching an index
  - conducting an `Experiment` 

Pre-requisites:
 - We assume that you are confident in programming Python, including [lambda functions](https://www.w3schools.com/python/python_lambda.asp).
 - We will **only be supporting notebooks on the Google Colab platform**.
  > *Explanation*: PyTerrier uses [pytrec_eval](https://github.com/cvangysel/pytrec_eval) for evaluation, which does not yet easily install on Windows. It will work fine on Linux and macOS, but only if you have the appropriate compilers installed. Hence, we prefer Google Colab.

Related Reading:
 - [Pandas documentation](https://pandas.pydata.org/docs/)
 - [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/)


PyTerrier is a Python framework, but uses the underlying [Terrier information retrieval toolkit](http://terrier.org) for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations. 

In the following, we introduce everything you need to know about PyTerrier, and also provide appropriate links to relevant parts of the [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/).


### Installation & Configuration

PyTerrier is installed as follows. This might take a few minutes, so you can read on.

In [None]:
%pip install python-terrier

The next step is to initialise PyTerrier. This is performed using PyTerrier's `init()` method. The `init()` method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent `init()` from being called more than once by checking `started()`.

In [8]:
# déclaration de la variable JAVA_HOME
import os
os.environ['JAVA_HOME'] = '/opt/homebrew/opt/openjdk/libexec/openjdk.jdk/'
!export JAVA_HOME='/opt/homebrew/opt/openjdk/libexec/openjdk.jdk/'

In [9]:
import pyterrier as pt
if not pt.started():
  pt.init()

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Documents, Indexing and Indexes

Much of PyTerrier's view of the world is wrapped up in Pandas dataframes. Let's consider some textual documents in a dataframe.


In [10]:
# we need to import pandas. We commonly rename it to pd, to make commands shorter
import pandas as pd

# lets not truncate output too much
pd.set_option('display.max_colwidth', 150)

docs_df = pd.DataFrame([
        ["d1", "this is the first document of many documents"],
        ["d2", "this is another document"],
        ["d3", "the topic of this document is unknown"]
    ], columns=["docno", "text"])

docs_df

Unnamed: 0,docno,text
0,d1,this is the first document of many documents
1,d2,this is another document
2,d3,the topic of this document is unknown


Before any search engine can estimate which documents are most likely to be relevant for a given query, it must index the documents. 

In the following cell, we index the dataframe's documents. The index, with all its data structures, is written into a directory called `index_3docs`. 

In [11]:
indexer = pt.DFIndexer("./index_3docs", overwrite=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

'./index_3docs/data.properties'

An `IndexRef`
 is essentially a string saying where an index is stored. Indeed, we can look in the `index_3docs` directory and see that it has created various small files: 

In [12]:
!ls -lh index_3docs/

total 80
-rw-r--r--  1 asriel  staff     3B Mar  9 20:10 data.direct.bf
-rw-r--r--  1 asriel  staff    51B Mar  9 20:10 data.document.fsarrayfile
-rw-r--r--  1 asriel  staff     4B Mar  9 20:10 data.inverted.bf
-rw-r--r--  1 asriel  staff   344B Mar  9 20:10 data.lexicon.fsomapfile
-rw-r--r--  1 asriel  staff   249B Mar  9 20:10 data.lexicon.fsomaphash
-rw-r--r--  1 asriel  staff    33B Mar  9 20:10 data.meta-0.fsomapfile
-rw-r--r--  1 asriel  staff    24B Mar  9 20:10 data.meta.idx
-rw-r--r--  1 asriel  staff    48B Mar  9 20:10 data.meta.zdata
-rw-r--r--  1 asriel  staff   4.1K Mar  9 20:10 data.properties


With an `IndexRef`, we can load it to an actual index. The method `pt.IndexFactory.of()` is the relevant factory. 

In [13]:
index = pt.IndexFactory.of(index_ref)

#lets see what type index is.
type(index)

jnius.reflect.org.terrier.structures.Index

Ok, so this object refers to Terrier's [`Index`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Index.html) type. Check the linked Javadoc – you will see that this Java object has methods such as:
 - `getCollectionStatistics()`
 - `getInvertedIndex()`
 - `getLexicon()`

Let's see what is returned by the `CollectionStatistics()` method:

In [14]:
print(index.getCollectionStatistics().toString())

Number of documents: 3
Number of terms: 4
Number of postings: 6
Number of fields: 0
Number of tokens: 7
Field names: []
Positions:   false



Ok, that seems fair – we have 3 documents. But why only 4 terms? 
Let's check the [`Lexicon`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Lexicon.html), which is our vocabulary. Fortunately, the `Lexicon` can be iterated easily from Python:

In [15]:
for kv in index.getLexicon():
  print("%s (%s) -> %s (%s)" % (kv.getKey(), type(kv.getKey()), kv.getValue().toString(), type(kv.getValue()) ) )

document (<class 'str'>) -> term0 Nt=3 TF=4 maxTF=2 @{0 0 0} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
first (<class 'str'>) -> term1 Nt=1 TF=1 maxTF=1 @{0 0 7} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
topic (<class 'str'>) -> term2 Nt=1 TF=1 maxTF=1 @{0 1 1} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
unknown (<class 'str'>) -> term3 Nt=1 TF=1 maxTF=1 @{0 1 5} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)


Here, iterating over the `Lexicon` returns a pair of `String ` term and a [`LexiconEntry`](http://terrier.org/docs/current/javadoc/org/terrier/structures/LexiconEntry.html) object – which itself is an [`EntryStatistics`](http://terrier.org/docs/current/javadoc/org/terrier/structures/EntryStatistics.html) – and contains information including the statistics of that term.


So what did we find? Here are some observations:
 - we only have 4 unique terms, as stopwords were removed;
 - we have one term for `"document"`, even though `"documents"` occurred in document "`d1`". 
 
Both these observations make sense, as indeed Terrier removes standard stopwords and applies Porter's stemmer by default.

Further:
 - `Nt` is the number of unique documents that each term occurs in – this is useful for calculating IDF.
 - `TF` is the total number of occurrences – some weighting models use this instead of Nt.
 - The numbers in the `@{}` are a pointer – they tell Terrier where the postings are for that term in the inverted index data structure.

Finally, we can also use the square bracket notation to lookup terms in Terrier's lexicon:


In [16]:
index.getLexicon()["document"].toString()

'term0 Nt=3 TF=4 maxTF=2 @{0 0 0}'

Let's now think about the inverted index. Remember that the inverted index tells us in which *documents* each term occurs in. The `LexiconEntry` is the pointer that tell us where to find the postings for that term in the inverted index.

In [17]:
pointer = index.getLexicon()["document"]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(posting.toString() + " doclen=%d" % posting.getDocumentLength())

ID(0) TF(2) doclen=3
ID(1) TF(1) doclen=1
ID(2) TF(1) doclen=3


Ok, so we can see that `"document"` occurs once in each of the three documents. 

NB: Terrier counts documents as integers from 0 (called *docids*). It records the mapping back to *docnos* (the string form, i.e. "`d1`", "`d2`") in a separate data structure called the *metaindex*.

### Searching an Index

Our way into search in PyTerrier is called `BatchRetrieve`. BatchRetrieve is configured by specifying an index and a weighting model (`Tf` in our example). We then search for a single-word query, `"document"`.

In [18]:
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("document")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,0,d1,0,2.0,document
1,1,1,d2,1,1.0,document
2,1,2,d3,2,1.0,document


So the `search()` method returns a dataframe with columns:
 - `qid`: this is by default "1", since it's our first and only query
 - `docid`: Terrier' internal integer for each document
 - `docno`: the external (string) unique identifier for each document
 - `score`: since we use the `Tf` weighting model, this score corresponds the total frequency of the query (terms) in each document
 - `rank`: A handy attribute showing the descending order by score
 - `query`: the input query

As expected, the `Tf` weighting model used here only counts the frequencies of the query terms in each document, i.e.:
$$
score(d,q) = \sum_{t \in q} tf_{t,d}
$$

Hence, it's clear that document `d1` should be the highest scored document with two occurrences (c.f. `'document'` and `'documents'`).  

We can also pass a dataframe of one or more queries to the `transform()` method (rather than the `search()` method) of a transformer, with queries numbered "q1", "q2" etc.. 

In [19]:
import pandas as pd
queries = pd.DataFrame([["q1", "document"], ["q2", "first document"]], columns=["qid", "query"])
br.transform(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,0,d1,0,2.0,document
1,q1,1,d2,1,1.0,document
2,q1,2,d3,2,1.0,document
3,q2,0,d1,0,3.0,first document
4,q2,1,d2,1,1.0,first document
5,q2,2,d3,2,1.0,first document


In fact, we are usually calling `transform()`, so it's the default method – i.e. 
`br.transform(queries)` can be more succinctly written as `br(queries)`.

In [20]:
br(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,0,d1,0,2.0,document
1,q1,1,d2,1,1.0,document
2,q1,2,d3,2,1.0,document
3,q2,0,d1,0,3.0,first document
4,q2,1,d2,1,1.0,first document
5,q2,2,d3,2,1.0,first document


### CORD19

OK, having 3 documents is quite trivial, so let's move upto a slightly larger corpus of documents. We'll be using the CORD19 datasets for the remainder of this tutorial. PyTerrier has a handy `get_dataset()` API, which allows us to download the corpus and index it.

In [24]:
import os

cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './terrier_cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
  # create the index, using the IterDictIndexer indexer 
  indexer = pt.index.IterDictIndexer(pt_index_path)

  # we give the dataset get_corpus_iter() directly to the indexer
  # while specifying the fields to index and the metadata to record
  index_ref = indexer.index(cord19.get_corpus_iter(), 
                            fields=('abstract',), 
                            meta=('docno',))

else:
  # if you already have the index, use it.
  index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")
index = pt.IndexFactory.of(index_ref)


#### Task 1
- What are the statistics of our index?

In [25]:
#YOUR SOLUTION
print(index.getCollectionStatistics().toString())

Number of documents: 192509
Number of terms: 151235
Number of postings: 11554033
Number of fields: 1
Number of tokens: 17728468
Field names: [abstract]
Positions:   false



Next, CORD19 also has a corresponding set of queries and relevance assessments (aka qrels), thus forming a *test collection*, 

We can easily access the topics and qrels from the dataset. Indeed these are expressed as dataframes as well (we use Pandas's [`head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method to show only the first 5 topics):

In [26]:
cord19.get_topics(variant='title').head(5)

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [00:00] [18.7kB] [18.0MB/s]
                                                                                 

Unnamed: 0,qid,query
0,1,coronavirus origin
1,2,coronavirus response to weather changes
2,3,coronavirus immunity
3,4,how do people die from the coronavirus
4,5,animal models of covid 19


In [None]:
cord19.get_qrels().head(5)

### Weighting Models

So far, we have been using the simple "`Tf`" as our ranking function for document retrieval in BatchRetrieve. However, we can use other models such as `"TF_IDF"` by simply changing the `wmodel="Tf"` keyword argument in the constructor of `BatchRetrieve`.


In [28]:
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
tfidf.search("chemical reactions")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,18717,iavwkdpr,0,11.035982,chemical reactions
1,1,171636,v3blnh02,1,10.329726,chemical reactions
2,1,147193,ei4rb8fr,2,10.317138,chemical reactions
3,1,121217,msdycum2,3,9.653734,chemical reactions
4,1,170863,sj8i9ss2,4,9.500211,chemical reactions
...,...,...,...,...,...,...
995,1,2428,38aabxh1,995,3.790183,chemical reactions
996,1,14752,u709r8ss,996,3.790183,chemical reactions
997,1,20074,wxi1xsbo,997,3.790183,chemical reactions
998,1,117156,ts3obwts,998,3.790183,chemical reactions


You will note that, as expected, the scores of documents ranked by `TF_IDF` are no longer integers. You can see the exact formula used by Terrier from [the Github repo](https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/matching/models/TF_IDF.java#L79).

Terrier supports many weighting models – the documentation contains [a list of supported models](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html) - some of which we will discover later in the tutorial.


### What is Success?

So far, we have been creating search engine models, but we haven't decided if any of them ia actually any good. Let's investigate if we are getting a correct ("relevant") document at the first rank.

In [29]:
qrels = cord19.get_qrels()
def get_res_with_labels(ranker, df):
  # get the results for the query or queries
  results = ranker( df )
  # left outer join with the qrels
  with_labels = results.merge(qrels, on=["qid", "docno"], how="left").fillna(0)
  return with_labels

# lets get the Tf results for the first query
get_res_with_labels(tfidf, cord19.get_topics(variant='title').head(1))

Unnamed: 0,qid,docid,docno,rank,score,query,label,iteration
0,1,175892,zy8qjaai,0,7.080599,coronavirus origin,1.0,1
1,1,82224,8ccl9aui,1,6.775667,coronavirus origin,2.0,1
2,1,135326,ne5r4d4b,2,6.683114,coronavirus origin,0.0,1.5
3,1,122804,75773gwg,3,6.590340,coronavirus origin,2.0,5
4,1,122805,kn2z7lho,4,6.590340,coronavirus origin,2.0,3
...,...,...,...,...,...,...,...,...
995,1,180809,0y0hau9l,995,4.214228,coronavirus origin,0.0,0
996,1,148967,f8vbflx6,996,4.212887,coronavirus origin,0.0,0
997,1,183189,uadfehr6,997,4.210201,coronavirus origin,2.0,1.5
998,1,67321,n5hnx2c3,998,4.202319,coronavirus origin,0.0,0


In [30]:
pt.Experiment(
    [tfidf],
    cord19.get_topics(variant='title'),
    cord19.get_qrels(),
    eval_metrics=["map", "ndcg"])

Unnamed: 0,name,map,ndcg
0,BR(TF_IDF),0.180002,0.370767


## That's all folks

The following parts of the PyTerrier documentation may be useful references for this notebook:
 * [PyTerrier datasets](https://pyterrier.readthedocs.io/en/latest/datasets.html)
 * [Using Terrier for retrieval](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html)
 * [Transformers in PyTerrier](https://pyterrier.readthedocs.io/en/latest/transformer.html)
 * [Transformer Operators](https://pyterrier.readthedocs.io/en/latest/operators.html)