Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course - Neural Reranking Tutorial
Authors: Abderrahmane Issam and Jan Scholtes

Version 2024-2025

#Notebook 4

In this notebook we will learn how to use [Pyterrier](https://github.com/terrier-org/pyterrier), a Python framework that is built on top of the Java-based Terrier IR platform. Pyterrier can be used to index different formats of datasets and can be integrated with different models starting from classical approaches like BM25 up to neural models like ColBERT. We will learn how to use Pyterrier for indexing, search and evaluation.

## Setup

### If you have a local GPU (Do the following steps in your local env):
If you have a GPU and you want to run this notebook locally, then I suggest you set up a conda environement as follows:



```
conda create --name ir python=3.11.1 \\
conda install -c conda-forge openjdk=11 \\
pip install notebook
```
The second step is to start jupyter in your local machine as follows:
```
jupyter notebook \
    --NotebookApp.allow_origin='https://colab.research.google.com' \
    --port=8888 \
    --NotebookApp.port_retries=0
```
Then go to `connect to local runtime` (which you will find in the menu where you can change the runtime) and paste the jupter backend URL that tou got in the output of the previous command.

If you are running this locally, then you only need to install packages once, otherwise, you will need to install them at the start of the instance and restart the runtime when required to.

In [1]:
!pip install python-terrier



In [2]:
!pip install --upgrade "git+https://github.com/terrierteam/pyterrier_colbert.git" --use-pep517

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-cv_jw71i
  Running command git clone --filter=blob:none --quiet https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-cv_jw71i
  Resolved https://github.com/terrierteam/pyterrier_colbert.git to commit ba5c86c0bc8da450dee361140541f35b5349a492
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting ColBERT@ git+https://github.com/cmacdonald/ColBERT.git@v0.2#egg=ColBERT (from pyterrier-colbert==0.0.1)
  Cloning https://github.com/cmacdonald/ColBERT.git (to revision v0.2) to /tmp/pip-install-m897m0wf/colbert_6164123c10f4485bb154d356c22b189a
  Running command git clone --filter=blob:none --quiet https://github.com/cmacdonald/ColBERT.git /tmp/pip-install-m897m0wf/colbert_6164123c10f4485bb154d356c22b1

## Indexing and Search Using Pyterrier

In [3]:
!pip install faiss-gpu-cu12

import faiss
assert faiss.get_num_gpus() > 0



In [None]:
# !pip install numpy --upgrade --force-reinstall

In [4]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


Pyterrier can create an index from different dataset formats including Pandas DataFrame which we demonstrate below. We will create a synthetic DataFrame of documents, as well as queries and qrels to use for evaluation. After indexing the documents, we create a BM25 model that we will use for search later.

In [5]:
import pandas as pd

# 1. Define Documents
documents = pd.DataFrame([
    {"docno": "d1", "text": "PyTerrier is great for information retrieval."},
    {"docno": "d2", "text": "Terrier is a powerful information retrieval platform."},
    {"docno": "d3", "text": "Python is a popular programming language."},
    {"docno": "d4", "text": "This tutorial introduces PyTerrier basics."}
])

# 2. Define Queries
queries = pd.DataFrame([
    {"query": "information retrieval", "qid": "q1"},
    {"query": "programming tutorial", "qid": "q2"}
])

# 3. Define Relevance Judgments (qrels)
qrels = pd.DataFrame([
    {"qid": "q1", "docno": "d1", "label": 1},
    {"qid": "q1", "docno": "d2", "label": 0},
    {"qid": "q2", "docno": "d3", "label": 1},
    {"qid": "q2", "docno": "d4", "label": 1}
])

# 4. Indexing
index_path = "./index"
!rm -r "./index"    # Remove index if it exists

indexer = pt.index.DFIndexer(index_path)
index_ref = indexer.index(text=documents["text"], docno=documents["docno"])

# 5. Retrieval (BM25)
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")

rm: cannot remove './index': No such file or directory


  indexer = pt.index.DFIndexer(index_path)
  bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")


We can see the files were created:

In [6]:
!ls -ltrh ./index

total 44K
-rw-r--r-- 1 root root    7 Apr 15 21:44 data.direct.bf
-rw-r--r-- 1 root root   64 Apr 15 21:44 data.meta.zdata
-rw-r--r-- 1 root root   32 Apr 15 21:44 data.meta.idx
-rw-r--r-- 1 root root   68 Apr 15 21:44 data.document.fsarrayfile
-rw-r--r-- 1 root root   44 Apr 15 21:44 data.meta-0.fsomapfile
-rw-r--r-- 1 root root    9 Apr 15 21:44 data.inverted.bf
-rw-r--r-- 1 root root 1.1K Apr 15 21:44 data.lexicon.fsomapfile
-rw-r--r-- 1 root root   52 Apr 15 21:44 data.lexicon.fsomapid
-rw-r--r-- 1 root root  321 Apr 15 21:44 data.lexicon.fsomaphash
-rw-r--r-- 1 root root 4.1K Apr 15 21:44 data.properties


We can see statistics of our index as follows:

In [7]:
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 4
Number of terms: 13
Number of postings: 15
Number of fields: 0
Number of tokens: 15
Field names: []
Positions:   false



In [8]:
terms = index.getLexicon()

for entry in terms:
    print(entry.getKey())

basic
great
introduc
languag
platform
popular
power
program
pyterri
python
retriev
terrier
tutori


### Exercise 1:
The number of terms (13) is less than the number of words in all the 4 documents. Explain the reason for this.

Answer here

These are the terms identified by Pyterrier.

In [9]:
for kv in index.getLexicon():
  print("%s -> %s" % (kv.getKey(), kv.getValue().toString() ) )

basic -> term11 Nt=1 TF=1 maxTF=1 @{0 0 0}
great -> term2 Nt=1 TF=1 maxTF=1 @{0 0 6}
introduc -> term12 Nt=1 TF=1 maxTF=1 @{0 1 0}
languag -> term6 Nt=1 TF=1 maxTF=1 @{0 1 6}
platform -> term4 Nt=1 TF=1 maxTF=1 @{0 2 2}
popular -> term8 Nt=1 TF=1 maxTF=1 @{0 2 6}
power -> term3 Nt=1 TF=1 maxTF=1 @{0 3 2}
program -> term9 Nt=1 TF=1 maxTF=1 @{0 3 6}
pyterri -> term1 Nt=2 TF=2 maxTF=1 @{0 4 2}
python -> term7 Nt=1 TF=1 maxTF=1 @{0 5 0}
retriev -> term0 Nt=2 TF=2 maxTF=1 @{0 5 4}
terrier -> term5 Nt=1 TF=1 maxTF=1 @{0 6 0}
tutori -> term10 Nt=1 TF=1 maxTF=1 @{0 6 4}


We can search a Pyterrier index using BM25 with the following function:

In [10]:
bm25.search("programming language")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2,d3,0,2.379879,programming language



### Exercise2

Define each column in the output of search above.

Answer here.

Pyterrier offers a way to evaluate models using a different metrics. The list of possible metrics to use is available here: https://pyterrier.readthedocs.io/en/latest/experiments.html#evaluation-measures-objects. \\

We can also see how `Experiment` accepts a DataFrame of queries and qrels which are used to compute the metrics.

In [11]:
# 6. Evaluation Pipeline
pt.Experiment([bm25], queries, qrels, eval_metrics=["map", "P_10"], names=["BM25"])

Unnamed: 0,name,map,P_10
0,BM25,0.75,0.15


### Exercise 3

Instead of BM25 use a TF_IDF model. Try using it to search for a query then pass it along with bm25 to `Experiment`.

In [None]:
# Answer here

## Neural Reranking with ColBERT

Below we use `pyterrier_colbert` which is a plugin for Pyterrier that makes it possible to use a ColBERT model for indexing and retrieval. We will use the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

In [12]:
!rm -rf ./colbertindex

import pyterrier_colbert.indexing

checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "./", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[Apr 15, 21:47:22] [0] 		 #> Local args.bsize = 128
[Apr 15, 21:47:22] [0] 		 #> args.index_root = ./
[Apr 15, 21:47:22] [0] 		 #> self.possible_subset_sizes = [69905]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Apr 15, 21:47:26] #> Loading model checkpoint.
[Apr 15, 21:47:26] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip


Downloading: "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip" to /root/.cache/torch/hub/checkpoints/colbert.dnn.zip

  0%|          | 0.00/1.11G [00:00<?, ?B/s][A
  0%|          | 128k/1.11G [00:00<1:21:08, 244kB/s][A
  0%|          | 256k/1.11G [00:00<49:23, 401kB/s]  [A
  0%|          | 384k/1.11G [00:00<33:54, 584kB/s][A
  0%|          | 768k/1.11G [00:01<17:26, 1.13MB/s][A
  0%|          | 1.38M/1.11G [00:01<08:55, 2.21MB/s][A
  0%|          | 1.88M/1.11G [00:01<07:24, 2.67MB/s][A
  0%|          | 2.88M/1.11G [00:01<04:26, 4.44MB/s][A
  0%|          | 5.50M/1.11G [00:01<02:01, 9.70MB/s][A
  1%|          | 6.62M/1.11G [00:01<02:07, 9.24MB/s][A
  1%|          | 8.75M/1.11G [00:01<01:35, 12.3MB/s][A
  1%|          | 12.0M/1.11G [00:01<01:05, 17.9MB/s][A
  1%|          | 14.0M/1.11G [00:01<01:03, 18.6MB/s][A
  1%|▏         | 16.6M/1.11G [00:02<01:07, 17.3MB/s][A
  2%|▏         | 19.8M/1.11G [00:02<00:55, 21.0MB/s][A
  2%|▏         | 22.0M/1.11G [00:02<00:54, 21.4MB/s]

[Apr 15, 21:48:46] #> checkpoint['epoch'] = 0
[Apr 15, 21:48:46] #> checkpoint['batch'] = 44500




tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



[Apr 15, 21:48:49] #> Note: Output directory ./ already exists




[Apr 15, 21:48:49] #> Creating directory ./colbertindex 




[INFO] [starting] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz

http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 0.0%| 0.00/2.13M [00:00<?, ?B/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 0.8%| 16.4k/2.13M [00:00<00:23, 91.3kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 2.3%| 49.2k/2.13M [00:00<00:15, 135kB/s] [A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 3.9%| 81.9k/2.13M [00:00<00:13, 150kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 7.3%| 156k/2.13M [00:00<00:09, 213kB/s] [A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 14.6%| 311k/2.13M [00:00<00:05, 341kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 29.7%| 631k/2.13M [00:01<00:02, 576kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 59.0%| 1.25M/2.13M [00:01<00:00, 980kB/s][A
[A[INFO] [finished] http://ir.d

[Apr 15, 21:50:40] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 6.2k (overall),  6.3k (this encoding),  18089.3M (this saving)
[Apr 15, 21:50:40] [0] 		 [NOTE] Done with local share.
[Apr 15, 21:50:40] [0] 		 #> Joining saver thread.
[Apr 15, 21:50:40] [0] 		 #> Saved batch #0 to ./colbertindex/0.pt 		 Saving Throughput = 2.4M passages per minute.

#> num_embeddings = 581496
[Apr 15, 21:50:40] #> Starting..
[Apr 15, 21:50:40] #> Processing slice #1 of 1 (range 0..1).
[Apr 15, 21:50:40] #> Will write to ./colbertindex/ivfpq.100.faiss.
[Apr 15, 21:50:40] #> Loading ./colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Apr 15, 21:50:40] Preparing resources for 1 GPUs.
[Apr 15, 21:50:40] #> Training with the vectors...
[Apr 15, 21:50:40] #> Training now (using 1 GPUs)...
0.17531108856201172
11.137163639068604
0.008240699768066406
[Apr 15, 21:50:52] Done training!

[Apr 15, 21:50:52] #> Indexing the vectors...
[Apr 15, 21:50:52] #> Loading ('./colbertindex/0

In [13]:
!ls -ltrh ./colbertindex

total 168M
-rw-r--r-- 1 root root 142M Apr 15 21:50 0.pt
-rw-r--r-- 1 root root 4.5M Apr 15 21:50 0.tokenids
-rw-r--r-- 1 root root 7.1M Apr 15 21:50 0.sample
-rw-r--r-- 1 root root  35K Apr 15 21:50 doclens.0.json
-rw-r--r-- 1 root root  24K Apr 15 21:50 docnos.pkl.gz
-rw-r--r-- 1 root root  14M Apr 15 21:50 ivfpq.100.faiss


In [14]:
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

[Apr 15, 21:51:02] #> Loading the FAISS index from ./colbertindex/ivfpq.100.faiss ..
[Apr 15, 21:51:02] #> Building the emb2pid mapping..
[Apr 15, 21:51:02] len(self.emb2pid) = 581496


  self.scaler = torch.cuda.amp.GradScaler()


Loading reranking index, memtype=mem


Loading index shards to memory:   0%|          | 0/1 [00:00<?, ?shard/s]

We search using ColBERT as follows. `% 5` is used to retrieve the top 5 most relevant entries.

In [15]:
out = (colbert_e2e % 5).search("chemical reactions")
out

  return torch.cuda.amp.autocast() if self.activated else nullcontext()


Unnamed: 0,qid,query,docid,query_toks,query_embs,score,docno,rank
0,1,chemical reactions,4911,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",19.821638,4912,0
3,1,chemical reactions,7048,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",19.053555,7049,1
2,1,chemical reactions,6479,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",18.036415,6480,2
4,1,chemical reactions,9373,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",17.136055,9374,3
1,1,chemical reactions,6278,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",16.793301,6279,4


In [16]:
out.loc[0, 'query_toks']

tensor([ 101,    1, 5072, 9597,  102,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])

In [17]:
out.loc[0, 'query_embs'].shape

torch.Size([32, 128])

### Exercise 4

There are two new columns in the search results: `query_toks` and `query_embs`. Explain what they are and explain the shape of `query_embs`.

Answer here.

In [18]:
dataset = pt.datasets.get_dataset("vaswani")
index_path = "./index"

!rm -rf ./index
indexer = pt.TRECCollectionIndexer(index_path)

indexer = indexer.index(dataset.get_corpus())

bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

Downloading vaswani corpus to /root/.pyterrier/corpora/vaswani/corpus


doc-text.trec:   0%|          | 0.00/0.99M [00:00<?, ?iB/s]

  bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")


In the following code we create a sentence transformer reranker.

In [19]:
import pandas as pd
from sentence_transformers import CrossEncoder

crossmodel = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6', max_length=512)

def _crossencoder_apply(df : pd.DataFrame):
  return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))

cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

We create a reranking pipeline that starts with BM25 as a retriever, and ends with using a sentence transformer (cross_encT) for reranking the retrieved documents. The reranking step requires the document text as input, which is not returned by default by bm25, and that is why we add `pt.text.get_text(dataset, 'text')` to retrieve the text documents and add them to the output of BM25.

In [20]:
dataset = pt.get_dataset('irds:vaswani')
cross_enc_rerank = bm25 >> pt.text.get_text(dataset, 'text') >> cross_encT

In [21]:
out = (bm25 % 5).search("chemical reactions")
out

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,9373,9374,0,22.076426,chemical reactions
1,1,8765,8766,1,20.498801,chemical reactions
2,1,7048,7049,2,20.159044,chemical reactions
3,1,4686,4687,3,19.323491,chemical reactions
4,1,10702,10703,4,13.472012,chemical reactions


In [22]:
out = (cross_enc_rerank % 5).search("chemical reactions")
out

[INFO] [starting] building docstore
docs_iter: 100%|██████████████████████| 11429/11429 [00:00<00:00, 38612.82doc/s]
[INFO] [finished] docs_iter: [00:00] [11429doc] [38122.15doc/s]
[INFO] [finished] building docstore [307ms]


Unnamed: 0,qid,docid,docno,score,query,text,rank
1,1,7048,7049,0.004507,chemical reactions,some reactions occurring in the earths upper a...,0
0,1,9373,9374,0.000726,chemical reactions,ion neutral reactions a list is given of reac...,1
3,1,6479,6480,0.000398,chemical reactions,reaction concept in electromagnetic theory a ...,2
2,1,4686,4687,0.000275,chemical reactions,nitrogen oxides and the airglow possible chem...,3
4,1,8706,8707,0.000251,chemical reactions,ion charge exchange reactions in oxygen afterg...,4


The following demonstrates how we can contruct IR pipelines using Pyterrier. The pipeline starts by using bm25 to retrieve relevant documents, which are then used by QueryExpantion transformer which expands the query by adding informative terms that are collected from the relevant documents. The last part runs bm25 with the new query. This process is called Pseudo Relevance Feedback (PRF).

In [23]:
query_expansion = bm25 >> pt.rewrite.QueryExpansion(indexer) >> bm25

In [24]:
pt.Experiment(
    [bm25, query_expansion, colbert_e2e, cross_enc_rerank],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "P_10", "mrt"],
    names = ["BM25", "QE", "ColBERT", "BM25 >> CrossEnc"]
)

21:52:22.832 [main] WARN org.terrier.querying.QueryExpansion -- qemodel control not set for QueryExpansion post process. Using default model Bo1


  return torch.cuda.amp.autocast() if self.activated else nullcontext()


Unnamed: 0,name,map,P_10,mrt
0,BM25,0.296517,0.352688,30.692147
1,QE,0.304647,0.369892,65.717505
2,ColBERT,0.278678,0.351613,611.180842
3,BM25 >> CrossEnc,0.275168,0.345161,4492.456365


On this small dataset, we can see that BM25 achieves better results than ColBERT. Adding Query Expansion improves MAP a little. But using PRF with ColBERT hurts (ColBERT-PRF) the performance while slowing down the inference because of the extra PRF step.

### Exercise 5

Apply the same process as above on a dataset of your choice. You can find the list of datasets in Pyterrier here: https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets

### Exercise 6

Add at least 2 other metrics and explain what each of them is trying to capture.

In [None]:
# Answer both exercise 5 and 6 here

### Exercise 7

Follow the example in this [README](https://github.com/terrierteam/pyterrier_colbert/tree/ba5c86c0bc8da450dee361140541f35b5349a492) to implement Pseudo Relevance Feedback (PRF) with ColBERT. Describe how it works, and add ColBERT-PRF to `Experiment` to evaluate it against the other pipelines.

In [None]:
# Answer here

### Exercise 8

Implement a reranking pipeline with ColBERT and add it to `Experiment`.

In [None]:
# Answer here

## Arxiv Abstracts Retrieval

In [25]:
!pip install setuptools==68.0.0       # you ca skip this if you are using a local instance

Collecting setuptools==68.0.0
  Downloading setuptools-68.0.0-py3-none-any.whl.metadata (6.4 kB)
Downloading setuptools-68.0.0-py3-none-any.whl (804 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/804.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m804.0/804.0 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 75.2.0
    Uninstalling setuptools-75.2.0:
      Successfully uninstalled setuptools-75.2.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.
ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.4.54 which is incompatible.
pandas-gbq 0.28.0 requires packaging>=22.0.0, but you ha

In [26]:
!pip install arxiv==2.1.3

Collecting arxiv==2.1.3
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv==2.1.3)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv==2.1.3)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=ef2f15170c7255d83a47c169c6f72ab5c9f54984fa44eb110d6b35e58ddcdd2a
  Stored in directory: /root/.cache/pip/wheels/3b/25/2a/105d6a15df6914f4d15047691c6c28f9052cc1173e40285d03
Successfully built sgmllib3k
Instal

The following code will retrieve 1000 abstracts for the query "nlp".

In [27]:
import arxiv

# Search for papers
search = arxiv.Search(
    query="nlp",
    max_results=1000,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

We construct a dictionary of titles and abstracts.

In [28]:
documents = []
for result in search.results():
    documents.append({
        'title': result.title,
        'abstract': result.summary,

    })

  for result in search.results():


In [29]:
documents[0]

{'title': 'MorphTok: Morphologically Grounded Tokenization for Indian Languages',
 'abstract': 'Tokenization is a crucial step in NLP, especially with the rise of large\nlanguage models (LLMs), impacting downstream performance, computational cost,\nand efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE)\nalgorithm for subword tokenization that greedily merges frequent character\nbigrams. This often leads to segmentation that does not align with\nlinguistically meaningful units. To address this, we propose morphology-aware\nsegmentation as a pre-tokenization step prior to applying BPE. To facilitate\nmorphology-aware segmentation, we create a novel dataset for Hindi and Marathi,\nincorporating sandhi splitting to enhance the subword tokenization. Experiments\non downstream tasks show that morphologically grounded tokenization improves\nperformance for machine translation and language modeling. Additionally, to\nhandle the ambiguity in the Unicode characters for diac

Rerun this in case you had to restart the session.

In [30]:
import pyterrier as pt
if not pt.java.started():
    pt.init()

To create an index, we need to to have a docno field which we create below, and a text field which is the abstract in this case. \\
We use `DFIndexer` which allows us to create by passing a list ids and text.

In [31]:
import pandas as pd

index_path = './arxivindex'
!rm -r './arxivindex'

df = pd.DataFrame({
    'docno': ['doc'+str(i) for i in range(len(documents))],
    'text': [document['abstract'] for document in documents]
})

indexer = pt.DFIndexer(index_path)
indexer.index(docno=df.docno, text=df.text)

rm: cannot remove './arxivindex': No such file or directory


  indexer = pt.DFIndexer(index_path)


<org.terrier.querying.IndexRef at 0x7e57f49a0110 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x58a13b0 at 0x7e581ad0f7f0>>

We create a BM25 retrieval model for the Arxiv index.

In [32]:
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")

  bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")


In [33]:
(bm25 % 5).search("information retrieval")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,842,doc842,0,6.634921,information retrieval
1,1,524,doc524,1,6.398256,information retrieval
2,1,926,doc926,2,6.297607,information retrieval
3,1,40,doc40,3,6.266992,information retrieval
4,1,192,doc192,4,6.264405,information retrieval


In [34]:
df[df['docno']=='doc670'].text.item()

"Evaluating the quality of machine-generated natural language content is a\nchallenging task in Natural Language Processing (NLP). Recently, large language\nmodels (LLMs) like GPT-4 have been employed for this purpose, but they are\ncomputationally expensive due to the extensive token usage required by complex\nevaluation prompts. In this paper, we propose a prompt optimization approach\nthat uses a smaller, fine-tuned language model to compress input data for\nevaluation prompt, thus reducing token usage and computational cost when using\nlarger LLMs for downstream evaluation. Our method involves a two-stage\nfine-tuning process: supervised fine-tuning followed by preference optimization\nto refine the model's outputs based on human preferences. We focus on Machine\nTranslation (MT) evaluation and utilize the GEMBA-MQM metric as a starting\npoint. Our results show a $2.37\\times$ reduction in token usage without any\nloss in evaluation quality. This work makes state-of-the-art LLM-bas

### Exercise 9

Create a ColBERT index using the Arxiv documents retrieved above and use it to search for some queries. Try to highlight how it is different from BM25 through the query results. You can also change the query we used to get arxiv pages.