# Reproduce TCT-ColBERT FF Interpolation

_IN5000, Master Thesis, group: Web Information Systems, theme: Information Retrieval, TU Delft_

_Bo van den Berg, b.vandenberg-6@student.tudelft.nl_

- https://github.com/wis-delft/in4325-information-retrieval
- https://colab.research.google.com/github/wis-delft/in4325-information-retrieval/blob/main/intro-pyterrier/07-neural_models.ipynb#scrollTo=NdAb9cjkPlTP

## Setup

### Selecting GPU for Training

**Important**: In order to train a large neural network in reasonable time, you'll need a CUDA-capable GPU. 
If you have one, follow the [official tutorials](https://pytorch.org/get-started/locally/) and install PyTorch with CUDA acceleration. 

If you do not have one, Google Colab offers free GPUs and TPUs. 
Please do the following: 

`Edit -> Notebook settings -> Hardware accelerator -> select a GPU`

If the installation was successful, restart your kernel. 
Then run the following cell to confirm that the GPU is detected. 
Now the following should no longer return `using the CPU`:

In [1]:
import torch

# If there's a GPU available, use it.
if torch.cuda.is_available():
    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Quadro P1000


In [2]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm="notebook")

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


_Side note_: In this notebook, we focus on the **retrieve-and-re-rank** setting. PyTerrier supports **dense retrieval** models through plugins (such as [pyterrier_ance](https://github.com/terrierteam/pyterrier_ance)). Since dense retrieval is often very resource-demanding, we do not cover it here. Another library that provides many pre-trained models and dense retrieval indexes is [pyserini](https://github.com/castorini/pyserini).


## Fast-Forward Indexes

Fast-forward indexes use _dual-encoder models_ (the same that are used in dense retrieval) for _interpolation-based re-ranking_. The benefit of this (compared to cross-encoders) is that document representations only need to be computed once (during the indexing step) and can be looked up during re-ranking.

### The encoders

We'll start by instantiating the encoders. [TCT-ColBERT](https://github.com/castorini/tct_colbert) is a single-vector dual-encoder model based on BERT, where the query and document encoders are identical (Siamese architecure). A pre-trained model (trained on MS MARCO) is [available on the Hugging Face hub](https://huggingface.co/castorini/tct_colbert-msmarco). We'll use this model in a transfer setting (i.e., without fine-tuning) on the FiQA dataset.

The encoders can be loaded as follows:

In [3]:
from fast_forward.encoder.transformer import TCTColBERTQueryEncoder

q_encoder = TCTColBERTQueryEncoder("castorini/tct_colbert-msmarco")

### The index

For the dense vector representations, we'll need another separate index.

In our case, we can load the provided index instead of indexing everything:

In [15]:
import ir_datasets
from fast_forward import Ranking
from pathlib import Path

k_s = 1000
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")

# load a run (TREC format) and attach all required queries
ranking = Ranking.from_file(
    Path("msmarco-passage-test2019-sparse10000.txt"),
    {q.query_id: q.text for q in dataset.queries_iter()},
).cut(k_s)

  df = pd.read_csv(


We can always load the index on disk instead of indexing everything.

We set `dim=768`, because our encoders output `768`-dimensional representations. `Mode.MAXP` determines how documents that have multiple vectors are scored.

In [5]:
from fast_forward import OnDiskIndex, Mode
from pathlib import Path

ff_index = OnDiskIndex.load(
    Path("../ff_msmarco-v1-passage.tct_colbert.h5"), 
    # dim=768,
    query_encoder=q_encoder, 
    mode=Mode.MAXP
)

100%|██████████| 8841823/8841823 [00:14<00:00, 603458.84it/s] 


At this point, if you have enough RAM, you can load the entire index (i.e., all vector representations) into the main memory:

- I don't have enough RAM for msmarco_passage though: 
`MemoryError: Unable to allocate 25.3 GiB for an array with shape (8841823, 768) and data type float32`

In [6]:
# Uncomment next line if the dataset is too large
# ff_index = ff_index.to_memory()

## Re-ranking

In order to use a Fast-Forward index for re-ranking, we wrap it in an `FFScore` transformer:


In [7]:
# standard re-ranking
# ff_out = ff_index(ranking.cut(k_s))
ff_out = ff_index(ranking)

In [8]:
from fast_forward.util.pyterrier import FFScore

ff_score = FFScore(ff_index)

The `score` column has now been updated to reflect the re-ranking scores. Furthermore, there is a new column, `score_0`, which contains the original retrieval scores. As mentioned earlier, Fast-Forward indexes focus on _interpolation-based re-ranking_. In essence, the idea is to take both lexical retrieval scores $s_{\text{lex}}$ and semantic re-ranking scores $s_{\text{sem}}$ into account, such that the final score $s$ is computed as follows:

$$s = \alpha s_{\text{lex}} + (1-\alpha) s_{\text{sem}}$$

We can perform the interpolation using the `FFInterpolate` transformer. 
We'll set the hyperparameter $\alpha$ to an abritrarily chosen 0.5 for now:

In [9]:
from fast_forward.util.pyterrier import FFInterpolate

ff_int = FFInterpolate(alpha=0.5)

In [23]:
from pyterrier.measures import AP, R, nDCG
from fast_forward.util import to_ir_measures

dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')

pt.Experiment(
    [to_ir_measures(ranking), to_ir_measures(ff_out)],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=[AP @ 1000, R @ 1000, nDCG @ 20],
    names=["TCT-ColBERT", "TCT-ColBERT >> FF"],
)

Unnamed: 0,name,AP@1000,R@1000,nDCG@20
0,TCT-ColBERT,0.377308,0.738937,0.491352
1,TCT-ColBERT >> FF,0.455415,0.738937,0.666728


### Validation

PyTerrier offers several functions to determine the best hyperparameters for a ranker. In the following, we'll use [`pyterrier.GridSearch`](https://pyterrier.readthedocs.io/en/latest/tuning.html#pyterrier.GridSearch) to find the best value for $\alpha$.

**Important**: When you tune hyperparameters of your model, **do not use the same data you use for testing (i.e., the testset)**. Otherwise, your results are invalid, because you optimized your method for the testing data. Instead, we'll use the development (validation) data:

In [18]:
# devset = pt.get_dataset('irds:msmarco-passage/dev/judged')

PyTerriers `GridSearch` class can be used to automatically run an experiment multiple times in order to find the hyperparameters that result in the best performance.

Conveniently, it also sets the best value for us in the transformer.

The value of hyperparameters such as $\alpha$ can make a big difference.

We'll use a similar pipeline as before, but we limit the number of candidate documents to `100` in order to reduce the runtime. We provide a list of values for `alpha` and a metric (MAP), which is used to decide which value results in the best performance:


In [19]:
# pt.GridSearch(
#     to_ir_measures(ff_index(ranking)), # ~bm25 % 100 >> ff_score >> ff_int,
#     {ff_int: {"alpha": [0.05, 0.1, 0.5, 0.9]}},
#     devset.get_topics(),
#     devset.get_qrels(),
#     "map",
#     verbose=True,
# )

[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] [starting] https://msmarco.z22.web.core.windows.net/msmarcoranking/qrels.dev.tsv
[INFO] [finished] https://msmarco.z22.web.core.windows.net/msmarcoranking/qrels.dev.tsv: [00:02] [1.20MB] [550kB/s]
[INFO] If you have a local copy of https://msmarco.z22.web.core.windows.net/msmarcoranking/queries.tar.gz, you can symlink it here to avoid downloading it again: /home/bovdberg/.ir_datasets/downloads/c177b2795d5f2dcc524cf00fcd973be1
[INFO] [starting] https://msmarco.z22.web.core.windows.net/msmarcoranking/queries.tar.gz
[INFO] [finished] https://msmarco.z22.web.core.windows.net/msmarcoranking/queries.tar.gz: [00:41] [18.9MB] [458kB/s]
                                                                                                  

GridScan:   0%|          | 0/4 [00:00<?, ?it/s]

Best map is 0.000000
Best setting is ['<fast_forward.util.pyterrier.FFInterpolate object at 0x7ff1987eb970> alpha=0.05']


Unnamed: 0,query_id,doc_id,score
0,962179,8785371,70.537880
1,962179,5653659,70.318794
2,962179,2329699,70.294601
3,962179,2978866,70.273598
4,962179,6898289,70.240326
...,...,...,...
42995,1037798,7783409,63.979805
42996,1037798,4547385,63.891666
42997,1037798,3850121,63.888672
42998,1037798,5538665,63.815487


_Side note_: As of now, PyTerrier does not support caching for re-ranking transformers. Hence, `GridSearch` takes a long time, because the scores are re-computed every time, even though that wouldn't be necessary.

In [20]:
# from pyterrier.measures import AP, R, nDCG
# from fast_forward.util import to_ir_measures

# dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')

# pt.Experiment(
#     [to_ir_measures(ranking), to_ir_measures(ff_out)],
#     dataset.get_topics(),
#     dataset.get_qrels(),
#     eval_metrics=[AP @ 1000, R @ 1000, nDCG @ 20],
#     names=["TCT-ColBERT", "TCT-ColBERT >> FF"],
# )

Unnamed: 0,name,AP@1000,R@1000,nDCG@20
0,TCT-ColBERT,0.377308,0.738937,0.491352
1,TCT-ColBERT >> FF,0.455415,0.738937,0.666728
