# PyTerrier ColBERT Demo Notebook - Vaswani

This notebook demonstrates use of [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert) for dense passage retrieval. 

[ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds. ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings. Then at search time, it embeds every query into another matrix of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. 


[ColBERT](https://arxiv.org/abs/2004.12832) is built on top of [BERT](https://arxiv.org/abs/1810.04805). ColBERT surpasses the quality of single-vector representation models, while scaling efficiently to large corpora. 

The corpus used in this demo is the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

## Installation 

We need to install [PyTerrier](https://github.com/terrier-org/pyterrier).

In [1]:
!pip install python-terrier

Collecting python-terrier
  Downloading python-terrier-0.7.2.tar.gz (95 kB)
[?25l[K     |███▍                            | 10 kB 21.9 MB/s eta 0:00:01[K     |██████▉                         | 20 kB 8.6 MB/s eta 0:00:01[K     |██████████▎                     | 30 kB 7.5 MB/s eta 0:00:01[K     |█████████████▊                  | 40 kB 7.0 MB/s eta 0:00:01[K     |█████████████████▏              | 51 kB 5.4 MB/s eta 0:00:01[K     |████████████████████▋           | 61 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████        | 71 kB 5.3 MB/s eta 0:00:01[K     |███████████████████████████▌    | 81 kB 5.9 MB/s eta 0:00:01[K     |███████████████████████████████ | 92 kB 5.9 MB/s eta 0:00:01[K     |████████████████████████████████| 95 kB 1.3 MB/s 
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting pyjnius~=1.3.0
  Downloading pyjnius-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 32.4 MB/s 
[?25hCollecting

This installs the [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert). It supplies an indexer and a retrieval transformer.

In [2]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-t44al8mn
  Running command git clone -q https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-t44al8mn
Collecting ColBERT@ git+https://github.com/cmacdonald/ColBERT.git@v0.2#egg=ColBERT
  Cloning https://github.com/cmacdonald/ColBERT.git (to revision v0.2) to /tmp/pip-install-_abh_q44/colbert_c3019fdd573c46079ab39efde5702939
  Running command git clone -q https://github.com/cmacdonald/ColBERT.git /tmp/pip-install-_abh_q44/colbert_c3019fdd573c46079ab39efde5702939
  Running command git checkout -b v0.2 --track origin/v0.2
  Switched to a new branch 'v0.2'
  Branch 'v0.2' set up to track remote branch 'v0.2' from 'origin'.
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 4.6 MB/s 
[?25hCollecting ujson
  Downloading ujson-5.1.0-cp

This installs [FAISS](https://github.com/facebookresearch/faiss), a library for efficient similarity search and clustering of dense vectors.

In [3]:
import sys

COLAB='google.colab' in sys.modules

try:
  import faiss
  faiss.get_num_gpus()
except:
  if COLAB:
    print('Installing faiss-gpu from pip ')
    !pip install faiss-gpu==1.6.3
  else:
    print('Installing faiss-gpu via Conda')
    !conda install -c pytorch faiss-gpu

import faiss
assert faiss.get_num_gpus() > 0

Installing faiss-gpu from pip 
Collecting faiss-gpu==1.6.3
  Downloading faiss_gpu-1.6.3-cp37-cp37m-manylinux2010_x86_64.whl (35.5 MB)
[K     |████████████████████████████████| 35.5 MB 334 kB/s 
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.6.3


# Setup

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [4]:
import pyterrier as pt
pt.init()

terrier-assemblies 5.6 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.6 jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.7.2 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


This is the ColBERT checkpoint generated by Craig Macdonald. and used in our TREC 2020 Participation. It will be downloaded first time it is used. Downloading time varies.

In [5]:
checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

# Indexing

This indexes the [Vaswani dataset](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/). Indexing takes about 3 minutes using a Colab GPU.

In [6]:
!rm -rf /content/colbertindex

import pyterrier_colbert.indexing

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "/content", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[Dec 23, 11:18:19] [0] 		 #> Local args.bsize = 128
[Dec 23, 11:18:19] [0] 		 #> args.index_root = /content
[Dec 23, 11:18:19] [0] 		 #> self.possible_subset_sizes = [69905]


Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[Dec 23, 11:18:55] #> Loading model checkpoint.
[Dec 23, 11:18:55] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip


Downloading: "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip" to /root/.cache/torch/hub/checkpoints/colbert.dnn.zip


  0%|          | 0.00/1.11G [00:00<?, ?B/s]



[Dec 23, 11:19:54] #> checkpoint['epoch'] = 0
[Dec 23, 11:19:54] #> checkpoint['batch'] = 44500




Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



[Dec 23, 11:19:56] #> Note: Output directory /content already exists




[Dec 23, 11:19:56] #> Creating directory /content/colbertindex 




[INFO] [starting] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz

http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 0.0%| 0.00/2.13M [00:00<?, ?B/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 2.3%| 49.2k/2.13M [00:00<00:07, 267kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 4.6%| 98.3k/2.13M [00:00<00:05, 341kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 7.7%| 164k/2.13M [00:00<00:04, 418kB/s] [A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 15.8%| 336k/2.13M [00:00<00:02, 675kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 32.0%| 680k/2.13M [00:00<00:01, 1.13MB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 64.0%| 1.36M/2.13M [00:00<00:00, 1.92MB/s][A
[A[INFO] [finished] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: [00:00] [2.13MB] [2.92MB/s]

http://ir.dcs.gla.ac.uk/re

[Dec 23, 11:23:47] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 3.0k (overall),  3.0k (this encoding),  12614.9M (this saving)
[Dec 23, 11:23:48] [0] 		 [NOTE] Done with local share.
[Dec 23, 11:23:48] [0] 		 #> Joining saver thread.
[Dec 23, 11:23:48] [0] 		 #> Saved batch #0 to /content/colbertindex/0.pt 		 Saving Throughput = 1.3M passages per minute.

#> num_embeddings = 581496
[Dec 23, 11:23:48] #> Starting..
[Dec 23, 11:23:48] #> Processing slice #1 of 1 (range 0..1).
[Dec 23, 11:23:48] #> Will write to /content/colbertindex/ivfpq.100.faiss.
[Dec 23, 11:23:48] #> Loading /content/colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Dec 23, 11:23:48] Preparing resources for 1 GPUs.
[Dec 23, 11:23:48] #> Training with the vectors...
[Dec 23, 11:23:48] #> Training now (using 1 GPUs)...
42.07839870452881
32.22199583053589
0.0013880729675292969
[Dec 23, 11:25:02] Done training!

[Dec 23, 11:25:02] #> Indexing the vectors...
[Dec 23, 11:25:02] #> Loading

The indexing procedure generates the document embeddings index and a [FAISS](https://github.com/facebookresearch/faiss) index, together with some additional files.

In [7]:
!ls -ltrh /content/colbertindex

total 168M
-rw-r--r-- 1 root root 142M Dec 23 11:23 0.pt
-rw-r--r-- 1 root root 4.5M Dec 23 11:23 0.tokenids
-rw-r--r-- 1 root root 7.1M Dec 23 11:23 0.sample
-rw-r--r-- 1 root root  35K Dec 23 11:23 doclens.0.json
-rw-r--r-- 1 root root  36K Dec 23 11:23 docnos.pkl.gz
-rw-r--r-- 1 root root  14M Dec 23 11:25 ivfpq.100.faiss


# Retrieval

Now that indexing has completed, we can load in the index and the checkpoint model (which we will need for encoding queries). Index loading can take some times, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Since we indexed a collection from scratch and the data structures are already loaded in main memory, we reuse the data structures for retrieval.

In the case the indexing was done offline, the following ColBERT factory can be used:

```python
pyterrier_colbert_factory = pyterrier_colbert.ranking.ColBERTFactory(checkpoint, "/content", "colbertindex")
```


In [8]:
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

[Dec 23, 11:25:10] #> Loading the FAISS index from /content/colbertindex/ivfpq.100.faiss ..
[Dec 23, 11:25:10] #> Building the emb2pid mapping..
[Dec 23, 11:25:10] len(self.emb2pid) = 581496
Loading reranking index, memtype=mem


Loading index shards to memory:   0%|          | 0/1 [00:00<?, ?shard/s]

Here we can ask [PyTerrier](https://github.com/terrier-org/pyterrier) to search the [ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) index for `'chemical reactions'`, returning the top 10 scored documents.

In [9]:
(colbert_e2e % 10).search("chemical reactions")

Unnamed: 0,qid,query,docid,query_toks,query_embs,score,docno,rank
1964,1,chemical reactions,4911,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",19.824375,4912,0
2509,1,chemical reactions,7048,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",19.055037,7049,1
2377,1,chemical reactions,6479,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",18.036079,6480,2
557,1,chemical reactions,9373,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",17.1395,9374,3
2330,1,chemical reactions,6278,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",16.796631,6279,4
1151,1,chemical reactions,2420,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",16.427742,2421,5
1817,1,chemical reactions,4292,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",16.191317,4293,6
1199,1,chemical reactions,10702,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",16.153692,10703,7
2088,1,chemical reactions,5303,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",16.006708,5304,8
1496,1,chemical reactions,3100,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0084), tensor(0.11...",15.884771,3101,9


# Run an experiment

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer.

In [10]:
dataset = pt.get_dataset("vaswani")

bm25 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


data.direct.bf:   0%|          | 0.00/388k [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/234k [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/362k [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/682k [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/777 [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/30.3k [00:00<?, ?iB/s]

data.meta-0.fsomapfile:   0%|          | 0.00/725k [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/89.3k [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/224k [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.29k [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/619 [00:00<?, ?iB/s]

Finally, lets evaluate our performance. We also load in an BM25 index for the same corpus for comparison reasons. We limit our experiments to just 50 queries.

In [20]:
pt.Experiment(
    [bm25, colbert_e2e%1000],
    dataset.get_topics().head(50),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "mrt"],
    names = ["BM25", "ColBERT"]
)

Unnamed: 0,name,map,recip_rank,mrt
0,BM25,0.338941,0.808238,34.955486
1,ColBERT,0.331955,0.762643,545.993813


In [21]:
from pyterrier.measures import *

pt.Experiment(
    [bm25, colbert_e2e],
    dataset.get_topics().head(10),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "mrt"],
    names = ["BM25", "ColBERT"]
)

Unnamed: 0,name,map,recip_rank,mrt
0,BM25,0.299508,0.683333,30.971393
1,ColBERT,0.288534,0.615679,544.313445


In [19]:
from pyterrier.measures import *

pt.Experiment(
    [bm25, colbert_e2e],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", nDCG@1000],
    names = ["BM25", "ColBERT"]
)

Unnamed: 0,name,map,recip_rank,nDCG@1000
0,BM25,0.296517,0.725665,0.621197
1,ColBERT,0.278781,0.70344,0.596948
