# Indexed EPIC

[EPIC](https://arxiv.org/pdf/2004.14245.pdf) is a neural re-ranking model that builds efficient representations for re-ranking. In this example, we show how to build an epic index to speed up the re-ranking process.

## Install and import required packages

In [None]:
!pip install --upgrade git+https://github.com/terrier-org/pyterrier
!pip install --upgrade git+https://github.com/Georgetown-IR-Lab/OpenNIR

In [1]:
import os
import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')
import onir_pt

PyTerrier 0.4.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


# Lazy re-ranking

We'll start by using a re-ranker that computes document representations as they are needed. Since this model uses BERT to build thiese representations, this process ends up taking a long time.

In [2]:
# Load a version of EPIC trained on the MS-MARCO dataset
lazy_epic = onir_pt.reranker.from_checkpoint('https://macavaney.us/epic.msmarco.tar.gz', expected_md5="2f6a16be1a6a63aab1e8fed55521a4db")

[2021-03-12 10:12:32,206][onir_pt][INFO] using cached checkpoint: /home/sean/data/onir/model_checkpoints/66273681b3ce24117dfda4b8ff58bad3


In [3]:
# Use the TREC COVID dataset for this example
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')

In [6]:
# Build an inverted index for TREC COIVID with pyterrier
pt_index_path = './terrier_cord19'
if not os.path.exists(pt_index_path + '/data.properties'):
    indexer = pt.index.IterDictIndexer(pt_index_path)
    index_ref = indexer.index(dataset.get_corpus_iter(), fields=('abstract',), meta=('docno',))
else:
    index_ref = pt.IndexRef.of(pt_index_path + '/data.properties')
index = pt.IndexFactory.of(index_ref)

cord19/trec-covid documents:   0%|          | 0/192509 [15ms<?, ?it/s]

07:33:15.313 [ForkJoinPool-1-worker-3] WARN  o.t.structures.indexing.Indexer - Indexed 54937 empty documents


In [7]:
br = pt.BatchRetrieve(index) % 30
pipeline = br >> pt.text.get_text(dataset, 'abstract') >> pt.apply.generic(lambda x: x.rename(columns={'abstract': 'text'})) >> lazy_epic
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    eval_metrics=["recip_rank", "P.5", "mrt"]
)

[2021-03-12 07:33:22,932][onir_pt][DEBUG] using GPU (deterministic)
[2021-03-12 07:33:22,939][onir_pt][DEBUG] [starting] batches


batches:   0%|          | 0/375 [15ms<?, ?it/s]

[2021-03-12 07:33:59,697][onir_pt][DEBUG] [finished] batches: [36.76s] [375it] [10.20it/s]


Unnamed: 0,name,recip_rank,P_5,mrt
0,"RankCutoff(BR(DPH), 30)",0.766833,0.684,31.369397
1,"Compose(Compose(Compose(RankCutoff(BR(DPH), 30...",0.817889,0.724,768.649893


As seen by the mean response time (mrt) above, the lazy EPIC re-ranker is much slower than retrieving from the terrier index.

# Pre-computing document vectors

We can speed up the process by first computing all the document vectors. To do this, we use the `onir_pt.indexed_epic` component.

In [8]:
indexed_epic = onir_pt.indexed_epic.from_checkpoint('https://macavaney.us/epic.msmarco.tar.gz',
                                            index_path='./epic_cord19')

[2021-03-12 07:34:33,488][onir_pt][INFO] using cached checkpoint: /home/sean/data/onir/model_checkpoints/66273681b3ce24117dfda4b8ff58bad3


In [9]:
# Index the documents. This takes some time, but it will end up saving a lot for mean response time.
indexed_epic.index(dataset.get_corpus_iter(), fields=('abstract',), replace=True)

cord19/trec-covid documents:   0%|          | 0/192509 [19ms<?, ?it/s]

[2021-03-12 07:34:44,500][onir_pt][DEBUG] using GPU (deterministic)


onir(epic,bert)

Now we can use the index to speed up the re-ranking:

In [12]:
pipeline = br >> indexed_epic.reranker()
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    eval_metrics=["recip_rank", "P.5", "mrt"]
)

[2021-03-12 09:44:12,456][onir_pt][DEBUG] using GPU (deterministic)
[2021-03-12 09:44:12,494][onir_pt][DEBUG] [starting] records


records:   0%|          | 0/1500 [16ms<?, ?it/s]

[2021-03-12 09:44:13,209][onir_pt][DEBUG] [finished] records: [713ms] [1500it] [2102.42it/s]


Unnamed: 0,name,recip_rank,P_5,mrt
0,"RankCutoff(BR(DPH), 30)",0.766833,0.684,32.052374
1,"Compose(RankCutoff(BR(DPH), 30), onir(epic,bert))",0.8215,0.7,47.084249


That was much faster -- 721ms faster than the lazy version! And it's only 15ms slower than DPH (which it uses as a first-stage ranker).

There is a slight change in effectiveness. This is because document vectors are pruned when indexed.

Also notice how the indexed re-ranker does not need the document text anymore; that also saves some time.