# Indexed EPIC

[EPIC](https://arxiv.org/pdf/2004.14245.pdf) is a neural re-ranking model that builds efficient representations for re-ranking. In this example, we show how to build an epic index to speed up the re-ranking process.

## Install and import required packages

In [1]:
!pip install --upgrade git+https://github.com/terrier-org/pyterrier
!pip install --upgrade git+https://github.com/Georgetown-IR-Lab/OpenNIR

Collecting git+https://github.com/terrier-org/pyterrier
  Cloning https://github.com/terrier-org/pyterrier to /tmp/pip-req-build-w07i2dd8
  Running command git clone -q https://github.com/terrier-org/pyterrier /tmp/pip-req-build-w07i2dd8
Building wheels for collected packages: python-terrier
  Building wheel for python-terrier (setup.py) ... [?25ldone
[?25h  Created wheel for python-terrier: filename=python_terrier-0.4.0-cp36-none-any.whl size=77279 sha256=c122ac837cc645b0d544984d26182b3a79c133657a1415b3890b836bae8b5255
  Stored in directory: /tmp/pip-ephem-wheel-cache-7l1jhcps/wheels/91/7d/75/656f56b2b8ece83f93195066cbc720d379e70f2a2da6e7955e
Successfully built python-terrier
Installing collected packages: python-terrier
  Found existing installation: python-terrier 0.4.0


    Uninstalling python-terrier-0.4.0:
      Successfully uninstalled python-terrier-0.4.0
Successfully installed python-terrier-0.4.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting git+https://github.com/Georgetown-IR-Lab/OpenNIR
  Cloning https://github.com/Georgetown-IR-Lab/OpenNIR to /tmp/pip-req-build-goygvm9o
  Running command git clone -q https://github.com/Georgetown-IR-Lab/OpenNIR /tmp/pip-req-build-goygvm9o


Building wheels for collected packages: OpenNIR
  Building wheel for OpenNIR (setup.py) ... [?25ldone
[?25h  Created wheel for OpenNIR: filename=OpenNIR-0.1.0-cp36-none-any.whl size=55844579 sha256=7e5766112740243bc44a3c3a4fbf2fd8520d550c17c8a1c136b6bee6954e6fb1
  Stored in directory: /tmp/pip-ephem-wheel-cache-gx4hdon5/wheels/3e/0c/99/d4d6998a276620c87fe9db8322e2fd769017eb77e1d3fcc67e
Successfully built OpenNIR
Installing collected packages: OpenNIR
  Found existing installation: OpenNIR 0.1.0
    Uninstalling OpenNIR-0.1.0:
      Successfully uninstalled OpenNIR-0.1.0
Successfully installed OpenNIR-0.1.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import os
import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')
import onir_pt

PyTerrier 0.4.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


# Lazy re-ranking

We'll start by using a re-ranker that computes document representations as they are needed. Since this model uses BERT to build thiese representations, this process ends up taking a long time.

In [3]:
# Load a version of EPIC trained on the MS-MARCO dataset
lazy_epic = onir_pt.reranker.from_checkpoint(
    'https://macavaney.us/epic.msmarco.tar.gz',
    expected_md5="2f6a16be1a6a63aab1e8fed55521a4db")

config file not found: config
[2021-03-19 07:35:51,950][onir_pt][INFO] using cached checkpoint: /home/sean/data/onir/model_checkpoints/66273681b3ce24117dfda4b8ff58bad3


In [4]:
# Use the TREC COVID dataset for this example
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')

In [6]:
# Build an inverted index for TREC COIVID with pyterrier
pt_index_path = './terrier_cord19'
if not os.path.exists(pt_index_path + '/data.properties'):
    indexer = pt.index.IterDictIndexer(pt_index_path)
    index_ref = indexer.index(dataset.get_corpus_iter(), fields=('abstract',), meta=('docno',))
else:
    index_ref = pt.IndexRef.of(pt_index_path + '/data.properties')
index = pt.IndexFactory.of(index_ref)

In [7]:
br = pt.BatchRetrieve(index) % 30
pipeline = (br >> pt.text.get_text(dataset, 'abstract')
               >> pt.apply.generic(lambda x: x.rename(columns={'abstract': 'text'}))
               >> lazy_epic)
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    names=['DPH', 'DPH >> EPIC (lazy)'],
    eval_metrics=["recip_rank", "P.5", "mrt"]
)

[2021-03-19 07:36:48,287][onir_pt][DEBUG] using GPU (deterministic)
[2021-03-19 07:36:49,675][onir_pt][DEBUG] [starting] batches


batches:   0%|          | 0/375 [16ms<?, ?it/s]

[2021-03-19 07:37:26,691][onir_pt][DEBUG] [finished] batches: [37.01s] [375it] [10.13it/s]


Unnamed: 0,name,recip_rank,P_5,mrt
0,DPH,0.766833,0.684,36.734074
1,DPH >> EPIC (lazy),0.817889,0.724,801.524438


As seen by the mean response time (mrt) above, the lazy EPIC re-ranker is much slower than retrieving from the terrier index.

# Pre-computing document vectors

We can speed up the process by first computing all the document vectors. To do this, we use the `onir_pt.indexed_epic` component.

In [8]:
indexed_epic = onir_pt.indexed_epic.from_checkpoint('https://macavaney.us/epic.msmarco.tar.gz',
                                            index_path='./epic_cord19')

[2021-03-19 07:37:29,309][onir_pt][INFO] using cached checkpoint: /home/sean/data/onir/model_checkpoints/66273681b3ce24117dfda4b8ff58bad3


In [9]:
# Index the documents. This takes some time, but it will end up saving a lot for mean response time.
indexed_epic.index(dataset.get_corpus_iter(), fields=('abstract',), replace=False)

cord19/trec-covid documents:   0%|          | 0/192509 [19ms<?, ?it/s]

AssertionError: index already built (use replace=True to replace)

Now we can use the index to speed up the re-ranking:

In [10]:
pipeline = br >> indexed_epic.reranker()
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    names=["DPH", "DPH >> EPIC (indexed)"],
    eval_metrics=["recip_rank", "P.5", "mrt"]
)

[2021-03-19 07:38:04,977][onir_pt][INFO] This EPIC transformer shouldn't be used to calculate query latency. It computes query vectors batches (rather than individually), and doesn't do this work in parallel with first-stage retrieval. For thise operations, use the epic pipeline in OpenNIR. (This message is only shown once.)
[2021-03-19 07:38:04,980][onir_pt][DEBUG] using GPU (deterministic)
[2021-03-19 07:38:05,096][onir_pt][DEBUG] [starting] records


records:   0%|          | 0/1500 [15ms<?, ?it/s]

  tids = torch.from_numpy(tids)


[2021-03-19 07:38:05,968][onir_pt][DEBUG] [finished] records: [869ms] [1500it] [1726.81it/s]


Unnamed: 0,name,recip_rank,P_5,mrt
0,DPH,0.766833,0.684,30.500175
1,DPH >> EPIC (indexed),0.8215,0.7,53.264584


That was much faster -- 721ms faster than the lazy version! And it's only 15ms slower than DPH (which it uses as a first-stage ranker).

There is a slight change in effectiveness. This is because document vectors are pruned when indexed.

Also notice how the indexed re-ranker does not need the document text anymore; that also saves some time.