# Using BERT (and other transformer methods) for IR - Data preparation

This notebook covers the basic on how to implement a nice pipeline for training and running inference over a IR dataset.
We wil use Anserini, with PySerini, to index and retrieve documents over the MsMarco TREC 2019 DL dataset.

##  Dependencies installation
First, let's install wh§at we need. I highly recommend using something like Conda to manage your environment!

We are using Python 3.7 and Cuda 10.1 (If you are using another version, check how to install Pytorch on https://pytorch.org/get-started/locally/#start-locally)


In [1]:
#Pytorch
! conda install -y pytorch torchvision cudatoolkit=10.1 -c pytorch
# 🤗 tokenizer (this gives us A HUGE boost on performance. Tokenizing is the slowest part of the process)
! pip install tokenizer
# 🤗 Transformer
! pip install transformers

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.2
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /home/arthur/miniconda3/envs/bert4IR

  added / updated specs:
    - cudatoolkit=10.1
    - pytorch
    - torchvision


The following packages will be UPDATED:

  ca-certificates                     2019.11.28-hecc5488_0 --> 2020.4.5.1-hecc5488_0
  certifi                         2019.11.28-py37hc8dfbb8_1 --> 2020.4.5.1-py37hc8dfbb8_0
  openssl                                 1.1.1f-h516909a_0 --> 1.1.1g-h516909a_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


### Terrier Instalation
Terrier should be easier to install/use than Anserini (i.e. No need of Java 11). 

- If you have Java installed on your machine, make sure that `JAVA_HOME` is set properly.
- If you want to be sure, check [SDKMAN!](https://sdkman.io/) to install a cleaner version of Java, in the version you want.

In [2]:
import os
os.environ["JAVA_HOME"] = "/home/arthur/.sdkman/candidates/java/8.0.242-open" # Make sure this points to the right place for your Java Home!
!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-install-6o31_ypx/python-terrier
  Running command git clone -q https://github.com/terrier-org/pyterrier.git /tmp/pip-install-6o31_ypx/python-terrier
Building wheels for collected packages: python-terrier
  Building wheel for python-terrier (setup.py) ... [?25ldone
[?25h  Created wheel for python-terrier: filename=python_terrier-0.1.3-py3-none-any.whl size=28344 sha256=a335fc9d1a5585f9a4572108fc7446fd0e801ee2a436b7f452c401b0e7a3fcbf
  Stored in directory: /tmp/pip-ephem-wheel-cache-4x7swyea/wheels/61/12/f7/d3c3d17f72ab9ad1c5d510a0d6bd1612023e01fa0e07f01059
Successfully built python-terrier
Installing collected packages: python-terrier
  Attempting uninstall: python-terrier
    Found existing installation: python-terrier 0.1.3
    Uninstalling python-terrier-0.1.3:
      Successfully uninstalled python-terrier-0.1.3
Successfully installed python-terrier-0.1.3


## Local variables
These variables are local to you, and should be eddited accordingly. thinks like path to download the dataset are all set here.

In [3]:
import os
data_home = "/ssd2/arthur/MsMarcoTREC/"  # Where you want to store the docs
n_threads = 32  # Number of threads to use. Make sure you have more than the number here!


def get_path(x):
    return os.path.join(data_home, x)


if not os.path.isdir(data_home):
    os.makedirs(data_home)

## Download data
We are using the MsMarco TREC 2019 dataset. We should download everything here.

If you are running this from the DeepIR machine from WIS, we already have everything there. Ask Arthur where this is and `ln -s` to your path.

In [4]:
from urllib import request
import gzip
import shutil

download_path = "https://msmarco.blob.core.windows.net/msmarcoranking/"  # default MsMarco path for downloading data
# It sucks to need documents in both .trec and .tsv, but it's easier this way, believe me.
files_to_get = [
    "docs/msmarco-docs.trec",  #docs in trec format
    "docs/msmarco-docs.tsv",  # docs in tsv format
    "queries/msmarco-doctrain-queries.tsv",  # train queries
    "qrels/msmarco-doctrain-qrels.tsv",  # train qrels
    "queries/msmarco-docdev-queries.tsv",  # dev queries
    "qrels/msmarco-docdev-qrels.tsv",  # dev qrels
    "queries/msmarco-test2019-queries.tsv",  # test queries
    "qrels/2019qrels-docs.txt"  # test qrels
]
for file in files_to_get:
    local_file_path = get_path(file)
    if not os.path.isfile(local_file_path):
        print(
            f"File {file.split('/')[-1]} not found. Downloading it from the Web"
        )
        url_to_fetch_from = download_path + file.split("/")[1] + ".gz"
        # qrels for test comes from NIST, not from Microsoft. Also, no need to uncompress
        if file == "qrels/2019qrels-docs.txt":
            url_to_fetch_from = "https://trec.nist.gov/data/deep/2019qrels-docs.txt"
            request.urlretrieve(url_to_fetch_from, local_file_path)
            continue
        # Create dir if it doesn't exist
        if not os.path.isdir("/".join(local_file_path.split("/")[:-1])):
            os.makedirs("/".join(local_file_path.split("/")[:-1]))
        try:
            request.urlretrieve(url_to_fetch_from, local_file_path + ".gz")
        except:
            print(
                f"Could not fetch {file} from {url_to_fetch_from}. Make sure that's the right URL!"
            )
            continue
        #Uncompress file. Not needed, but easier. (you could use the gzip lib to open the files...)
        with gzip.open(local_file_path + ".gz", 'rb') as f_in, open(local_file_path, 'wb') as outf:
            print(f"Extracting file {file}")
            shutil.copyfileobj(f_in, outf)
            os.remove(local_file_path + ".gz")

## Create Terrier Index
This may take a while... (~1h in our experiments)

- You will not receive any feedback on the output while the indexing is running. You may chack the progress by running `ls -lah` on the index folder and check if the files are increasing in size.
- Alternatively, run the script manually (don't forget to set `JAVA_HOME`) and have some feedback on the terminal. As a sanity check, the index must contain 3,213,835 documents.

In [7]:
import pyterrier as pt
try:
    pt.init(mem=16384)
except:
    continue
index_path = get_path("terrier-index")
import shutil
shutil.rmtree(index_path, ignore_errors=True)
indexer = pt.TRECCollectionIndexer(index_path)
# index_properies = {"block.indexing":"true", "invertedfile.lexiconscanner":"pointers"}
index = indexer.index(get_path("docs/msmarco-docs.trec"))

IndexingType.CLASSIC
IndexingType.CLASSIC


In [11]:
indexer.getIndexStats()

Collection statistics:
number of indexed documents: 3213835
size of vocabulary: 16168096
number of tokens: 2204592607
number of pointers: 905088837
number of fields: 0
field names: []
blocks: false
