# Using BERT (and other transformer methods) for IR - Data preparation

This notebook covers the basic on how to implement a nice pipeline for training and running inference over a IR dataset.
We wil use Anserini, with PySerini, to index and retrieve documents over the MsMarco TREC 2019 DL dataset.

##  Dependencies installation
First, let's install wh§at we need. I highly recommend using something like Conda to manage your environment!

We are using Python 3.7 and Cuda 10.1 (If you are using another version, check how to install Pytorch on https://pytorch.org/get-started/locally/#start-locally)


In [5]:
#Pytorch
! conda install -y pytorch torchvision cudatoolkit=10.1 -c pytorch
# 🤗 tokenizer (this gives us A HUGE boost on performance. Tokenizing is the slowest part of the process)
! pip install tokenizer
# 🤗 Transformer
! pip install transformers

Cloning into 'anserini'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (44/44), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 12581 (delta 15), reused 25 (delta 9), pack-reused 12537[K
Receiving objects: 100% (12581/12581), 18.57 MiB | 15.04 MiB/s, done.
Resolving deltas: 100% (7132/7132), done.
Checking connectivity... done.


### Anserini installation.
Java is a pain in the ass. That's why you should run these commands on your terminal, not here!

```git clone https://github.com/castorini/anserini`
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java
sdk use java 11.0.6.hs-adpt # This may change. Check the version of Java 11 that were installed
cd anserini
mvn clean package -Dmaven.test.skip=true appassembler:assemble
```

This should be enough to install anserini. If not, check their repository

## Local variables
These variables are local to you, and should be eddited accordingly. thinks like path to download the dataset are all set here.

In [39]:
import os
data_home = "/ssd2/arthur/MsMarcoTREC/"  # Where you want to store the docs
anserini_path = "/ssd2/arthur/bert4IR/anserini"  # Should be where you downloaded and installed Anserini. Check above!
n_threads = 32  # Number of threads to use. Make sure you have more than the number here!


def get_path(x):
    return os.path.join(data_home, x)


if not os.path.isdir(data_home):
    os.makedirs(data_home)

## Download data
We are using the MsMarco TREC 2019 dataset. We should download everything here.

If you are running this from the DeepIR machine from WIS, we already have everything there. Ask Arthur where this is and `ln -s` to your path.

In [37]:
from urllib import request
import gzip
import shutil

download_path = "https://msmarco.blob.core.windows.net/msmarcoranking/"  # default MsMarco path for downloading data
# It sucks to need documents in both .trec and .tsv, but it's easier this way, believe me.
files_to_get = [
    "docs/msmarco-docs.trec",  #docs in trec format
    "docs/msmarco-docs.tsv",  # docs in tsv format
    "queries/msmarco-doctrain-queries.tsv",  # train queries
    "qrels/msmarco-doctrain-qrels.tsv",  # train qrels
    "queries/msmarco-docdev-queries.tsv",  # dev queries
    "qrels/msmarco-docdev-qrels.tsv",  # dev qrels
    "queries/msmarco-test2019-queries.tsv",  # test queries
    "qrels/2019qrels-docs.txt"  # test qrels
]
for file in files_to_get:
    local_file_path = get_path(file)
    if not os.path.isfile(local_file_path):
        print(
            f"File {file.split('/')[-1]} not found. Downloading it from the Web"
        )
        url_to_fetch_from = download_path + file.split("/")[1] + ".gz"
        # qrels for test comes from NIST, not from Microsoft. Also, no need to uncompress
        if file == "qrels/2019qrels-docs.txt":
            url_to_fetch_from = "https://trec.nist.gov/data/deep/2019qrels-docs.txt"
            request.urlretrieve(url_to_fetch_from, local_file_path)
            continue
        # Create dir if it doesn't exist
        if not os.path.isdir("/".join(local_file_path.split("/")[:-1])):
            os.makedirs("/".join(local_file_path.split("/")[:-1]))
        try:
            request.urlretrieve(url_to_fetch_from, local_file_path + ".gz")
        except:
            print(
                f"Could not fetch {file} from {url_to_fetch_from}. Make sure that's the right URL!"
            )
            continue
        #Uncompress file. Not needed, but easier. (you could use the gzip lib to open the files...)
        with gzip.open(local_file_path + ".gz", 'rb') as f_in, open(local_file_path, 'wb') as outf:
            print(f"Extracting file {file}")
            shutil.copyfileobj(f_in, outf)
            os.remove(local_file_path + ".gz")

## Create Anserini Index
This may take a while... We are copying the procedure from here: https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc.md.

- You will not receive any feedback on the output while the indexing is running. You may chack the progress by running `ls -lah` on the index folder and check if the files are increasing in size.
- Alternatively, run the script manually (don't forget to set `JAVA_HOME`) and have some feedback on the terminal. As a sanity check, the index must contain 3,213,835 documents.

In [90]:
import subprocess, os
from os.path import expanduser
home = expanduser("~")
my_env = os.environ.copy()
my_env["JAVA_HOME"] = f"{home}/.sdkman/candidates/java/11.0.6.hs-adpt"  #Set right JAVA version

command = [
    "sh", f"{anserini_path}/target/appassembler/bin/IndexCollection",  # Invoke Anserini Indexer
    "-collection", "TrecCollection", # Define type of collection (TREC)
    "-generator", "LuceneDocumentGenerator",   # Define type of indice to generate
    "-threads", str(n_threads),  # Number of threads to use to index
    "-input", get_path("docs/"),  # File with documents
    "-index", get_path("lucene-index.msmarco-doc.pos+docvectors+rawdocs"),  # Where to store the index
    "-storePositions", "-storeDocvectors", "-storeRawDocs"  # Extra options
]

# Nothing will output to the shell. You may check progress by running "ls -lah" on the idex folder above.
# Alternatively, you can run the script manually on a terminal, so you can have some feedback on the indexing process.
output = subprocess.run(command,
                        stdout=subprocess.PIPE,
                        stderr=subprocess.PIPE,
                        text=True,
                        env=my_env)

# Write log to disk.
if not os.path.isdir(get_path("logs")):
    os.makedirs(get_path("logs"))
with open(get_path("logs/indexing.log"), 'w') as f:
    f.write(output.stdout)

KeyboardInterrupt: 