## Retrieval using BM25 (pyserini implementation)
Reranking on the whole dataset is very costly. So, as per usual practice, an initial list (of 1000 candidates) is generated using a fast retriever such as BM25.

In [None]:
data_folder_root = "/media/cse/HDD/SamiKhan/anwesha"
dataset_name = "anwesha-travel"
topics_path = f"{data_folder_root}/{dataset_name}-topics.tsv"
qrels_path = f"{data_folder_root}/{dataset_name}-qrels.txt"

!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input {data_folder_root}/{dataset_name}-collection/ \
  --language bn \
  --index indexes/{dataset_name}-bm25 \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

!python -m pyserini.search.lucene \
  --index indexes/{dataset_name}-bm25 \
  --topics {topics_path} \
  --output run.{dataset_name}.bm25.txt \
  --language bn \
  --bm25

!python -m pyserini.eval.trec_eval \
  -c -M 100 -m ndcg_cut.10 -m recall.100 {qrels_path} \
  run.{dataset_name}.bm25.txt

2024-01-28 08:00:24,322 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:204) - Setting log level to INFO
2024-01-28 08:00:24,324 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - AbstractIndexer settings:
2024-01-28 08:00:24,325 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) -  + DocumentCollection path: /content/gdrive/MyDrive/Research/predefence/anwesha/anwesha-travel-collection/
2024-01-28 08:00:24,325 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + CollectionClass: JsonCollection
2024-01-28 08:00:24,326 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + Index path: indexes/anwesha-travel-bm25
2024-01-28 08:00:24,326 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Threads: 1
2024-01-28 08:00:24,327 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Optimize (merge segments)? false
2024-01-28 08:00:24,390 INFO  [main] index.IndexCollection (IndexCollection.java:237) - Using lan

In [1]:
data_folder_root = "/media/cse/HDD/SamiKhan/anwesha"
dataset_name = "anwesha-travel"
qrels_path = f"{data_folder_root}/{dataset_name}-qrels.txt"
run_path = f"{data_folder_root}/runs/run.{dataset_name}.bm25.txt"

!python -m pyserini.eval.trec_eval \
  -c -M 100 -m ndcg_cut.10 -m recall.100 {qrels_path} \
  {run_path}

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /home/cse/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/home/cse/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/home/cse/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-M', '100', '-m', 'ndcg_cut.10', '-m', 'recall.100', '/media/cse/HDD/SamiKhan/anwesha/anwesha-travel-qrels.txt', '/media/cse/HDD/SamiKhan/anwesha/runs/run.anwesha-travel.bm25.txt']
Results:
recall_100            	all	0.9853
ndcg_cut_10           	all	0.7322


## Training Reranker on Mr. TyDi Bangla Dataset

In [18]:
import os
os.environ["WANDB_DISABLED"] = "true"
!huggingface-cli login --token=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

!CUDA_VISIBLE_DEVICES=0 python train_reranker.py \
  --output_dir reranker_mrtydi-bn \
  --model_name_or_path bert-base-multilingual-uncased \
  --dataset_name castorini/mr-tydi:bengali \
  --fp16 \
  --save_strategy no \
  --per_device_train_batch_size 8 \
  --train_n_passages 8 \
  --learning_rate 5e-6 \
  --q_max_len 64 \
  --p_max_len 256 \
  --num_train_epochs 3 \
  --logging_steps 500 \
  --overwrite_output_dir \
  --eval_dataset_name anwesha-news \
  --collection_path /media/cse/HDD/SamiKhan/anwesha/anwesha-news-collection/anwesha-news-collection.jsonl \
  --topics_path /media/cse/HDD/SamiKhan/anwesha/anwesha-news-topics.tsv \
  --qrels_path /media/cse/HDD/SamiKhan/anwesha/anwesha-news-qrels.txt \
  --retrieval_results /media/cse/HDD/SamiKhan/anwesha/runs/run.anwesha-news.bm25.txt 

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/cse/.cache/huggingface/token
Login successful
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
01/30/2024 17:15:59 - INFO - __main__ -   Training/evaluation parameters TevatronTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_encode=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,

## Evaluate reranker (BM25 + MonoBERT)

In [23]:
data_folder_root = "/media/cse/HDD/SamiKhan/anwesha/"
model_name_or_path = "reranker_mrtydi-bn"
dataset_name = "anwesha-news"

!python reranker_eval.py \
    --collection_path {data_folder_root}/{dataset_name}-collection/{dataset_name}-collection.jsonl \
    --topics_path {data_folder_root}/{dataset_name}-topics.tsv \
    --qrels_path {data_folder_root}/{dataset_name}-qrels.txt \
    --retrieval_results {data_folder_root}/runs/run.{dataset_name}.bm25.txt \
    --model_name_or_path {model_name_or_path} \
    --output_save_path run.{dataset_name}.monobert.txt \
    --fp16 True \
    --per_device_eval_batch_size 64 \
    --dataloader_num_workers 12

100%|████████████████████████████████████| 134/134 [00:00<00:00, 1724039.07it/s]
100%|████████████████████████████████████████| 133/133 [00:01<00:00, 126.69it/s]
