# <ins> Milestone 2 </ins>

## Information

The following Code is for the use inside the "pyterrier" Docker image and couldn´t function, if you want to execute it in your notebook!

Please follow these steps to create the needed output:

### Pre-condition

1. Navigate in your terminal to the "milestone2" folder and open "Docker Desktop"

2. Pull the image of the "milestone1" with the qrels command
    ```
    docker pull registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1
    ```
3. Execute the tira-run command for the "milestone1" to get the needed output:
    ```
    tira-run --output-directory ${PWD}/iranthology-dataset-tira --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command '/irds_cli.sh --ir_datasets_id iranthology-dnc-limited --output_dataset_path $outputDir'
    ```
   
### Milestone 2

4. Build an image for the current milestone:
    ```
    docker build -t milestone2 .
    ```
5. Run this image:
    ```
    docker run -p 8888:8888 --rm -ti -v ${PWD}:/workspace --entrypoint jupyter milestone2  notebook --allow-root --ip 0.0.0.0
    ```
6. Execute this notebook as tira would do:
    ```
    tira-run --input-directory ${PWD}/iranthology-dataset-tira --output-directory ${PWD}/bm25-output --image milestone2 --command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/dnc-limited-notebook-milestone2.ipynb'
    ```
7. Render results:
    ```
    tira-run --output-directory ${PWD}/bm25-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'diffir --dataset iranthology-dnc-limited --web $outputDir/run.txt > $outputDir/run.html'
    ```
8. Evaluate the effectiveness of the baseline on your relevance judgments:
    ```
    tira-run --input-directory ${PWD}/bm25-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'ir_measures iranthology-dnc-limited $inputDataset/run.txt nDCG@10 MRR P@10 Recall@100'
    ```

## Code

In [1]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')

Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will use a small hardcoded example located in ./iranthology-dataset-tira.
The output directory is /tmp/


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True

In [2]:
!ls -lha {input_directory}

total 115M
drwxrwxrwx 1 root root 4.0K May 28 18:26 .
drwxrwxrwx 1 root root 4.0K May 28 20:48 ..
-rw-r--r-- 1 root root 115M May 28 18:26 documents.jsonl
-rw-r--r-- 1 root root   46 May 28 18:26 metadata.json
-rw-r--r-- 1 root root 3.0K May 28 18:26 queries.jsonl
-rw-r--r-- 1 root root 3.8K May 28 18:26 queries.xml


The input directory contains the following files:

In [3]:
queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]

Load Data

In [4]:
!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100})
index_ref = iter_indexer.index(tqdm(documents))

100%|███████████████████████████████████████████████████████████████████████████| 53673/53673 [00:10<00:00, 5021.94it/s]


Create the Index.

In [5]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True)

Create Retrieval Pipeline

In [6]:
print('Step 5: Create Run.')

run = bm25(queries)

Step 5: Create Run.


BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  8.52q/s]


Create run

In [7]:
print('We look at the first 10 results of the run:\n')
run.head(10)

We look at the first 10 results of the run:



Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8644,2007.sigirconf_conference-2007.39,0,17.033467,machine learning for more relevant results
1,1,6941,2012.sigirconf_conference-2012.5,1,16.888207,machine learning for more relevant results
2,1,21378,2009.cikm_conference-2009.190,2,16.778766,machine learning for more relevant results
3,1,24062,2016.cikm_conference-2016.93,3,16.194192,machine learning for more relevant results
4,1,50848,2007.ipm_journal-ir0anthology0volumeA43A4.3,4,16.157097,machine learning for more relevant results
5,1,22430,2018.cikm_conference-2018.299,5,16.146165,machine learning for more relevant results
6,1,33649,2018.wwwconf_conference-2018c.298,6,16.113325,machine learning for more relevant results
7,1,22940,2010.cikm_conference-2010.163,7,16.001429,machine learning for more relevant results
8,1,8916,2010.sigirconf_conference-2010.89,8,15.730135,machine learning for more relevant results
9,1,7178,2014.sigirconf_conference-2014.17,9,15.720376,machine learning for more relevant results


We look at the first 10 results of the run:

In [8]:
print('Step 6: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='BM25', depth=1000)

print('Done :)')

Step 6: Persist Run.
Done :)


Step 6: Persist Run.
Done :)