# <ins> Milestone 3 </ins>

## Information for BM25-SDM

The following Code is for the use inside the "pyterrier" Docker image and couldn´t function, if you want to execute it in your notebook!

Please follow these steps to create the needed output:

### Pre-condition

1. Navigate in your terminal to the "milestone3" folder and open "Docker Desktop"

2. Pull the image of the "milestone1" with the qrels command
    ```
    docker pull registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1
    ```
3. Execute the tira-run command for the "milestone1" to get the needed output:
    ```
    tira-run --output-directory ${PWD}/iranthology-dataset-tira --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command '/irds_cli.sh --ir_datasets_id iranthology-dnc-limited --output_dataset_path $outputDir'
    ```
   
### Milestone 3

4. Build an image for the current milestone:
    ```
    docker build -t milestone3 .
    ```
5. Run this image:
    ```
    docker run -p 8888:8888 --rm -ti -v ${PWD}:/workspace --entrypoint jupyter milestone3  notebook --allow-root --ip 0.0.0.0
    ```
6. Execute this notebook as tira would do:
    ```
    tira-run --input-directory ${PWD}/iranthology-dataset-tira --output-directory ${PWD}/bm25-sdm-output --image milestone3 --command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/dnc-limited-notebook-bm25-sdm.ipynb'
    ```
7. Render results:
    ```
    tira-run --output-directory ${PWD}/bm25-sdm-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'diffir --dataset iranthology-dnc-limited --web $outputDir/run.txt > $outputDir/run.html'
    ```
8. Evaluate the effectiveness of the baseline on your relevance judgments:
    ```
    tira-run --input-directory ${PWD}/bm25-sdm-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'ir_measures iranthology-dnc-limited $inputDataset/run.txt nDCG@10 MRR P@10 Recall@100'
    ```

## Code

In [1]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
import json
from tqdm import tqdm
import os

Importing all necessary modules

In [2]:
ensure_pyterrier_is_loaded()

#Using the "get_input_directory_andoutputdirectory" function code from "tira.third_party_integrations" to set the input andoutout directory
default_input = './iranthology-dataset-tira'
default_output = '/tmp/'

input_directory = os.environ.get('TIRA_INPUT_DIRECTORY', None)
if not input_directory:
    input_directory = default_input

output_directory = os.environ.get('TIRA_OUTPUT_DIRECTORY', default_output)

print(f'I will read the input data from {input_directory}.')
print(f'The output directory is {output_directory}')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will read the input data from ./iranthology-dataset-tira.
The output directory is /tmp/


Ensure that PyTerrier is loaded and setting the input and output directory.

In [3]:
print("The input directory contains following files:")
!ls -lha {input_directory}

The input directory contains following files:
total 115M
drwxrwxrwx 1 root root 4.0K Jun 21 13:59 .
drwxrwxrwx 1 root root 4.0K Jun 21 17:20 ..
-rw-r--r-- 1 root root 115M Jun 21 13:59 documents.jsonl
-rw-r--r-- 1 root root   46 Jun 21 13:59 metadata.json
-rw-r--r-- 1 root root 3.0K Jun 21 13:59 queries.jsonl
-rw-r--r-- 1 root root 3.8K Jun 21 13:59 queries.xml


Checking the input directory.

In [4]:
pd.set_option('display.max_colwidth', 150) #Setting the maximum width for better visibility

docs_df = pd.read_json(f'{input_directory}/documents.jsonl', lines=True) #Consider the textural documents in a dataframe

print("Viewing the first 5 documents from the imported documents.jsonl:")
docs_df.head(5)

Viewing the first 5 documents from the imported documents.jsonl:


Unnamed: 0,docno,text,original_document
0,2019.sigirconf_workshop-2019birndl.0,CEUR Workshop Proceedings 2414 CEUR-WS.org 2019 http://ceur-ws.org/Vol-2414 urn:nbn:de:0074-2414-3 https://dblp.org/rec/conf/sigir/2019birndl.bib ...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.0', 'text': 'CEUR Workshop Proceedings 2414 CEUR-WS.org 2019 http://ceur-ws.org/Vol-2414 urn:nbn:de..."
1,2019.sigirconf_workshop-2019birndl.1,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.1', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
2,2019.sigirconf_workshop-2019birndl.2,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.2', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
3,2019.sigirconf_workshop-2019birndl.3,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.3', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
4,2019.sigirconf_workshop-2019birndl.4,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.4', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."


Viewing the "documents.jsonl" data.

In [5]:
queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml') #Excracting the queries as trecxml format

#Viewing the queries
print("The loaded queries are:")
print(queries)

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')] #Extraction the documents (please refer to the code cell above to get a glimpse of its content)

The loaded queries are:
  qid                                        query
0   1   machine learning for more relevant results
1   2     crawling websites using machine learning
2   3              recommenders influence on users
3   4                search engine caching effects
4   5                     consumer product reviews
5   6                 limitations machine learning


Loading all needed data (queries and documents)

In [6]:
!rm -Rf ./index #If the index folder exists, ten delete it and its content

iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100}, blocks=True) #Creating the index
index_ref = iter_indexer.index(tqdm(documents)) #Using a progressbar for visibility

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:13<00:00, 3848.61it/s]


Create the Index.

In [7]:
sdm = pt.rewrite.SDM()
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True)

retrieval_pipeline = sdm >> bm25

bm25.search("Limitations machine learning") #viewing the docid and score of the query "Limitations machine learning"

BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.58q/s]


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,15379,2021.wsdm_conference-2021.78,0,17.374985,Limitations machine learning
1,1,15457,2021.wsdm_conference-2021.156,1,17.300120,Limitations machine learning
2,1,15517,2017.wsdm_conference-2017.57,2,16.986565,Limitations machine learning
3,1,32063,2020.wwwconf_conference-2020c.108,3,16.938821,Limitations machine learning
4,1,14952,2018.wsdm_conference-2018.3,4,16.905845,Limitations machine learning
...,...,...,...,...,...,...
995,1,38903,2018.tist_journal-ir0anthology0volumeA9A2.4,995,7.355572,Limitations machine learning
996,1,22895,2010.cikm_conference-2010.118,996,7.347620,Limitations machine learning
997,1,30508,2015.wwwconf_conference-2015c.256,997,7.347620,Limitations machine learning
998,1,416,2011.ntcir_workshop-2011.2,998,7.338885,Limitations machine learning


Create Retrieval Pipeline

In [8]:
index = pt.IndexFactory.of(index_ref)
print("Index informations:")
print(index.getCollectionStatistics().toString())

Index informations:
Number of documents: 53673
Number of terms: 138650
Number of postings: 3931406
Number of fields: 1
Number of tokens: 6333284
Field names: [text]
Positions:   true



Getting information about your created index directory.

In [9]:
run = retrieval_pipeline(queries)

BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  6.20q/s]


Create run

In [10]:
print('We look at the first 10 results of the run:\n')
run.head(10)

We look at the first 10 results of the run:



Unnamed: 0,qid,docid,docno,rank,score,query,query_0
0,1,8644,2007.sigirconf_conference-2007.39,0,17.399095,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
1,1,6941,2012.sigirconf_conference-2012.5,1,17.211525,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
2,1,21378,2009.cikm_conference-2009.190,2,17.166249,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
3,1,24062,2016.cikm_conference-2016.93,3,16.689386,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
4,1,22430,2018.cikm_conference-2018.299,4,16.534006,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
5,1,33649,2018.wwwconf_conference-2018c.298,5,16.510985,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
6,1,50848,2007.ipm_journal-ir0anthology0volumeA43A4.3,6,16.300946,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
7,1,22940,2010.cikm_conference-2010.163,7,16.217081,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
8,1,7178,2014.sigirconf_conference-2014.17,8,16.137815,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results
9,1,8916,2010.sigirconf_conference-2010.89,9,15.973994,machine learning relevant results #combine:0=0.1:wmodel=org.terrier.matching.models.dependence.pBiL(#1(machine learning)) #combine:0=0.1:wmodel=or...,machine learning for more relevant results


In [11]:
persist_and_normalize_run(run, output_file=output_directory, system_name='BM25-SDM', depth=1000)

print("Persist Run and normalize run.")

Persist Run and normalize run.


Persist Run.