# <ins> Milestone 3 </ins>

## Information for BM25

The following Code is for the use inside the "pyterrier" Docker image and couldn´t function, if you want to execute it in your notebook!

Please follow these steps to create the needed output:

### Pre-condition

1. Navigate in your terminal to the "milestone3" folder and open "Docker Desktop"

2. Pull the image of the "milestone1" with the qrels command
    ```
    docker pull registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1
    ```
3. Execute the tira-run command for the "milestone1" to get the needed output:
    ```
    tira-run --output-directory ${PWD}/iranthology-dataset-tira --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command '/irds_cli.sh --ir_datasets_id iranthology-dnc-limited --output_dataset_path $outputDir'
    ```
   
### Milestone 3

4. Build an image for the current milestone:
    ```
    docker build -t milestone3 .
    ```
5. Run this image:
    ```
    docker run -p 8888:8888 --rm -ti -v ${PWD}:/workspace --entrypoint jupyter milestone3  notebook --allow-root --ip 0.0.0.0
    ```
6. Execute this notebook as tira would do:
    ```
    tira-run --input-directory ${PWD}/iranthology-dataset-tira --output-directory ${PWD}/bm25-multi-field-output --image milestone3 --command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/dnc-limited-notebook-bm25-multi-field.ipynb'
    ```
7. Render results:
    ```
    tira-run --output-directory ${PWD}/bm25-multi-field-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'diffir --dataset iranthology-dnc-limited --web $outputDir/run.txt > $outputDir/run.html'
    ```
8. Evaluate the effectiveness of the baseline on your relevance judgments:
    ```
    tira-run --input-directory ${PWD}/bm25-multi-field-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'ir_measures iranthology-dnc-limited $inputDataset/run.txt nDCG@10 MRR P@10 Recall@100'
    ```

## Code

In [1]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
import json
from tqdm import tqdm
import os

Importing all necessary modules

In [2]:
ensure_pyterrier_is_loaded()

#Using the "get_input_directory_andoutputdirectory" function code from "tira.third_party_integrations" to set the input andoutout directory
default_input = './iranthology-dataset-tira'
default_output = '/tmp/'

input_directory = os.environ.get('TIRA_INPUT_DIRECTORY', None)
if not input_directory:
    input_directory = default_input

output_directory = os.environ.get('TIRA_OUTPUT_DIRECTORY', default_output)

print(f'I will read the input data from {input_directory}.')
print(f'The output directory is {output_directory}')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will read the input data from ./iranthology-dataset-tira.
The output directory is /tmp/


Ensure that PyTerrier is loaded and setting the input and output directory.

In [3]:
print("The input directory contains following files:")
!ls -lha {input_directory}

The input directory contains following files:
total 146M
drwxrwxrwx 1 root root 4.0K Jun 24 17:11 .
drwxrwxrwx 1 root root 4.0K Jun 24 17:11 ..
-rw-r--r-- 1 root root 146M Jun 24 17:11 documents.jsonl
-rw-r--r-- 1 root root   46 Jun 24 17:11 metadata.json
-rw-r--r-- 1 root root 3.0K Jun 24 17:11 queries.jsonl
-rw-r--r-- 1 root root 3.8K Jun 24 17:11 queries.xml


Checking the input directory.

In [4]:
pd.set_option('display.max_colwidth', 150) #Setting the maximum width for better visibility

docs_df = pd.read_json(f'{input_directory}/documents.jsonl', lines=True) #Consider the textural documents in a dataframe

print("Viewing the first 5 documents from the imported documents.jsonl:")
docs_df.head(5)

Viewing the first 5 documents from the imported documents.jsonl:


Unnamed: 0,docno,text,original_document
0,2019.sigirconf_workshop-2019birndl.0,CEUR Workshop Proceedings 2414 CEUR-WS.org 2019 http://ceur-ws.org/Vol-2414 urn:nbn:de:0074-2414-3 https://dblp.org/rec/conf/sigir/2019birndl.bib ...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.0', 'text': 'CEUR Workshop Proceedings 2414 CEUR-WS.org 2019 http://ceur-ws.org/Vol-2414 urn:nbn:de..."
1,2019.sigirconf_workshop-2019birndl.1,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.1', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
2,2019.sigirconf_workshop-2019birndl.2,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.2', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
3,2019.sigirconf_workshop-2019birndl.3,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.3', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
4,2019.sigirconf_workshop-2019birndl.4,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.4', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."


Function to get Values of a String in Json format!

In [5]:
queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml') #Excracting the queries as trecxml format

#Viewing the queries
print("The loaded queries are:")
print(queries)

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]
documents = [{'docno': i['docno'], 'text': i['text'], 'title': i['original_document']['title'], 'abstract': i['original_document']['abstract']} for i in documents]

The loaded queries are:
  qid                                        query
0   1   machine learning for more relevant results
1   2     crawling websites using machine learning
2   3              recommenders influence on users
3   4                search engine caching effects
4   5                     consumer product reviews
5   6                 limitations machine learning


TypeError: string indices must be integers

Loading all needed data (queries and documents)

In [None]:
!rm -Rf ./index #If the index folder exists, ten delete it and its content

iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100, 'title': 10240, 'abstract': 10240, 'text': 10240}, blocks=True)
index_ref = iter_indexer.index(tqdm(documents))

Create the Index.

In [None]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True, metadata=['docno', 'text', 'title', 'abstract'])

bm25_title = pt.text.scorer(body_attr="title", wmodel="BM25")
bm25_abstract = pt.text.scorer(body_attr="abstract", wmodel="BM25")
bm25_text = pt.text.scorer(body_attr="text", wmodel="BM25")


# Here some "random" ranking formula that puts the highest weight on the title and
# reduces the weight of matches on the text field
# Here is big potential for improvements :)
combined_bm25_score = ((2*bm25_title) + (1*bm25_abstract) + (0.5*bm25_text))


dph_title = pt.text.scorer(body_attr="title", wmodel="DPH")
dph_abstract = pt.text.scorer(body_attr="abstract", wmodel="DPH")
dph_text = pt.text.scorer(body_attr="text", wmodel="DPH")

# Here some "random" ranking formula that puts the highest weight on the title and
# reduces the weight of matches on the text field
# Here is big potential for improvements :)
combined_dph_score = ((2*dph_title) + (1*dph_abstract) + (0.5*dph_text))

# The overall Pipeline: We retrieve the top-1000 results from BM25 that we re-rank using the combined BM25 and DPH scores.
# We just add the scores of BM25 and DPH
# Here is big potential for improvements :)
retrieval_pipeline = bm25 %1000 >> combined_bm25_score + combined_dph_score
bm25.search("Limitations machine learning") #viewing the docid and score of the query "Limitations machine learning"

Create Retrieval Pipeline

In [None]:
index = pt.IndexFactory.of(index_ref)
print("Index informations:")
print(index.getCollectionStatistics().toString())

Getting information about your created index directory.

In [None]:
run = bm25(queries[:1])
run.head(5)

Create run

In [None]:
print('We look at the first 10 results of the run:\n')
run.head(10)

In [None]:
run = retrieval_pipeline(queries)

Retrival Pipeline

In [None]:
persist_and_normalize_run(run, output_file=output_directory, system_name='BM25', depth=1000)

print("Persist Run and normalize run.")

Persist Run.