# <ins> Milestone 2 </ins>

## Information

The following Code is for the use inside the "pyterrier" Docker image and couldn´t function, if you want to execute it in your notebook!

Please follow these steps to create the needed output:

### Pre-condition

1. Navigate in your terminal to the "milestone2" folder and open "Docker Desktop"

2. Pull the image of the "milestone1" with the qrels command
    ```
    docker pull registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1
    ```
3. Execute the tira-run command for the "milestone1" to get the needed output:
    ```
    tira-run --output-directory ${PWD}/iranthology-dataset-tira --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command '/irds_cli.sh --ir_datasets_id iranthology-dnc-limited --output_dataset_path $outputDir'
    ```
   
### Milestone 2

4. Build an image for the current milestone:
    ```
    docker build -t milestone2 .
    ```
5. Run this image:
    ```
    docker run -p 8888:8888 --rm -ti -v ${PWD}:/workspace --entrypoint jupyter milestone2  notebook --allow-root --ip 0.0.0.0
    ```
6. Execute this notebook as tira would do:
    ```
    tira-run --input-directory ${PWD}/iranthology-dataset-tira --output-directory ${PWD}/bm25-output --image milestone2 --command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/dnc-limited-notebook-milestone2.ipynb'
    ```
7. Render results:
    ```
    tira-run --output-directory ${PWD}/bm25-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'diffir --dataset iranthology-dnc-limited --web $outputDir/run.txt > $outputDir/run.html'
    ```
8. Evaluate the effectiveness of the baseline on your relevance judgments:
    ```
    tira-run --input-directory ${PWD}/bm25-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'ir_measures iranthology-dnc-limited $inputDataset/run.txt nDCG@10 MRR P@10 Recall@100'
    ```

## Code

In [1]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
import json
from tqdm import tqdm
import os

Importing all necessary modules

In [2]:
ensure_pyterrier_is_loaded()

#Using the "get_input_directory_andoutputdirectory" function code from "tira.third_party_integrations" to set the input andoutout directory
default_input = './iranthology-dataset-tira'
default_output = '/tmp/'

input_directory = os.environ.get('TIRA_INPUT_DIRECTORY', None)
if not input_directory:
    input_directory = default_input

output_directory = os.environ.get('TIRA_OUTPUT_DIRECTORY', default_output)

print(f'I will read the input data from {input_directory}.')
print(f'The output directory is {output_directory}')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will read the input data from ./iranthology-dataset-tira.
The output directory is /tmp/


Ensure that PyTerrier is loaded and setting the input and output directory.

In [3]:
print("The input directory contains following files:")
!ls -lha {input_directory}

The input directory contains following files:
total 115M
drwxrwxrwx 1 root root 4.0K Jun  5 10:45 .
drwxrwxrwx 1 root root 4.0K Jun  5 12:11 ..
-rw-r--r-- 1 root root 115M Jun  2 14:53 documents.jsonl
-rw-r--r-- 1 root root   46 Jun  2 14:53 metadata.json
-rw-r--r-- 1 root root 3.0K Jun  2 14:53 queries.jsonl
-rw-r--r-- 1 root root 3.8K Jun  2 14:53 queries.xml


Checking the input directory.

In [4]:
pd.set_option('display.max_colwidth', 150) #Setting the maximum width for better visibility

docs_df = pd.read_json(f'{input_directory}/documents.jsonl', lines=True) #Consider the textural documents in a dataframe

print("Viewing the first 5 documents from the imported documents.jsonl:")
docs_df.head(5)

Viewing the first 5 documents from the imported documents.jsonl:


Unnamed: 0,docno,text,original_document
0,2019.sigirconf_workshop-2019birndl.0,CEUR Workshop Proceedings 2414 CEUR-WS.org 2019 http://ceur-ws.org/Vol-2414 urn:nbn:de:0074-2414-3 https://dblp.org/rec/conf/sigir/2019birndl.bib ...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.0', 'text': 'CEUR Workshop Proceedings 2414 CEUR-WS.org 2019 http://ceur-ws.org/Vol-2414 urn:nbn:de..."
1,2019.sigirconf_workshop-2019birndl.1,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.1', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
2,2019.sigirconf_workshop-2019birndl.2,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.2', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
3,2019.sigirconf_workshop-2019birndl.3,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.3', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."
4,2019.sigirconf_workshop-2019birndl.4,DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing fo...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.4', 'text': 'DBLP:conf/sigir/2019birndl Proceedings of the 4th Joint Workshop on Bibliometric-enhan..."


Viewing the "documents.jsonl" data.

In [5]:
queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml') #Excracting the queries as trecxml format

#Viewing the queries
print("The loaded queries are:")
print(queries)

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')] #Extraction the documents (please refer to the code cell above to get a glimpse of its content)

The loaded queries are:
  qid                                        query
0   1   machine learning for more relevant results
1   2     crawling websites using machine learning
2   3              recommenders influence on users
3   4                search engine caching effects
4   5                     consumer product reviews
5   6                 limitations machine learning


Loading all needed data (queries and documents)

In [6]:
!rm -Rf ./index #If the index folder exists, ten delete it and its content

iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100}) #Creating the index
index_ref = iter_indexer.index(tqdm(documents)) #Using a progressbar for visibility

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:11<00:00, 4629.68it/s]


Create the Index.

In [7]:
rm3 = pt.rewrite.RM3(index_ref)
bm25 = pt.BatchRetriever(index_ref,wmodel="BM25",verbose=True)

bm25.search("Limitations machine learning") #viewing the docid and score of the query "Limitations machine learning"

BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.36q/s]


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,15379,2021.wsdm_conference-2021.78,0,17.374985,limitations of machine learning
1,1,15457,2021.wsdm_conference-2021.156,1,17.300120,limitations of machine learning
2,1,15517,2017.wsdm_conference-2017.57,2,16.986565,limitations of machine learning
3,1,32063,2020.wwwconf_conference-2020c.108,3,16.938821,limitations of machine learning
4,1,14952,2018.wsdm_conference-2018.3,4,16.905845,limitations of machine learning
...,...,...,...,...,...,...
995,1,38903,2018.tist_journal-ir0anthology0volumeA9A2.4,995,7.355572,limitations of machine learning
996,1,22895,2010.cikm_conference-2010.118,996,7.347620,limitations of machine learning
997,1,30508,2015.wwwconf_conference-2015c.256,997,7.347620,limitations of machine learning
998,1,416,2011.ntcir_workshop-2011.2,998,7.338885,limitations of machine learning


Create Retrieval Pipeline

In [8]:
index = pt.IndexFactory.of(index_ref)
print("Index informations:")
print(index.getCollectionStatistics().toString())

Index informations:
Number of documents: 53673
Number of terms: 138650
Number of postings: 3931406
Number of fields: 1
Number of tokens: 6333284
Field names: [text]
Positions:   false



Getting information about your created index directory.

In [9]:
retrieval_pipeline = bm25 >> rm3 >> bm25

run = retrieval_pipeline(queries) #Creating Retrieval Pipeline

BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 11.28q/s]
BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 23.91q/s]


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,15379,2021.wsdm_conference-2021.78,0,17.374985,Limitations machine learning
1,1,15457,2021.wsdm_conference-2021.156,1,17.300120,Limitations machine learning
2,1,15517,2017.wsdm_conference-2017.57,2,16.986565,Limitations machine learning
3,1,32063,2020.wwwconf_conference-2020c.108,3,16.938821,Limitations machine learning
4,1,14952,2018.wsdm_conference-2018.3,4,16.905845,Limitations machine learning
...,...,...,...,...,...,...
995,1,38903,2018.tist_journal-ir0anthology0volumeA9A2.4,995,7.355572,Limitations machine learning
996,1,22895,2010.cikm_conference-2010.118,996,7.347620,Limitations machine learning
997,1,30508,2015.wwwconf_conference-2015c.256,997,7.347620,Limitations machine learning
998,1,416,2011.ntcir_workshop-2011.2,998,7.338885,Limitations machine learning


Create run

In [10]:
print('We look at the first 10 results of the run:\n')
run.head(10)

We look at the first 10 results of the run:



Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8644,2007.sigirconf_conference-2007.39,0,17.033467,machine learning for more relevant results
1,1,6941,2012.sigirconf_conference-2012.5,1,16.888207,machine learning for more relevant results
2,1,21378,2009.cikm_conference-2009.190,2,16.778766,machine learning for more relevant results
3,1,24062,2016.cikm_conference-2016.93,3,16.194192,machine learning for more relevant results
4,1,50848,2007.ipm_journal-ir0anthology0volumeA43A4.3,4,16.157097,machine learning for more relevant results
5,1,22430,2018.cikm_conference-2018.299,5,16.146165,machine learning for more relevant results
6,1,33649,2018.wwwconf_conference-2018c.298,6,16.113325,machine learning for more relevant results
7,1,22940,2010.cikm_conference-2010.163,7,16.001429,machine learning for more relevant results
8,1,8916,2010.sigirconf_conference-2010.89,8,15.730135,machine learning for more relevant results
9,1,7178,2014.sigirconf_conference-2014.17,9,15.720376,machine learning for more relevant results


In [11]:
persist_and_normalize_run(run, output_file=output_directory, system_name='BM25', depth=1000)

print("Persist Run and normalize run.")

Persist Run and normalize run.


Persist Run.

## Reflections

By Timothy Kriewald: <br/> During the work on the second milestone, the biggest problem was my self-created topic from the first milestone. Unfortunately, my topic "Limitations of machine learning" has very few relevant qrels. This made me realize how significant even minor formulations can be. If, for example, I had not solely focused on limitations but also addressed improvements, the pool of my relevant qrels would have increased significantly. The same applies to machine learning! If I had chosen artificial intelligence instead of machine learning, I would have found a lot more relevant qrels. So, the biggest problem here was myself. While it is true that a more precise topic can lead to more relevant results, the number of these relevant results is very low. When I randomly selected a document from the dataset of milestone 1 and aligned my topic with it, I assumed that there would be several documents in that direction. However, the fact that there are only two documents that truly correspond to my topic was far less than expected. The remaining "relevant" topics touch on my mentioned topic to some extent, but only superficially.

If I could choose a new topic, my approach would be different. I would first get a rough overview of the included documents and think more from the perspective of a user, considering which information I need from the data and not keeping it as specific as I did.

Thanks to the repositories provided on Git, setting up the code was made very easy. After watching the given tutorials and being able to incorporate many things from them, further work proved to be not a big problem.

Although I would have liked to see a bit more initiative from my group members, all in all, the work, especially with the repositories, was much easier, more enjoyable, and less demanding compared to the first milestone.

By Willi: <br/> This assignment was a completely different experience to the first one. Thanks to the great tutorial we had a much better understanding of the task and the current state of the project. The programming part of the task was at the same level of difficulty as the last one, but this time we were able to concentrate more and worry less, so in the end it took less time. The biggest challenge for me was creating our Qrels. As my topic touches on two very common topics, recommender systems and how users are affected by them, there were a lot of false positives and finding the really important documents was like finding a needle in a haystack, as many documents discussed technical topics rather than social ones. In the end I would say that actually seeing our search engine print out results after the tutorial was a great sense of achievement as we saw what the preparation had led to, while at the same time we were motivated for the task ahead as our results were not as promising, probably because it was our first implemented version without any fine-tuning.

By Nils: <br/> The biggest challenge in this task was manually evaluating the documents for relevance, as there were many documents that needed to be analyzed. Additionally, it was quite difficult for me to determine whether a document was relevant to my topic or not, as most of them touched upon multiple subjects, some of which could have fallen into a different topic. However, the programming part of this assignment went well compared to the last submission because we had a well-structured tutorial. We simply had to follow the tutorial closely, and unlike last time, there were no significant uncertainties about what exactly needed to be submitted. This allowed us to work efficiently.

By Paul: <br/> The most time consuming task in this assigment was the manual evaluation of the papers. Although the abstracts of most papers were very consice, it was'nt easy to decide for some of them, if they belong to my topic or not. In such cases i had to give it a closer look, which resulted in even more time consuming work.
For this assignment the instructions were very clear, so that we could easily follow through and had no major uncertainties on what had to be done, to complete the task.

By Constantin: <br/> This Assignment was quite a bit different to the first one. For one our team was now established without any more last-minute additions, and we were as such able to divide the work each of us had to do much easier and more efficiently. In this way the qrels also helped as everyone could evaluate their own qrels, allowing us to work in parallel and moving past that step relatively quickly. As for my qrels, while it was quite a bit monotonous, I was also intrigued how my request led to some papers that while relevant to my query, where not at all in the direction I had first envisioned.

As for the second part of this milestone, the tutorials provided were a great help and allowed us to understand not only what we had to achieve but also how to achieve it.

All in all I would say this assignment wasn’t only more easily understood but also easier to give everyone a way in participating even in a group that is as big as ours.

By Dorjan: <br/> This milestone took a lot more effort in doing the actual task but a lot less effort in trying to organize as a team and we had close to no technical issues.

This resulted in an overall easier and more engaging task, as the papers we had to sift through were interesting to say the least.

Evaluating the relevancy of the papers was quite strenuous and it just showed us the importance of having an Information Retrieval system more than anything else could have.

