# <ins> Milestone 3 </ins>

## Information for BM25 >> Multi Field

The following Code is for the use inside the "pyterrier" Docker image and couldnÂ´t function, if you want to execute it in your notebook!

Please follow these steps to create the needed output:

### Pre-condition

1. Navigate in your terminal to the "milestone3" folder and open "Docker Desktop"

2. Pull the image of the "milestone1" with the qrels command
    ```
    docker pull registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1
    ```
3. Execute the tira-run command for the "milestone1" to get the needed output:
    ```
    tira-run --output-directory ${PWD}/iranthology-dataset-tira --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command '/irds_cli.sh --ir_datasets_id iranthology-dnc-limited --output_dataset_path $outputDir'
    ```
4. Exchange our documents.jsonl with that of the tutor dataset:
    ```
    iranthology-dataset-tutors-tira.zip -> extract documents.jsonl
    insert it into 'iranthology-dataset-tira' and overwrite the existing file
    ```

### Milestone 3

5. Build an image for the current milestone:
    ```
    docker build -t milestone3 .
    ```
6. Run this image:
    ```
    docker run -d -p 8888:8888 --rm -ti -v ${PWD}:/workspace --entrypoint jupyter milestone3  notebook --allow-root --ip 0.0.0.0
    ```
7. Execute this notebook as tira would do:
    ```
    tira-run --input-directory ${PWD}/iranthology-dataset-tira --output-directory ${PWD}/bm25-multi-field-output --image milestone3 --command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/dnc-limited-notebook-bm25-multi-field.ipynb'
    ```
8. Render results:
    ```
    tira-run --output-directory ${PWD}/bm25-multi-field-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'diffir --dataset iranthology-dnc-limited --web $outputDir/run.txt > $outputDir/run.html'
    ```
9. Evaluate the effectiveness of the baseline on your relevance judgments:
    ```
    tira-run --input-directory ${PWD}/bm25-multi-field-output --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-dnc-limited/ir-datasets:0.0.1 --allow-network true --command 'ir_measures iranthology-dnc-limited $inputDataset/run.txt nDCG@10 MRR P@10 Recall@100'
    ```

## Code

In [1]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
import json
from tqdm import tqdm
import os

Importing all necessary modules

In [2]:
ensure_pyterrier_is_loaded()

#Using the "get_input_directory_andoutputdirectory" function code from "tira.third_party_integrations" to set the input andoutout directory
default_input = './iranthology-dataset-tira'
default_output = '/tmp/'

input_directory = os.environ.get('TIRA_INPUT_DIRECTORY', None)
if not input_directory:
    input_directory = default_input

output_directory = os.environ.get('TIRA_OUTPUT_DIRECTORY', default_output)

print(f'I will read the input data from {input_directory}.')
print(f'The output directory is {output_directory}')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will read the input data from ./iranthology-dataset-tira.
The output directory is /tmp/


Ensure that PyTerrier is loaded and setting the input and output directory.

In [3]:
print("The input directory contains following files:")
!ls -lha {input_directory}

The input directory contains following files:
total 77M
drwxrwxrwx 1 root root 4.0K Jun 27 09:19 .
drwxrwxrwx 1 root root 4.0K Jun 27 09:24 ..
-rwxrwxrwx 1 root root  77M Jun  6 02:59 documents.jsonl
-rwxrwxrwx 1 root root   41 Jun 26 15:54 metadata.json
-rwxrwxrwx 1 root root 1.6K Jun 26 15:54 queries.jsonl
-rwxrwxrwx 1 root root 2.1K Jun 26 15:54 queries.xml


Checking the input directory.

In [4]:
pd.set_option('display.max_colwidth', 150) #Setting the maximum width for better visibility

docs_df = pd.read_json(f'{input_directory}/documents.jsonl', lines=True) #Consider the textural documents in a dataframe

print("Viewing the first 5 documents from the imported documents.jsonl:")
docs_df.head(5)

Viewing the first 5 documents from the imported documents.jsonl:


Unnamed: 0,docno,text,original_document
0,2019.sigirconf_workshop-2019birndl.0,Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.0', 'abstract': '', 'title': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Inform..."
1,2019.sigirconf_workshop-2019birndl.1,Preface: 4th Joint Workshop on BIRNDL at SIGIR 2019,"{'doc_id': '2019.sigirconf_workshop-2019birndl.1', 'abstract': '', 'title': 'Preface: 4th Joint Workshop on BIRNDL at SIGIR 2019', 'authors': [], ..."
2,2019.sigirconf_workshop-2019birndl.2,"Personalized Feed/Query-formulation, Predictive Impact, and Ranking The Meta discovery system is designed to aid biomedical researchers in keeping...","{'doc_id': '2019.sigirconf_workshop-2019birndl.2', 'abstract': 'The Meta discovery system is designed to aid biomedical researchers in keeping up ..."
3,2019.sigirconf_workshop-2019birndl.3,"Discourse Processing for Text Analysis: Recent Successes, Current Challenges Computational discourse processing has come a long way in the 10 year...","{'doc_id': '2019.sigirconf_workshop-2019birndl.3', 'abstract': 'Computational discourse processing has come a long way in the 10 years since I spo..."
4,2019.sigirconf_workshop-2019birndl.4,Distant Supervision for Silver Label Generation of Software Mentions in Social Scientific Publications Many scientific investigations rely on soft...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.4', 'abstract': 'Many scientific investigations rely on software for a range of different tasks inc..."


In [5]:
queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml') #Excracting the queries as trecxml format

#Viewing the queries
print("The loaded queries are:")
print(queries)

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]
documents = [{'docno': i['docno'], 'text': i['text'], 'title': i['original_document']['title'], 'abstract': i['original_document']['abstract']} for i in documents]

The loaded queries are:
  qid                                       query
0   1               detect health related queries
1   2   large language models for query expansion
2   3                     datasets for web search
3   4                known item search for movies


Loading all needed data (queries and documents)

In [6]:
!rm -Rf ./index #If the index folder exists, ten delete it and its content

iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100, 'title': 10240, 'abstract': 10240, 'text': 10240}, blocks=True)
index_ref = iter_indexer.index(tqdm(documents))

 31%|████████████████████████████████████████████████████████▍                                                                                                                              | 16560/53673 [00:08<00:09, 4119.84it/s]



100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:16<00:00, 3229.35it/s]


09:26:46.751 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


Create the Index.

In [7]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True, metadata=['docno', 'text', 'title', 'abstract'])

bm25_title = pt.text.scorer(body_attr="title", wmodel="BM25")
bm25_abstract = pt.text.scorer(body_attr="abstract", wmodel="BM25")
bm25_text = pt.text.scorer(body_attr="text", wmodel="BM25")


# Here some "random" ranking formula that puts the highest weight on the title and
# reduces the weight of matches on the text field
# Here is big potential for improvements :)
combined_bm25_score = ((1.6*bm25_title) + (1.3*bm25_abstract) + (0.6*bm25_text))


dph_title = pt.text.scorer(body_attr="title", wmodel="DPH")
dph_abstract = pt.text.scorer(body_attr="abstract", wmodel="DPH")
dph_text = pt.text.scorer(body_attr="text", wmodel="DPH")

# Here some "random" ranking formula that puts the highest weight on the title and
# reduces the weight of matches on the text field
# Here is big potential for improvements :)
combined_dph_score = ((1.6*dph_title) + (1.2*dph_abstract) + (0.6*dph_text))

TfIdf_title = pt.text.scorer(body_attr="title", wmodel="TF_IDF", background_index=index_ref)
TfIdf_abstract = pt.text.scorer(body_attr="abstract", wmodel="TF_IDF", background_index=index_ref)
TfIdf_text = pt.text.scorer(body_attr="text", wmodel="TF_IDF", background_index=index_ref)

combined_TfIdf_score = ((1.6*TfIdf_title) + (1.1*TfIdf_abstract) + (0.6*TfIdf_text))

# The overall Pipeline: We retrieve the top-1000 results from BM25 that we re-rank using the combined BM25 and DPH scores.
# We just add the scores of BM25 and DPH
# Here is big potential for improvements :)
retrieval_pipeline = bm25 %500 >> (0.5*combined_bm25_score) + (combined_dph_score * 1) + (combined_TfIdf_score * 1.1)
bm25.search("Limitations machine learning") #viewing the docid and score of the query "Limitations machine learning"

BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.01s/q]


Unnamed: 0,qid,docid,docno,text,title,abstract,rank,score,query
0,1,14952,2018.wsdm_conference-2018.3,"Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution ABSTRACTCurrent machine learning systems operate, almost ...",Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution,"ABSTRACTCurrent machine learning systems operate, almost exclusively, in a statistical, or model-blind mode, which entails severe theoretical limi...",0,16.227153,Limitations machine learning
1,1,15457,2021.wsdm_conference-2021.156,The 1st International Workshop on Machine Reasoning: International Machine Reasoning Conference (MRC 2021) Recent years have witnessed the success...,The 1st International Workshop on Machine Reasoning: International Machine Reasoning Conference (MRC 2021),Recent years have witnessed the success of machine learning and especially deep learning in many research areas such as Vision and Language Proces...,1,15.578946,Limitations machine learning
2,1,32063,2020.wwwconf_conference-2020c.108,Using Deep Learning for Temporal Forecasting of User Activity on Social Media: Challenges and Limitations The recent advances in neural network-ba...,Using Deep Learning for Temporal Forecasting of User Activity on Social Media: Challenges and Limitations,The recent advances in neural network-based machine learning algorithms promise a revolution in prediction-based tasks in a variety of domains. Of...,2,15.118075,Limitations machine learning
3,1,15517,2017.wsdm_conference-2017.57,Machine Learning at Amazon ABSTRACTIn this talk I will give an introduction into the field of machine learning and discuss why it is a crucial tec...,Machine Learning at Amazon,ABSTRACTIn this talk I will give an introduction into the field of machine learning and discuss why it is a crucial technology for Amazon.Machine ...,3,14.964288,Limitations machine learning
4,1,15379,2021.wsdm_conference-2021.78,Say No to the Discrimination: Learning Fair Graph Neural Networks with Limited Sensitive Attribute Information Graph neural networks (GNNs) have s...,Say No to the Discrimination: Learning Fair Graph Neural Networks with Limited Sensitive Attribute Information,"Graph neural networks (GNNs) have shown great power in modeling graph structured data. However, similar to other machine learning models, GNNs may...",4,14.802679,Limitations machine learning
...,...,...,...,...,...,...,...,...,...
995,1,15635,2020.wsdm_conference-2020.70,"Fast Item Ranking under Neural Network based Measures Recently, plenty of neural network based recommendation models have demonstrated their stren...",Fast Item Ranking under Neural Network based Measures,"Recently, plenty of neural network based recommendation models have demonstrated their strength in modeling complicated relationships between hete...",995,6.757645,Limitations machine learning
996,1,9900,2016.sigirconf_conference-2016.1,Understanding Human Language: Can NLP and Deep Learning Help? ABSTRACTThere is a lot of overlap between the core problems of information retrieval...,Understanding Human Language: Can NLP and Deep Learning Help?,ABSTRACTThere is a lot of overlap between the core problems of information retrieval (IR) and natural language processing (NLP). An IR system gain...,996,6.756975,Limitations machine learning
997,1,33234,2018.wwwconf_conference-2018.75,Through a Gender Lens: Learning Usage Patterns of Emojis from Large-Scale Android Users ABSTRACTBased on a large data set of emoji using behavior ...,Through a Gender Lens: Learning Usage Patterns of Emojis from Large-Scale Android Users,"ABSTRACTBased on a large data set of emoji using behavior collected from smartphone users over the world, this paper investigates genderspeci c us...",997,6.743410,Limitations machine learning
998,1,28677,2007.wwwconf_conference-2007.75,Exhibit: lightweight structured data publishing ABSTRACTThe early Web was hailed for giving individuals the same publishing power as large content...,Exhibit: lightweight structured data publishing,"ABSTRACTThe early Web was hailed for giving individuals the same publishing power as large content providers. But over time, large content provide...",998,6.727883,Limitations machine learning


Create Retrieval Pipeline

In [8]:
index = pt.IndexFactory.of(index_ref)
print("Index informations:")
print(index.getCollectionStatistics().toString())

Index informations:
Number of documents: 53673
Number of terms: 40295
Number of postings: 1789380
Number of fields: 1
Number of tokens: 2706154
Field names: [text]
Positions:   true



Getting information about your created index directory.

In [9]:
run = bm25(queries[:1])
print('We look at the first 5 results of the run:\n')
run.head(5)

BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.20q/s]

We look at the first 5 results of the run:






Unnamed: 0,qid,docid,docno,text,title,abstract,rank,score,query
0,1,49659,2021.ipm_journal-ir0anthology0volumeA58A1.6,Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches,Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches,,0,16.708093,detect health related queries
1,1,27490,2011.spire_conference-2011.10,Detecting Health Events on the Social Web to Enable Epidemic Intelligence,Detecting Health Events on the Social Web to Enable Epidemic Intelligence,,1,15.699445,detect health related queries
2,1,19930,2019.cikm_conference-2019.346,Concept Drift Adaption for Online Anomaly Detection in Structural Health Monitoring,Concept Drift Adaption for Online Anomaly Detection in Structural Health Monitoring,,2,15.507586,detect health related queries
3,1,39429,2021.tist_journal-ir0anthology0volumeA12A2.4,Indirectly Supervised Anomaly Detection of Clinically Meaningful Health Events from Smart Home Data,Indirectly Supervised Anomaly Detection of Clinically Meaningful Health Events from Smart Home Data,,3,15.137599,detect health related queries
4,1,33009,2013.wwwconf_conference-2013c.302,"From health-persona to societal health ABSTRACTIn this position paper, we propose an approach for Web Observatories that builds on using social me...",From health-persona to societal health,"ABSTRACTIn this position paper, we propose an approach for Web Observatories that builds on using social media, personal data, and sensors to buil...",4,14.88186,detect health related queries


Create Retrieval Pipeline

In [10]:
run = retrieval_pipeline(queries)

BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.08q/s]




Create the run

In [11]:
persist_and_normalize_run(run, output_file=output_directory, system_name='BM25', depth=1000)

print("Persist Run and normalize run.")

Persist Run and normalize run.


Persist Run.

## Previous attempts

nDCG@10 0.4443
RR      0.7128
P@10    0.3500
R@100   0.5998

higher title rating and higher last step
nDCG@10 0.4625
RR      0.7167
P@10    0.3667
R@100   0.5998

2.0 Title higher Text
nDCG@10 0.3217
RR      0.3874
P@10    0.3167
R@100   0.5471

no abstract rating
Kotoro — heute um 19:01 Uhr
nDCG@10 0.4714
RR      0.6444
P@10    0.3833
R@100   0.6980

staggered decrease in abstract value.
1.3 -> 1.2 -> 1.1
nDCG@10 0.4702
RR      0.6389
P@10    0.3833
R@100   0.6980

staggered decrease in text value
0.6 -> 0.5 -> 0.4
Kotoro — heute um 19:10 Uhr
nDCG@10 0.4736
RR      0.6444
P@10    0.3833
R@100   0.6980

staggered increase in title value
1.6-> 1.8 -> 2.0
nDCG@10 0.4720
RR      0.6444
P@10    0.3833
R@100   0.6915

1.6 title wert generell
nDCG@10 0.4720
RR      0.6444
P@10    0.3833
R@100   0.6748

increased DHP value 1.2 and text value to 0.8 in dhp
Kotoro — heute um 19:21 Uhr
nDCG@10 0.4720
RR      0.6444
P@10    0.3833
R@100   0.6748
nDCG@10 0.4775
RR      0.6444
P@10    0.4000
R@100   0.6637

% 1000 -> % 500
nDCG@10 0.4704
RR      0.6444
P@10    0.3833
R@100   0.6804

letzten beiden schritte höher gewichtet
Kotoro — heute um 19:29 Uhr
nDCG@10 0.4669
RR      0.6444
P@10    0.3833
R@100   0.6637

higher bm25 values

## Reflections

Timothy Kriewald: The task of this milestone was to improve the previous milestones (specifically Milestone 2). Therefore, unlike the previous milestones, there were no major issues. Our main task here was to achieve the best possible result for the dataset and the topics through extensive testing. After trying the RM3 and SDM approach, we quickly realised, that these won't improve our results as expected. We decided to extend BM25 using "Multi Field". This approach provides a way to easily adjust the scores as desired. The biggest challenge here was to determine which field or approach affects the outcome and how to choose the optimal weighting. All in all, however, this was a significantly smoother milestone that, aside from a few minor hurdles, was much more pleasant compared to the previous milestones.

Paul Gresens: Beginning with this task i encountered a lot of technical problems, which I could only partly resolve leading to a situation, where I couldn't help my team as much as I wanted to. Besides these difficulties the whole goal of this assignment seemed o lot clearer, compared to the first too, which helped a lot to understand the assignment itself.

Constantin:
At the beginning of this milestone, I was at a bit of a loss as to how we could improve on our Retrieval System. At first, we tried using RM3 and SDM which in fact worsened our results. So we quickly pivoted to focusing on the Multi field Retrieval Method, since we could easily and quickly modify the values try it out and repeat. In the end we were able to improve on our earlier result, however our improvements were not as significant as I would have hoped. All in all the more freeform style of this milestone certainly added some a sense of uncertainty but more than that the sense of accomplishment for every improvement felt all the better because of it.

Nils:
At the beginning of the task, there was only the baseline retrieval system that needed improvement. We initially implemented different notebooks and observed how the values evolved. Then, we experimented with various weights and methods using the Multi field Notebook.
During this milestone, it was enjoyable to try out different approaches and gradually understand the impact of value changes. However, it required quite a bit of work to collect, analyze, and evaluate the different approaches.
We encountered some issues in this milestone when we swapped our dataset with the tutors' dataset, as well as with the long loading times when uploading images and using the submission platform.
Overall, this task has shown me how exciting it can be to experiment with different values and evaluate the results.

Willi:
Having implemented a standard BM25 approach in Milestone 2, we knew the general procedure for implementing different retrieval approaches, which, together with the example notebooks provided, served as a solid tutorial.
The RM3 and SDM methods didn't improve our results, so we quickly switched to the Multi field notebook and changed the parameters to try and improve our scores, which we in the end did, even though the changes weren't as significant as we hoped they did. As I didn't do much on this milestone, I can't go into detail about all the problems we encountered, other than the usual ones such as non-functioning installations or problems with Docker.
In the end, I liked the freer and more goal-oriented working style of this last submission.