# Preparation

The image is build using the following command:

```
docker build -t registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-randomizers/iranthology:0.0.1 .
```

This docker image also contains jupyter, so that we can run/test this notebook via the following docker command:

```
docker run -p 8888:8888 --rm -ti -w /workspace -v ${PWD}:/workspace --entrypoint jupyter registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-randomizers/iranthology:0.0.1 notebook --allow-root --ip 0.0.0.0
```

For the build image, the following two test commands work:

First, exporting the data:

```
tira-run \
    --output-directory ${PWD}/iranthology-dataset-tira \
    --image registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-randomizers/iranthology:0.0.1 \
    --allow-network true \
    --command '/irds_cli.sh --ir_datasets_id iranthology-randomizers --output_dataset_path $outputDir'
```

Second, producing a ranking with an existing retrieval model:

```
tira-run \
    --input-directory ${PWD}/iranthology-dataset-tira \
    --image webis/tira-ir-starter-pyterrier:0.0.1-base \
    --command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/full-rank-pipeline.ipynb'
```


# Our data integration

The ir_datasets integration is in the file `iranthology.py`:

In [9]:
!cat iranthology.py

import ir_datasets
from ir_datasets.formats import JsonlDocs, TrecXmlQueries
from typing import NamedTuple
from ir_datasets.datasets.base import Dataset

class IrAnthologyDocument(NamedTuple):
    doc_id: str
    text: str
    
    def default_text(self):
        return self.text

ir_datasets.registry.register('iranthology-randomizers', Dataset(
    JsonlDocs(ir_datasets.util.PackageDataFile(path='datasets_in_progress/ir-anthology-processed.jsonl'), doc_cls=IrAnthologyDocument, lang='en'),
    TrecXmlQueries(ir_datasets.util.PackageDataFile(path='datasets_in_progress/topics.xml'), lang='en')
))


In this implementation, we ...

### Example for reflection cells

In [5]:
import ir_datasets
dataset = ir_datasets.load("iranthology-randomizers")

example_document = dataset.docs_store().get('2014.cikm_workshop-2014dtmbio.20')
example_document

IrAnthologyDocument(doc_id='2014.cikm_workshop-2014dtmbio.20', text="{'crossref': 'DBLP:conf/cikm/2014dtmbio', 'booktitle': 'Proceedings of the ACM 8th International Workshop on Data and Text Mining in Bioinformatics, DTMBIO@CIKM 2014, Shanghai, China, November 7, 2014', 'pages': '47', 'publisher': 'ACM', 'year': '2014', 'url': 'https://doi.org/10.1145/2665970.2665989', 'doi': '10.1145/2665970.2665989', 'biburl': 'https://dblp.org/rec/conf/cikm/JungYYKCKL14.bib', 'bibsource': 'dblp computer science bibliography, https://dblp.org', 'bibkey': 'DBLP:conf/cikm/JungYYKCKL14', 'bibtype': 'inproceedings', 'authors': ['Jinmyung Jung', 'Hasun Yu', 'Seyeol Yoon', 'Mijin Kwon', 'Sungji Choo', 'Sangwoo Kim', 'Doheon Lee'], 'editors': [], 'venue': 'CIKM', 'id': '2014.cikm_workshop-2014dtmbio.20', 'date': 1541519853.0, 'abstract': 'ABSTRACTInferring drug-induced phenotypes via computational approaches can give a substantial support to drug discovery procedure. However, existing computational models 

TODO: Describe the document above and maybe the intention behind design decisions?

In [7]:
import ir_datasets
dataset = ir_datasets.load("iranthology-randomizers")

for query in dataset.queries_iter():
    print(query.query_id +':' + query.title)


1:Information Retrieval with algorithms
2:misspellings in queries
3:information in different language
4:Abbreviations in queries
5:lemmatization algorithms


TODO: Describe something for queries above?

# Milestone 1 (Group: Randomizers)

In [1]:
import json

with open('./ir-anthology-07-11-2021-ss23.jsonl', 'r') as json_file:
    json_list = list(json_file)

lis = []
for json_str in json_list:
    result = json.loads(json_str)
    id = result["id"]
    lis.append({"doc_id" : id, "text" : f"{result}"})
    #  the contents of "text" are subject to change

with open("ir-anthology-processed.jsonl", 'w') as f:
    for item in lis:
        f.write(json.dumps(item) + "\n")

# JSONL processing
This first cell reads the given .jsonl file and processes it into a new one of the desired form.

In [None]:
import ir_datasets
from ir_datasets.formats import JsonlDocs, TrecXmlQueries
from typing import NamedTuple
from ir_datasets.datasets.base import Dataset

class IrAnthologyDocument(NamedTuple):
    doc_id: str
    text: str
    
    def default_text(self):
        return self.text

ir_datasets.registry.register('iranthology-randomizers', Dataset(
    JsonlDocs(ir_datasets.util.PackageDataFile(path='datasets_in_progress/ir-anthology-processed.jsonl'), doc_cls=IrAnthologyDocument, lang='en'),
    TrecXmlQueries(ir_datasets.util.PackageDataFile(path='datasets_in_progress/topics.xml'), lang='en')
))


# Registry

This cell registers the dataset into ir-datasets

# Reflection

Working in a group of 5 can be messy, however we managed to organize our group quite well. Finding a day and time at which everyone is free was not easy, but we figured it out, got to know each other a little and decided on how to approach the task.

Working on the script to process the .jsonl file was simple, with some prior python experience and some good documentation and examples to be found on the internet.
More difficult was figuring out how to apply all the steps from the pangram example in the tutorial to the actual asignment. Alot of things, like the registration script and the tira-run command were just given as examples and figuring out how to 'translate' these examples to fit our files was not easy.