# Template Notebook for Milestones

In this notebook you will write your code, producing the required output for each Milestone.

Your notebook must contain 3 types of cells:

- (1) Code cells: Cells that contain code snippets, capturing one cohesive fragment of your code.

- (2) Corresponding explanation cells: Each code cell must be followed by a text cell containing the **English** explanation of what the corresponding code cell does and what it's purpose is

- (3) One reflection cell: One cell at the bottom of the notebook that contains your individual reflection on your process working on this milestones in **English**. It could contain technical problems and how you overcame them, it could contain social problems and how you deal with them (group work is hard!), it could contain explanations of prior skills or knowledge that made certain parts of the task easier for you, etc... (those are just suggestions. Your individual reflections will of course contain different/additional aspects)

In [1]:
import json
import os

In [2]:
raw_dataset = []

with open('/home/mark/Projects/Python/Information Retrieval/ir-anthology/ir-anthology-07-11-2021-ss23.jsonl') as f:
    for line in f:
        doc = json.loads(line)
        raw_dataset.append(doc)

In [3]:
print(raw_dataset[21])

{'crossref': 'DBLP:conf/sigir/2019birndl', 'booktitle': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019', 'series': 'CEUR Workshop Proceedings', 'volume': '2414', 'pages': '196–207', 'publisher': 'CEUR-WS.org', 'year': '2019', 'url': 'http://ceur-ws.org/Vol-2414/paper20.pdf', 'biburl': 'https://dblp.org/rec/conf/sigir/LiZXHLLL19.bib', 'bibsource': 'dblp computer science bibliography, https://dblp.org', 'bibkey': 'DBLP:conf/sigir/LiZXHLLL19', 'bibtype': 'inproceedings', 'pdf': 'http://ceur-ws.org/Vol-2414/paper20.pdf', 'authors': ['Lei Li', 'Yingqi Zhu', 'Yang Xie', 'Zuying Huang', 'Wei Liu', 'Xingyuan Li', 'Yinan Liu'], 'editors': [], 'venue': 'SIGIR', 'id': '2019.sigirconf_workshop-2019birndl.21', 'date': 1581522299.0, 'abstrac

In [4]:
new_dataset = []

for doc in raw_dataset:
    doc_id = doc['id']
    title = doc['title']
    abstract = doc['abstract']
    text = title + '. ' + abstract
    new_doc = {'doc_id': doc_id, 'text': text}
    new_dataset.append(new_doc)

with open('/home/mark/Projects/Python/Information Retrieval/Lucky Coincidence/processed_dataset.jsonl', 'w') as f:
    for doc in new_dataset:
        json.dump(doc, f)
        f.write('\n')

In [5]:
import ir_datasets

class MyDataset(ir_datasets.Dataset):
    def __init__(self):
        self._docs_path = '/home/mark/Projects/Python/Information Retrieval/Lucky Coincidence/processed_dataset.jsonl'
        self._trec_topics_path = '/home/mark/Projects/Python/Information Retrieval/Lucky Coincidence/topics.xml'
        super().__init__('iranthology-luckycoincidence',
            {
                'docs': ir_datasets.TsvDocs({
                    'doc_id': 0,
                    'text': 1,
                }),
                'trec_topics': ir_datasets.TrecXmlQueries(),
        },
        {
            'docs': self._docs_path,
            'trec_topics': self._trec_topics_path,
        }
    )

    def docs_iter(self):
        with open(self._docs_path) as f:
            for line in f:
                doc = json.loads(line)
                yield {'doc_id': doc['doc_id'], 'text': doc['text']}

    def trec_topics(self):
        with open(self._trec_topics_path) as f:
            return f.read()

ir_datasets.registry.register(MyDataset())

AttributeError: module 'ir_datasets' has no attribute 'IndexedText'

### Example Explanation cell

The above cell prints a sentence in order to give an impression of what might be done with such a cell. This cell here explains it.

In [None]:
# add more code cells

### Add More Explanation cells

### Example Reflection Cell

Working on this notebook was difficult, because it is not the real notebook but just an example. My experience in writing notebooks helped speed up the process, for example knowing that there are different types of cells. However, it was difficult to figure our what to write exactly in order to make sure the students understand what they are supposed to do as I could not test it before giving it to the students.