# Template Notebook for Milestones

In this notebook you will write your code, producing the required output for each Milestone.

Your notebook must contain 3 types of cells:

- (1) Code cells: Cells that contain code snippets, capturing one cohesive fragment of your code.

- (2) Corresponding explanation cells: Each code cell must be followed by a text cell containing the **English** explanation of what the corresponding code cell does and what it's purpose is

- (3) One reflection cell: One cell at the bottom of the notebook that contains your individual reflection on your process working on this milestones in **English**. It could contain technical problems and how you overcame them, it could contain social problems and how you deal with them (group work is hard!), it could contain explanations of prior skills or knowledge that made certain parts of the task easier for you, etc... (those are just suggestions. Your individual reflections will of course contain different/additional aspects)

In [1]:
import ir_datasets
from ir_datasets.formats import JsonlDocs, TrecXmlQueries, TrecQrels
from typing import NamedTuple, Dict
from ir_datasets.datasets.base import Dataset

# Imports
The code cell above imports the necessary libraries to be used. The main library utilized here is **'ir_datasets'** which processes the dataset directly. The other library is **'typing'**, a standard Python library used to create a **'NamedTuple'**.

In [None]:
class IrAntologyData(NamedTuple):
    doc_id: str
    title: str
    abstract: str
    
    def default_text(self):
        return self.title

# Class Creation
This code defines a **'NamedTuple'** called **'IrAntologyData'** with three fields: **'doc_id'**, **'title'**, and **'abstract'**. A **'NamedTuple'** is a type of tuple that allows you to refer to the fields by name instead of index.

The **'IrAntologyData NamedTuple'** also includes a method called **'default_text()'** that returns the title field. The **'default_text()'** method is used as a default value for text fields in certain parts of the code.

This **'NamedTuple'** can be used to represent a document in a dataset, with **'doc_id'** being the unique identifier for the document, title being the title of the document, and abstract being the summary of the document.

In [None]:
ir_datasets.registry.register('iranthology-lucky-coincidence', Dataset(
    JsonlDocs(ir_datasets.util.PackageDataFile(path='datasets_in_progress/ir-anthology-07-11-2021-ss23.jsonl'), doc_cls=IrAntologyData, lang='en'),
    TrecXmlQueries(ir_datasets.util.PackageDataFile(path='datasets_in_progress/topics.xml'), lang='en'),
    TrecQrels(ir_datasets.util.PackageDataFile(path='datasets_in_progress/qrels.txt'), {0:'Not Relevant', 1:'Relevant'})
))

# Registering IR Dataset
This code registers a new dataset in the **'ir_datasets'** registry called **'iranthology-lucky-coincidence'**. The dataset is created from JSONL documents and TREC XML queries, with **'IrAntologyData'** as the document class and English as the language.

# Reflection

Working on this project was difficult and confusing in some parts. While it was relatively simple to go through the steps of creating a new .jsonl dataset and topics in .xml format, we were challenged by the assignment in parts where we had to use Docker and TIRA. We have never created a Dockerfile before and ran into a lot of problems trying to utilize the ir_datasets library and building a Docker image. And due to our time limit of less than one week and very limited amount of contact with the experts, we had to rely a lot on doing independent trial and error.