# Template Notebook for Milestones

In this notebook you will write your code, producing the required output for each Milestone.

Your notebook must contain 3 types of cells:

- (1) Code cells: Cells that contain code snippets, capturing one cohesive fragment of your code.

- (2) Corresponding explanation cells: Each code cell must be followed by a text cell containing the **English** explanation of what the corresponding code cell does and what it's purpose is

- (3) One reflection cell: One cell at the bottom of the notebook that contains your individual reflection on your process working on this milestones in **English**. It could contain technical problems and how you overcame them, it could contain social problems and how you deal with them (group work is hard!), it could contain explanations of prior skills or knowledge that made certain parts of the task easier for you, etc... (those are just suggestions. Your individual reflections will of course contain different/additional aspects)

In [None]:
import json

# Import
The above code cell is needed to import necessary libraries to pre-process the data. Since we are dealing with a **'.jsonl'** file, we need to use the **'json.load()'** function to load the data.

In [None]:
raw_dataset = []

with open('/home/mark/Projects/Python/information_retrieval/luckycoincidence/ir-anthology-07-11-2021-ss23.jsonl') as f:
    for line in f:
        doc = json.loads(line)
        raw_dataset.append(doc)

# Loading and Parsing JSONL Dataset

This code is used to read in a JSONL dataset from a specified file path, where each line in the file contains a JSON object. 

The **'open()'** function opens the file, iterates through each line using a for loop, and parses each line using **'json.loads()'**. The resulting dictionary object is added to the **'raw_dataset'** array.

The **'raw_dataset'** array is used to store the raw data, which can then be further processed and transformed.

In [None]:
print(raw_dataset[0])
print(raw_dataset[1])
print(raw_dataset[2])

# Retrieving Initial Dataset

The cell above is to show the first 3 lines of data and the varibles contained in them. Since the data is already parsed into an array, we can now easily access each elements by their index.

In [None]:
new_dataset = []

for doc in raw_dataset:
    doc_id = doc['id']
    title = doc['title']
    abstract = doc['abstract']
    text = title + '. ' + abstract
    new_doc = {'doc_id': doc_id, 'text': text}
    new_dataset.append(new_doc)

with open('/home/mark/Projects/Python/information_retrieval/luckycoincidence/processed_dataset.jsonl', 'w') as f:
    for doc in new_dataset:
        json.dump(doc, f)
        f.write('\n')

# Pre-Processing Data
The cell above which extracts selected variables from 'raw_dataset'. **'id'** is extracted and renamed as **'doc_id'**, while **'title'** and **'abstract'** are combined and then renamed as **'text'**. This creates our new dataset. Then, we write this dataset into a jsonl file named **'processed_dataset.jsonl'**.

In [None]:
import ir_datasets
from ir_datasets.formats import JsonlDocs, TrecXmlQueries
from ir_datasets.util import PackageDataFile
from ir_datasets.datasets.base import Dataset

ir_datasets.registry.register(
    'iranthology-luckycoincidence',
    Dataset(
        JsonlDocs(PackageDataFile('luckycoincidence/processed_dataset.jsonl')),
        TrecXmlQueries(PackageDataFile('luckycoincidence/topics.xml')),
    )
)

# Registering Lucky Coincidence Dataset in **'ir_datasets'**
This code registers the processed dataset and TREC XML-format topics for the **Lucky Coincidence** team into **'ir_datasets'**. 

The processed dataset and topics are loaded using the **'JsonlDocs'** and **'TrecXmlQueries'** modules respectively, and are packaged using **'PackageDataFile'**. 

Then, the dataset is registered with the name **'iranthology-luckycoincidence'** into **'ir_datasets'** using the register function from the **'ir_datasets.registry'** module. 

This makes the dataset available for use in future experiments and tasks, allowing researchers to easily access and analyze the data.

In [None]:
# Output processed dataset path
print(f'Processed dataset saved to "/home/mark/Projects/Python/information_retrieval/luckycoincidence/processed_dataset.jsonl"')

# Output TREC XML-format topics
with open('/home/mark/Projects/Python/information_retrieval/luckycoincidence/topics.xml', 'r') as f:
    xml_string = f.read()
    
with open('/home/mark/Projects/Python/information_retrieval/luckycoincidence/custom_topics.xml', 'w') as f:
    f.write(xml_string)

print('TREC XML-format topics saved to "/home/mark/Projects/Python/information_retrieval/luckycoincidence/custom_topics.xml"')

# Outputting Processed Dataset and TREC XML-format Topics

The **first line outputs** the path where the processed dataset is saved in JSONL format. 

The **second block reads** the 'TREC XML-format topics file saved in the file path and converts it to a string. 

The **third block writes** the TREC XML-format topics string to a new file path where it can be saved. 

Finally, **the last line outputs** the path where the TREC XML-format topics are saved.

# Reflection

Working on this project was difficult and confusing in some parts. While it was relatively simple to go through the steps of creating a new .jsonl dataset and topics in .xml format, we were challenged by the assignment in parts where we had to use Docker and TIRA. We have never created a Dockerfile before and ran into a lot of problems trying to utilize the ir_datasets library and building a Docker image. And due to our time limit of less than one week and very limited amount of contact with the experts, we had to rely a lot on doing independent trial and error.