# <ins> Milestone 1 </ins>

The following presents the code for converting the file "ir-anthology-07-11-2021-ss23" into a new file with the name "dnc-limited-documents.jsonl" in the correct format into the directory "data". In addition, this newly created file is registered together with the requested XML file called "dnc-limited-topics.xml". <br/><br/>Afterwards, a reflection will be presented.

## Code

### Formatting "ir-anthology-07-11-2021-ss23" with the fields "doc_id" and "text"

In [12]:
import json #Module to work on JSON data
import os #Module for interacting with the operating system
import platform #Module to check the current platform

Importing necessary modules.

In [13]:
def getDirectory():
    return os.path.join(os.getcwd(), "")

Function to get the current directory in the right format.<br/><br/>Return: <br/> &emsp; Current directory as String.

In [14]:
xmlFileName = "dnc-limited-topics.xml"
inputFileName = "ir-anthology-07-11-2021-ss23.jsonl"
outputFileName = "dnc-limited-documents.jsonl"


#Changing slashes to backslashes, depending on the current platform.
if platform.system() == 'Linux':
    irdatasetName = "dnc-limited-dataset-tira/"
    directoryData = "data/"
else:
    irdatasetName = "dnc-limited-dataset-tira\\"
    directoryData = "data\\"

inputFilePath = getDirectory() + directoryData + inputFileName
outputFilePath = getDirectory() + directoryData + outputFileName
xmlFilePath = getDirectory() + directoryData + xmlFileName
irdatasetPath = getDirectory() + irdatasetName

File names/folders and their paths.

In [15]:
def checkForFile(file_path, file_name):
    #If the file exists at the given path return true with an output
    if os.path.isfile(file_path):
        print(f"The File {file_name} already exists.")
        return True
    #If the file doesn´t exists at the given path return flase with an output
    else:
        print(f"The File {file_name} doesn´t exist.")
        return False

Function to check if file already exists at the given path. <br/><br/>Return: <br/> &emsp; True: If file exists <br/> &emsp; False: If file doesn´t exist <br/><br/>Output: <br/> &emsp; Small information text, whether the file exists or not.

In [16]:
def getEntriesToJSONL(inputFilePath, outputFilePath, outputFileName):
    #If the File doesn´t exist
    if checkForFile(outputFilePath,outputFileName) == False:
        with open(outputFilePath, 'w') as f:
            #Try to create the "outputFile" as an empty file and give the following output
            try:
                f.write(json.dumps(""))
                print(f"The File {outputFileName} was created in the following path: {outputFilePath}")
            #If the try failed, return the following output
            except Exception as e:
                print(f"An error occurred creating the File {outputFileName}: {e}")

    # Open the input-JSONL-file and the output-JSONL-file
    with open(inputFilePath, "r") as input_file, open(outputFilePath, "w") as output_file:
        # Iterate over each line (object) of the input-JSONL-file
        for line in input_file:
            lineJSON = json.loads(line) # Load the current line as JSON
            array = [] #Array for the values of the current line
            for key in lineJSON:
                array.append(lineJSON[key]) #Append the value for each key to the array
            stringJSON = " ".join(str(item) for item in array) #Create a string with the values of the object for the "text" field

            # Create an object with the "doc_id" (id of the object) and "text" fields 
            finalObject = {"doc_id": lineJSON["id"], "text": stringJSON}

            # Write the finalObject as JSON to the output JSONL file
            output_file.write(json.dumps(finalObject) + "\n")

Function to convert the "inputFile" to an "outputFile" as JSONL-File at the given path by the following steps: <br/> &emsp; 1. First check if "outputFile" already exists and if necessary create it. <br/> &emsp; 2. Convert the "inputFile" with the "doc_id" and "text" field.

In [17]:
getEntriesToJSONL(inputFilePath, outputFilePath, outputFileName)

The File dnc-limited-documents.jsonl already exists.


Execute the function to convert the "inputFile" <br/><br/> Output: <br/> &emsp; If file exists or has been created.

## Register Dataset

In [18]:
!docker build -t dnc-limited-ir-dataset -f Dockerfile.iranthology .

#1 [internal] load build definition from Dockerfile.iranthology
#1 transferring dockerfile: 290B 0.0s done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/webis/tira-ir-datasets-starter:0.0.54
#3 DONE 0.6s

#4 [1/2] FROM docker.io/webis/tira-ir-datasets-starter:0.0.54@sha256:2d59e9cd38cfdde34662f8fba5b426dd1f2a7b29e54e50ce8be676d0ad3af2ad
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 59.78MB 0.7s done
#5 DONE 0.7s

#6 [2/2] COPY iranthology-dnc-limited.py data/dnc-limited-topics.xml  data/dnc-limited-documents.jsonl /usr/lib/python3.8/site-packages/ir_datasets/datasets_in_progress/
#6 CACHED

#7 exporting to image
#7 exporting layers done
#7 writing image sha256:fb7900afcc037be917df46496e06d06cb83992a0660a6526982d221a885b5058 done
#7 naming to docker.io/library/dnc-limited-ir-dataset done
#7 DONE 0.0s


Building a docker image for the dataset called "dnc-limited-ir-dataset" of the "Dockerfile.iranthology" Docker file.

In [19]:
#If the directory for the "tira-run" output already exists, an output is given
if os.path.isdir(irdatasetPath):
    print('The folder already exists! Please delete the folder to recreate the data.')
#If the directory doesn´t exists, the platform is checked and the required "tira-run" command is used
else:
    if platform.system() == 'Windows':
        !tira-run --output-directory %cd%\dnc-limited-dataset-tira --image dnc-limited-ir-dataset --allow-network true --command "/irds_cli.sh --ir_datasets_id iranthology-dnc-limited --output_dataset_path $outputDir"
    else:
        !tira-run --output-directory ${PWD}/dnc-limited-dataset-tira --image dnc-limited-ir-dataset --allow-network true --command '/irds_cli.sh --ir_datasets_id iranthology-dnc-limited --output_dataset_path $outputDir'

Task: Full-Rank -> create files: 
 documents.jsonl 
 queries.jsonl 
 qrels.txt 
 at /tira-data/output/


Load Documents: 0it [00:00, ?it/s]

Load Documents: 243it [00:00, 2429.13it/s]

Load Documents: 1340it [00:00, 7452.21it/s]

Load Documents: 2583it [00:00, 9720.83it/s]

Load Documents: 3759it [00:00, 10525.38it/s]

Load Documents: 4986it [00:00, 11151.88it/s]

Load Documents: 6278it [00:00, 11751.58it/s]

Load Documents: 7454it [00:00, 10947.99it/s]

Load Documents: 8559it [00:00, 10697.04it/s]

Load Documents: 9636it [00:00, 9745.68it/s] 

Load Documents: 10674it [00:01, 9919.48it/s]

Load Documents: 11680it [00:01, 8769.47it/s]

Load Documents: 13022it [00:01, 9994.07it/s]

Load Documents: 14068it [00:01, 10117.44it/s]

Load Documents: 15109it [00:01, 9574.21it/s] 

Load Documents: 16215it [00:01, 9980.48it/s]

Load Documents: 17707it [00:01, 11364.84it/s]

Load Documents: 19059it [00:01, 11980.90it/s]

Load Documents: 20334it [00:01, 12196.44it/s]

Load Documents: 21569it [00:02

Check, if the tira data already have been created and if it needs, create them!

### Topics

The file "dnc-limited-topics.xml", which contains the topics, can be found in the "data" directory. <br/>Below you will find the section of the topic with the person who created it:

Topic 1 by Constantin Urbainsky:

```xml
<topic number="1">
  <title>machine learnign for more relevant results</title>
  <description>Which papers describe methods to find more relevant results using machine learning?</description>
  <narrative>
      Relevant papers describe one or more methods to find more relevant results using machine learning.
      Papers about just machine learning in IR in general or papers just about finding more relevant results are not
      relevant.
  </narrative>
</topic>
```

Topic 2 by Nils Harbach:

```xml
<topic number="2">
  <title>Crawling websites using machine learning</title>
  <description>Papers that describe how to use AI to crawl the context of websites more efficient.</description>
  <narrative>Papers in this topic describe methods and algorithms to use machine learning for crawling. They also contain information on the latest research findings on the topic. Papers about crawling methods without AI are not relevant for this topic.</narrative>
</topic>
```

Topic 3 by Willi Bittorf:

```xml
<topic number="3">
    <title>Recommenders influence on users</title>
    <description>Papers that describe the change in user behaviour because of recommenders?</description>
    <narrative>Relevant papers describe how users are affected by recommenders, papers about the recommenders from a technological point of view are not relevant</narrative>
</topic>
```

Topic 4 by Tom Paul Gresens:

```xml
<topic number="4">
    <title>Search engine caching effects</title>
    <description>Papers that describe the effects and/or efficient use of search engine caching in terms of result freshness, query latency and other potential advantages or disadvantages </description>
    <narrative>Papers in this topic will describe the design trade-off between low latency querying and returning the most recently available results as well as different architectures to create efficient caching systems. Results should not contain any other caching related topics (e.g. hardware or web browsers)</narrative>
</topic>
```

Topic 5 by Dorjan Domi:

```xml
<topic number="5">
    <title>Consumer Product reviews</title>
    <description>Papers that describe the effects of product reviews on consumer decisions</description>
    <narrative>Relevant papers would describe the influence that reviews have on on individual decisions of the consumer on whether to buy a product or not. Not relevant papers, would contain other studies about reviews, that are not pertaining to human psychology</narrative>
</topic>
```

Topic 6 by Timothy Kriewald:

```xml
<topic number="6">
    <title>Limitations machine learning</title>
    <description>Which papers describe the limitations of machine learning?</description>
    <narrative>Relevant papers describe the limitations of machine learning ( e.g. dependence on data quality and quantity, limited ability to handle complex tasks, vulnerability to disturbances and attacks, need for resources and energy). Papers that contains machine learning but not its limitations are not relevant.</narrative>
  </topic>
```

## Reflection

By Constantin:
While working on the first milestone, I was constantly unsure how to proceed. While many tutorials and all of the data were available to us, it was nonetheless confusing. All information was spread out; some of it was in the notes for the lab, some were on the assignment sheet. Moreover, although I followed the tutorial for installing and using tira, the command that worked for my teammates did not for myself. And although we tried our hardest to find out why it wouldn't work for me, we ultimately failed. I was still able to contribute by helping fix problems as they arose, but I wasn't able to actually run tira locally.

While on the topic of teamwork, I would say that I have been very fortunate with my group. Everyone was very fun to be around and determined to get this project done. If I had to critique our team, it would have to be in terms of organization. We had quite a bit of trouble finding times for meetings and group-work sessions because of conflicting schedules. On top of that, since all of us got into the project late and as such missed the first week of lab and lectures, we had to play catch-up, which wasn't great.

In terms of prior experience, I would say that I didn't have much in the way of experience with the technologies used. Python is a language I never used before, although it wasn't as big a shift from Java as say Haskell was. Docker was used in our project for softwaretech-lab; however, it is still quite foreign to me, and finding my way around it was challenging. The Terminal is also something that I used a bit in our softwaretech-lab and as such somewhat familiarized myself with, however much like Docker, I would say I still have much to learn.

By Willi:
My primary source of concern during work on our first submission was the task itself, as there was no obvious path on how to complete. All the subtasks were doable and we managed to complete most of them quickly, but we had a hard time understanding what the final result should actually look like
The tutorials were helpful but scattered, leading to more confusion while getting to know all the different technologies we're going to use in this course.

By Nils:
Starting the project was very difficult for me personally, as I didn't have much experience in Python. In addition, the information and tutorials on the project were very widely distributed on all the different platforms, which made it very difficult to read in. Also, in my eyes, the assignment on the sheet was vaguely worded and you didn't know exactly what you had to do now and especially what had to be handed in. This is very frustrating at the beginning when you don't know exactly what to do and you can't develop a clear plan. So we spent a long time developing things that were not required for the first milestone.

By Timothy Kriewald: <br/>While working on the first milestone, it became clear that the biggest challenge was understanding the task at hand. As a group, we were often unsure about what was expected of us and how to approach it. Each member had a slightly different understanding, leading to shifts in direction and disagreement throughout the project. Therefor, much of the work that was completed had to be removed in the end.

Despite this setback, I can look back positively on our group work. Almost every team member was determined and motivated to complete the task to the best of their abilities. Although there were some difficulties in finding suitable meeting times, we worked continuously. Unfortunately, most of us, including myself, were added to the module a week after the lecture started, which made the start a bit challenging. However, with some time and effort, we were able to overcome it.

Potential problems, such as setting up the Jupyter notebook, creating and processing a Docker image, could be prevented through the tutorials. While they were somewhat scattered, they were helpful, and all installations were executed flawlessly by me. Other issues were due to inexperience with Python, but these were comparatively easy to solve through teamwork, StackOverflow, and ChatGPT. Technical problems were much easier and more effective to solve than those resulting from disagreement and misunderstandings. We had to make many changes, causing problems within the program and costing us a lot of time and effort.

Overall, the work was interesting, and the group collaboration was better than expected. However, as a group, we need to develop a solid plan and establish a truly unified understanding of the task at the outset to prevent many problems in the long run.

By Dorjan:
The start of the project was quite rough. We started off with the assignment itself which was quite obscure. We needed to transform this large dataset into a specific new structure such that the result could be used for milestone 2. However, not understanding what milestone 2 really entails makes this part of the assignment very challenging. As a team, we decided to take a particular approach based on our intuition and the given structure in the example of milestone 1. After having completed the main Assignment, however, we started struggling to put this together in a docker image.

Unfortunately, since i didn't have any experience with Docker at all, i couldn't be of great help to my teammates in this regard. After several days of pondering on the issue, some of our teammates, that were a bit more well versed in using docker, managed to fix the issue.

What i am looking forward to, is a more organized teamwork, that plays to everyones strengths. The hope for the future of this project, is that we can overcome the technical difficulties of making the project work and that we can focus more on the contents of the assignment itself.

By Paul:
I joined the team later than intended due to a longer period of illness. Despite this setback, my team welcomed me warmly and gave me helpful tips on how to catch up quickly. Although I had never worked with Python or Jupiter before, I understood it quite quickly with the given tutorials, even though some information, including details about the actual assignment, was quite spread out. I'm looking forward to working with the team and contributing my own strengths to the project.