# Vector space models

This notebook is produced for the final project of the course "Information Retrieval" at the University of Trieste. The goal of this project is to implement a vector space model for information retrieval. The notebook is structured as follows:
- [1. Overview](#1.-Overview): A simple tour of the IR system using python scripts
- [2. Implementation](#2.-Implementation): A detailed description of the implementation of the IR system
- [3. Evaluation](#3.-Evaluation): A description of the evaluation of the IR system.

## Overview

Fristly, we need to import the scripts from the `src` folder:

In [1]:
from src.ir_system import IRSystem

%load_ext autoreload
%autoreload 2

Then we create an instance of the `IRSystem` class, choosing the dataset to use:
- `dataset`: the dataset to use. It can be either `time` or `reuters`
  
When the dataset is loaded a Inverted Index is created. The Inverted Index is a dictionary where the keys are the terms and the values are the list of documents where the term appears. 

If the index is already created, we can load it from the `data` folder:

In [2]:
system = IRSystem(dataset="time", load=False)

Indexing:: 100%|██████████| 422/422 [00:02<00:00, 154.07it/s]


Now the IR system is ready to be used. Let explore the IRSystem class:

In [3]:
corpus = system['corpus'] # get corpus list of list of words
index = system['index'] # get index, custum Index class
query = system['query'] # get query, list of list of words

Let explore the index:

In [7]:
# Get the posting list of a term
posting_list = system["index"]["america"]
# Get the idf of a term
idf = system["index"].idf("america")
# Get the tf-idf of a term in a document
tf_idf = system["index"].tf_idf("america", 46, len(corpus))

print("Posting list:", posting_list, "\nidf:", idf, "\ntfidf:", tf_idf)


Posting list: [(46, 3), (53, 3), (263, 2), (3, 1), (47, 1), (131, 1), (173, 1), (214, 1), (270, 1), (271, 1), (274, 1), (278, 1), (279, 1), (293, 1), (294, 1), (321, 1), (327, 1), (366, 1), (381, 1), (404, 1), (417, 1)] 
idf: 1.3030931562277546 
tfidf: 3.9092794686832635


Now the IR system is ready to be used. We can search for a query using the `search` method. The method returns a list of documents sorted by relevance. The relevance is computed using the cosine similarity between the query and the documents.

In [6]:
# select a query
query = system["query"][3]
print("Query:", query)

# get the top 10 documents for the query
results = system.search(query, k=10)

# Explore the results and their scores
print("Results:", results)

# Get the documents from the results
documents = system.get_document_from_list(results)
print("Documents:", documents[0])

Query: ['us', 'policy', 'toward', 'the', 'new', 'regime', 'in', 'south', 'viet', 'nam', 'which', 'overthrew', 'president', 'diem']
Results: [docid:54 score:1.0, docid:254 score:1.0, docid:283 score:1.0]
Documents: ['iraq', 'friends', 'brothers', 'not', 'long', 'ago', 'abdul', 'karim', 'kassem', 'lean', 'and', 'psychotic', 'strongman', 'of', 'iraq', 'boasted', 'that', 'he', 'had', 'survived', 'attempts', 'to', 'kill', 'him', 'over', 'the', 'past', 'f', 'years', 'last', 'week', 'in', 'baghdad', 'death', 'kept', 'the', 'th', 'appointment', 'rebel', 'iraqi', 'army', 'officers', 'overthrew', 'the', 'government', 'and', 'issued', 'a', 'characteristic', 'middle', 'eastern', 'communique', 'with', 'the', 'help', 'of', 'god', 'we', 'have', 'been', 'able', 'to', 'destroy', 'the', 'enemy', 'of', 'god', 'and', 'of', 'the', 'people', 'abdul', 'karim', 'kassem', 'and', 'his', 'gang', 'who', 'have', 'used', 'the', 'country', 'for', 'their', 'own', 'interests', 'and', 'who', 'choked', 'liberty', 'and',