# Exploring Ranking Models in Information Retrieval

## Objective
Understand the practical implementation and differences between the Vector Space Model and the Binary Independence Model in ranking documents relative to a user query.

### Step 1: Data Preprocessing

Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

# LIBRERIAS

In [23]:
import os
import re
import collections
import pandas as pd
from numpy import dot
from numpy.linalg import norm
import numpy as np

In [24]:
# Define the path to the directory containing the text files
CORPUS_DIR = "../week01/data"
documents = {}

In [25]:
def clean_text(text):
    # Remover símbolos y paréntesis utilizando expresiones regulares
    cleaned_text = re.sub(r'[^\w\s]', '', text)
    return cleaned_text

for filename in os.listdir(CORPUS_DIR):
    if filename.endswith('.txt'):
        file_path = os.path.join(CORPUS_DIR, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read().lower()
            cleaned_text = clean_text(text)
            documents[filename] = cleaned_text

In [26]:
# Dictionary to save the doc normalized
normalized_word_counts = {}

In [27]:
for doc in documents:
    word_count = collections.Counter(documents[doc].split())
    total_words = sum(word_count.values())
    normalized_word_count = {word: count/total_words for word, count in word_count.items()}
    normalized_word_counts[doc] = normalized_word_count

In [28]:
df = pd.DataFrame.from_dict(normalized_word_counts)

In [29]:
df_no_nan = df.fillna(0)
df_no_nan = df_no_nan.rename_axis('Files')
df_no_nan.head(10)

Unnamed: 0_level_0,20_hrs_40_min_.txt,Adventures_of_Huckleberry_Finn.txt,Alices_Adventures_in_Wonderland.txt,Ang_Filibusterismo_Karugtóng_ng_Noli_Me_Tangere.txt,Anne_of_Green_Gables.txt,A_Christmas_Carol_in_Prose_Being_a_Ghost_Story_of_Christmas.txt,A_Dolls_House__a_play.txt,A_Modest_Proposal.txt,A_Room_with_a_View.txt,A_Study_in_Scarlet.txt,...,Thus_Spake_Zarathustra_A_Book_for_All_and_None.txt,Tractatus Logico-Philosophicus.txt,Treasure_Island.txt,Twenty_years_after.txt,Ulysses.txt,Up_the_Orinoco_and_down_the_Magdalena.txt,Walden_and_On_The_Duty_Of_Civil_Disobedience.txt,War_and_Peace.txt,WinniethePooh.txt,Wuthering_Heights.txt
Files,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.065397,0.044282,0.061879,0.001607,0.038739,0.055549,0.033005,0.055297,0.046823,0.058548,...,0.053015,0.045192,0.064093,0.059102,0.05606,0.079774,0.063312,0.061084,0.034676,0.039849
project,0.002223,0.00078,0.002984,0.000753,0.000853,0.002785,0.002982,0.013824,0.001264,0.001894,...,0.000772,0.001716,0.001234,0.0004,0.000351,0.000631,0.000741,0.000178,0.003402,0.000765
gutenberg,0.002103,0.000762,0.00295,0.000744,0.000824,0.002754,0.002948,0.013514,0.00125,0.001873,...,0.000764,0.000585,0.00122,0.000363,0.000329,0.000573,0.000732,0.000154,0.003363,0.000732
ebook,0.000338,0.000114,0.000441,0.00011,0.000123,0.000411,0.000441,0.002019,0.000187,0.00028,...,0.000114,0.000273,0.000182,7.3e-05,5.2e-05,8.4e-05,0.000109,2.3e-05,0.000503,0.000109
of,0.031804,0.015634,0.021395,0.001049,0.01928,0.024815,0.017926,0.04023,0.020841,0.028714,...,0.025398,0.028282,0.025306,0.025328,0.030836,0.048698,0.030444,0.02649,0.020141,0.019697
20,0.000242,9e-06,3.4e-05,1.7e-05,9e-06,3.2e-05,3.4e-05,0.000155,1.4e-05,2.2e-05,...,8.8e-05,2e-05,2.8e-05,4e-06,5.6e-05,3.2e-05,8e-06,7e-06,3.9e-05,8e-06
hrs,0.000218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40,0.000218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2e-05,1.4e-05,0.0,1.5e-05,1.9e-05,0.0,2e-06,0.0,0.0
min,0.000193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
our,0.002586,0.000946,0.000407,3.4e-05,0.000654,0.000601,0.00122,0.004039,0.001005,0.001485,...,0.001009,0.000839,0.002006,0.001461,0.001094,0.005664,0.002627,0.001102,0.000271,0.000967


In [30]:
df_no_nan.to_csv("vect_normalized.csv",index=True)

### Step 2:  Vector Space Model (VSM)

Task: Implement a simple Vector Space Model using term frequency.

Requirements:
* _Document and Query Representation:_ Convert each document and the query into a vector where each dimension corresponds to a term from the corpus. Use simple term frequency for weighting.
* _Cosine Similarity Calculation:_ Calculate the cosine similarity between the query vector and each document vector.
* _Ranking:_ Rank the documents based on their cosine similarity scores from highest to lowest.

In [31]:
df = pd.read_csv("vect_normalized.csv")

In [32]:
import math
def euclidean_distance(vector1, vector2):
    squared_diff = sum([(x - y)**2 for x, y in zip(vector1, vector2)])
    distance = math.sqrt(squared_diff)
    return distance

In [33]:
def get_query_vector(query):
    query_vector = df.loc[df['Files'] == query].values[0][1:]
    return query_vector

In [34]:
def rank_documents(query):
    query_vector = get_query_vector(query)
    similarities = []

    for i in range(df.shape[1] - 1):
        doc_vector = df.iloc[:, i + 1].values
        distance = euclidean_distance(query_vector, doc_vector)
        similarities.append((df.columns[i + 1], distance))

    ranked_documents = sorted(similarities, key=lambda x: x[1])
    return ranked_documents

In [35]:
df.head(10)

Unnamed: 0,Files,20_hrs_40_min_.txt,Adventures_of_Huckleberry_Finn.txt,Alices_Adventures_in_Wonderland.txt,Ang_Filibusterismo_Karugtóng_ng_Noli_Me_Tangere.txt,Anne_of_Green_Gables.txt,A_Christmas_Carol_in_Prose_Being_a_Ghost_Story_of_Christmas.txt,A_Dolls_House__a_play.txt,A_Modest_Proposal.txt,A_Room_with_a_View.txt,...,Thus_Spake_Zarathustra_A_Book_for_All_and_None.txt,Tractatus Logico-Philosophicus.txt,Treasure_Island.txt,Twenty_years_after.txt,Ulysses.txt,Up_the_Orinoco_and_down_the_Magdalena.txt,Walden_and_On_The_Duty_Of_Civil_Disobedience.txt,War_and_Peace.txt,WinniethePooh.txt,Wuthering_Heights.txt
0,the,0.065397,0.044282,0.061879,0.001607,0.038739,0.055549,0.033005,0.055297,0.046823,...,0.053015,0.045192,0.064093,0.059102,0.05606,0.079774,0.063312,0.061084,0.034676,0.039849
1,project,0.002223,0.00078,0.002984,0.000753,0.000853,0.002785,0.002982,0.013824,0.001264,...,0.000772,0.001716,0.001234,0.0004,0.000351,0.000631,0.000741,0.000178,0.003402,0.000765
2,gutenberg,0.002103,0.000762,0.00295,0.000744,0.000824,0.002754,0.002948,0.013514,0.00125,...,0.000764,0.000585,0.00122,0.000363,0.000329,0.000573,0.000732,0.000154,0.003363,0.000732
3,ebook,0.000338,0.000114,0.000441,0.00011,0.000123,0.000411,0.000441,0.002019,0.000187,...,0.000114,0.000273,0.000182,7.3e-05,5.2e-05,8.4e-05,0.000109,2.3e-05,0.000503,0.000109
4,of,0.031804,0.015634,0.021395,0.001049,0.01928,0.024815,0.017926,0.04023,0.020841,...,0.025398,0.028282,0.025306,0.025328,0.030836,0.048698,0.030444,0.02649,0.020141,0.019697
5,20,0.000242,9e-06,3.4e-05,1.7e-05,9e-06,3.2e-05,3.4e-05,0.000155,1.4e-05,...,8.8e-05,2e-05,2.8e-05,4e-06,5.6e-05,3.2e-05,8e-06,7e-06,3.9e-05,8e-06
6,hrs,0.000218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,40,0.000218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2e-05,1.4e-05,0.0,1.5e-05,1.9e-05,0.0,2e-06,0.0,0.0
8,min,0.000193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,our,0.002586,0.000946,0.000407,3.4e-05,0.000654,0.000601,0.00122,0.004039,0.001005,...,0.001009,0.000839,0.002006,0.001461,0.001094,0.005664,0.002627,0.001102,0.000271,0.000967


In [36]:
query = str(input("Ingresa la palabra a buscar: "))
ranked_docs = rank_documents(query)
print(f"Documentos ordenados por similitud con la consulta '{query}':")
for doc, score in ranked_docs:
    print(f"{doc}: {score}")

Documentos ordenados por similitud con la consulta 'cifras':
Don_Quijote.txt: 0.016459826252429456
La_casa_e_la_famiglia_di_Masaniello.txt: 0.016529453326237533
Geschiedenis_der_Noordsche_Compagnie.txt: 0.020794327611582165
Noli_Me_Tangere.txt: 0.032256556091244434
Ang_Filibusterismo_Karugtóng_ng_Noli_Me_Tangere.txt: 0.032717029955072766
Romeo_and_Juliet.txt: 0.060035677580618535
The_Complete_Works_of_William_Shakespeare.txt: 0.061343348158907254
Tractatus Logico-Philosophicus.txt: 0.06377605590635833
The_Blue_Castle_a_novel.txt: 0.06579234852946171
The_Importance_of_Being_Earnest_A_Trivial_Comedy_for_Serious_People.txt: 0.0672868626914898
Anne_of_Green_Gables.txt: 0.07015290351777582
Crime_and_Punishment.txt: 0.07093944175726419
WinniethePooh.txt: 0.07119607076018954
Notes_from_the_Underground.txt: 0.07131024592472482
A_Room_with_a_View.txt: 0.07142051621050313
A_Dolls_House__a_play.txt: 0.0714820488433844
Don_Juan.txt: 0.0715287895237159
The_divine_comedy.txt: 0.07206208844733245
Mid

### Step 3: Binary Independence Model (BIM)

Task: Implement a basic Binary Independence Model to rank documents.

Requirements:
* _Binary Representation:_ Represent the corpus and the query in binary vectors (1 if the term is present, 0 otherwise).
* _Probability Estimation:_ Assume arbitrary probabilities for the presence of each term in relevant and non-relevant documents.
* _Relevance Scoring:_ Calculate the relevance score for each document based on the product of probabilities for terms present in the query.
* _Ranking:_ Rank the documents based on their relevance scores from highest to lowest.