# Exploring Ranking Models in Information Retrieval

## Objective
Understand the practical implementation and differences between the Vector Space Model and the Binary Independence Model in ranking documents relative to a user query.

### Step 1: Data Preprocessing

Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

In [1]:
import os
import re
import collections
import pandas as pd
from numpy import dot
from numpy.linalg import norm
import numpy as np

In [2]:

CORPUS_DIR = "Downloads/descargas"
documents = {}

In [3]:
# Definición de una función para limpiar el texto
def clean_text(text):
    # Utiliza una expresión regular para eliminar caracteres no alfanuméricos y no espaciales del texto
    cleaned_text = re.sub(r'[^\w\s]', '', text)
    return cleaned_text

# Itera sobre los archivos en el directorio especificado por la variable CORPUS_DIR
for filename in os.listdir(CORPUS_DIR):
    # Verifica si el archivo tiene la extensión '.txt'
    if filename.endswith('.txt'):
        # Construye la ruta completa al archivo
        file_path = os.path.join(CORPUS_DIR, filename)
        # Abre el archivo en modo lectura con codificación UTF-8
        with open(file_path, 'r', encoding='utf-8') as file:
            # Lee el contenido del archivo y lo convierte a minúsculas
            text = file.read().lower()
            # Limpia el texto utilizando la función clean_text()
            cleaned_text = clean_text(text)
            # Agrega el texto limpio al diccionario documents, utilizando el nombre del archivo como clave
            documents[filename] = cleaned_text


In [4]:
normalized_word_counts = {}

In [5]:
import collections
import pandas as pd

normalized_word_counts = {}

for doc in documents:
    word_count = collections.Counter(documents[doc].split())
    total_words = sum(word_count.values())
    normalized_word_count = {word: count/total_words for word, count in word_count.items()}
    normalized_word_counts[doc] = normalized_word_count

df = pd.DataFrame.from_dict(normalized_word_counts)
df_no_nan = df.fillna(0)
df_no_nan = df_no_nan.rename_axis('Archivo')

# Imprimir las primeras 100 filas del DataFrame
print(df_no_nan.head(100))

df_no_nan.to_csv("NormalizarVector.csv", index=True)


           pg100.txt  pg10676.txt  pg1080.txt  pg10907.txt  pg11.txt  \
Archivo                                                                
the         0.031548     0.070610    0.055297     0.089691  0.061879   
project     0.000103     0.000965    0.013824     0.000416  0.002984   
gutenberg   0.000090     0.000745    0.013514     0.000398  0.002950   
ebook       0.000013     0.000110    0.002019     0.000059  0.000441   
of          0.019516     0.030938    0.040230     0.039828  0.021395   
...              ...          ...         ...          ...       ...   
fourth      0.000066     0.000042    0.000311     0.000104  0.000034   
second      0.000655     0.000144    0.000311     0.000258  0.000203   
life        0.000955     0.000542    0.000311     0.000127  0.000407   
fifth       0.000036     0.000034    0.000000     0.000054  0.000034   
sixth       0.000036     0.000000    0.000000     0.000023  0.000000   

           pg1184.txt  pg120.txt  pg1232.txt  pg12582.txt  pg12

### Step 2:  Vector Space Model (VSM)

Task: Implement a simple Vector Space Model using term frequency.

Requirements:
* _Document and Query Representation:_ Convert each document and the query into a vector where each dimension corresponds to a term from the corpus. Use simple term frequency for weighting.
* _Cosine Similarity Calculation:_ Calculate the cosine similarity between the query vector and each document vector.
* _Ranking:_ Rank the documents based on their cosine similarity scores from highest to lowest.

In [6]:
# Importa la biblioteca pandas para el manejo de datos y la biblioteca math para operaciones matemáticas
import pandas as pd
import math

# Lee el archivo CSV "NormalizarVector.csv" y carga los datos en un DataFrame llamado df
df = pd.read_csv("NormalizarVector.csv")

# Función para calcular la distancia euclidiana entre dos vectores
def euclidean_distance(vector1, vector2):
    # Calcula la suma de las diferencias al cuadrado entre elementos correspondientes de los vectores
    squared_diff = sum([(x - y)**2 for x, y in zip(vector1, vector2)])
    # Calcula la raíz cuadrada de la suma de las diferencias al cuadrado
    distance = math.sqrt(squared_diff)
    return distance

# Función para obtener el vector asociado a un archivo de consulta
def get_query_vector(query):
    # Filtra el DataFrame para obtener el vector asociado al archivo de consulta
    query_vector = df.loc[df['Archivo'] == query].values[0][1:]
    return query_vector

# Función para calcular las similitudes entre el vector de consulta y todos los documentos en el DataFrame
def rank_documents(query):
    # Obtiene el vector asociado al archivo de consulta
    query_vector = get_query_vector(query)
    similarities = []

    # Itera sobre las columnas del DataFrame (vectores de documentos)
    for i in range(df.shape[1] - 1):
        # Obtiene el vector de un documento
        doc_vector = df.iloc[:, i + 1].values
        # Calcula la distancia euclidiana entre el vector de consulta y el vector del documento
        distance = euclidean_distance(query_vector, doc_vector)
        # Almacena la distancia y el nombre del documento en una lista de similitudes
        similarities.append((df.columns[i + 1], distance))

    # Ordena los documentos según su similitud con el vector de consulta
    ranked_documents = sorted(similarities, key=lambda x: x[1])
    return ranked_documents

# Muestra los primeros 100 registros del DataFrame df, proporcionando una vista previa de los datos cargados desde el archivo CSV
df.head(100)


Unnamed: 0,Archivo,pg100.txt,pg10676.txt,pg1080.txt,pg10907.txt,pg11.txt,pg1184.txt,pg120.txt,pg1232.txt,pg12582.txt,...,pg73448.txt,pg7370.txt,pg74.txt,pg76.txt,pg768.txt,pg84.txt,pg844.txt,pg8800.txt,pg98.txt,pg996.txt
0,the,0.031548,0.070610,0.055297,0.089691,0.061879,0.061475,0.064093,0.058870,0.085680,...,0.065397,0.061040,0.053341,0.044282,0.039849,0.056090,0.033986,0.051372,0.059054,0.052207
1,project,0.000103,0.000965,0.013824,0.000416,0.002984,0.000233,0.001234,0.001662,0.000389,...,0.002223,0.001479,0.001192,0.000780,0.000765,0.001127,0.003720,0.000788,0.000634,0.000221
2,gutenberg,0.000090,0.000745,0.013514,0.000398,0.002950,0.000190,0.001220,0.001643,0.000368,...,0.002103,0.001462,0.001178,0.000762,0.000732,0.001114,0.003678,0.000779,0.000626,0.000205
3,ebook,0.000013,0.000110,0.002019,0.000059,0.000441,0.000028,0.000182,0.000246,0.000055,...,0.000338,0.000218,0.000176,0.000114,0.000109,0.000166,0.000550,0.000116,0.000094,0.000033
4,of,0.019516,0.030938,0.040230,0.039828,0.021395,0.027768,0.025306,0.034317,0.040506,...,0.031804,0.044183,0.021496,0.015634,0.019697,0.035357,0.021939,0.023071,0.029765,0.031301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,fourth,0.000066,0.000042,0.000311,0.000104,0.000034,0.000047,0.000014,0.000000,0.000076,...,0.000000,0.000000,0.000027,0.000035,0.000008,0.000000,0.000000,0.000081,0.000022,0.000028
96,second,0.000655,0.000144,0.000311,0.000258,0.000203,0.000353,0.000407,0.000397,0.000241,...,0.000218,0.000202,0.000108,0.000272,0.000185,0.000077,0.000211,0.000278,0.000230,0.000237
97,life,0.000955,0.000542,0.000311,0.000127,0.000407,0.000701,0.000379,0.000567,0.000123,...,0.000411,0.001513,0.000447,0.000237,0.000463,0.001473,0.001395,0.000797,0.001073,0.000916
98,fifth,0.000036,0.000034,0.000000,0.000054,0.000034,0.000050,0.000014,0.000000,0.000076,...,0.000000,0.000000,0.000000,0.000018,0.000008,0.000013,0.000000,0.000116,0.000022,0.000016


In [7]:
# Importa la biblioteca tkinter para crear la interfaz gráfica
import tkinter as tk

# Función que se ejecuta al hacer clic en el botón de búsqueda
def search():
    # Obtiene la consulta ingresada por el usuario desde el campo de entrada
    query = entry.get()
    # Verifica si la consulta es para salir del programa
    if query.lower() == 'exit':
        # Establece un mensaje de salida si la consulta es 'exit'
        output_text.set("Saliendo del programa...")
    # Verifica si la consulta no está presente en la columna 'Archivo' del DataFrame df
    elif query not in df['Archivo'].values:
        # Establece un mensaje de error si la consulta no se encuentra en los documentos
        output_text.set("La palabra no se encontró en los documentos. Intenta con otra palabra.")
    else:
        # Obtiene los documentos ordenados por similitud con la consulta
        ranked_docs = rank_documents(query)
        # Construye un texto con los resultados de la búsqueda
        result_text = f"Documentos ordenados por similitud con la consulta '{query}':\n\n"
        for doc, score in reversed(ranked_docs):
            result_text += f"{doc}: {score}\n"
        # Establece el texto de los resultados para mostrarlo en la interfaz
        output_text.set(result_text)

# Crea la ventana principal de la aplicación
root = tk.Tk()
root.title("Búsqueda de Documentos")
root.geometry("400x300")  # Establece el tamaño de la ventana

# Estilo para la etiqueta de título
title_style = ("Helvetica", 14, "bold")

# Crea una etiqueta de título y la agrega a la ventana
title_label = tk.Label(root, text="Búsqueda de Documentos", font=title_style)
title_label.pack(pady=10)

# Crea etiqueta y campo de entrada de texto para la consulta
query_label = tk.Label(root, text="Ingresa la palabra a buscar (o 'exit' para salir):")
query_label.pack()
entry = tk.Entry(root, width=40, borderwidth=2)
entry.pack()

# Crea un botón de búsqueda y lo agrega a la ventana
search_button = tk.Button(root, text="Buscar", width=10, command=search)
search_button.pack(pady=10)

# Variable para mostrar los resultados de la búsqueda
output_text = tk.StringVar()
output_text.set("")  # Inicialmente, no hay resultados
output_label = tk.Label(root, textvariable=output_text, justify="left", anchor="w", wraplength=380)
output_label.pack(pady=10, padx=10)

# Inicia el bucle principal de la interfaz gráfica
root.mainloop()


### Step 3: Binary Independence Model (BIM)

Task: Implement a basic Binary Independence Model to rank documents.

Requirements:
* _Binary Representation:_ Represent the corpus and the query in binary vectors (1 if the term is present, 0 otherwise).
* _Probability Estimation:_ Assume arbitrary probabilities for the presence of each term in relevant and non-relevant documents.
* _Relevance Scoring:_ Calculate the relevance score for each document based on the product of probabilities for terms present in the query.
* _Ranking:_ Rank the documents based on their relevance scores from highest to lowest.