# Exploring Ranking Models in Information Retrieval

## Objective
Understand the practical implementation and differences between the Vector Space Model and the Binary Independence Model in ranking documents relative to a user query.

### Step 1: Data Preprocessing

Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

In [1]:
import os
import pandas as pd
from collections import defaultdict

# Define the path to the directory containing the text files
CORPUS_DIR = '../../week01/data'
documents = {}

# Dictionary to store word counts for each document
word_counts = defaultdict(lambda: defaultdict(int))

### Step 2:  Vector Space Model (VSM)

Task: Implement a simple Vector Space Model using term frequency.

Requirements:
* _Document and Query Representation:_ Convert each document and the query into a vector where each dimension corresponds to a term from the corpus. Use simple term frequency for weighting.
* _Cosine Similarity Calculation:_ Calculate the cosine similarity between the query vector and each document vector.
* _Ranking:_ Rank the documents based on their cosine similarity scores from highest to lowest.

In [2]:
# Iterate through each file in the directory
for filename in os.listdir(CORPUS_DIR):
    if filename.endswith('.txt'):
        file_path = os.path.join(CORPUS_DIR, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            document_text = file.read().lower()  # Read and convert to lowercase
            documents[filename] = document_text

            # Count word occurrences in the current document
            words = document_text.split()
            for word in words:
                word_counts[word][filename] += 1

# Create a DataFrame from the word counts
df = pd.DataFrame(word_counts)

# Fill NaN values with 0 (for words not present in some documents)
df.fillna(0, inplace=True)

# Save the DataFrame to a CSV file
# df_word_counts.to_csv('word_counts_per_document.csv')

# Display the DataFrame
print(df)

                                   ﻿the  project  gutenberg  ebook     of  \
A Room with a View.txt                1       83         25      8   1435   
Chronicles of London Bridge.txt       1       84         25      8   9664   
Winnie-the-Pooh.txt                   1       83         25      8    507   
The Enchanted April.txt               1       83         25      8   1842   
Moby Dick.txt                         1       87         25      8   6682   
...                                 ...      ...        ...    ...    ...   
The Hound of the Baskervilles.txt     1       83         25      8   1719   
Romeo and Juliet.txt                  1       84         26      8    516   
The Blue Castle- a novel.txt          1       83         25      8   1591   
The Metamorphoses of Ovid.txt         1       83         25      8   5601   
The History of Woman Suffrage.txt     1       86         25      8  23106   

                                      a  room  with   view  this  ...  \
A 

### Step 3: Binary Independence Model (BIM)

Task: Implement a basic Binary Independence Model to rank documents.

Requirements:
* _Binary Representation:_ Represent the corpus and the query in binary vectors (1 if the term is present, 0 otherwise).
* _Probability Estimation:_ Assume arbitrary probabilities for the presence of each term in relevant and non-relevant documents.
* _Relevance Scoring:_ Calculate the relevance score for each document based on the product of probabilities for terms present in the query.
* _Ranking:_ Rank the documents based on their relevance scores from highest to lowest.

In [3]:
df = df.transpose()

In [4]:
df

Unnamed: 0,A Room with a View.txt,Chronicles of London Bridge.txt,Winnie-the-Pooh.txt,The Enchanted April.txt,Moby Dick.txt,A Doll's House.txt,A Christmas Carol in Prose Being a Ghost Story of Christmas.txt,Ulysses.txt,The Brothers Karamazov.txt,Jane Eyre- An Autobiography.txt,...,A Smaller History of Rome.txt,The Adventures of Ferdinand Count Fathom — Complete.txt,The Count of Monte Cristo.txt,John Dewey's logical theory.txt,Christopher Columbus and How He Received and Imparted the Spirit of Discovery.txt,The Hound of the Baskervilles.txt,Romeo and Juliet.txt,The Blue Castle- a novel.txt,The Metamorphoses of Ovid.txt,The History of Woman Suffrage.txt
﻿the,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
project,83.0,84.0,83.0,83.0,87.0,83.0,83.0,88.0,83.0,84.0,...,93.0,88.0,92.0,83.0,92.0,83.0,84.0,83.0,83.0,86.0
gutenberg,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,...,25.0,25.0,26.0,25.0,25.0,25.0,26.0,25.0,25.0,25.0
ebook,8.0,8.0,8.0,8.0,8.0,8.0,8.0,9.0,8.0,8.0,...,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0
of,1435.0,9664.0,507.0,1842.0,6682.0,514.0,781.0,8221.0,7335.0,4442.0,...,7792.0,7342.0,12779.0,2648.0,10842.0,1719.0,516.0,1591.0,5601.0,23106.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
encouragment,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
-->encouragement,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
