# <center><b>Assignment 2</b></center>
### Name: Anshul Rustogi
#### Roll No: 2010110111
### Name: Kolluru jeshwanth
#### Roll No: 2010110351


#### Importing Libraries

In [1]:
import pandas as pd
import nltk # Natural Language Toolkit
import re # Regular Expression
import string
import math
import numpy as np

from nltk.stem import PorterStemmer, WordNetLemmatizer

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('punkt')
    nltk.download('omw-1.4')

In [2]:
# # connecting to google drive
# from google.colab import drive
# drive.mount('/content/drive')

### <b>Question:</b>Vector Space Model
In this assignment, you will be implementing ranked retrieval using vector space model. To implement the VSM, you may choose to implement your dictionary and postings lists in the following format. The only difference between this format and that in the textbook, is that you encode term frequencies in the postings for the purpose of computing tf×idf. The tuple in each posting represents (doc ID, term freq).

#### Step 1: Reading the data
##### Note: Data taken from last IR assignment

In [3]:
import os
df = pd.DataFrame(columns=['Title','Content'])
# for dirname, _, filenames in os.walk('/content/drive/MyDrive/Corpus'):
for dirname, _, filenames in os.walk('Corpus'):
    for filename in filenames:
        #print(os.path.join(dirname, filename))
        with open(os.path.join(dirname, filename), 'r') as f:
            df = pd.concat([df, pd.DataFrame({'Title':filename, 'Content':f.read()}, index=[0])], ignore_index=True)

def retrieve_titles(indexes):
    titles = []
    for index in indexes:
        titles.append(df['Title'][index])
    return titles
    
df.head()

Unnamed: 0,Title,Content
0,levis.txt,"what is levis?\n\nWalter A. Haas, (born May 11..."
1,nike.txt,What is nike?\n\nNike is a champion brand buil...
2,Dell.txt,"What is dell?\n\nThe company, first named PC’s..."
3,huawei.txt,"what is huawei?\n\nHuawei Technologies Co., Lt..."
4,Ola.txt,What is Ola?\n\nOla needs no introduction. The...


#### Step 2: Preprocessing the data

In [4]:
#Creating a function to do the preprocessing: case folding, tokenization, stopword removal, lemmatization and stemming
nltk_stopwords = nltk.corpus.stopwords.words('english')

def preprocess(text):
  
    #Case Folding: convert all characters to lowercase
    text = text.lower()
    #Remove punctuation
    text = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)

    #Tokenization: using own tokenizer
    text = text.strip()
    tokens = re.split('\W+', text)

    #Remove empty tokens and single character tokens
    tokens = [token for token in tokens if token != '' and len(token) > 1]

    #Stopword Removal: remove stopwords like 'a', 'the', 'is', etc.
    tokens = [token for token in tokens if token not in nltk_stopwords]
    #Lemmatization: convert words to their base form (e.g. 'better' to 'good')
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    #Stemming: convert words to their root form (e.g. 'caring' to 'care', 'launched' to 'launch', etc.)
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    return tokens

#Applying the function to the dataframe
df['Content_preprocessed'] = df['Content'].apply(preprocess)
df.head()

Unnamed: 0,Title,Content,Content_preprocessed
0,levis.txt,"what is levis?\n\nWalter A. Haas, (born May 11...","[levi, walter, haa, born, may, san, francisco,..."
1,nike.txt,What is nike?\n\nNike is a champion brand buil...,"[nike, nike, champion, brand, builder, adverti..."
2,Dell.txt,"What is dell?\n\nThe company, first named PC’s...","[dell, compani, first, name, pc, limit, found,..."
3,huawei.txt,"what is huawei?\n\nHuawei Technologies Co., Lt...","[huawei, huawei, technolog, co, ltd, chines, m..."
4,Ola.txt,What is Ola?\n\nOla needs no introduction. The...,"[ola, ola, need, introduct, first, indian, cab..."


#### Step 3: Creating the tf x idf
In the searching step, you will need to rank documents by cosine similarity based on tf×idf. In
terms of SMART notation of ddd.qqq, you will need to implement the lnc.ltc ranking scheme (i.e., log tf and idf with cosine normalization for queries documents, and log tf, cosine normalization but no idf for documents. Compute cosine similarity between the query and each document, with the weights follow the tf×idf calculation, where term freq = 1 + log(tf) and inverse document frequency idf = log(N/df) (for queries). That is,
tf-idf = (1 + log(tf)) * log(N/df).


In [5]:
'''
The given question asks us to apply the vector space model to the corpus with different weighting schemes: 
1. For document:
    a. For term frequency: logaritmic
    b. For document frequency: none
    c. Normalization: cosine

2. For quey:
    a. For term frequency: logaritmic
    b. For document frequency: t ie inverse document frequency
    c. Normalization: cosine
'''

#Defining some basic functions for the vector space model
def calculate_idf(N: int, df: int) -> float:
    '''
    calculate the inverse document frequency (idf) of a term
    :param N: number of documents
    :param df: document frequency of a term (number of documents containing the term)
    '''
    return math.log(N/df, 10)

def get_document_frequency(term: str, documents: list) -> int:
    '''
    calculate the document frequency of a term in a list of documents
    :param term: the term
    :param documents: the list of documents
    '''
    count = 0
    for document in documents:
        if term in document:
            count += 1
    return count

def log_term_frequency(freq: int) -> float:
    '''
    calculate the log term frequency of a term in a document
    :param freq: the term frequency
    '''
    if freq == 0:
        return 0
    return 1 + math.log(freq, 10)


In [6]:
def preprocess_documents(df):
    document_vectors = []
    for document in df['Content_preprocessed']:
        #Calculating the term frequency
        term_frequency = {}
        for term in document:
            if term in term_frequency:
                term_frequency[term] += 1
            else:
                term_frequency[term] = 1
        #Calculating the log term frequency
        for term in term_frequency:
            term_frequency[term] = log_term_frequency(term_frequency[term])
        #Taking cosine normalization
        sum_of_squares = sum([freq**2 for freq in term_frequency.values()])
        for term in term_frequency:
            term_frequency[term] /= math.sqrt(sum_of_squares)
            
        document_vectors.append(term_frequency)

    return document_vectors

dc_vectors = preprocess_documents(df)

In [10]:
def preprocess_query(query: str) -> list:
    '''
    preprocess the query
    :param query: the query
    '''

    #Preprocessing the query
    query = preprocess(query)
    #Calculating the term frequency of the query
    query_vector = {}
    for term in query:
        if term in query_vector:
            query_vector[term] += 1
        else:
            query_vector[term] = 1
    #Calculating the inverse document frequency of the query
    idf = {}
    for term in query_vector:
        idf[term] = calculate_idf(len(df), get_document_frequency(term, df['Content_preprocessed']))
    #Calculating the log term frequency of the query
    for term in query_vector:
        query_vector[term] = log_term_frequency(query_vector[term])
    #Multiplying the term frequency with the inverse document frequency
    for term in query_vector:
        query_vector[term] = query_vector[term] * idf[term]

    #Normalizing the query vector with cosine normalization
    sum_of_squares = sum([freq**2 for freq in query_vector.values()])
    for term in query_vector:
        query_vector[term] /= math.sqrt(sum_of_squares)

    return query_vector

In [11]:
def calculate_cosine_similarity(query_vector: dict, document_vector: dict) -> float:
    '''
    calculate the cosine similarity between a query and a document
    :param query_vector: the query vector
    :param document_vector: the document vector
    '''
    numerator = 0
    for term in query_vector:
        if term in document_vector:
            numerator += query_vector[term] * document_vector[term]
    return numerator

def get_top_k_documents(query: str, k: int) -> list:
    '''
    get the top k documents for a query
    :param query: the query
    :param k: number of documents to return
    '''
    query_vector = preprocess_query(query)
    scores = []
    for document_vector in dc_vectors:
        scores.append(calculate_cosine_similarity(query_vector, document_vector))
    #print(scores)
    #Printing the top k documents with their titles and scores
    #Getting the indices of the top k documents
    top_k_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    for i in range(k):
        print('\n' + str(i+1) + ')\nTitle: ', df.iloc[top_k_indices[i]]['Title'])
        print('Score: ', scores[top_k_indices[i]], end="\n---------------XXX---------------")


In [13]:
get_top_k_documents('Warwickshire, came from an ancient family and was the heiress to some land', 5)


1)
Title:  shakespeare.txt
Score:  0.12296257967131181
---------------XXX---------------
2)
Title:  levis.txt
Score:  0.025503933795216413
---------------XXX---------------
3)
Title:  Adobe.txt
Score:  0.023453516952090917
---------------XXX---------------
4)
Title:  google.txt
Score:  0.021295839437710868
---------------XXX---------------
5)
Title:  nike.txt
Score:  0.019556033706669237
---------------XXX---------------