### Information Retrieval System

Task: Develop an information retrieval system based on ranked retrieval. The intended system should be based on tf-idf scores and cosine similarities to retrieve ranked indices of documents most relevant to the need. 

Upon querying, the query should be compared to the words of every document based on the mentioned scheme and returns ranked indices most relevant to the query. The 5 most relevant documents should be returned. If there are fewer than 5 relevant documents then only relevant documents should be returned.

- Import necessary modules

In [68]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

- Obtain stopwords list and define stemmer

In [69]:
nltk_stopwords = stopwords.words('english')
ps = PorterStemmer()

print(nltk_stopwords[0:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


- Load queries and documents

In [70]:
with open('Queries.txt','r') as f:
    queries = f.read()
queries_list = re.split('\n',queries)
queries_list = [query for query in queries_list if query]
print(queries_list)

documents=pd.read_csv("WordsDataset.csv", index_col=0)
documents.head()

['25 batman alone man', 'lack of intelligence', 'game of soccer', 'undertaker wwe record', 'movie for kids', 'harry kane height']


Unnamed: 0_level_0,words
docID,Unnamed: 1_level_1
0,"Hiker, demon, creepy, scary, tunnel, stalk"
1,"Batman, batman beyond, who are you, narrows it down, animated, show, officer"
2,"Up, carl, russell, honor, award, scout badge, old man, kids, movie, record"
3,"Tom, jerry, sword, stab, dont care, cartoon, show"
4,"Wholesome, comic, dialogue bubble, dog, sleeping with owner"


- Define Cosine similarity function to be used to determine the similarity between the query and each document

In [71]:
def cosine_similarity(vector1, vector2):
    numerator = np.dot(vector1,vector2)
    denominator = np.linalg.norm(vector1)*np.linalg.norm(vector2)
    if denominator == 0:
        similarity = 0
    else:
        similarity = numerator/denominator
    return similarity

- Document retrieval function to return the 10 most relevant documents from 'WordsDataset.csv' file
- Document vectors are obtained for each document in the data set. 
- Each value in the document vector is the Term Frequency-Inverse Document Frequency (TF-IDF) weight for a given word in the shared vector space.

In [72]:
def retrieve_documents(query,documents):

    print("Query: " + query)

    # populate dictionary with each indexed document
    # Tokenise, apply stopwords and stem each document
    documents_dict = {}
    for i in range(len(documents)):
        tokens = word_tokenize(documents.loc[i]['words'])
        words = [word for word in tokens if word.isalnum()]
        words = [word for word in words if not word in stopwords.words()]
        words = [ps.stem(word) for word in words]
        documents_dict[i] = words

    # tokenised, stopped and stemmed query
    query_tokens = word_tokenize(query)
    query_words = [word for word in query_tokens if word.isalnum()]
    query_words = [word for word in query_words if not word in stopwords.words()]
    query_words = [ps.stem(word) for word in query_words]

    # Obtain inverse document frequency for each term in shared vector space
    # shared vector space is the vector indexed by relevant words in the query
    idf_term = []
    for word in query_words:
        doc_count = 0
        for i in range(len(documents)):
            if word in documents_dict[i]:
                doc_count += 1
        
        idf_term.append(np.log10(len(documents)/(1 + doc_count)))

    # creating empty term frequency and document vectors for each document
    termfrequency_dict = {}
    docvector_dict = {}
    for i in range(len(documents)):
        termfrequency_dict[i] = []
        docvector_dict[i] = []
        for k in range(len(query_words)):
            termfrequency_dict[i].append(0)
            docvector_dict[i].append(0)

    # populating document vectors for each document
    # value is given by tf-idf weight for each term in shared vector space
    for i in range(len(documents)):
        k=0
        for word in query_words:
            for term in documents_dict[i]:
                if term == word:
                    termfrequency_dict[i][k] += 1
            docvector_dict[i][k] = termfrequency_dict[i][k]*idf_term[k]
            k += 1

    # populating query document vector
    query_docvector = [1 for word in query_words]

    # Find the cosine similarities between each document and the query
    doc_similarities = {}
    for i in range(len(documents)):
        doc_similarities[i] = cosine_similarity(docvector_dict[i], query_docvector)

    # Sort the documents in order of the most relevant document
    sorted_doc_sim = sorted(doc_similarities.items(), key=lambda x:x[1], reverse=True)
    sorted_doc_sim = sorted_doc_sim[0:5]

    relevant_queries_idx = []
    for i in range(len(sorted_doc_sim)):
        if sorted_doc_sim[i][1] > 0: # Do not want to return documents that have 0 similarity
            relevant_queries_idx.append(sorted_doc_sim[i][0])

    print("Search results:")
    print(documents.loc[relevant_queries_idx])

#### Results

In [73]:
pd.set_option('display.max_colwidth', None)

retrieve_documents(queries_list[0],documents)

Query: 25 batman alone man
Search results:
                                                                              words
docID                                                                              
1      Batman, batman beyond, who are you, narrows it down, animated, show, officer
35                                                    Uno, draw 25, option, classic


In [74]:
retrieve_documents(queries_list[1],documents)

Query: lack of intelligence
Search results:
                                                 words
docID                                                 
28     Stupid intelligence, brain, what is this answer


In [79]:
retrieve_documents(queries_list[2],documents)

Query: game of soccer
Search results:
                                                                  words
docID                                                                  
8      Geralt, yennefer, pointing, blame, slapstick, video game, player
53                      video game, football, harry kane, umpire, sport


In [80]:
retrieve_documents(queries_list[3],documents)

Query: undertaker wwe record
Search results:
                                                                            words
docID                                                                            
34                              Undertaker, randy Orton, surprised, reaction, wwe
2      Up, carl, russell, honor, award, scout badge, old man, kids, movie, record


In [81]:
retrieve_documents(queries_list[4],documents)

Query: movie for kids
Search results:
                                                                            words
docID                                                                            
2      Up, carl, russell, honor, award, scout badge, old man, kids, movie, record
7                 Lord of the rings, lotr, gandalf, pipip, sending, movie, height
11                          Groot, gunpoint, force, surreal, movie, despicable me
15                  Jedi, master, lightsaber, block, unexpected, movie, star wars
16                         Joker, you think this is funny, laugh, reaction, movie


In [78]:
retrieve_documents(queries_list[5],documents)

Query: harry kane height
Search results:
                                                                 words
docID                                                                 
53                     video game, football, harry kane, umpire, sport
7      Lord of the rings, lotr, gandalf, pipip, sending, movie, height
50       Harry Osborn, peter parker, spider man, movie, hidden, voeyer
