# Task -1:
Importing the Libraries needed for the preprocessing of the texts in the assignment. We'll be requiring spaCy for using NLP based functions for processing and modelling the text. We will be requiring the Regex library (re) for the pre-processing related task such as to remove all characters except for alphanumerics and whitespaces. We'll be requiring Pandas to load the dataset and perform the needed pre-processing

In [1]:
import spacy
import re
import pandas as pd

In this step, we are displaying the data obtain from all the CSV files provided to us, we store the data obtained from docs.csv in a Dataframe named df, the data obtained from queries.csv in queries and the data from qdrel.csv as qdrel. 

In [2]:
df = pd.read_csv("./Query_Doc/docs.csv")
queries = pd.read_csv("./Query_Doc/queries.csv")
qdrel = pd.read_csv("./Query_Doc/qdrel.csv")

print("\033[1m" + "Docs Data Frame: \n" + '\033[0m\n', df.head())
print("\033[1m" + "Queries Data Frame: \n" + '\033[0m\n', queries.head())
print("\033[1m" + "Relational Data Frame: \n" + '\033[0m\n', qdrel.head())

[1mDocs Data Frame: 
[0m
    Unnamed: 0  doc_id                                           doc_text
0           0       1  What is the step by step guide to invest in sh...
1           1       2  What is the step by step guide to invest in sh...
2           2       3  What is the story of Kohinoor (Koh-i-Noor) Dia...
3           3       4  What would happen if the Indian government sto...
4           4       5  How can I increase the speed of my internet co...
[1mQueries Data Frame: 
[0m
    Unnamed: 0  query_id                                         query_text
0           0      4584                How can ask questions using photos?
1           1      6588  What is Atal Pension Yojana? What are its bene...
2           2     10113      Where is starch digested? How is it digested?
3           3      7957      What is a conjecture? What are some examples?
4           4      5498  What can India do to support the people suffer...
[1mRelational Data Frame: 
[0m
    Unnamed: 0  quer

Now we realise that there is an unnecessary column named unnamed, we seek to remove that to filter out noise from the data set because it is **NOT** needed in the modelling process 

In [3]:
# Removal of the unnamed column from the dataframe
df = df.drop(df.columns[df.columns.str.contains('Unnamed', case=False)], axis=1)
queries = queries.drop(queries.columns[queries.columns.str.contains('Unnamed', case=False)], axis=1)
qdrel = qdrel.drop(qdrel.columns[qdrel.columns.str.contains('Unnamed', case=False)], axis=1)

print("\033[1m" + "Docs Data Frame: \n" + '\033[0m\n', df.head())
print("\033[1m" + "Queries Data Frame: \n" + '\033[0m\n', queries.head())
print("\033[1m" + "Relational Data Frame: \n" + '\033[0m\n', qdrel.head())

[1mDocs Data Frame: 
[0m
    doc_id                                           doc_text
0       1  What is the step by step guide to invest in sh...
1       2  What is the step by step guide to invest in sh...
2       3  What is the story of Kohinoor (Koh-i-Noor) Dia...
3       4  What would happen if the Indian government sto...
4       5  How can I increase the speed of my internet co...
[1mQueries Data Frame: 
[0m
    query_id                                         query_text
0      4584                How can ask questions using photos?
1      6588  What is Atal Pension Yojana? What are its bene...
2     10113      Where is starch digested? How is it digested?
3      7957      What is a conjecture? What are some examples?
4      5498  What can India do to support the people suffer...
[1mRelational Data Frame: 
[0m
    query_id  doc_id
0       318     317
1       378     377
2       379     380
3       399    2606
4       399    2607


## a) Preprocessing of the docs and queries - removing the characters other than alphanumerics or whitespaces

Now the first sub task under task-1 expects us to remove all the characters other than alphanumerics and white spaces, we will be using regex for the same. The library used is re, which we had already imported. All we have to do is write a purify_docs function that takes in the data from each of the cell of the docs and queries dataframes as input and then apply the regex operation of filtering out the characters other than lower-case a-z, upper-case A-Z, digits from 0-9 and spaces.

In [4]:
# Removal of characters other than alphabets (both uppercase and lowercase included), digits and numbers
def purify_docs(data):
    purified_doc = re.sub(r'[^a-zA-Z0-9\s]', ' ', data)
    return purified_doc

df['pure'] = (df['doc_text']).apply(purify_docs)
queries['pure'] = (queries['query_text']).apply(purify_docs)

print("\033[1m" + "Docs Data Frame: \n" + '\033[0m\n', df.head(), "\n")
print("\033[1m" + "Queries Data Frame: \n" + '\033[0m\n', queries.head(), "\n")

[1mDocs Data Frame: 
[0m
    doc_id                                           doc_text  \
0       1  What is the step by step guide to invest in sh...   
1       2  What is the step by step guide to invest in sh...   
2       3  What is the story of Kohinoor (Koh-i-Noor) Dia...   
3       4  What would happen if the Indian government sto...   
4       5  How can I increase the speed of my internet co...   

                                                pure  
0  What is the step by step guide to invest in sh...  
1  What is the step by step guide to invest in sh...  
2  What is the story of Kohinoor  Koh i Noor  Dia...  
3  What would happen if the Indian government sto...  
4  How can I increase the speed of my internet co...   

[1mQueries Data Frame: 
[0m
    query_id                                         query_text  \
0      4584                How can ask questions using photos?   
1      6588  What is Atal Pension Yojana? What are its bene...   
2     10113      Where is 

## b) Now we need to correct the spellings in both queries and documents. 
For each query, which got corrected, we need to display the original and the corrected query on two spearate lines

In [None]:
# Correct Spellings function ->

## c) Now we shall tokenize the words in the documents using spaCy library. 
Also as mentioned in the problem statement we need to remove all words that occur in less than 5 documents or more than 85% of the documents : this forms the vocabulary for the task. Lastly for each document and query, we create the create TF-IDF vectors for obtianing the cosine-similarity scores and further processing.

In [30]:
# First of all we need to load the NLP spaCy pipeline
nlp = spacy.load("en_core_web_sm")

# Next in order to tokenize the words into a list, we use the mentioned below function for dividing the words into tokens, for each cell of the dataframe a list is returned
def derive_tokens(sentence):
    doc = nlp(sentence)
    tokensList = []
    
    for token in doc:
        word = token.text
        tokensList.append(word)
        
    return tokensList

# Now we apply the above mentioned function to the dataframes
df['tokensList'] = df['pure'].apply(derive_tokens)
queries['tokensList'] = queries['pure'].apply(derive_tokens)

In [32]:
# Next we shall generate the vocabulary and for that purpose we first need a collection of all tokens used in the docs database
def performFiltrationOfWords(df):
    collectionOfAllWords = []

    for tokens in df['tokensList']:
        for eachToken in tokens:
            collectionOfAllWords.append(eachToken)

    collectionSet = set(collectionOfAllWords)
    mapCollections = {}

    for token in collectionSet:
        mapCollections[token] = collectionOfAllWords.count(token)

    purifiedWords = []
    
    for token in collectionSet:
        if mapCollections[token] >= 5 and mapCollections[token]<= len(df) * 0.85:
            purifiedWords.append(token)
            
    return purifiedWords

filteredWords = performFiltrationOfWords(df)

# We generate a new column tokenList which has the words included in the given vocabulary and then we join those words to form sentences
df['tokensList'] = df['tokensList'].apply(lambda words: [word for word in words if word in filteredWords])
df['sentences'] = df['tokensList'].apply(lambda word: ' '.join(word))
queries['sentences'] = queries['tokensList'].apply(lambda word: ' '.join(word)) 

Next we apply the TFIDF function to the Docs and Queries data-frames post preprocessing, in order to create TFIDF vectors 

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidfVectorsForDocs = tfidf.fit_transform(df['sentences'])
tfidfVectorsForQueries = tfidf.transform(queries['sentences'])

## d) Now for each of the queries, we shall perform the Cosine Pairwise Similarity
We will be using the sklearn library for the same to calculate the cosine pairwise similarity and along with that we will also be displaying the top 1, top 5 and top 10 most similar documents sorted according to Cosine Similarity scores for eahc query in the Queries dataframe

In [34]:
from sklearn.metrics.pairwise import cosine_similarity

generateCosineSimilarityMatrix = cosine_similarity(tfidfVectorsForQueries, tfidfVectorsForDocs)

def obtainSimilarDocs(n):
    return generateCosineSimilarityMatrix.argsort(axis=1)[:, -n:][:, ::-1]

def printSimilarDocs(n, documents):
    print(f"\nTop {n} Similar Docs: ")
    for i, indices in enumerate(documents, start=1):
        docText = df.iloc[indices]['doc_text']
        docIndex = df.iloc[indices]['doc_id']
        print(f"Doc ID: {docIndex} : {docText}")
        
# Obtaining the similarity indices and printing them subsequently
topOneSimilarInds = obtainSimilarDocs(1)
topFiveSimilarInds = obtainSimilarDocs(5)
topTenSimilarInds = obtainSimilarDocs(10)

# Running a for loop in order to print all the values:
for i, rows in queries.iterrows():
    query = rows['query_text']
    print(f"\nGiven Query: {query}")
    
    printSimilarDocs(1, topOneSimilarInds[i])
    printSimilarDocs(5, topFiveSimilarInds[i])
    printSimilarDocs(10, topTenSimilarInds[i])


Given Query: How can ask questions using photos?

Top 1 Similar Docs: 
Doc ID: 1377 : What are some of the best photos?

Top 5 Similar Docs: 
Doc ID: 1377 : What are some of the best photos?
Doc ID: 1782 : What are the best interview questions to ask?
Doc ID: 9951 : Is there any way to automatically like Instagram photos with hashtags, using software?
Doc ID: 45 : What are the questions should not ask on Quora?
Doc ID: 4412 : Why do people have to ask Quora for questions?

Top 10 Similar Docs: 
Doc ID: 1377 : What are some of the best photos?
Doc ID: 1782 : What are the best interview questions to ask?
Doc ID: 9951 : Is there any way to automatically like Instagram photos with hashtags, using software?
Doc ID: 45 : What are the questions should not ask on Quora?
Doc ID: 4412 : Why do people have to ask Quora for questions?
Doc ID: 4583 : How do I ask questions with pictures on "Quora"?
Doc ID: 2603 : What are the best questions to ask a girl while chatting?
Doc ID: 9179 : How can I ma

Doc ID: 9213 : What are some fun games to play at a co-ed baby shower?
Doc ID: 10116 : Is dating actually fun?
Doc ID: 5774 : What are some nice messages to write on a baby shower card?
Doc ID: 8643 : What are some of the Nostradamus predictions which actually occurred in history?
Doc ID: 261 : What are some yakshini mantras?

Top 10 Similar Docs: 
Doc ID: 9213 : What are some fun games to play at a co-ed baby shower?
Doc ID: 10116 : Is dating actually fun?
Doc ID: 5774 : What are some nice messages to write on a baby shower card?
Doc ID: 8643 : What are some of the Nostradamus predictions which actually occurred in history?
Doc ID: 261 : What are some yakshini mantras?
Doc ID: 6461 : What are the best Final Fantasy games?
Doc ID: 3625 : My hair falls only during shower but not after shower. Why?
Doc ID: 4578 : What are some good games for a couple to play?
Doc ID: 5646 : What are some substitutes for butter?
Doc ID: 3712 : What are some references for DHL Courier?

Given Query: Is it 

## e) Calculations of Precision@K Scores

In [35]:
def getPScores(n):
    pAtOneSum = 0.0
    pAtFiveSum = 0.0
    pAtTenSum = 0.0
    if n==1:
        for i, rows in queries.iterrows():
            generateSetOfRelations = set(qdrel[qdrel['query_id'] == rows['query_id']]['doc_id'])
            topOneDocs = topOneSimilarInds[i]
            dataAtOne = set()
            for j, indices in enumerate(topOneDocs, start=1):
                dataAtOne.add(df.iloc[indices]['doc_id'])
    
            pAtOne = len(generateSetOfRelations.intersection(dataAtOne)) / 1
            pAtOneSum += pAtOne
            queryLength = len(queries)
        return pAtOneSum/queryLength
    
    if n==5:
        for i, rows in queries.iterrows():
            generateSetOfRelations = set(qdrel[qdrel['query_id'] == rows['query_id']]['doc_id'])
            topFiveDocs = topFiveSimilarInds[i]
            dataAtFive = set()
            for j, indices in enumerate(topFiveDocs, start=1):
                dataAtFive.add(df.iloc[indices]['doc_id'])
    
            pAtFive = len(generateSetOfRelations.intersection(dataAtFive)) / 5
            pAtFiveSum += pAtFive
            queryLength = len(queries)
        return pAtFiveSum/queryLength
    
    if n==10:
        for i, rows in queries.iterrows():
            generateSetOfRelations = set(qdrel[qdrel['query_id'] == rows['query_id']]['doc_id'])
            topTenDocs = topTenSimilarInds[i]
            dataAtTen = set()
            for j, indices in enumerate(topTenDocs, start=1):
                dataAtTen.add(df.iloc[indices]['doc_id'])
    
            pAtTen = len(generateSetOfRelations.intersection(dataAtTen)) / 10
            pAtTenSum += pAtTen
            queryLength = len(queries)
        return pAtTenSum/queryLength
        
pAOne = getPScores(1)
pAFive = getPScores(5)
pATen = getPScores(10)

print(f"The Weighted Avg Precision@1 Score is as follows: {pAOne:.5f}")
print(f"The Weighted Avg Precision@5 Score is as follows: {pAFive:.5f}")
print(f"The Weighted Avg Precision@10 Score is as follows: {pATen:.5f}")

The Weighted Avg Precision@1 Score is as follows: 0.59000
The Weighted Avg Precision@5 Score is as follows: 0.19000
The Weighted Avg Precision@10 Score is as follows: 0.10100


## Task - 2a: Only Stemming :
SpaCy library has no builtin function for spacing, so we shallbe using the NLTK library for Stemming, essentially NLTK has three widely used Stemmers: 

- Porter Stemmer
- Snowball Stemmer
- Lancaster Stemmer

For the purpose of this model we will be using PorterStemmer, we will just creating an instance and then apply them to the words in the pre-processing phase. We have implemented all functions here, let us checkout the resultant Precision Scores at the end 

In [25]:
from nltk.stem import PorterStemmer
st = PorterStemmer() 

def performStemmingAndTokenize(sentence):
    doc = nlp(sentence)
    tokensList = []
    
    for token in doc:
        word = st.stem(token.text)
        tokensList.append(word.lower())
        
    return tokensList

df['stemmedTokens'] = df['pure'].apply(performStemmingAndTokenize)
queries['stemmedTokens'] = queries['pure'].apply(performStemmingAndTokenize)

def performFiltrationOfStemmedWords(df):
    collectionOfAllWords = []

    for tokens in df['stemmedTokens']:
        for eachToken in tokens:
            collectionOfAllWords.append(eachToken)

    collectionSet = set(collectionOfAllWords)
    mapCollections = {}

    for token in collectionSet:
        mapCollections[token] = collectionOfAllWords.count(token)

    purifiedWords = []
    
    for token in collectionSet:
        if mapCollections[token] >= 5 and mapCollections[token]<= len(df) * 0.85:
            purifiedWords.append(token)
            
    return purifiedWords

filteredWords = performFiltrationOfStemmedWords(df)

df['stemmedTokens'] = df['stemmedTokens'].apply(lambda words: [word for word in words if word in filteredWords])
df['stemmedSentences'] = df['stemmedTokens'].apply(lambda word: ' '.join(word))
queries['stemmedSentences'] = queries['stemmedTokens'].apply(lambda word: ' '.join(word)) 

tfidf = TfidfVectorizer()
tfidfVectorsForDocs = tfidf.fit_transform(df['stemmedSentences'])
tfidfVectorsForQueries = tfidf.transform(queries['stemmedSentences'])

generateCosineSimilarityMatrix = cosine_similarity(tfidfVectorsForQueries, tfidfVectorsForDocs)
        
topOneSimilarInds = obtainSimilarDocs(1)
topFiveSimilarInds = obtainSimilarDocs(5)
topTenSimilarInds = obtainSimilarDocs(10)
        
pAOne = getPScores(1)
pAFive = getPScores(5)
pATen = getPScores(10)

print(f"The Weighted Avg Precision@1 Score is as follows: {pAOne:.5f}")
print(f"The Weighted Avg Precision@5 Score is as follows: {pAFive:.5f}")
print(f"The Weighted Avg Precision@10 Score is as follows: {pATen:.5f}")

The Weighted Avg Precision@1 Score is as follows: 0.69000
The Weighted Avg Precision@5 Score is as follows: 0.20000
The Weighted Avg Precision@10 Score is as follows: 0.10900


## Task - 2b: Performing Lemmatization on the Data
Now in this case, before the pre-processing step, we have the builtin function for a Lemmatizer in the standard spaCy NLP Pipeline, therefore we use it to provide only the base words for each of the words present in the data set

In [27]:
nlp = spacy.load("en_core_web_sm")
 

def performLemmatizationAndTokenize(sentence):
    doc = nlp(sentence)
    tokensList = []
    
    for token in doc:
        tokensList.append(token.lemma_.lower())
        
    return tokensList

df['lemmatizedTokens'] = df['pure'].apply(performLemmatizationAndTokenize)
queries['lemmatizedTokens'] = queries['pure'].apply(performLemmatizationAndTokenize)

def performFiltrationOfLemmatizedWords(df):
    collectionOfAllWords = []

    for tokens in df['lemmatizedTokens']:
        for eachToken in tokens:
            collectionOfAllWords.append(eachToken)

    collectionSet = set(collectionOfAllWords)
    mapCollections = {}

    for token in collectionSet:
        mapCollections[token] = collectionOfAllWords.count(token)

    purifiedWords = []
    
    for token in collectionSet:
        if mapCollections[token] >= 5 and mapCollections[token]<= len(df) * 0.85:
            purifiedWords.append(token)
            
    return purifiedWords

filteredWords = performFiltrationOfLemmatizedWords(df)

df['lemmatizedTokens'] = df['lemmatizedTokens'].apply(lambda words: [word for word in words if word in filteredWords])
df['lemmatizedSentences'] = df['lemmatizedTokens'].apply(lambda word: ' '.join(word))
queries['lemmatizedSentences'] = queries['lemmatizedTokens'].apply(lambda word: ' '.join(word)) 

tfidf = TfidfVectorizer()
tfidfVectorsForDocs = tfidf.fit_transform(df['lemmatizedSentences'])
tfidfVectorsForQueries = tfidf.transform(queries['lemmatizedSentences'])

generateCosineSimilarityMatrix = cosine_similarity(tfidfVectorsForQueries, tfidfVectorsForDocs)
        
topOneSimilarInds = obtainSimilarDocs(1)
topFiveSimilarInds = obtainSimilarDocs(5)
topTenSimilarInds = obtainSimilarDocs(10)
        
pAOne = getPScores(1)
pAFive = getPScores(5)
pATen = getPScores(10)

print(f"The Weighted Avg Precision@1 Score is as follows: {pAOne:.5f}")
print(f"The Weighted Avg Precision@5 Score is as follows: {pAFive:.5f}")
print(f"The Weighted Avg Precision@10 Score is as follows: {pATen:.5f}")

The Weighted Avg Precision@1 Score is as follows: 0.70000
The Weighted Avg Precision@5 Score is as follows: 0.19400
The Weighted Avg Precision@10 Score is as follows: 0.10700


## Task - 3: Apply Parts-Of-Speech Tagging and Named Entity Recognition and obtaining the results
In this case, we shall increase the frequency of the important words like the ones which are NOUNS and the ones which are names of entities like an organisation, person's name etc. We shall multiply in a factor of 2 for nouns and 4 for Named Entities

In [36]:
def performPOSAndNER(sentence):
    doc = nlp(sentence)
    
    newTokensList = []
    for token in doc:
        if token.pos_ == 'NOUN':
            newTokensList.extend([token.text] * 2)
        elif token.ent_type_ != '':
            newTokensList.extend([token.text] * 4)
        else:
            newTokensList.append(token.lemma_)
    
    return ' '.join(newTokensList)

queries['POSTagged'] = queries['pure'].apply(performPOSAndNER)
df['POSTagged'] = df['pure'].apply(performPOSAndNER)

tfidf = TfidfVectorizer()
tfidfVectorsForDocs = tfidf.fit_transform(df['POSTagged'])
tfidfVectorsForQueries = tfidf.transform(queries['POSTagged'])

generateCosineSimilarityMatrix = cosine_similarity(tfidfVectorsForQueries, tfidfVectorsForDocs)
        
topOneSimilarInds = obtainSimilarDocs(1)
topFiveSimilarInds = obtainSimilarDocs(5)
topTenSimilarInds = obtainSimilarDocs(10)
        
pAOne = getPScores(1)
pAFive = getPScores(5)
pATen = getPScores(10)

print(f"The Weighted Avg Precision@1 Score is as follows: {pAOne:.5f}")
print(f"The Weighted Avg Precision@5 Score is as follows: {pAFive:.5f}")
print(f"The Weighted Avg Precision@10 Score is as follows: {pATen:.5f}")

The Weighted Avg Precision@1 Score is as follows: 0.82000
The Weighted Avg Precision@5 Score is as follows: 0.20800
The Weighted Avg Precision@10 Score is as follows: 0.10800


## Task - 4: Changing the Parameters in order to get better results 

**Changes**
- **Rectifying the Vocabulary:** instead of blindly flitering out the words which have frequency less than 5 and more than 85000, we will instead be removing all those words which are not a noun or aren't names of recognised entities and then successively, we'll apply lemmatization and then generate the words, also we note that we'll be using the POS Tagging as described above for words in their respective lower-case

> **Justification**: If we print the frequency maps or dictionaries containing the useful words, we observe that certain important words which are essential to the context of the sentence are lost because they have a frequency less than 5 in the whole dataset, this is the reason why we shall be using the above mentioned rectification