## Web Mining : Information Retrival and Natural Language Processing

### Ranking document based on Cosine Similarity on real time CISI dataset
Dataset link : https://www.kaggle.com/dmaso01dsta/cisi-a-dataset-for-information-retrieval


### Problem Statment :
Represent collection of documents in VSM (Vector Space Model) using tf and tf*idf (Preprocess using stopword
removal/stemming). Create an Inverted Index for accessing the documents with respect to terms, use tf based and tf idf based weighting. For a query find similarity of query and
documents using cosine similarity measures by using VSM representation(tf,tfidf) and on
basis of Okapi Model. Display the ranked order of documents. Use standard benchmark
data set CISI/CACM. Thus you have three models – Tf, Tf-idf, okapi.

###  Step 1: Importing all the necessary libraries

In [4]:
import numpy as np
import pandas as pd

#from nltk.stem import PorterStemmer
#from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import operator## Create Vocabulary
import nltk
from collections import Counter
import scipy

### Step 2 : Creating Preprocessing functions
* Converting all to lower case
* Removing stop words
* Removing Punctuations
* Stemming


In [5]:
# Converting to lower case
def lower_case(data):
    return np.char.lower(data)

def remove_stopwords(data):
    stop_words=nltk.corpus.stopwords.words('english')
    #print(stop_words)
    new_data=""
    word_tokens=nltk.tokenize.word_tokenize(str(data))
    for w in word_tokens:
        if w not in stop_words:
            new_data=new_data+" "+w
    return new_data

def remove_punctuations(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

def remove_apostrophe(data):
    return np.char.replace(data, "'", "")

def stemming(data):
    stemmer= nltk.stem.PorterStemmer()
    
    tokens = nltk.tokenize.word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + stemmer.stem(w)
    return new_text


In [6]:
data=['My name is Bharat Dadwaria, I live in delhi and study at Jawaharlal Nehru University, New Delhi']

def preprocessing_data(data):
    data=lower_case(data)
    data=remove_punctuations(data)
    data=remove_apostrophe(data)
    data=stemming(data)
    #Once more we need to remove the punctuations
    data=remove_punctuations(data)
    data=remove_stopwords(data)
    return data

### Step 3: Importing the dataset and preprocessing it

In [22]:
with open('cisi-a/CISI.ALL') as f:
    lines = ""
    for l in f.readlines():
        lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
    lines = lines.lstrip("\n").split("\n")

processed_text = []
processed_title = []
doc_set = {}
doc_list=[]
doc_id = ""
doc_text = ""
doc_auth=""
doc_title={}
for l in lines:
    if l.startswith(".I"):
        doc_id = l.split(" ")[1].strip()
    elif l.startswith(".T"):
        processed_title.append((str((l.lstrip(" ")))))
        #doc_title[doc_id] = doc_text.lstrip(" ")
        #doc_id=""
        #doc_text=""
        
    elif l.startswith(".A"):
        doc_auth = l.lstrip(" ")
    elif l.startswith(".X"):
        processed_text.append((str(preprocessing_data(doc_text.lstrip(" ")))))
        #processed_text.append(word(str(preprocessing_data(doc_text.lstrip(" ")))))
        doc_set[doc_id] = doc_text.lstrip(" ")
        doc_id = ""
        doc_text = ""
    else:
        doc_text += l.strip()[3:] + " " # The first 3 characters of a line can be ignored.

# Print something to see the dictionary structure, etc.
print(f"Total Number of documents = {len(doc_set)}" + ".\n")
print(doc_set["1"]) # note that the dictionary indexes are strings, not numbers. 
print(len(doc_set))  
#print(processed_title)
#print(processed_text)
doc_list=doc_set.values()

Total Number of documents = 1460.

The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. 
1460


### Step 4 : Converting the text into Vector Space Model using Tf-Idf And Performing Cosine Similarity

In [24]:
vocabulary = set(processed_text)
for doc in processed_text:
    vocabulary.update((doc))
    #vocabulary = list(vocabulary)# Intializating the tfIdf model
print("The Lenght of the Vocabulary is : ",len(vocabulary))
tf_idf_vector=TfidfVectorizer(vocabulary=vocabulary,stop_words='english')
tf_idf_tran=tf_idf_vector.fit_transform(processed_text)

The Lenght of the Vocabulary is :  1494


In [25]:
# For Converting the query to the Vector Space model using tfidf.
def gen_vector_T(tokens):
    Q = np.zeros((len(vocabulary)))    
    x= tf_idf_vector.fit_transform(tokens)
    #print(tokens[0].split(','))
    for token in tokens[0].split(','):
        #print(token)
        try:
            ind = vocabulary.index(token)
            Q[ind]  = x[0, tfidf.vocabulary_[token]]
        except:
            pass
    return Q

In [26]:
#def cosine_sim(a, b):
 #   cos_sim = (np.dot(a, b)/(np.linalg.norm(a,ord=None)*np.linalg.norm(b,ord=None)))
  #  return cos_sim

def cosine_sim(x, y):
    return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))


In [27]:
def cosine_similarity_T(k, query):
    preprocessed_query = preprocessed_query = re.sub("\W+", " ", query).strip()
    tokens = nltk.tokenize.word_tokenize(str(preprocessed_query))
    d_cosines = []
    #print(tokens)
    query_vector = gen_vector_T(tokens)
    #print((query_vector))
    
    for d in tf_idf_tran.A:
        d_cosines.append(cosine_sim(query_vector, d))
    #print((d))                
    out = np.array(d_cosines).argsort()[-k:][::-1]
    #print(d_cosines)
    return(out)
    '''
    #print("")
    d_cosines.sort()
    a = pd.DataFrame()
    print(d_cosines)
    for i,index in enumerate(out):
        a.loc[i,'index'] = str(index)
        #a.loc[i,'Subject'] = doc_list|['Subject'][index]
    for j,simScore in enumerate(d_cosines[-k:][::-1]):
        a.loc[j,'Score'] = simScore
    return a
    '''
    

In [28]:
query=" history of the DEWEY Decimal Classification.  The first edition of the DDC was published in"
query1="What is the need for information consolidation, evaluation, and retrieval in scientific research?"

In [29]:

top_search=cosine_similarity_T(10,query)
print("Top 10 Search documents are(Documents ID) : \n\n",top_search)

Top 10 Search documents are(Documents ID) : 

 [1459  478  480  481  482  483  484  485  486  487]


  


### Step 6:  Fetching the top Result query documents data

In [30]:
count=1;
for i in top_search:
    k=str(i)
    j=i+1
    print("\nsearch result :",count,"\nDoc ",j," : ",processed_title[i],"\n\n")
    print(doc_set[k])
    count=count+1


search result : 1 
Doc  1460  :  .T Modern Integral Information Systems for Chemistry and Chemical Technology 


This book considers the basic aspects of this complex problem - the historical and social essence of language and thought, their interaction in historical evolution, the essence of linguistic meaning in relation to the content side of thought, and the physiological mechanism of the processes of abstraction, generalization, etc. 

search result : 2 
Doc  479  :  .T Automatic Term Classifications and Retrieval 


All analysis of information for storage and of questions for effecting retrieval must be in terms of concepts and the relations between them. The concepts may be just words (descriptors), as in simple post-co-ordinate keyword indexing systems, or they may be class-terms or other idea-groupings, as in classifications.  The relations between concepts often appear to be absent, but if more than one word is used in indexing or in a search there is clearly an implicit rel