# Semester Project : Information Retrieval System
 ### NUST
- #### Submitted by : Hassan Ashiq BESE 23 C
- ###### Link to my GitHub Repository <a href="https://github.com/hassanashiqasse/PCA">Click Here</a>

### Importing Libraries

In [34]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from collections import Counter
from num2words import num2words

import nltk
import os
import string
import numpy as np
import copy
import pandas as pd
import pickle
import re
import math

### Reading CISI.ALL File

In [35]:
with open('CISI.ALL') as CISI_file:
    lines = ""
    for l in CISI_file.readlines():
        lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
    lines = lines.lstrip("\n").split("\n")
    
    
print("Done")

Done


### Placing each document in CISI File in a dictionary

In [36]:
doc_set = {}
doc_id = ""
doc_text = ""
for l in lines:
    if l.startswith(".I"):
        doc_id = l.split(" ")[1].strip()
    elif l.startswith(".X"):
        doc_set[doc_id] = doc_text.lstrip(" ")
        doc_id = ""
        doc_text = ""
    else:
        doc_text += l.strip()[3:] + " " # The first 3 characters of a line can be ignored.

# Print something to see the dictionary structure, etc.
print(f"Number of documents = {len(doc_set)}" + ".\n")

doc_set['1']


Number of documents = 1460.



"18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. "

In [37]:
doc_set["3"]

"Two Kinds of Power An Essay on Bibliographic Control Wilson, P. The relationships between the organization and control of writings and the organization and control of knowledge and information will inevitably enter our story, for writings contain, along with much else, a great deal of mankind's stock of knowledge and information.  Bibliographical control is a form of power, and if knowledge itself is a form of power, as the familiar slogan claims, bibliographical control is in a certain sense power over power, power to obtain the knowledge recorded in written form.  As writings are not simply, and not in any simple way, storehouses of knowledge, we cannot satisfactorily discuss bibliographical control as simply control over the knowledge and information contained in writings. "

### Reading CISI Query File and placing each query in a dictionary

In [38]:
with open('CISI.QRY') as f:
    lines = ""
    for l in f.readlines():
        lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
    lines = lines.lstrip("\n").split("\n")
    
qry_set = {}
qry_id = ""
for l in lines:
    if l.startswith(".I"):
        qry_id = l.split(" ")[1].strip()
    elif l.startswith(".W"):
        qry_set[qry_id] = l.strip()[3:]
        qry_id = ""
    
# Print something to see the dictionary structure, etc.
print(f"Number of queries = {len(qry_set)}" + ".\n")
print("Query # 2 : ", qry_set["2"]) # note that the dictionary indexes are strings, not numbers. 

Number of queries = 112.

Query # 2 :  How can actually pertinent data, as opposed to references or entire articles themselves, be retrieved automatically in response to information requests?


In [39]:
qry_set["1"]

'What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles?'

### Pre Process of Data

In [40]:
def convert_lower_case(data):
    return np.char.lower(data)

In [41]:
def remove_stop_words(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text

In [42]:
def remove_punctuation(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

In [43]:
def remove_apostrophe(data):
    return np.char.replace(data, "'", "")

In [44]:
def stemming(data):
    stemmer= PorterStemmer()
    
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + stemmer.stem(w)
    return new_text

In [45]:
def convert_numbers(data):
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        try:
            w = num2words(int(w))
        except:
            a = 0
        new_text = new_text + " " + w
    new_text = np.char.replace(new_text, "-", " ")
    return new_text

In [46]:
def preprocess(data):
    data = convert_lower_case(data)
    data = remove_punctuation(data) #remove comma seperately
    data = remove_apostrophe(data)
    data = remove_stop_words(data)
    data = convert_numbers(data)
    data = stemming(data)
    data = remove_punctuation(data)
    data = convert_numbers(data)
    data = stemming(data) #needed again as we need to stem the words
    data = remove_punctuation(data) #needed again as num2word is giving few hypens and commas fourty-one
    data = remove_stop_words(data) #needed again as num2word is giving stop words 101 - one hundred and one
    return data

In [47]:
processed_set={}
proc_token_id=""
proc_token_text=""

for i in doc_set:
    doc_token_id=i
    processed_set[doc_token_id]=preprocess(doc_set[str(i)])
print("done")
    

done


In [48]:
doc_set["2"]

'Use Made of Technical Libraries Slater, M. This report is an analysis of 6300 acts of use in 104 technical libraries in the United Kingdom. Library use is only one aspect of the wider pattern of information use.  Information transfer in libraries is restricted to the use of documents.  It takes no account of documents used outside the library, still less of information transferred orally from person to person.  The library acts as a channel in only a proportion of the situations in which information is transferred. Taking technical information transfer as a whole, there is no doubt that this proportion is not the major one.  There are users of technical information - particularly in technology rather than science - who visit libraries rarely if at all, relying on desk collections of handbooks, current periodicals and personal contact with their colleagues and with people in other organizations.  Even regular library users also receive information in other ways. '

In [49]:
processed_set["2"]

' use made technic librari slater report analysi six thousand three hundr act use one hundr four technic librari unit kingdom librari use one aspect wider pattern inform use inform transfer librari restrict use document take account document use outsid librari still less inform transfer oral person person librari act channel proport situat inform transfer take technic inform transfer whole doubt proport major one user technic inform particularli technolog rather scienc visit librari rare reli desk collect handbook current period person contact colleagu peopl organ even regular librari user also receiv inform way'

#### Converting prcessed text to tokens and placing in a dictionary where keys are the docs id

In [50]:
tokens_set={}
doc_token_id=""
doct_token_text=""

for i in processed_set:
    doc_token_id=i
    tokens_set[doc_token_id]=word_tokenize(processed_set[str(i)])
print("done")
    

done


In [51]:
np.array(tokens_set["2"]).T

array(['use', 'made', 'technic', 'librari', 'slater', 'report', 'analysi',
       'six', 'thousand', 'three', 'hundr', 'act', 'use', 'one', 'hundr',
       'four', 'technic', 'librari', 'unit', 'kingdom', 'librari', 'use',
       'one', 'aspect', 'wider', 'pattern', 'inform', 'use', 'inform',
       'transfer', 'librari', 'restrict', 'use', 'document', 'take',
       'account', 'document', 'use', 'outsid', 'librari', 'still', 'less',
       'inform', 'transfer', 'oral', 'person', 'person', 'librari', 'act',
       'channel', 'proport', 'situat', 'inform', 'transfer', 'take',
       'technic', 'inform', 'transfer', 'whole', 'doubt', 'proport',
       'major', 'one', 'user', 'technic', 'inform', 'particularli',
       'technolog', 'rather', 'scienc', 'visit', 'librari', 'rare',
       'reli', 'desk', 'collect', 'handbook', 'current', 'period',
       'person', 'contact', 'colleagu', 'peopl', 'organ', 'even',
       'regular', 'librari', 'user', 'also', 'receiv', 'inform', 'way'],
      d

#### Calculating DF

In [52]:
DF = {}

for i in range(len(tokens_set)):
    tokens = tokens_set[str(i+1)]
    for w in tokens:
        try:
            DF[w].add(i)
        except:
            DF[w] = {i}
for i in DF:
    DF[i] = len(DF[i])

In [53]:
DF

{'eighteen': 15,
 'edit': 44,
 'dewey': 13,
 'decim': 16,
 'classif': 105,
 'comaromi': 1,
 'present': 318,
 'studi': 362,
 'histori': 52,
 'first': 175,
 'ddc': 5,
 'publish': 122,
 'one': 579,
 'thousand': 350,
 'eight': 127,
 'hundr': 377,
 'seventi': 134,
 'six': 112,
 'eighteenth': 1,
 'nine': 299,
 'futur': 95,
 'continu': 68,
 'appear': 86,
 'need': 251,
 'spite': 7,
 'long': 48,
 'healthi': 1,
 'life': 39,
 'howev': 111,
 'full': 41,
 'stori': 4,
 'never': 15,
 'told': 4,
 'biographi': 3,
 'briefli': 34,
 'describ': 271,
 'system': 515,
 'attempt': 125,
 'provid': 251,
 'detail': 101,
 'work': 252,
 'spur': 3,
 'growth': 67,
 'librarianship': 49,
 'countri': 46,
 'abroad': 5,
 'use': 659,
 'made': 211,
 'technic': 130,
 'librari': 555,
 'slater': 5,
 'report': 185,
 'analysi': 226,
 'three': 212,
 'act': 27,
 'four': 143,
 'unit': 94,
 'kingdom': 10,
 'aspect': 103,
 'wider': 17,
 'pattern': 85,
 'inform': 660,
 'transfer': 30,
 'restrict': 25,
 'document': 251,
 'take': 66,
 '

In [54]:

total_vocab_size = len(DF)
total_vocab_size


6750

In [55]:
total_vocab = [x for x in DF]
N=len(total_vocab)
N

6750

In [56]:
def doc_freq(word):
    c = 0
    try:
        c = DF[word]
    except:
        pass
    return c

In [57]:
doc = 0
N=len(tokens_set)
tf_idf = {}

for i in range(len(tokens_set)):
    if(i>0):
        tokens = tokens_set[str(i)]
    
    counter = Counter(tokens)
    words_count = len(tokens)
    
    for token in np.unique(tokens):
        
        tf = counter[token]/words_count
        df = doc_freq(token)
        idf = np.log((N+1)/(df+1))
        
        tf_idf[doc,token] = tf*idf
    doc += 1

print("tf-idf done")

tf-idf done


In [58]:
tf_idf

{(0, 'author'): 0.03667246564756004,
 (0, 'averag'): 0.05982443036105432,
 (0, 'certif'): 0.1014469528374195,
 (0, 'chemic'): 0.1370688561354312,
 (0, 'chemistri'): 0.12234519181184544,
 (0, 'chernyi'): 0.08755682397861289,
 (0, 'compound'): 0.06117259590592272,
 (0, 'document'): 0.08643183561832506,
 (0, 'etc'): 0.06117259590592272,
 (0, 'fifteen'): 0.06264164768772087,
 (0, 'fifti'): 0.08479987035928797,
 (0, 'hundr'): 0.06649092866218582,
 (0, 'increa'): 0.043962397143649315,
 (0, 'inform'): 0.026004018750787318,
 (0, 'integr'): 0.05532546567741891,
 (0, 'journal'): 0.03798464118319179,
 (0, 'last'): 0.05289235083941443,
 (0, 'literatur'): 0.030888508686531476,
 (0, 'modern'): 0.05898355668256989,
 (0, 'monograph'): 0.06425541937318403,
 (0, 'new'): 0.02967889934129818,
 (0, 'number'): 0.032035370794841485,
 (0, 'one'): 0.030290108465909328,
 (0, 'paper'): 0.027740410363096074,
 (0, 'patent'): 0.07506272476472933,
 (0, 'present'): 0.024945660802718934,
 (0, 'public'): 0.033701111391

## cosine similarilty

In [59]:
def cosine_sim(a, b):
    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim

In [60]:
D = np.zeros((N, total_vocab_size))   #total_vocab_size is the length of DF
for i in tf_idf:
    try:
        ind = total_vocab.index(i[1])
        D[i[0]][ind] = tf_idf[i]
    except:
        pass

In [61]:
def gen_vector(tokens):

    Q = np.zeros((len(total_vocab)))
    
    counter = Counter(tokens)
    words_count = len(tokens)

    query_weights = {}
    
    for token in np.unique(tokens):
        
        tf = counter[token]/words_count
        df = doc_freq(token)
        idf = math.log((N+1)/(df+1))

        try:
            ind = total_vocab.index(token)
            Q[ind] = tf*idf
        except:
            pass
    return Q

In [62]:
def cosine_similarity(k, query):
    
    preprocessed_query = preprocess(query)
    tokens = word_tokenize(str(preprocessed_query))
    
    #print("\nQuery:", query)
    
    d_cosines = []
    
    query_vector = gen_vector(tokens)
    
    for d in D:
        d_cosines.append(cosine_sim(query_vector, d))
        
    out = np.array(d_cosines).argsort()[-k:][::-1]
    
    
    #print("Most similar Dpocuments-IDs : ")
    
    #print(out)
    
    return out


In [63]:
Q = cosine_similarity(5,qry_set["3"])

print('Related documents to given query :\n \"', qry_set["3"] , '" \n' )
Q

Related documents to given query :
 " What is information science?  Give definitions where possible. " 



array([ 469,  445, 1181, 1179,  540], dtype=int64)

### Reading ground truth


In [64]:
rel_set = {}
with open('CISI.REL') as f:
    for l in f.readlines():
        qry_id = l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[0]
        doc_id = int(l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[-1])
        if qry_id in rel_set:
            rel_set[qry_id].append(doc_id)
        else:
            rel_set[qry_id] = []
            rel_set[qry_id].append(doc_id) 
    
    
print(rel_set["3"]) # note that the dictionary indexes are strings, not numbers. 
    

FileNotFoundError: [Errno 2] No such file or directory: 'IR Project/Datasets.REL'

### Precision and recall, Accuracy and Measure

In [None]:
precision_list=[]
recall_list=[]

In [None]:
precision_list=[]
recall_list=[]
accuracy_list=[]

for i in range(1,len(doc_set)):
    try:
        result_from_cosine=cosine_similarity(6 , qry_set[str(i)]).tolist()
        result_from_ground_truth=rel_set[str(i)]
        
        true_Positive=len(set(result_from_cosine) & set(result_from_ground_truth)) #set(a) & set(b) gives us intersection between a and b
        false_Positive=len(np.setdiff1d(result_from_cosine , result_from_ground_truth))
        false_Negative=len(np.setdiff1d(result_from_ground_truth , result_from_cosine))
        true_negative= ( len(doc_set) -  (true_Positive + false_Negative + false_Positive) )
        #print("true psotive",true_Positive)
        #print("false negative",false_Negative)
        
        try:
            precission= (true_Positive) / ( true_Positive + false_Positive )
            recall= (true_Positive) / (true_Positive + false_Negative)
            
            accuracy= ( true_negative + true_Positive ) / (  true_negative + true_Positive + false_Negative +false_Positive)
           
        except ZeroDivisionError:
            pass

        precision_list.append(precission)
        recall_list.append(recall)
        accuracy_list.append(accuracy)
        
        
        
    except KeyError:
        pass
    

In [None]:
average_precision=sum(precision_list)

In [None]:
average_recall=sum(recall_list)

In [None]:
Accuracy= sum(accuracy_list)

In [None]:
F_Measure = (2 * average_precision * average_recall) / (average_precision + average_recall)

In [None]:
print("Average Precision is : ", average_precision)
print("Average Recall is : ", average_recall)
print("F-score is : " ,F_Measure)
print("Accuracy : " ,Accuracy)

Average Precision is :  35.99999999999999
Average Recall is :  11.822298142574425
F-score is :  17.799342552037704
Accuracy :  73.85068493150679


## Enter a Query, Get your Result 
### Simple User Interface

In [None]:
query=input("Enter your query here : ")

Q=cosine_similarity(10,query)

print("\n\nEntered Query is : " , query)
print("\n\nRelated Documents IDs are : ", Q)
print("\nDo you want to retrive the document ? \n press Y to see all related docs \n Press S to see a single document with given id \n Press N to exit ")

entered_option=input()
    
if entered_option == "Y":

    print("\n\n*** You are in All Document Retriveal Mood ***\n\n")

    for i in range(len(Q)):
            print("\n\nDoc-Id :", Q[i] , "\n\t" ,doc_set[str(Q[i])])
           
elif entered_option == "S":
    print("Enter your desired document ID : ")
    doc_id=input()
    print("Doc-Id : ", doc_id, "\n\t" ,doc_set[doc_id])
        

else:
    print("Thank you for using our Information System")
    print("Hassan Ashiq & Usman Ali Abbasi")

Enter your query here : 469


  




Entered Query is :  469


Related Documents IDs are :  [1459  478  480  481  482  483  484  485  486  487]

Do you want to retrive the document ? 
 press Y to see all related docs 
 Press S to see a single document with given id 
 Press N to exit 
N
Thank you for using our Information System
Hassan Ashiq & Usman Ali Abbasi


In [None]:
qry_set["3"]

'What is information science?  Give definitions where possible.'

 ## IR Semester Project
 ### NUST
- #### Submitted by : Hassan Ashiq BESE 23 C 
- ###### Link to my GitHub Repository <a href="https://github.com/hassanashiqasse/PCA">Click Here</a>