# Information Retreval

## Table of Contents
<ul>
    <li><a href="#read_files">Read Files</a></li>
    <li><a href="#regularisation">Regularisation</a></li>
    <li>
    <a href="#tokenization_normalization">Tokenization and normalization</a>
        <ul>
            <li> <a href="#tokenization">Tokenization</a></li>
            <li> <a href="#stop_word_removal">Stop word removal</a></li>
        </ul>
    </li>
    <li>
    <a href="#positioal_index_model">Positioal index model</a>
        <ul>
            <li> <a href="#constructing_auxiliary">Constructing Auxiliary structure(s) </a></li>
            <li> <a href="#Phrase_query">Phrase query</a></li>
        </ul>
    </li>
    <li>
    <a href="#vector_space_model">Vector space model</a>
        <ul>
            <li> <a href="#term_frequency">Term frequency </a></li>
            <li> <a href="#idf">inverse document frequency</a></li>
            <li> <a href="#tf_idf_matrix">TF.idf matrix </a></li>
            <li> <a href="#Similarity">Similarity between query and each document </a></li>
        </ul>
    </li>
</ul>

In [1]:
# from nltk.corpus import stopwords
# from nltk.tokenize import sent_tokenize, word_tokenize
# from nltk.cluster.util import cosine_distance
# import nltk
import numpy as np
import pandas as pd
import os
import re

<a id='read_files'></a>
## Read Files

In [2]:
encoding="utf8"
docPaths = os.listdir("documents")
documentList = []
for docName in docPaths:
    with open("documents/"+docName,"r",encoding=encoding) as file:
        if file.mode == 'r':
            documentList.append(file.read())

<a id='regularisation'></a>
## regularisation

In [3]:
for i in range(10):
#     documentList[i] = re.sub(r'\n|“|”|\.|\'|,', '', document)#Removing punctuation
    documentList[i] = re.sub(r'’', '\'', documentList[i])  # Replace ’ With '
    documentList[i] = re.sub(r'“', '"', documentList[i])   # Replace “ With "
    documentList[i] = re.sub(r'”', '"', documentList[i])   # Replace ” With "
    documentList[i] = re.sub(r',', ',', documentList[i])   # Replace , With ,  

In [4]:
documentList[9]

'Ms. Groves had originally sent the video, in which she looked into the camera and said, "I can drive," followed by the slur, to a friend on Snapchat in 2016, when she was a freshman and had just gotten her learner\'s permit. It later circulated among some students at Heritage High School, which she and Mr. Galligan attended, but did not cause much of a stir.\n\nMr. Galligan had not seen the video before receiving it last school year, when he and Ms. Groves were seniors. By then, she was a varsity cheer captain who dreamed of attending the University of Tennessee, Knoxville, whose cheer team was the reigning national champion. When she made the team in May, her parents celebrated with a cake and orange balloons, the university\'s official color.\n\nThe next month, as protests were sweeping the nation after the police killing of George Floyd, Ms. Groves, in a public Instagram post, urged people to "protest, donate, sign a petition, rally, do something" in support of the Black Lives Matt

<a id='tokenization_normalization'></a>
## Tokenization and normalization

<a id='tokenization'></a>
> ### Tokenization

In [5]:
from nltk.tokenize import regexp_tokenize 

In [6]:
tokenized_documentList = []
for document in documentList:
    tokenized_documentList.append( regexp_tokenize(document, "[\w']+") )

In [7]:
tokenized_documentList

[['The',
  'Perdue',
  'campaign',
  'did',
  'not',
  'respond',
  'to',
  "CNN's",
  'inquiries',
  'about',
  'the',
  'status',
  'of',
  'their',
  'ad',
  'on',
  'Saturday',
  'and',
  'they',
  'once',
  'again',
  'ignored',
  'questions',
  'about',
  'where',
  'the',
  'senator',
  'stands',
  'on',
  "Trump's",
  'calls',
  'for',
  'changes',
  'to',
  'the',
  'bill',
  'At',
  'a',
  'news',
  'conference',
  'on',
  'Wednesday',
  'Loeffler',
  'said',
  'she',
  'would',
  'be',
  'open',
  'to',
  'the',
  'idea',
  'of',
  'bigger',
  'checks',
  'but',
  'argued',
  'that',
  'other',
  'parts',
  'of',
  'the',
  'package',
  'would',
  'have',
  'to',
  'be',
  'cut',
  'in',
  'order',
  'to',
  'accommodate',
  'the',
  'expense',
  "Trump's",
  'mixed',
  'signals',
  'on',
  'the',
  'relief',
  'bill',
  'are',
  'just',
  'one',
  'example',
  'of',
  'how',
  'his',
  'erratic',
  'behavior',
  'during',
  'the',
  'waning',
  'days',
  'of',
  'his',
  'p

<a id='stop_word_removal'></a>
> ### Stop Word Removal

In [8]:
from nltk.corpus import stopwords 
stopwords_english = stopwords.words('english')

In [9]:
Cleaned_tokenized_documentList = []
for document in tokenized_documentList:
    cleandoc = []
    for word in document:
        if word.lower() not in stopwords_english:
            cleandoc.append(word)
    Cleaned_tokenized_documentList.append(cleandoc)
    

In [10]:
Cleaned_tokenized_documentList

[['Perdue',
  'campaign',
  'respond',
  "CNN's",
  'inquiries',
  'status',
  'ad',
  'Saturday',
  'ignored',
  'questions',
  'senator',
  'stands',
  "Trump's",
  'calls',
  'changes',
  'bill',
  'news',
  'conference',
  'Wednesday',
  'Loeffler',
  'said',
  'would',
  'open',
  'idea',
  'bigger',
  'checks',
  'argued',
  'parts',
  'package',
  'would',
  'cut',
  'order',
  'accommodate',
  'expense',
  "Trump's",
  'mixed',
  'signals',
  'relief',
  'bill',
  'one',
  'example',
  'erratic',
  'behavior',
  'waning',
  'days',
  'presidency',
  'interrupting',
  'careful',
  'messaging',
  'Republicans',
  'trying',
  'craft',
  'fight',
  'hold',
  'onto',
  'two',
  'seats',
  'retain',
  'control',
  'majority',
  'US',
  'Senate',
  'Trump',
  'last',
  'week',
  'vetoed',
  'National',
  'Defense',
  'Authorization',
  'Act',
  'legislation',
  'Loeffler',
  'Perdue',
  'supported',
  'setting',
  'first',
  'possible',
  'veto',
  'override',
  'presidency',
  'Trump

<a id='positioal_index_model'></a>
## Positioal Index Model

<a id='constructing_auxiliary'></a>
> ### Constructing Auxiliary Structure(s)

In [11]:
#Creat unique Words Per Document and Number of Word Per Document
uniqueWordsPerDoc = []
numWordPerDoc = []
totalWords = 0
for document in Cleaned_tokenized_documentList:
    num = len(document)
    totalWords += num
    numWordPerDoc.append(num)
    list_set  = set(document)
    uniqueWordsPerDoc += (list(list_set))

In [12]:
#Creat unique Words in all Documents
list_set  = set(uniqueWordsPerDoc)
uniqueWords = (list(list_set))

print("total number Of word is: ",totalWords)
print("Number of unique Words is: ",len(uniqueWords))


total number Of word is:  1253
Number of unique Words is:  749


Create a List consist of  Posting  List and  Positional Index   

In [13]:
words = {}
for word in uniqueWords:
    words[word] = {}
    docNum = 0
    for document in Cleaned_tokenized_documentList:
        if word in document:
            index_pos_list = [ i for i in range(len(document)) if document[i] == word ]
            words[word][docNum] = index_pos_list
        docNum += 1    
        

In [14]:
words

{'remorse': {4: [34]},
 'elect': {0: [97]},
 'generation': {1: [13]},
 'transforming': {1: [108], 9: [152]},
 'use': {5: [82], 6: [12], 7: [3, 54]},
 'convicted': {5: [19]},
 'Education': {1: [100], 9: [144]},
 'industry': {1: [110, 169], 9: [154]},
 'dreamed': {9: [45]},
 'fight': {0: [52]},
 'return': {3: [40], 8: [71]},
 'first': {0: [75], 5: [81]},
 'standard': {7: [15]},
 "team's": {4: [32]},
 'browser': {2: [30]},
 'highly': {7: [45]},
 'technology': {1: [106], 8: [3], 9: [150]},
 'guilty': {5: [11, 68]},
 'Galligan': {9: [26, 32]},
 'commercial': {7: [68]},
 'depth': {1: [160]},
 'Country': {1: [95], 9: [139]},
 'ball': {6: [19]},
 'provides': {7: [33]},
 'ramp': {1: [141]},
 'courses': {1: [198]},
 'become': {1: [126]},
 'across': {1: [168]},
 'list': {5: [73, 94]},
 'school': {1: [211], 4: [50], 9: [37]},
 'certainly': {5: [80]},
 'Prakash': {1: [94], 9: [138]},
 'made': {9: [56]},
 'pleaded': {5: [10]},
 'year': {1: [20, 172], 3: [26], 4: [38], 9: [38]},
 'desired': {7: [86],

In [15]:
words["results"]

{1: [140], 2: [42], 3: [61, 113], 6: [40], 7: [35, 38, 65], 8: [55, 82, 127]}

<a id='Phrase_query'></a>
> ### Phrase query

In [16]:
def phrase_query(query):
    queryList = regexp_tokenize(query, "[\w']+")
    documentList = []
    flag = 0
    # get Posting List for  evry Word 
    for word in queryList:
        try:
            documentList.append(list(words[word].keys()))
        except:
            return -1
        
    # get intersection Documents between words
    intersectDocs = documentList[0]
    for i in range(1,len(documentList)):
        intersectDocs =  list( set(intersectDocs).intersection(documentList[i]) )
        flag = 1
    
     
    if flag == 1:
        # Validate indexing
        query_intersection = {}
        query_Notintersection = {}
        for Docnum in intersectDocs:
            indexsList = []
            for word in queryList :
                indexsList.append(words[word][Docnum] )
            for i in range(1,len(indexsList)): # change range(1,len(intersectIndex)) with k 
                indexsList[i] = list(np.asarray(indexsList[i]) - i)
                
            # get intersection Index between Words 
            intersectIndexs = indexsList[0]
            for i in range(1,len(indexsList)):
                intersectIndexs = list( set(intersectIndexs).intersection(indexsList[i]) )
                
            # save intersections 
            if len(intersectIndexs) > 0:
                query_intersection[Docnum]=[]
                for i in range(0,len(queryList)): # change range(1,len(queryList)) with k 
                    query_intersection[Docnum].append(list(np.asarray(intersectIndexs)+i))
            else:
                query_Notintersection[Docnum]=[]
                for i in range(0,len(indexsList)): # change range(1,len(queryList)) with k
                    query_Notintersection[Docnum].append(indexsList[i])
        return query_intersection,query_Notintersection
                               
    else:
        return intersectDocs
         

In [17]:
phrase_query("package left    unsigned   President")

({4: [[67], [68], [69], [70]]}, {})

In [18]:
#search
term = "search"
print("<{} {}".format(term,len(words[term])))
for key, value in words[term].items():
    print('  ',key, ' : ', value)

<search 3
   2  :  [40]
   7  :  [27, 34, 64, 74]
   8  :  [12, 17, 21, 28, 57, 89, 95, 126]


<a id='vector_space_model'></a>
## Vector Space Model 

![title](TFIDFEqutions.jpg)

In [68]:
import math

In [69]:
documentListTill3 = Cleaned_tokenized_documentList[:3]
len(documentListTill3)

3

<a id='term_frequency'></a>
> ### Term Frequency

In [71]:
def tf(term,document):
    i=0
    for word in document:
        if word == term:
            i+=1
    return i

In [72]:
def creat_tfMatrix(uniqueWords, docNum):
    Tf_Idf_Matrix = pd.DataFrame( columns = ['term', 'tf'])
    for term in uniqueWords:
        new_row = {'term':term,'tf':tf(term,documentListTill3[docNum])}
        Tf_Idf_Matrix = Tf_Idf_Matrix.append(new_row, ignore_index=True)
    return Tf_Idf_Matrix

In [73]:
# create Unique Words 
uniqueWords = []
for doc in documentListTill3:
    uniqueWords += list(set(doc))
uniqueWords = list(set(uniqueWords))
len(uniqueWords)

314

In [74]:
# First Document 
Tf_Idf_Matrix1 = creat_tfMatrix(uniqueWords, 0)

In [75]:
# Second Document 
Tf_Idf_Matrix2 = creat_tfMatrix(uniqueWords, 1)

In [76]:
# third Document 
Tf_Idf_Matrix3 = creat_tfMatrix(uniqueWords, 2)

In [82]:
Tf_Idf_Matrix2

Unnamed: 0,term,tf
0,support,2
1,job,1
2,elect,0
3,Wednesday,0
4,generation,1
...,...,...
309,program,7
310,Joe,0
311,Republicans,0
312,quality,0


<a id='idf'></a>
> ### inverse document frequency

![title](IDFEqution.jpg)

In [83]:
# Create DF 
def create_df(uniqueWords,documentList):
    df={}
    for term in uniqueWords:
        df[term]=0
        for document in documentList:
            if term in document:
                df[term]+=1
    return df

In [84]:
# Create IDF 
def create_Idf(df, N):
    Idf={}
    for term in uniqueWords:
        Idf[term]=format(math.log10(N/df[term]), '.3f')
    return Idf

In [85]:
df  = create_df(uniqueWords, documentListTill3)
idf = create_Idf(df,3)
idf

{'support': '0.477',
 'job': '0.477',
 'elect': '0.477',
 'Wednesday': '0.477',
 'generation': '0.477',
 'hypermedia': '0.477',
 'difficult': '0.477',
 'empower': '0.477',
 'GOP': '0.477',
 'tomorrow': '0.477',
 'track': '0.477',
 'changes': '0.477',
 'Saturday': '0.477',
 'transforming': '0.477',
 'set': '0.176',
 'Education': '0.477',
 'continues': '0.477',
 'industry': '0.477',
 'results': '0.176',
 'travel': '0.477',
 'containing': '0.477',
 'fight': '0.477',
 'first': '0.477',
 'content': '0.477',
 'April': '0.477',
 'questions': '0.477',
 'Azure': '0.477',
 'browser': '0.477',
 'package': '0.477',
 'technology': '0.477',
 'depth': '0.477',
 'would': '0.477',
 'Country': '0.477',
 'ever': '0.477',
 "CNN's": '0.477',
 'ramp': '0.477',
 'courses': '0.477',
 'calculated': '0.477',
 'administered': '0.477',
 'become': '0.477',
 'across': '0.477',
 'checks': '0.477',
 'businesses': '0.176',
 'working': '0.477',
 'critical': '0.477',
 'fraud': '0.176',
 'voting': '0.477',
 'school': '0.

<a id='tf_idf_matrix'></a>
> ### Tf_Idf_Matrix

#### Creat TF Weight

![title](tf_weight.jpg)

In [86]:
# Create TF Weight
def creat_TF_Weight(Tf_Idf_Matrix):
    TF_Weightlist = []
    for tf in Tf_Idf_Matrix.tf.values:
        if tf != 0:
            TF_Weightlist.append(  1+math.log10(tf))
        else:
            TF_Weightlist.append(0)
    return TF_Weightlist

In [87]:
# First Document 
Tf_Idf_Matrix1['tf_weight'] = creat_TF_Weight(Tf_Idf_Matrix1);

In [88]:
# Second Document 
Tf_Idf_Matrix2['tf_weight'] = creat_TF_Weight(Tf_Idf_Matrix2);

In [89]:
# third Document 
Tf_Idf_Matrix3['tf_weight'] = creat_TF_Weight(Tf_Idf_Matrix3);

In [90]:
Tf_Idf_Matrix2

Unnamed: 0,term,tf,tf_weight
0,support,2,1.301030
1,job,1,1.000000
2,elect,0,0.000000
3,Wednesday,0,0.000000
4,generation,1,1.000000
...,...,...,...
309,program,7,1.845098
310,Joe,0,0.000000
311,Republicans,0,0.000000
312,quality,0,0.000000


#### Creat TF-IDF

![title](TF_IDF.jpg)

In [91]:
# Creat TF-IDF
def creat_TF_IDF(Tf_Idf_Matrix,idf):
    TF_IDFlist = []
    TF_Weight = list(Tf_Idf_Matrix.tf_weight.values)
    IDF = list(idf.values())
    for i in range(len(IDF)):
        TF_IDFlist.append(TF_Weight[i]*float(IDF[i]))
    return TF_IDFlist

In [92]:
# First Document 
Tf_Idf_Matrix1['tf_idf'] = creat_TF_IDF(Tf_Idf_Matrix1,idf);

In [93]:
# Second Document
Tf_Idf_Matrix2['tf_idf'] = creat_TF_IDF(Tf_Idf_Matrix2,idf);

In [94]:
# third Document 
Tf_Idf_Matrix3['tf_idf'] = creat_TF_IDF(Tf_Idf_Matrix3,idf);

In [95]:
Tf_Idf_Matrix2

Unnamed: 0,term,tf,tf_weight,tf_idf
0,support,2,1.301030,0.620591
1,job,1,1.000000,0.477000
2,elect,0,0.000000,0.000000
3,Wednesday,0,0.000000,0.000000
4,generation,1,1.000000,0.477000
...,...,...,...,...
309,program,7,1.845098,0.324737
310,Joe,0,0.000000,0.000000
311,Republicans,0,0.000000,0.000000
312,quality,0,0.000000,0.000000


##### Creat Normalize

In [97]:
# Create Length
def creat_length(Tf_Idf_Matrix):
    tf_idfSquare = 0
    for tf_idf in Tf_Idf_Matrix.tf_idf.values:
        tf_idfSquare += (tf_idf*tf_idf)
    return math.sqrt(tf_idfSquare)

In [98]:
# First Document 
lengthD1 = creat_length(Tf_Idf_Matrix1)

In [99]:
# Second Document
lengthD2 = creat_length(Tf_Idf_Matrix2)

In [100]:
# third Document 
lengthD3 = creat_length(Tf_Idf_Matrix3)

In [101]:
# Create normalize
def creat_normalize(Tf_Idf_Matrix,length):
    normalizelist = []
    for tf_idf in Tf_Idf_Matrix.tf_idf.values:
        normalizelist.append(tf_idf/length)
    return normalizelist
    

In [102]:
# First Document 
Tf_Idf_Matrix1['normalize'] = creat_normalize(Tf_Idf_Matrix1,lengthD1);

In [103]:
# Second Document
Tf_Idf_Matrix2['normalize'] = creat_normalize(Tf_Idf_Matrix2,lengthD2);

In [104]:
# third Document 
Tf_Idf_Matrix3['normalize'] = creat_normalize(Tf_Idf_Matrix3,lengthD3);

In [108]:
Tf_Idf_Matrix2

Unnamed: 0,term,tf,tf_weight,tf_idf,normalize
0,support,2,1.301030,0.620591,0.110016
1,job,1,1.000000,0.477000,0.084561
2,elect,0,0.000000,0.000000,0.000000
3,Wednesday,0,0.000000,0.000000,0.000000
4,generation,1,1.000000,0.477000,0.084561
...,...,...,...,...,...
309,program,7,1.845098,0.324737,0.057568
310,Joe,0,0.000000,0.000000,0.000000
311,Republicans,0,0.000000,0.000000,0.000000
312,quality,0,0.000000,0.000000,0.000000


<a id='Similarity'></a>
> ### Similarity between query and each document

In [109]:
# Create Tf Idf Matrix for  the query
def create_TFIDFMatrixQuery(query,documentListTill3):
    queryList = regexp_tokenize(query, "[\w']+")
    documentListTill3.append(queryList)
    # Create TF Matrix 
    queryTf_Idf_Matrix = creat_tfMatrix(uniqueWords, 3)
    # Create TF Weight
    queryTf_Idf_Matrix['tf_weight'] = creat_TF_Weight(queryTf_Idf_Matrix)
    # Create TF-IDF
    queryTf_Idf_Matrix['tf_idf'] = creat_TF_IDF(queryTf_Idf_Matrix,idf);
    # Create Length
    querylengthD = creat_length(queryTf_Idf_Matrix)
    # Create normalize
    queryTf_Idf_Matrix['normalize'] = creat_normalize(queryTf_Idf_Matrix,querylengthD)
    documentListTill3 = documentListTill3[:-1]
    return queryTf_Idf_Matrix

In [110]:
queryTf_Idf_Matrix = create_TFIDFMatrixQuery("job Wednesday hypermedia access",documentListTill3)

            term tf
0        support  0
1            job  1
2          elect  0
3      Wednesday  1
4     generation  0
..           ... ..
309      program  0
310          Joe  0
311  Republicans  0
312      quality  0
313   developing  0

[314 rows x 2 columns]


In [112]:
queryTf_Idf_Matrix

Unnamed: 0,term,tf,tf_weight,tf_idf,normalize
0,support,0,0.0,0.000,0.00000
1,job,1,1.0,0.477,0.56468
2,elect,0,0.0,0.000,0.00000
3,Wednesday,1,1.0,0.477,0.56468
4,generation,0,0.0,0.000,0.00000
...,...,...,...,...,...
309,program,0,0.0,0.000,0.00000
310,Joe,0,0.0,0.000,0.00000
311,Republicans,0,0.0,0.000,0.00000
312,quality,0,0.0,0.000,0.00000


In [116]:
# Calculate Similarity
def similarity(queryTf_Idf_Matrix,documentTf_Idf_Matrix):
    querynormalize    = list(queryTf_Idf_Matrix['normalize'].values)
    documentnormalize = list(documentTf_Idf_Matrix['normalize'].values)
    result = 0
    for i in range(len(querynormalize)):
        result += querynormalize[i]*documentnormalize[i]
    return result

In [118]:
print( "query similarity With Document1 is: ",similarity(queryTf_Idf_Matrix,Tf_Idf_Matrix1) )
print( "query similarity With Document2 is: ",similarity(queryTf_Idf_Matrix,Tf_Idf_Matrix2) )
print( "query similarity With Document3 is: ",similarity(queryTf_Idf_Matrix,Tf_Idf_Matrix3) )

query similarity With Document1 is:  0.05407387310571205
query similarity With Document2 is:  0.05425034834813515
query similarity With Document3 is:  0.0988644865895887
