# Describing the inverted index

The first thing that we're going to be doing is setting up an inverted index. This index will be used to map the term frequency to a particular document in a nested dictionary structure. 

```
index = {
    term : {
        document_id: term frequency
    }
}
```

We're also going to create a dictionary that will hold the metadata of our documents

```
documents = {
    document_id: {
        name: name of the document,
        magnitude: length of the vector # This will be important later. 
    }
}
```

In [2]:
index = {}
documents = {}

# Ingestion of documents

Now we're going to establish the procedure for document ingestion. This will take in a document and will add it to our index. 

First we're going to read in a short article from wikipedia

In [3]:
path = "books/Tanglewood Tales by Nathaniel Hawthorne.txt"
text = open(path, 'r', encoding='utf-8').read()
print(text[1000:1500])

evere application to study had

made upon his health; and I was happy to conclude, from the excellent

physical condition in which I saw him, that the remedy had already been

attended with very desirable success. He had now run up from Boston by

the noon train, partly impelled by the friendly regard with which he

is pleased to honor me, and partly, as I soon found, on a matter of

literary business.



It delighted me to receive Mr. Bright, for the first time, under a roof,

though a very hum


<hr>

For this text we're going to have to go through a few different steps in order to ingest it properly.

1. Preprocess the text
2. Calculate the frequency of the tokens
3. Insert the terms and their frequencies for each document into the inverted index
4. Calculate the magnitude of the document's term vector
5. Insert the document's metadata into the documents dictionary

We're calculating the magnitude of the term vector now since it is easier to calculate it now rather than at the time of ranking

In [4]:
from typing import List
import re

# Using a simple regular expression in order to tokenize.
def tokenize(text: str) -> List[str]:
    return re.findall(r"\b\w+\b", text.lower())

In [5]:
tokens = tokenize(text)
print(tokens[100:110])
print("Number of tokens:", len(tokens))

['i', 'was', 'favored', 'with', 'a', 'flying', 'visit', 'from', 'my', 'young']
Number of tokens: 69379


In order to count the frequencies we're going to use the collections class from counter which does exactly that and returns a dictionary like object that we can use to reference the term frequencies

In [6]:
from collections import Counter

term_frequencies = Counter(tokens)

for k in list(term_frequencies.keys())[:10]:
    print(k, term_frequencies[k])

the 3948
project 87
gutenberg 93
ebook 11
of 2150
tanglewood 9
tales 9
by 295
nathaniel 4
hawthorne 4


Now for each unique term from the text we need to add it to the index by reference of a document id like so

In [7]:
doc_id = len(documents) # By taking the len(documents) we get an incrementing identifier

for term, term_freq in term_frequencies.items():
    if term not in index:
        index[term] = {}
    index[term][doc_id] = term_freq

Now let's go ahead and calculate the magnitude of the term vector

In [8]:
from math import sqrt
mag = sqrt(sum(x**2 for x in term_frequencies.values()))
mag

6665.505757255034

Now that we have the magnitude let's go ahead and add it as part of the metadata in our documents dictionary

In [9]:
documents[doc_id] = {
    "name": path,
    "magnitude": mag
}

Now that we have the procedure established let's go ahead and turn them into functions that can take in some text and automatically add them to the index 

In [11]:
def index_document(text:str, name:str):
    tokens = tokenize(text)
    term_frequencies = Counter(tokens)
    doc_id = len(documents)

    for term in term_frequencies:
        if term not in index:
            index[term] = {}
        index[term][doc_id] = term_frequencies[term]
    
    mag = sqrt(sum([x**2 for x in term_frequencies.values()]))

    documents[doc_id] = {
        "name": name,
        "magnitude": mag
    }

`pip install tqdm`

In [12]:
# Getting files from the articles folder. This will be created from the data_pull.py file

from glob import glob
from tqdm import tqdm
paths = glob("books/*.txt")

# resetting the index and documents dicts as to not re-read data
index = {}
documents = {}

for path in tqdm(paths):
    text = open(path, 'r', encoding='utf-8').read()
    index_document(text, name=path)

100%|██████████| 808/808 [00:50<00:00, 15.85it/s]


# Searching the index

Now that we have a variety of documents ingested into the index we can start to query our index. To do this we're going to be calculating the cosine similarity of the query to the document within the query. This will give us a rank for each document's similairty to the query and in turn our search results. 

To do this we're going to 

* Iterate over the terms of the query
* Find the documents associated with the current term
* Calculate the sum of the tfidf scores for each of the individual documents
* Normalize the sum of the tfidf scores using the pre-calculated magnitude of the documents term vector
* Get the top 10 document ids from the calculated scores

In [13]:
from collections import Counter
from math import log

query = "Flambeau priest"
query_terms = tokenize(query)

N = len(documents) # The number of documents in the corpus
scores = Counter() # Counter to hold the scores and default to zero if it doesn't exist

for term in query_terms:
    df = len(index[term]) # The document freqeuncy for the term 
    idf = log(N/df) # The inverse-document frequency for the term

    for doc_id, tf in index[term].items():
        scores[doc_id] += (tf * idf) # adding the tfidf to the document's score for this query

In [14]:
# Normalize the scores based on the magnitude we calculated during document ingestion
for doc_id, score in scores.items():
    scores[doc_id] = score / documents[doc_id]['magnitude']

In [15]:
# get the top results from the query
for doc_id, score in scores.most_common(10):
    print(documents[doc_id]['name'], ":", score)

books\The Innocence of Father Brown by G. K. Chesterton.txt : 0.11844505593618641
books\The Wisdom of Father Brown by G. K. Chesterton.txt : 0.07829545072133995
books\The King's Jackal by Richard Harding Davis.txt : 0.010696040353392553
books\Fables by Robert Louis Stevenson.txt : 0.009301339283403058
books\A Theological-Political Treatise [Part IV] by Benedictus de Spinoza.txt : 0.007420887002484888
books\The Damnation of Theron Ware by Harold Frederic.txt : 0.006779504784506538
books\The Confutatio Pontificia by Johann Michael Reu.txt : 0.006464333956174691
books\Ballads by Robert Louis Stevenson.txt : 0.006181985437514823
books\The Story of the Amulet by E. Nesbit.txt : 0.00588393808574806
books\The Mayflower Compact.txt : 0.005525188901532451


In [16]:
def search(query:str):
    query_terms = tokenize(query)

    N = len(documents) # The number of documents in the corpus
    scores = Counter() # Counter to hold the scores and default to zero if it doesn't exist

    for term in query_terms:
        df = len(index[term]) # The document freqeuncy for the term 
        idf = log(N/df) # The inverse-document frequency for the term

        for doc_id, tf in index[term].items():
            scores[doc_id] += (tf * idf) # adding the tfidf to the document's score for this query
    # Normalize the scores based on the magnitude we calculated during document ingestion
    for doc_id, score in scores.items():
        scores[doc_id] = score / documents[doc_id]['magnitude']

    # get the top results from the query
    for doc_id, score in scores.most_common(10):
        print(documents[doc_id]['name'], ":", score)

In [25]:
search("Roland")

books\Orlando Furioso by Lodovico Ariosto.txt : 0.011454267647381083
books\The Song of Roland by C. K. Scott-Moncrieff.txt : 0.005918761796358551
books\Travels with a Donkey in the Cevennes by Robert Louis Stevenson.txt : 0.004376585534473633
books\Life of Robert Browning by William Sharp.txt : 0.0030481482585879337
books\The Breitmann Ballads by Charles Godfrey Leland.txt : 0.002346248211612463
books\Four Arthurian Romances by active 12th century de Troyes Chrétien.txt : 0.0014779907490621055
books\Don Quixote by Miguel de Cervantes Saavedra.txt : 0.001346239924237458
books\Reprinted Pieces by Charles Dickens.txt : 0.0012344111357089579
books\Child Christopher and Goldilind the Fair by William Morris.txt : 0.00106557836375704
books\The Goodness of St. Rocque, and Other Stories by Alice Moore Dunbar-Nelson.txt : 0.0010290289032903579
