## Topic modelling 

Expanding on document vectors, as aforementioned, words counts (basic or normalized by length of lexicon/document) don't tell us much about the importance of such word in the document *relative* to the rest of the documents in the corpus.
<br>
Hence solving a solution for this problem would mean we could start to describe documents within the corpus.
<br>
An example corpus such as every kite book written would generally mean the word 'Kite' will appear very frequently in every book (document) that we counted - which doesn't provide us with any useful information/data because it cannot differentiate/distinguish between those documents.
<br>
Some related words like 'aerodynamics' or 'wind' may not be common across the entire corpus, but for ones where it did frequently occur, we would know more about each document's nature. To accomplish this we need another tool.

**Inverse Document Frequency (IDF)** - Allows us to perform topic analysis corresponding to *Zipf's law*
<br>
A quick overview of such law seen from this [wiki](https://en.wikipedia.org/wiki/Zipf%27s_law)
> Zipf's law states that given some corpus of natural language utterances, the frequency of any word is ***inversely proportional*** to its rank in the frequency table.
* In summary if we rank the words of a corpus by the number of occurences and list them in descending order, for a decently large sample of documents, we'll find accordingly that the first word in the ranked list is twice as likely to occur as the second word in the list; it is also three times as likely to appear as the third word in the list
* Given a large corpus, using this heuristic to illustrate statistically how likely a certain word is to appear in any certain document of that corpus

Given a term frequency counter, one can count tokens and bin them up in two ways
<br>
1) Per document
<br>
2) Across the entire corpus
<br>
<br>
For now, we'll just focus on 1).


Sticking with the Kite corpora example - we'll retrieve the total word count for each document in our corpus (intro_doc and history_doc)

In [54]:
from nltk.tokenize import TreebankWordTokenizer
from nlpia.data.loaders import kite_text, kite_history # kite_intro and kite_hist respectively
tokenizer = TreebankWordTokenizer()

kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)
intro_total = len(intro_tokens)
intro_total 

363

In [55]:
kite_hist = str(kite_history).lower()
history_tokens = tokenizer.tokenize(kite_hist)
history_total = len(history_tokens)
history_total

297

Given we compiled a couple of tokenized kite documents at our disposal, let's look at the term frequency (TF) of 'kite' in each document 
<br>
We'll store the TFs we find in two dictionaries - one for each document.

In [56]:
from collections import Counter
intro_tf = {}
history_tf = {} 
intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite']/intro_total
history_counts = Counter(history_tokens)
history_tf['kite'] = history_counts['kite']/history_total

In [57]:
print(f"the Term Frequency of 'kite' in intro document is: {intro_tf['kite']:.4f}")
print(f"the Term Frequency of 'kite' in history document is: {history_tf['kite']:.4f}")

the Term Frequency of 'kite' in intro document is: 0.0441
the Term Frequency of 'kite' in history document is: 0.0202


Given the printed statements, it looks to be that the TF proportion of the intro document is twice the size of the TF proportion of the history document. But we cannot say the intro portion is twice as much about kites.
<br>
Another thought experiment is to go a bit deeper and search for other related terms and the correspondin TF for them such as 'and'.

In [58]:
intro_tf['and'] = intro_counts['and']/intro_total
history_tf['and'] = history_counts['and']/history_total

In [59]:
print(f"the Term Frequency of 'and' in intro document is: {intro_tf['and']:.4f}")
print(f"the Term Frequency of 'and' in history document is: {history_tf['and']:.4f}")

the Term Frequency of 'and' in intro document is: 0.0275
the Term Frequency of 'and' in history document is: 0.0303


Again, both of these documents have something to say about 'and' just as much as 'kite'. But again this is not helpful for us as it is not revelatory given a quick view of both of these TFs. By this logic of the this tf within the document 'and' is seen as an important word in the document which is not the case - given our heuritstic of identifying stopwords/prepositions that should be filtered out.

A good way to conceptualize a term's inverse document frequency (IDF) is to understand that if term appears in a document relatively frequently, but occurs rarely in the rest of the corpus, it's safe to assume that it's important to that document specifically. This is the basic foundation towards topic analysis.

**Term IDF** - Ratio of the total number documents to the number of documents the term appears in. It can be seen as a 'rarity' measure to weight the TFs

In [60]:
num_docs_containing_and = 0 
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1

In [61]:
num_docs_containing_kite = 0 
for doc in [intro_tokens, history_tokens]:
    if 'kite' in doc:
        num_docs_containing_kite += 1

In [62]:
num_docs_containing_china = 0 
for doc in [intro_tokens, history_tokens]:
    if 'china' in doc:
        num_docs_containing_china += 1

In [63]:
num_docs_containing_and 

2

In [64]:
# IDF ratio 
# Denom is 2 given that 'and' appears in 2 different documents in this corpus (intro_tokens and hist_tokens)
len([intro_tokens, history_tokens]) / num_docs_containing_and

1.0

In [65]:
# TF for China in the two documents 
intro_tf['china'] = intro_counts['china']/intro_total
history_tf['china'] = history_counts['china']/history_total

Now it's just a matter of acquiring the IDF for all three TFs 

In [66]:
num_docs = 2 
intro_idf = {} 
history_idf = {} 
intro_idf['and'] = num_docs/num_docs_containing_and
history_idf['and'] = num_docs/num_docs_containing_and
intro_idf['kite'] = num_docs/num_docs_containing_kite
history_idf['kite'] = num_docs/num_docs_containing_kite
intro_idf['china'] = num_docs/num_docs_containing_china
history_idf['china'] = num_docs/num_docs_containing_china

In [67]:
# tfidf for intro document 
intro_tfidf = {}
intro_tfidf['and'] = intro_tf['and'] * intro_idf['and']
intro_tfidf['kite'] = intro_tf['kite'] * intro_idf['kite']
intro_tfidf['china'] = intro_tf['china'] * intro_idf['china']

In [68]:
intro_tfidf

{'and': 0.027548209366391185, 'kite': 0.0440771349862259, 'china': 0.0}

In [69]:
history_tfidf = {}
history_tfidf['and'] = history_tf['and'] * history_idf['and']
history_tfidf['kite'] = history_tf['kite'] * history_idf['kite']
history_tfidf['china'] = history_tf['china'] * history_idf['china']

In [70]:
history_tfidf

{'and': 0.030303030303030304,
 'kite': 0.020202020202020204,
 'china': 0.020202020202020204}

### Zipf application 

Given an instance of 1 million documents as an example we can say the following
* 'cat' - the term 'cat' appears in 1 document 
* 'dog' - the term 'dog' appears in 10 documents

In [71]:
# cat IDF 
cat_idf = int(1_000_000/1)
print(f'The IDF of cat across 1 million documents is: {cat_idf:,}')

The IDF of cat across 1 million documents is: 1,000,000


In [72]:
# dog IDF 
dog_idf = int(1000000/10)
print(f'The IDF of dog across 1 million documents is: {dog_idf:,}')

The IDF of dog across 1 million documents is: 100,000


As such, the diffence in size/scale appears large. Given Zipf's law, comparing frequencies between two words (even when the frequencies are relatively similar), the more frequent word will have an exponentially higher frequency than the less frequent one. 
<br>
To control for this, Zipf's law also suggests that we scale all word/document frequencies using the `log()` function which is the inverse of the `exp()` function. 

In [73]:
import math # built-in library
import numpy as np # scientific computing

Using this `log` technique ensures that words such as 'cat' and 'dog' - having relatively similar TF counts - aren't exponentially different in frequency. 
<br>
Also, this distribution of word frequencies ensure that our TF-IDF scores are more uniformly distributed 
<br> 
The base of the log function is not important as we're only concerned about frequency distribution being uniform and not to scale it within a particular numerical range.

In [74]:
# Using base 10 log 
print(np.log10(cat_idf))
print(np.log10(dog_idf))

6.0
5.0


In [75]:
from pathlib import Path
import os 

In [76]:
path = Path().home()/'Desktop'/'nlp-map-project'/'chp3-nlpia-notes'/'img-vects'
os.chdir(path)

<img src="img-vects/tfidf.png" alt="tfidf formulation" width="400" height='250'/>

Figure 1 
NLPIA Lane, Howard and Hapke (2019) chp 3.4.1 pp. 254 Apple iBooks.

Figure 1 summarises the tfidf formulae 
* t - a given term 
* d - a given document 
* D - A given corpus (collection of documents in total)

***Heuristic:***
* TF - The more times a word appears in the document (corresponding with TF-IDF overall), the TF also increases
* IDF - As the number of documents that contain that word goes up, the IDF (corresponding with the TF-IDF overall) for that word will decrease

### Relevance ranking

In [77]:
# Load in again the 'Harry' corpus example 
from nlpia.data.loaders import harry_docs as docs
from collections import Counter, OrderedDict
from nltk.tokenize import TreebankWordTokenizer

In [78]:
doc_tokens = [] 
for doc in docs:
    doc_tokens.append(sorted(tokenizer.tokenize(doc.lower())))

all_doc_tokens = sum(doc_tokens, [])
print(len(all_doc_tokens))

lexicon = sorted(set(all_doc_tokens))

print(len(lexicon))
print(lexicon)

33
18
[',', '.', 'and', 'as', 'faster', 'get', 'got', 'hairy', 'harry', 'home', 'is', 'jill', 'not', 'store', 'than', 'the', 'to', 'would']


In [79]:
zero_vector = OrderedDict((token, 0) for token in lexicon)
zero_vector

OrderedDict([(',', 0),
             ('.', 0),
             ('and', 0),
             ('as', 0),
             ('faster', 0),
             ('get', 0),
             ('got', 0),
             ('hairy', 0),
             ('harry', 0),
             ('home', 0),
             ('is', 0),
             ('jill', 0),
             ('not', 0),
             ('store', 0),
             ('than', 0),
             ('the', 0),
             ('to', 0),
             ('would', 0)])

In [80]:
import copy 
document_tfidf_vectors = []
for doc in docs:
    # copy ensures that we create a separate object to reference and not overwriting as we       iterate 
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)

    for key, value in token_counts.items():
        docs_containing_key = 0 
        for doc_ in docs:
            if key in doc_:
                docs_containing_key += 1
        tf = value / len(lexicon)
        if docs_containing_key:
            idf = len(docs)/docs_containing_key
        else:
            idf = 0 
        vec[key] = tf * idf
    document_tfidf_vectors.append(vec)

In [81]:
import pandas as pd 
df = pd.DataFrame(document_tfidf_vectors)

In [82]:
# Store the document tfidf vectors in a dataframe for better visibility
# rows denote the different sentences/documents with tfidf word vectors (lexicon) as features
df

Unnamed: 0,",",.,and,as,faster,get,got,hairy,harry,home,is,jill,not,store,than,the,to,would
0,0.166667,0.055556,0.083333,0.0,0.25,0.166667,0.166667,0.0,0.0,0.166667,0.0,0.0,0.0,0.166667,0.0,0.5,0.166667,0.166667
1,0.0,0.055556,0.083333,0.0,0.083333,0.0,0.0,0.083333,0.0,0.0,0.083333,0.0,0.0,0.0,0.166667,0.0,0.0,0.0
2,0.0,0.055556,0.0,0.111111,0.0,0.0,0.0,0.083333,0.0,0.0,0.083333,0.0,0.166667,0.0,0.0,0.0,0.0,0.0


This TFIDF vector representation is clever in considering that the term which is the main topic ('Harry') is given a (low) weight of 0 as it understands that such TF appears in every doc and is relatively frequent.

From here, we have a K-dimensional vector representation of each document in the corpus. 

In [83]:
print(f'the DataFrame dimensions (row x columns) for the tfidf word vectors: {df.shape}')

the DataFrame dimensions (row x columns) for the tfidf word vectors: (3, 18)


We can expand on this transformation and use for cases pertaining to search - where two vectors in a given vector space are shown to be similar if they have a similar/close angle.
<br>
As a reminder, we use cosine similarity to measure such similarity which is likely to be similar when the cosine similarity score is high.
<br>
The formulae to minimise (optimise) is:
<br>
$cos (\Theta$) = $\frac{A \times B}{|A| \times |B|}$

To perform a simple TF-IDF search, treat the search query itself as a document and therefore get the TF-IDF based vector representation of it.
<br>
Then find the documents whose vectors have the highest cosine similarities to the query and return such computations as search results. 

Again take example of the three documents about Harry and use the example search query:
> “How long does it take to get to the store?”
<br>

In [115]:
query = 'How long does it take to get to the store?'
query_vec = copy.copy(zero_vector)
query_vec = copy.copy(zero_vector)

tokens = tokenizer.tokenize(query.lower())
token_counts = Counter(tokens)

for key, value in token_counts.items():
    docs_containing_key = 0 
    for doc_ in docs:
        if key in doc_.lower():
            docs_containing_key += 1
        if docs_containing_key == 0:
    # continue is used here to skip to the next key if such token isn't found in the lexicon
    # This is also used to avoid the 'division' by zero error 
            continue 
        tf = value/len(tokens)
        idf = len(docs)/docs_containing_key
        query_vec[key] = tf * idf 


In [116]:
def cos_sim(vect_a, vect_b):
    """ Convert from dictionary to lists for easier computation and matching """
    vect_a = [val for val in vect_a.values()]
    vect_b = [val for val in vect_b.values()]

    dot_prod = 0
    for i, v in enumerate(vect_a):
        dot_prod += v * vect_b[i]
    
    mag_1 = math.sqrt(sum([x**2 for x in vect_a]))
    mag_2 = math.sqrt(sum([x**2 for x in vect_b]))

    return dot_prod / (mag_1 * mag_2)

In [117]:
print(f'cosine similarity between search query and first document in Harry corpus: {cos_sim(query_vec, document_tfidf_vectors[0])}')

cosine similarity between search query and first document in Harry corpus: 0.6132857433407973


In [118]:
print(f'cosine similarity between search query and second document in Harry corpus: {cos_sim(query_vec, document_tfidf_vectors[1])}')

cosine similarity between search query and second document in Harry corpus: 0.0


In [119]:
print(f'cosine similarity between search query and third document in Harry corpus: {cos_sim(query_vec, document_tfidf_vectors[2])}')

cosine similarity between search query and third document in Harry corpus: 0.0


We can see that that first document (doc 0) contains the most relevance for our search query.
* This simple technique of TFIDF representation and taking the cosine similarities enables the application of finding relevant documents in any corpus using keywords

Major search engines like Google *inverted index* that is somewhat safe from competition as it has a time complexity of O(1) - constant time which is the fastest possible algorithm run time - as opposed to our simple TF-IDF vector search that uses an 'index scan' for each query consisting of O(N) - linear time which *slightly* slower then constant time.

N.B - To make the above search querying similarities comparison more effective, rather than ignoring/dropping the keys that weren't found in the lexicon to avoid the denominator being 0 (i.e. division by zero error), a better approach is to add +1 to the denominator of every IDF calculation and avoding such erroneous situations.
<br>
This approach is known as the ***additive smoothing (laplace smoothing)***. When compiled, it'll usually improve the search results for TF-IDF keyword-based searches.

### Tools (automating the search similarity pipeline)

Tools/packages have been designed in this domain to mostly automate the above standard python built-in code and reducing the number of lines/work needed for the same computations.

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = docs 
vectorizer = TfidfVectorizer(min_df=1) # default min_df is 1 anyway
model = vectorizer.fit_transform(corpus)

In [110]:
# as above when using TF-IDF vector representation, we have here 3 documents with 16 word vector features (the model here already ignores the two puncuation tokens for us)
print(model.get_shape) # sparse matrix 
model.shape

<bound method spmatrix.get_shape of <3x16 sparse matrix of type '<class 'numpy.float64'>'
	with 23 stored elements in Compressed Sparse Row format>>


(3, 16)

In [130]:
model.todense()

matrix([[0.1614879 , 0.        , 0.48446369, 0.21233718, 0.21233718,
         0.        , 0.25081952, 0.21233718, 0.        , 0.        ,
         0.        , 0.21233718, 0.        , 0.63701154, 0.21233718,
         0.21233718],
        [0.36930805, 0.        , 0.36930805, 0.        , 0.        ,
         0.36930805, 0.28680065, 0.        , 0.36930805, 0.36930805,
         0.        , 0.        , 0.48559571, 0.        , 0.        ,
         0.        ],
        [0.        , 0.75143242, 0.        , 0.        , 0.        ,
         0.28574186, 0.22190405, 0.        , 0.28574186, 0.28574186,
         0.37571621, 0.        , 0.        , 0.        , 0.        ,
         0.        ]])

From above, we get the matrix representation of the TF-IDF computations for each word vector - like a list of lists.
<br>
We can be even more clearer and add the corresponding vocabulary for this vector representation where each column is the word vector and the row represents each document using a **DataFrame**

In [131]:
df = pd.DataFrame(model.todense().round(2), columns= vectorizer.get_feature_names())
df

Unnamed: 0,and,as,faster,get,got,hairy,harry,home,is,jill,not,store,than,the,to,would
0,0.16,0.0,0.48,0.21,0.21,0.0,0.25,0.21,0.0,0.0,0.0,0.21,0.0,0.64,0.21,0.21
1,0.37,0.0,0.37,0.0,0.0,0.37,0.29,0.0,0.37,0.37,0.0,0.0,0.49,0.0,0.0,0.0
2,0.0,0.75,0.0,0.0,0.0,0.29,0.22,0.0,0.29,0.29,0.38,0.0,0.0,0.0,0.0,0.0


Lastly, we can also the use the `scikit-learn` library and import the function `cosine_similarity` to return a cosine_similarity for each query we want to search for as aforementioned.

In [127]:
from sklearn.metrics.pairwise import cosine_similarity

In [128]:
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    return cosineSimilarities

In [129]:
get_tf_idf_query_similarity(vectorizer, model, 'How long does it take to get to the store?')

array([0.56179137, 0.        , 0.        ])

As mentioned above the first document in the Harry corpus is the most relevant to our search query - where the cosine similarity numbers are very similar to the cosine similarity measures we computed above (strictly python code).
<br>
Overall, when we use large texts, these pre-optimised TF-IDF model will save loads of work and time.