## Topic modelling 

Expanding on document vectors, as aforementioned, words counts (basic or normalized by length of lexicon/document) don't tell us much about the importance of such word in the document *relative* to the rest of the documents in the corpus.
<br>
Hence solving a solution for this problem would mean we could start to describe documents within the corpus.
<br>
An example corpus such as every kite book written would generally mean the word 'Kite' will appear very frequently in every book (document) that we counted - which doesn't provide us with any useful information/data because it cannot differentiate/distinguish between those documents.
<br>
Some related words like 'aerodynamics' or 'wind' may not be common across the entire corpus, but for ones where it did frequently occur, we would know more about each document's nature. To accomplish this we need another tool.

**Inverse Document Frequency (IDF)** - Allows us to perform topic analysis corresponding to *Zipf's law*
<br>
A quick overview of such law seen from this [wiki](https://en.wikipedia.org/wiki/Zipf%27s_law)
> Zipf's law states that given some corpus of natural language utterances, the frequency of any word is ***inversely proportional*** to its rank in the frequency table.
* In summary if we rank the words of a corpus by the number of occurences and list them in descending order, for a decently large sample of documents, we'll find accordingly that the first word in the ranked list is twice as likely to occur as the second word in the list; it is also three times as likely to appear as the third word in the list
* Given a large corpus, using this heuristic to illustrate statistically how likely a certain word is to appear in any certain document of that corpus

Given a term frequency counter, one can count tokens and bin them up in two ways
<br>
1) Per document
<br>
2) Across the entire corpus
<br>
<br>
For now, we'll just focus on 1).


Sticking with the Kite corpora example - we'll retrieve the total word count for each document in our corpus (intro_doc and history_doc)

In [22]:
from nltk.tokenize import TreebankWordTokenizer
from nlpia.data.loaders import kite_text, kite_history # kite_intro and kite_hist respectively
tokenizer = TreebankWordTokenizer()

kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)
intro_total = len(intro_tokens)
intro_total 

363

In [23]:
kite_hist = str(kite_history).lower()
history_tokens = tokenizer.tokenize(kite_hist)
history_total = len(history_tokens)
history_total

297

Given we compiled a couple of tokenized kite documents at our disposal, let's look at the term frequency (TF) of 'kite' in each document 
<br>
We'll store the TFs we find in two dictionaries - one for each document.

In [24]:
from collections import Counter
intro_tf = {}
history_tf = {} 
intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite']/intro_total
history_counts = Counter(history_tokens)
history_tf['kite'] = history_counts['kite']/history_total

In [25]:
print(f"the Term Frequency of 'kite' in intro document is: {intro_tf['kite']:.4f}")
print(f"the Term Frequency of 'kite' in history document is: {history_tf['kite']:.4f}")

the Term Frequency of 'kite' in intro document is: 0.0441
the Term Frequency of 'kite' in history document is: 0.0202


Given the printed statements, it looks to be that the TF proportion of the intro document is twice the size of the TF proportion of the history document. But we cannot say the intro portion is twice as much about kites.
<br>
Another thought experiment is to go a bit deeper and search for other related terms and the correspondin TF for them such as 'and'.

In [26]:
intro_tf['and'] = intro_counts['and']/intro_total
history_tf['and'] = history_counts['and']/history_total

In [27]:
print(f"the Term Frequency of 'and' in intro document is: {intro_tf['and']:.4f}")
print(f"the Term Frequency of 'and' in history document is: {history_tf['and']:.4f}")

the Term Frequency of 'and' in intro document is: 0.0275
the Term Frequency of 'and' in history document is: 0.0303


Again, both of these documents have something to say about 'and' just as much as 'kite'. But again this is not helpful for us as it is not revelatory given a quick view of both of these TFs. By this logic of the this tf within the document 'and' is seen as an important word in the document which is not the case - given our heuritstic of identifying stopwords/prepositions that should be filtered out.

A good way to conceptualize a term's inverse document frequency (IDF) is to understand that if term appears in a document relatively frequently, but occurs rarely in the rest of the corpus, it's safe to assume that it's important to that document specifically. This is the basic foundation towards topic analysis.

**Term IDF** - Ratio of the total number documents to the number of documents the term appears in. It can be seen as a 'rarity' measure to weight the TFs

In [28]:
num_docs_containing_and = 0 
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1

In [29]:
num_docs_containing_kite = 0 
for doc in [intro_tokens, history_tokens]:
    if 'kite' in doc:
        num_docs_containing_kite += 1

In [30]:
num_docs_containing_china = 0 
for doc in [intro_tokens, history_tokens]:
    if 'china' in doc:
        num_docs_containing_china += 1

In [31]:
num_docs_containing_and 

2

In [32]:
# IDF ratio 
# Denom is 2 given that 'and' appears in 2 different documents in this corpus (intro_tokens and hist_tokens)
len([intro_tokens, history_tokens]) / num_docs_containing_and

1.0

In [33]:
# TF for China in the two documents 
intro_tf['china'] = intro_counts['china']/intro_total
history_tf['china'] = history_counts['china']/history_total

Now it's just a matter of acquiring the IDF for all three TFs 

In [34]:
num_docs = 2 
intro_idf = {} 
history_idf = {} 
intro_idf['and'] = num_docs/num_docs_containing_and
history_idf['and'] = num_docs/num_docs_containing_and
intro_idf['kite'] = num_docs/num_docs_containing_kite
history_idf['kite'] = num_docs/num_docs_containing_kite
intro_idf['china'] = num_docs/num_docs_containing_china
history_idf['china'] = num_docs/num_docs_containing_china

In [35]:
# tfidf for intro document 
intro_tfidf = {}
intro_tfidf['and'] = intro_tf['and'] * intro_idf['and']
intro_tfidf['kite'] = intro_tf['kite'] * intro_idf['kite']
intro_tfidf['china'] = intro_tf['china'] * intro_idf['china']

In [36]:
intro_tfidf

{'and': 0.027548209366391185, 'kite': 0.0440771349862259, 'china': 0.0}

In [37]:
history_tfidf = {}
history_tfidf['and'] = history_tf['and'] * hist_idf['and']
history_tfidf['kite'] = history_tf['kite'] * history_idf['kite']
history_tfidf['china'] = history_tf['china'] * history_idf['china']

In [38]:
history_tfidf

{'and': 0.030303030303030304,
 'kite': 0.020202020202020204,
 'china': 0.020202020202020204}

### Zipf application 

Given an instance of 1 million documents as an example we can say the following
* 'cat' - the term 'cat' appears in 1 document 
* 'dog' - the term 'dog' appears in 10 documents

In [57]:
# cat IDF 
cat_idf = int(1_000_000/1)
print(f'The IDF of cat across 1 million documents is: {cat_idf:,}')

The IDF of cat across 1 million documents is: 1,000,000


In [60]:
# dog IDF 
dog_idf = int(1000000/10)
print(f'The IDF of dog across 1 million documents is: {dog_idf:,}')

The IDF of dog across 1 million documents is: 100,000


As such, the diffence in size/scale appears large. Given Zipf's law, comparing frequencies between two words (even when the frequencies are relatively similar), the more frequent word will have an exponentially higher frequency than the less frequent one. 
<br>
To control for this, Zipf's law also suggests that we scale all word/document frequencies using the `log()` function which is the inverse of the `exp()` function. 

In [61]:
import math # built-in library
import numpy as np # scientific computing

Using this `log` technique ensures that words such as 'cat' and 'dog' - having relatively similar TF counts - aren't exponentially different in frequency. 
<br>
Also, this distribution of word frequencies ensure that our TF-IDF scores are more uniformly distributed 
<br> 
The base of the log function is not important as we're only concerned about frequency distribution being uniform and not to scale it within a particular numerical range.

In [65]:
# Using base 10 log 
print(np.log10(cat_idf))
print(np.log10(dog_idf))

6.0
5.0


In [66]:
from pathlib import Path
import os 

In [67]:
path = Path().home()/'Desktop'/'nlp-map-project'/'chp3-nlpia-notes'/'img-vects'
os.chdir(path)

<img src="img-vects/tfidf.png" alt="tfidf formulation" width="400" height='250'/>

Figure 1 
NLPIA Lane, Howard and Hapke (2019) chp 3.4.1 pp. 254 Apple iBooks.

Figure 1 summarises the tfidf formulae 
* t - a given term 
* d - a given document 
* D - A given corpus (collection of documents in total)

***Heuristic:***
* TF - The more times a word appears in the document (corresponding with TF-IDF overall), the TF also increases
* IDF - As the number of documents that contain that word goes up, the IDF (corresponding with the TF-IDF overall) for that word will decrease