## Topic modelling 

Expanding on document vectors, as aforementioned, words counts (basic or normalized by length of lexicon/document) don't tell us much about the importance of such word in the document *relative* to the rest of the documents in the corpus.
<br>
Hence solving a solution for this problem would mean we could start to describe documents within the corpus.
<br>
An example corpus such as every kite book written would generally mean the word 'Kite' will appear very frequently in every book (document) that we counted - which doesn't provide us with any useful information/data because it cannot differentiate/distinguish between those documents.
<br>
Some related words like 'aerodynamics' or 'wind' may not be common across the entire corpus, but for ones where it did frequently occur, we would know more about each document's nature. To accomplish this we need another tool.

**Inverse Document Frequency (IDF)** - Allows us to perform topic analysis corresponding to *Zipf's law*
<br>
A quick overview of such law seen from this [wiki](https://en.wikipedia.org/wiki/Zipf%27s_law)
> Zipf's law states that given some corpus of natural language utterances, the frequency of any word is ***inversely proportional*** to its rank in the frequency table.
* In summary if we rank the words of a corpus by the number of occurences and list them in descending order, for a decently large sample of documents, we'll find accordingly that the first word in the ranked list is twice as likely to occur as the second word in the list; it is also three times as likely to appear as the third word in the list
* Given a large corpus, using this heuristic to illustrate statistically how likely a certain word is to appear in any certain document of that corpus

Given a term frequency counter, one can count tokens and bin them up in two ways
<br>
1) Per document
<br>
2) Across the entire corpus
<br>
<br>
For now, we'll just focus on 1).


Sticking with the Kite corpora example - we'll retrieve the total word count for each document in our corpus (intro_doc and history_doc)

In [13]:
from nltk.tokenize import TreebankWordTokenizer
from nlpia.data.loaders import kite_text, kite_history # kite_intro and kite_hist respectively
tokenizer = TreebankWordTokenizer()

kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)
intro_total = len(intro_tokens)
intro_total 

363

In [14]:
kite_hist = str(kite_history).lower()
hist_tokens = tokenizer.tokenize(kite_hist)
hist_total = len(hist_tokens)
hist_total

297

Given we compiled a couple of tokenized kite documents at our disposal, let's look at the term frequency (TF) of 'kite' in each document 
<br>
We'll store the TFs we find in two dictionaries - one for each document.

In [15]:
from collections import Counter
intro_tf = {}
history_tf = {} 
intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite']/intro_total
hist_counts = Counter(hist_tokens)
history_tf['kite'] = hist_counts['kite']/hist_total

In [17]:
print(f"the Term Frequency of 'kite' in intro document is: {intro_tf['kite']:.4f}")
print(f"the Term Frequency of 'kite' in history document is: {history_tf['kite']:.4f}")

the Term Frequency of 'kite' in intro document is: 0.0441
the Term Frequency of 'kite' in history document is: 0.0202


Given the printed statements, it looks to be that the TF proportion of the intro document is twice the size of the TF proportion of the history document. But we cannot say the intro portion is twice as much about kites.
<br>
Another thought experiment is to go a bit deeper and search for other related terms and the correspondin TF for them such as 'and'.

In [18]:
intro_tf['and'] = intro_counts['and']/intro_total
history_tf['and'] = hist_counts['and']/hist_total

In [19]:
print(f"the Term Frequency of 'and' in intro document is: {intro_tf['and']:.4f}")
print(f"the Term Frequency of 'and' in history document is: {history_tf['and']:.4f}")

the Term Frequency of 'and' in intro document is: 0.0275
the Term Frequency of 'and' in history document is: 0.0303


Again, both of these documents have something to say about 'and' just as much as 'kite'. But again this is not helpful for us as it is not revelatory given a quick view of both of these TFs. By this logic of the this tf within the document 'and' is seen as an important word in the document which is not the case - given our heuritstic of identifying stopwords/prepositions that should be filtered out.

A good way to conceptualize a term's inverse document frequency (IDF) is to understand that if term appears in a document relatively frequently, but occurs rarely in the rest of the corpus, it's safe to assume that it's important to that document specifically. This is the basic foundation towards topic analysis.

**Term IDF** - Ratio of the total number documents to the number of documents the term appears in. It can be seen as a 'rarity' measure to weight the TFs

In [20]:
num_docs_containing_and = 0 
for doc in [intro_tokens, hist_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1

In [21]:
num_docs_containing_and 

2

In [28]:
# IDF ratio 
# Denom is 2 given there are two docs in this corpus (intro_tokens and hist_tokens)
num_docs_containing_and / len([intro_tokens, hist_tokens])

1.0