## Topic modelling 

Expanding on document vectors, as aforementioned, words counts (basic or normalized by length of lexicon/document) don't tell us much about the importance of such word in the document *relative* to the rest of the documents in the corpus.
<br>
Hence solving a solution for this problem would mean we could start to describe documents within the corpus.
<br>
An example corpus such as every kite book written would generally mean the word 'Kite' will appear very frequently in every book (document) that we counted - which doesn't provide us with any useful information/data because it cannot differentiate/distinguish between those documents.
<br>
Some related words like 'aerodynamics' or 'wind' may not be common across the entire corpus, but for ones where it did frequently occur, we would know more about each document's nature. To accomplish this we need another tool.

**Inverse Document Frequency (IDF)** - Allows us to perform topic analysis corresponding to *Zipf's law*
<br>
A quick overview of such law seen from this [wiki](https://en.wikipedia.org/wiki/Zipf%27s_law)
> Zipf's law states that given some corpus of natural language utterances, the frequency of any word is ***inversely proportional*** to its rank in the frequency table.
* In summary if we rank the words of a corpus by the number of occurences and list them in descending order, for a decently large sample of documents, we'll find accordingly that the first word in the ranked list is twice as likely to occur as the second word in the list; it is also three times as likely to appear as the third word in the list
* Given a large corpus, using this heuristic to illustrate statistically how likely a certain word is to appear in any certain document of that corpus

Given a term frequency counter, one can count tokens and bin them up in two ways
<br>
1) Per document
<br>
2) Across the entire corpus
<br>
<br>
For now, we'll just focus on 1).


Sticking with the Kite corpora example - we'll retrieve the total word count for each document in our corpus (intro_doc and history_doc)

In [1]:
from nltk.tokenize import TreebankWordTokenizer
from nlpia.data.loaders import kite_text, kite_history # kite_intro and kite_hist respectively
tokenizer = TreebankWordTokenizer()

kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)

In [2]:
len(intro_tokens)

363