## Overview

Extending from simple keyword retrieval from tokenization, it might be more useful to understand which words are more 'important' to a particular document and across the corpus as a whole.
<br>
From this, we can use the 'importance' value to find relevant documents in a corpus based on keyword importance within each document.
* Measure positivity relative to tokens across a corpus - Understanding the frequency with which those words appear in a document *in relation* to the rest of the documents, you can use that to further refine the 'positivity' of the document. Hence, learn more about an approach that uses less binary measure of words and their usage within a document - where common use cases revoling around generating features from natural language involves search engines and spam filters
* Convert tokens into continous numbers (as opposed to integers representing words counts or binary vectors representing presence/absenc of words) - Representing (transforming) words into a continous form allows more handy computational ability (math), with the the goal of finding the numerical representation of words that capture the importance/information content of the words they represent


There are three powerful ways to represent words and their importance in a document:
<br>
<br>
1) Bag of words (BOW) - Vectors of words counts/frequencies
<br>
2) Bag of n-grams - Counts of word pairs (bigrams), triplets (trigrams) etc.
<br>
3) Term Frequency Inverse Document Frequency (TF-IDF) vectors - Word scores that more so better represent their importance; the technique here is to aggregate the word counts and *divide* each by the number of documents in which the word occurs
<br>
<br>
These techniques are all stastical models in that they are frequency based.

### Bag of Words (BOW)

A common example that involves counting occurrences of words as a use case

In [4]:
from nltk.tokenize import TreebankWordTokenizer
sentence = "The faster Shuaib got to the store, the faster Shuaib would be able to return home"

In [5]:
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence.lower())

In [6]:
tokens

['the',
 'faster',
 'shuaib',
 'got',
 'to',
 'the',
 'store',
 ',',
 'the',
 'faster',
 'shuaib',
 'would',
 'be',
 'able',
 'to',
 'return',
 'home']

In [7]:
from collections import Counter
bow = Counter(tokens)
bow # Distinct word counts - dictionary like format with no inherent order 

Counter({'the': 3,
         'faster': 2,
         'shuaib': 2,
         'got': 1,
         'to': 2,
         'store': 1,
         ',': 1,
         'would': 1,
         'be': 1,
         'able': 1,
         'return': 1,
         'home': 1})

In [8]:
type(bow)

collections.Counter

A `collections.Counter` object is an unordered collection, also called a bag or multiset. A Counter can be displayed in a seemingly reasonable order, like lexical order or the order that tokens appeared in your statement. But just as for a standard Python `dict`, one can't rely on the order of your tokens (keys) in a Counter.

The BOW contain relevant information about the original intent of the sentence.
<br>
The information in a BOW can perform meaningful computations such as detecting spam, compute sentiment and subtle intent (sarcasm).
<br>
Using the `most_common` method, we can find the most frequent (descending order) ranked words.

In [10]:
# 4 most frequent words 
bow.most_common(4)

[('the', 3), ('faster', 2), ('shuaib', 2), ('to', 2)]

*Term Frequency (TF)* - Illustrates the number of times a word occurs in a given document.
* TFIDF is where the count of word occurrences are normalized by the number of terms in the document/corupus

Given this is a rough BOW process - ignore the stop words/prepositions that have no meaning and consider 'faster' and 'shuaib' for now.

In [14]:
shuaib_appears = bow['shuaib']
lexicon = len(bow) # number of unique tokens from original document/sentence

In [19]:
tf = shuaib_appears/lexicon
round(tf, 3) # round to 3 dp

0.167

Raw words counts sometimes can be less useful if we want to understand the relevancy of a key word relative to the document/corpus. 
<br>
normalized term frequencies helps us understand the relationship between a specifc term corresponding do a certain document.

In [21]:
from nlpia.data.loaders import kite_text 
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(kite_text.lower())

`kite_text` comprises of paragraphs in the wikipedia article on 'kites'

In [23]:
token_counts = Counter(tokens)
token_counts.most_common(5)

[('the', 26), ('a', 20), ('kite', 16), (',', 15), ('and', 10)]

It's highly unlikely the article is talking about stopwords like 'the' and 'and' so we can remove them.

In [31]:
import nltk
nltk.download('stopwords', quiet=True)
stopwords = nltk.corpus.stopwords.words('english')

In [32]:
tokens = [x for x in tokens if x not in stopwords]

In [33]:
kite_counts = Counter(tokens)

In [36]:
# 8 most common tokens in the corpus
kite_counts.most_common(8)

[('kite', 16),
 (',', 15),
 ('kites', 8),
 ('wing', 5),
 ('lift', 4),
 ('may', 4),
 ('also', 3),
 ('kiting', 3)]

We can quickly see that terms 'kite(s)', 'wing' and 'lift' are of some importance to this document - also allows us to make a decent inference about what the general topic about the document is.
<br>
An additional application of these term frequencies across multiple documents in a (kite) corpus where the terms 'wing' and 'lift' would rank highly in most of these documents as they are referenced quite often.
<br>
To show this neatly in mathematical terms, we'll have to apply **vectorization**.