# Representing Text As Numeric Matrices

## Vectorizing 

We need to use some technique to represent unstructured text as numeric matrices that can be used for machine learning. Each document in a corpus becomes a numeric vector in a matrix.

## Bag Of Words

One of the simplest and most common ways to represent text numerically is by creating a Document-Frequency Matrix (DFM), also known as a Bag Of Words (BoW) model. A DFM simply captures the frequency of terms, regardless of order. 

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
 
corpus = ['You love me', 
          'You do not love me',
          'You really really love food']

In [2]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)
X.toarray()

array([[0, 0, 1, 1, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 1],
       [0, 1, 1, 0, 0, 2, 1]], dtype=int64)

In [3]:
df = pd.DataFrame()
df['vocabulary --->'] = vectorizer.get_feature_names()
df['doc 1 vector'] = X.toarray()[0]
df['doc 2 vector'] = X.toarray()[1]
df['doc 3 vector'] = X.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,do,food,love,me,not,really,you
doc 1 vector,0,0,1,1,0,0,1
doc 2 vector,1,0,1,1,1,0,1
doc 3 vector,0,1,1,0,0,2,1


## TF-IDF

Perhaps the most famous and useful text vectorization method is Term Frequency - Inverse Document Frequency. There are many IR papers on the subject and it is deep. At a high level, TF-IDF balances the frequency (or importance) of a term in a document with its frequency in the entire corpus, generating a score instead of a simple count for each token in a document.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
vector = vectorizer.fit_transform(corpus)

df = pd.DataFrame()
df['vocabulary --->'] = vectorizer.get_feature_names()
df['doc 1 scores'] = vector.toarray()[0]
df['doc 2 scores'] = vector.toarray()[1]
df['doc 3 scores'] = vector.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,do,food,love,me,not,really,you
doc 1 scores,0.0,0.0,0.522842,0.673255,0.0,0.0,0.522842
doc 2 scores,0.55249,0.0,0.32631,0.420183,0.55249,0.0,0.32631
doc 3 scores,0.0,0.41894,0.247433,0.0,0.0,0.83788,0.247433


We see how the terms **do not** get a higher score than **love** in the second document, compared to the first document. In the third document, **really** gets the highest score since it appears twice and not in any other document. For the TF-IDF algorithm, **really** shows as more informative since it isn't common amongst all documents and is very common in one document. 
**Food** gets half the score of **really** since it also only appears in the third document but it appears only once. This gives us a bit of intuition of how TF-IDF works.

**DETAILS**

When the `fit()` method is called, it creates a dictionary that stores each term in the corpus and its assigned feature index. This dictionary is the vectorizer's `.vocabulary_`.

In [5]:
vectorizer.vocabulary_

{'you': 6, 'love': 2, 'me': 3, 'do': 0, 'not': 4, 'really': 5, 'food': 1}

The method `get_feature_names()` returns the sorted list of feature names sans indices:

In [6]:
vectorizer.get_feature_names()

['do', 'food', 'love', 'me', 'not', 'really', 'you']

As opposed to `Countvectorizer`, `TfidfVectorizer` doesn't simply list token counts as features in a sparse matrix; rather, it assigns **weights** (or scores) using the following forumla, given terms *t* and documents *d*:
<center><br>$\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}$

This formula balances the importance of a term in a document vs its importance in the entire corpus.

**Term Frequency (TF)**

The number of times a term appears in a document.

If the word is common (like "the") it appears with high frequency. Linguistics informs us that very frequent terms are uninformative, especially in larger documents. Ideally, we'd like to decrease the weight assigned to these frequent terms. It is also common practice to filter out extremely common terms. There are lists of common terms, called stop words, which should be inspected during any text analytic project for relevancy to that particular project. 

One problem with implementing TF alone is that rare words in a document may be uninformative in the context of entire corpus, so we want to balance the high score assigned to them in a document with another weight assigned via their frequency in the entire set of documents.


**Inverse Document Frequency (IDF)**

In `sklearn`  the IDF term differs from the "texbook" definition and is calculated in the following manner (see the [User Guide](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) for details):
<center><br>$\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$
    
where $n$ is the total number of documents in the document set, and $\text{df}(t)$ is the number of documents in the document set that contain term. The addition of a document and the extra "plus ones" avoid division by 0 but also makes it so we do not completely discard extremely common terms. Taking the $\log$ helps us balance the multiplication since counts alone would weight the IDF term too heavily compared to the TF term.

The resulting TF-IDF vectors are then normalized by the Euclidean norm. Perhaps the first departure from default parameters would be to try out `sublinear_tf=True` to replace $\text{tf}$ with $1 + \log(\text{tf})$.
    
---
    
As an example, let's "hand-roll" the TF-IDF vector for the 3rd document and compare it to the output of the `TfidfVectorizer()` method:

In [7]:
# our corpus
docs = [print(doc) for doc in corpus]

You love me
You do not love me
You really really love food


In [8]:
# our sorted vocabulary
['do', 'food', 'love', 'me', 'not', 'really', 'you']

['do', 'food', 'love', 'me', 'not', 'really', 'you']

In [9]:
import numpy as np

n = 3 # num docs
doc3_tfs = [0, 1, 1, 0, 0, 2, 1] # doc3 term freqs
term_dfs = [1, 1, 3, 2, 1, 1, 3] # term document freqs

tfidf_vec = []
for ix, tf in enumerate(doc3_tfs):
    df = term_dfs[ix]
    frac = (n+1) / (df+1)
    idf = np.log(frac) + 1
    tfidf_vec.append(tf*idf)

# raw tf-idfs
[round(i, 6) for i in tfidf_vec] 

[0.0, 1.693147, 1.0, 0.0, 0.0, 3.386294, 1.0]

Applying $v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 +
v{_2}^2 + \dots + v{_n}^2}}$

In [10]:
def return_L2norm(vec):
    squares = [x**2 for x in vec]
    den = np.sqrt(np.sum(squares))
    L2norm = [x/den for x in vec]
    return L2norm

In [11]:
# L2-normalized tf-idfs
tfidf_vec_norm = return_L2norm(tfidf_vec)
[round(i, 6) for i in tfidf_vec_norm]

[0.0, 0.41894, 0.247433, 0.0, 0.0, 0.83788, 0.247433]

In [12]:
# comparing to TfidfVectorizer 
vectorizer = TfidfVectorizer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) # explicit defaults
vector = vectorizer.fit_transform(corpus)
df = pd.DataFrame()
df['vocabulary --->'] = vectorizer.get_feature_names()
df['doc 3'] = vector.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,do,food,love,me,not,really,you
doc 3,0.0,0.41894,0.247433,0.0,0.0,0.83788,0.247433


## N-Grams

Unigram BoW and TF-IDF models do not take into account the order of terms. N-Gram models capture the order of N-consecutive terms, such as bigrams (2 terms), trigrams (3 terms), and so forth. We can add N-Gram features to our BoW or TF-IDF models to capture term order and increase accuracy.

- NB: the gain in accuracy might not be worth the trade-off in performance, since adding N-Gram features will quickly explode our feature space. We could also apply a number of dimension reduction techniques.

In [13]:
from nltk import ngrams

generators = []
for doc in corpus:
    generators.append(ngrams(doc.split(' '), n=2))

bigram_corpus = []
for generator in generators:
    bigrams = []
    for ix, val in enumerate(generator):
        bigram = ''.join([val[0], val[1]])
        bigrams.append(bigram)
    
    bigram_corpus.append(' '.join(bigrams))

bigram_corpus

['Youlove loveme',
 'Youdo donot notlove loveme',
 'Youreally reallyreally reallylove lovefood']

**BoW Bigram features**

In [14]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(bigram_corpus)

df = pd.DataFrame()
df['vocabulary --->'] = vectorizer.get_feature_names()
df['doc 1 vector'] = X.toarray()[0]
df['doc 2 vector'] = X.toarray()[1]
df['doc 3 vector'] = X.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,donot,lovefood,loveme,notlove,reallylove,reallyreally,youdo,youlove,youreally
doc 1 vector,0,0,1,0,0,0,0,1,0
doc 2 vector,1,0,1,1,0,0,1,0,0
doc 3 vector,0,1,0,0,1,1,0,0,1


**TF-IDF Bigram features**

In [15]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(bigram_corpus)

df = pd.DataFrame()
df['vocabulary --->'] = vectorizer.get_feature_names()
df['doc 1 vector'] = X.toarray()[0]
df['doc 2 vector'] = X.toarray()[1]
df['doc 3 vector'] = X.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,donot,lovefood,loveme,notlove,reallylove,reallyreally,youdo,youlove,youreally
doc 1 vector,0.0,0.0,0.605349,0.0,0.0,0.0,0.0,0.795961,0.0
doc 2 vector,0.528635,0.0,0.40204,0.528635,0.0,0.0,0.528635,0.0,0.0
doc 3 vector,0.0,0.5,0.0,0.0,0.5,0.5,0.0,0.0,0.5


Notice how **reallyreally** and **reallylove** are less important in the TF-IDF bigram feature space than **youlove**, **loveme**, and even **donot** and **youdo**, as it should be. As expected, TF-IDF bigrams capture more of the semantic essence of the documents than unigrams where **really** was the most important feature.

```
TODO: Word Embedding

```

---