##  TD - IDF Tutorial 

### Vectorizing 

We need to use some technique to convert text into numerical vectors that can be used for machine learning. 


**Bag Of Words**

One of the simplest and most common ways to perform this conversion is called Bag Of Words (BoW) - it captures the frequency of terms in a corpus, regardlers of order. Example:

In [117]:
from sklearn.feature_extraction.text import CountVectorizer
 
corpus = ['All crows are black', 
          'The bird in my cage is black',
          'The bird is a crow']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

df = pd.DataFrame()
df['vocabulary --->'] = vectorizer.get_feature_names()
df['doc 1 vector'] = X.toarray()[0]
df['doc 2 vector'] = X.toarray()[1]
df['doc 3 vector'] = X.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,all,are,bird,black,cage,crow,crows,in,is,my,the
doc 1 vector,1,1,0,1,0,0,1,0,0,0,0
doc 2 vector,0,0,1,1,1,0,0,1,1,1,1
doc 3 vector,0,0,1,0,0,1,0,0,1,0,1


**N-Grams**

A slight and improved variant of BoW, N-Grams takes into account N consecutive terms and builds the same kind of BoW feature vector with bi-grams (if N=2), or tri-grams (if N=3), and so forth. Each N-gram constitute a feature, just like our unigrams (terms) in BoW. Example:

In [123]:
from nltk import ngrams

generators = []
for doc in corpus:
    generators.append(ngrams(doc.split(' '), n=2))

bigram_corpus = []
for generator in generators:
    bigrams = []
    for ix, val in enumerate(generator):
        bigram = ''.join([val[0], val[1]])
        bigrams.append(bigram)
    
    bigram_corpus.append(' '.join(bigrams))

In [124]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(bigram_corpus)

df = pd.DataFrame()
df['vocabulary --->'] = vectorizer.get_feature_names()
df['doc 1 vector'] = X.toarray()[0]
df['doc 2 vector'] = X.toarray()[1]
df['doc 3 vector'] = X.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,acrow,allcrows,areblack,birdin,birdis,cageis,crowsare,inmy,isa,isblack,mycage,thebird
doc 1 vector,0,1,1,0,0,0,1,0,0,0,0,0
doc 2 vector,0,0,0,1,0,1,0,1,0,1,1,1
doc 3 vector,1,0,0,0,1,0,0,0,1,0,0,1


**TF-IDF** 

OVERVIEW

Perhaps the most famous and useful text vectorization method is Term Frequency - Inverse Document Frequency. At a high level, TF IDF takes into account the importance of a term in a document and weighs it against its importance in the entire corpus.

Here is the example with unigrams (TF-IDF could receive ngrams as inputs), compared to the BoW vectorizer:

In [126]:
from sklearn.feature_extraction.text import TfidfVectorizer

bow_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

bow_vec = bow_vectorizer.fit_transform(corpus)
tfidf_vec = tfidf_vectorizer.fit_transform(corpus)

df = pd.DataFrame()
df['vocabulary --->'] = bow_vectorizer.get_feature_names()
df['doc 1 bow'] = bow_vec.toarray()[0]
df['doc 2 bow'] = bow_vec.toarray()[1]
df['doc 3 bow'] = bow_vec.toarray()[2]
df['doc 1 tfidf'] = tfidf_vec.toarray()[0]
df['doc 2 tfidf'] = tfidf_vec.toarray()[1]
df['doc 3 tfidf'] = tfidf_vec.toarray()[2]
df.set_index('vocabulary --->', inplace=True)
df.T

vocabulary --->,all,are,bird,black,cage,crow,crows,in,is,my,the
doc 1 bow,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
doc 2 bow,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0
doc 3 bow,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
doc 1 tfidf,0.528635,0.528635,0.0,0.40204,0.0,0.0,0.528635,0.0,0.0,0.0,0.0
doc 2 tfidf,0.0,0.0,0.329928,0.329928,0.433816,0.0,0.0,0.433816,0.329928,0.433816,0.329928
doc 3 tfidf,0.0,0.0,0.459854,0.0,0.0,0.604652,0.0,0.0,0.459854,0.0,0.459854


DETAILS

When the `fit()` method is called, it creates a dictionary that stores each term in the corpus and its assigned feature index. This dictionary is the vectorizer's `.vocabulary_`.



In [127]:
vectorizer.vocabulary_

{'all': 0,
 'crows': 6,
 'are': 1,
 'black': 3,
 'the': 10,
 'bird': 2,
 'in': 7,
 'my': 9,
 'cage': 4,
 'is': 8,
 'crow': 5}

As opposed to `Countvectorizer`, `TfidfVectorizer` doesn't simply one-hot encode each of these terms as features in a sparse matrix; rather, it assigns **weights** based on a simple multiplication: $TF * IDF$, which balances the trade-off between the document-wide and corpus-wide representations of terms. 

**Term Frequency (TF)**

The count of a term in a document.

If the word is common (like "the") it appears with high frequency. From [Zipf's law](https://en.wikipedia.org/wiki/Zipf's_law) we learn that very frequent terms are uninformative in linguistics, these so-called "stop words" are often filtered out. Ideally, we'd like to decrease the weight assigned to these frequent terms.

One problem with implementing TF alone is that rare words in a document may be uninformative in the context of entire corpus, so we want to balance the weight assigned to them in a document with another weight assigned via their frequency in all the documents (the corpus).


**Inverse Document Frequency (IDF)**

The IDF term is calculated thus:

- use math notation for: log of {number of docs in your corpus divided by the number of docs in which this term appears}.

In [128]:
## Maybe use below for something else?

#### Random Sample

For the tutorial we'll use a random sample the 1.6M dataset.

In [81]:
def random_sample(df):
    """
    Sample 1% without replacement.
    """
    ix = random.sample_without_replacement(n_population=len(df),
                                           n_samples=round(len(df)/100),
                                           random_state=42)
    out = df.loc[ix, ]
    return out

In [126]:
# ensure equal amounts

# divide into negatives and positives
df0 = dfm[dfm['target'] == 0].copy()
df1 = dfm[dfm['target'] == 1].copy()
df1.index = range(0, len(df1))

# sample 1% from each and concatenate
df0_sample = random_sample(df0)
df1_sample = random_sample(df1)
df_sample = pd.concat([df0_sample, df1_sample])
df_sample.index = range(0, len(df_sample))

# counts grouped by target
df_sample.loc[:, ('target','text')].groupby(['target']).count()

Unnamed: 0_level_0,text
target,Unnamed: 1_level_1
0,7983
1,7979


#### TF IDF vector

In [140]:
# instantiate TF IDF vectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True)

y is just an array with the target variable

In [127]:
y = np.array(df_sample.iloc[:, 0]).ravel()
y

array([0, 0, 0, ..., 1, 1, 1], dtype=int64)

In [89]:
# using tokenized feature 
X = vectorizer.fit_transform(np.array(df_sample.iloc[:, 2]).ravel())

In [128]:
# converting to dense format so we can visualize data
col = [i for i in vectorizer.get_feature_names()] 
temp = pd.DataFrame(X.todense(), columns=col) 
temp.shape

(15962, 19580)