## Import

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

By default, punctuation is stripped out and only words longer than 2 letters are counted (via the default regex set by the token_patterns option). All words are first lower-cased (so "Yes" and "yes" are the same word) via the lowercase option.

## CountVectorizer

We first initialize a CountVectorizer object, optionally changing the default options

In [2]:
vectorizer = CountVectorizer()

We then use "fit_transform" to obtain bag of word/ word count vectors for the sentences in our corpus

In [3]:
bow_vectors = vectorizer.fit_transform(["I am happy", "Yes I am"])

The result is a collection (matrix) of sparse vectors, as for a more realistic corpus, there would be more than 3 words longer than 2 characters.  We can see the counts explicitly by converting this matrix to a dense matrix:

In [4]:
bow_vectors.todense()

matrix([[1, 1, 0],
        [1, 0, 1]])

And we can see which words are represented by these features using the get_feature_names method

In [5]:
vectorizer.get_feature_names()

['am', 'happy', 'yes']

Finally we can use the featurizer to transform sentences in new corpus to word vectors.

In [6]:
vectorizer.transform(["Yes, I am the very model of a happy major general", 
                      "I am the very model of a modern major general",
                      "Happy happy happy",
                      "Yes, happy"]).todense()

matrix([[1, 1, 1],
        [1, 0, 0],
        [0, 3, 0],
        [0, 1, 1]])

We see that the transform function counts the numbers of "am", "happy", and "yes" in each sentence ignoring capitalization and punctuation.  New words are ignored.

## TfidfVectorizer

We now perform the analogous procedure for the TF-IDF vectorizer

In [8]:
tf_vectorizer = TfidfVectorizer()
tf_bow_vectors = tf_vectorizer.fit_transform(["I am happy", "Yes I am"])
tf_bow_vectors.todense()

matrix([[0.57973867, 0.81480247, 0.        ],
        [0.57973867, 0.        , 0.81480247]])

Following the documentation (https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction), the TF-IDF values in the rows are  
(1) the number of times the word shows up in the given string  
(2) times (log(1 + the number of rows) - log(1 + the number of rows wiht the given word) + 1)  
(3) with the values in each row normalized to Euclidean length of 1

In [20]:
for row in tf_bow_vectors.todense().tolist():
    total = 0.0
    for x in row:
        total += x**2
    print(total)

0.9999999999999999
0.9999999999999999


In [21]:
bow_vectors.todense()

matrix([[1, 1, 0],
        [1, 0, 1]])

The first word ("am") shows up in both words, so the IDF value is 1.  
The other words in the vectors only show up in one word, so the IDF value is 1 - log(2) + log(3) $\approx$ 1.41

In [23]:
import numpy as np
1 - np.log(2) + np.log(3), 0.81480247 / 0.57973867

(1.4054651081081646, 1.4054651037854693)

In [25]:
tf_vectorizer.transform(["Yes, I am the very model of a happy major general", 
                      "I am the very model of a modern major general",
                      "Happy happy happy",
                      "Yes, happy"]).todense()

matrix([[0.44943642, 0.6316672 , 0.6316672 ],
        [1.        , 0.        , 0.        ],
        [0.        , 1.        , 0.        ],
        [0.        , 0.70710678, 0.70710678]])