## Import

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

By default, punctuation is stripped out and only words longer than 2 letters are counted (via the default regex set by the token_patterns option). All words are first lower-cased (so "Yes" and "yes" are the same word) via the lowercase option.

## Demo

We first initialize a CountVectorizer object, optionally changing the default options

In [7]:
vectorizer = CountVectorizer()

We then use "fit_transform" to obtain bag of word/ word count vectors for the sentences in our corpus

In [11]:
bow_vectors = vectorizer.fit_transform(["I am happy", "Yes I am"])

The result is a collection (matrix) of sparse vectors, as for a more realistic corpus, there would be more than 3 words longer than 2 characters.  We can see the counts explicitly by converting this matrix to a dense matrix:

In [12]:
bow_vectors.todense()

matrix([[1, 1, 0],
        [1, 0, 1]])

And we can see which words are represented by these features using the get_feature_names method

In [17]:
vectorizer.get_feature_names()

['am', 'happy', 'yes']

Finally we can use the featurizer to transform sentences in new corpus to word vectors.

In [19]:
vectorizer.transform(["Yes, I am the very model of a happy major general", 
                      "I am the very model of a modern major general",
                      "Happy happy happy",
                      "Yes, happy"]).todense()

matrix([[1, 1, 1],
        [1, 0, 0],
        [0, 3, 0],
        [0, 1, 1]])

We see that the transform function counts the numbers of "am", "happy", and "yes" in each sentence ignoring capitalization and punctuation.  New words are ignored.