You have text data and want to create a set of features indicating the number of times an observation’s text contains a particular word

In [5]:
#Use scikit-learn’s CountVectorizer:

# Load library
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['I love Brazil. Brazil!',
'Sweden is best',
'Germany beats both'])


In [6]:
# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
# Show feature matrix
bag_of_words

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

This output is a sparse array, which is often necessary when we have a large
amount of text. However, in our toy example we can use toarray to view a
matrix of word counts for each observation:

In [7]:
bag_of_words.toarray()


array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [8]:
# Show feature names
count.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

The text data in our solution was purposely small. In the real world, a single
observation of text data could be the contents of an entire book! Since our bagof-words model creates a feature for every unique word in the data, the resulting
matrix can contain thousands of features. This means that the size of the matrix
can sometimes become very large in memory. However, luckily we can exploit a
common characteristic of bag-of-words feature matrices to reduce the amount of
data we need to store.
Most words likely do not occur in most observations, and therefore bag-of-words
feature matrices will contain mostly 0s as values. We call these types of matrices
“sparse.” Instead of storing all values of the matrix, we can only store nonzero
values and then assume all other values are 0. This will save us memory when
we have large feature matrices. One of the nice features of CountVectorizer is
that the output is a sparse matrix by default.
CountVectorizer comes with a number of useful parameters to make creating
bag-of-words feature matrices easy. First, while by default every feature is a
word, that does not have to be the case. Instead we can set every feature to be the
combination of two words (called a 2-gram) or even three words (3-gram).
ngram_range sets the minimum and maximum size of our n-grams. For
example, (2,3) will return all 2-grams and 3-grams. Second, we can easily
remove low-information filler words using stop_words either with a built-in list
or a custom list. Finally, we can restrict the words or phrases we want to consider
to a certain list of words using vocabulary. For example, we could create a bagof-words feature matrix for only occurrences of country names:


In [9]:
# Create feature matrix with arguments
count_2gram = CountVectorizer(ngram_range=(1,2),
stop_words="english",
vocabulary=['brazil'])
bag = count_2gram.fit_transform(text_data)
# View feature matrix
bag.toarray()



array([[2],
       [0],
       [0]], dtype=int64)

In [10]:
# View the 1-grams and 2-grams
count_2gram.vocabulary_


{'brazil': 0}