In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Text Feature Extraction with Bag-of-Words
In many tasks, like in the classical spam detection, your input data is text.
Free text with variables length is very far from the fixed length numeric representation that we need to do machine learning with scikit-learn.
However, there is an easy and effective way to go from text data to a numeric representation that we can use with our models, called bag-of-words.

![bag of word features](figures/bag_of_words.svg)

Lets assume that each sample in your dataset is represented as one string, which could be just a sentence, an email or a whole news article or book. To represent the sample, we first split the string into a list of tokens, which correspond to (somewhat normalized) words. A simple way to do this to just split by whitespace, and then lowercase the word.
Then, we built a vocabulary of all tokens (lowercased words) that appear in our whole dataset. This is usually a very large vocabulary.
Finally, looking at our single sample, we could how often each word in the vocabulary appears.
We represent our string by a vector, where each entry is how often a given word in the vocabular appears in the string.

As each sample will only contain very few of the words, most entries will be zero, leading to a very high-dimensional but sparse representation.

In [8]:
X = ["Some say the world will end in fire,",
     "Some say in ice."]

In [9]:
len(X)

2

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X)


CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
vectorizer.vocabulary_

{u'end': 0,
 u'fire': 1,
 u'ice': 2,
 u'in': 3,
 u'say': 4,
 u'some': 5,
 u'the': 6,
 u'will': 7,
 u'world': 8}

In [14]:
X_bag_of_words = vectorizer.transform(X)

In [15]:
X_bag_of_words.shape

(2, 9)

In [17]:
X_bag_of_words

<2x9 sparse matrix of type '<type 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [18]:
X_bag_of_words.toarray()

array([[1, 1, 0, 1, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 1, 0, 0, 0]])

In [20]:
vectorizer.get_feature_names()

[u'end', u'fire', u'ice', u'in', u'say', u'some', u'the', u'will', u'world']

In [19]:
vectorizer.inverse_transform(X_bag_of_words)

[array([u'end', u'fire', u'in', u'say', u'some', u'the', u'will', u'world'], 
       dtype='<U5'), array([u'ice', u'in', u'say', u'some'], 
       dtype='<U5')]

# Tfidf Encoding
A useful transformation that is often applied to the bag-of-word encoding is the so-called term-frequency inverse-document-frequency (Tfidf) scaling, which is a non-linear transformation of the word counts.

The Tfidf encoding rescales words that are common to have less weight:

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [27]:
import numpy as np
np.set_printoptions(precision=2)

print(tfidf_vectorizer.transform(X).toarray())

[[ 0.39  0.39  0.    0.28  0.28  0.28  0.39  0.39  0.39]
 [ 0.    0.    0.63  0.45  0.45  0.45  0.    0.    0.  ]]
