# Text Feature Extraction

scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
<ul><li> <b> tokenizing</b> strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
</li>
<li> <b> counting </b>  the occurrences of tokens in each document. </li>
<li> <b> normalizing </b> and weighting with diminishing importance tokens that occur in the majority of samples / documents. </li>
</ul>

In this scheme, features and samples are defined as follows:
<ul>
<li> each <b> individual token occurrence frequency </b> (normalized or not) is treated as a <b> feature </b>. </li>
<li> the vector of all the token frequencies for a given <b> document</b> is considered a multivariate<b> instance </b>.</li>
</ul>

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call <b> vectorization </b> the general process of turning a collection of text documents into numerical feature vectors.

This specific strategy <b> (tokenization, counting and normalization)</b> is called the <b> Bag of Words or “Bag of n-grams”</b> representation.

Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. 

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

<h3> Vectorizer Usage </h3>

CountVectorizer implements both tokenization and occurrence counting in a single class:

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

In [8]:
corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)

X                           

<4x9 sparse matrix of type '<type 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

<b> <p style="color:Tomato;"> What do they mean by sparse? </p></b>

In [12]:
print(X)

  (0, 1)	1
  (0, 2)	1
  (0, 6)	1
  (0, 3)	1
  (0, 8)	1
  (1, 5)	2
  (1, 1)	1
  (1, 6)	1
  (1, 3)	1
  (1, 8)	1
  (2, 4)	1
  (2, 7)	1
  (2, 0)	1
  (2, 6)	1
  (3, 1)	1
  (3, 2)	1
  (3, 6)	1
  (3, 3)	1
  (3, 8)	1


In [13]:
analyze = vectorizer.build_analyzer()

analyze("This is a text document to analyze.")

[u'this', u'is', u'text', u'document', u'to', u'analyze']

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

In [14]:
vectorizer.get_feature_names()

[u'and',
 u'document',
 u'first',
 u'is',
 u'one',
 u'second',
 u'the',
 u'third',
 u'this']

Lets look at the matrix representation

In [16]:
corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?']

X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:

In [17]:
vectorizer.vocabulary_.get('document')

1

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

In [18]:
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):

In [19]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!') 

[u'bi', u'grams', u'are', u'cool', u'bi grams', u'grams are', u'are cool']

The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local positioning patterns:

In [20]:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2

array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

<h3> Tf–idf term weighting</h3>

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

In [24]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(smooth_idf=False)

Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting. The two other features only in less than 50% of the time hence probably more representative of the content of the documents: (please note that the values are normalized)

In [25]:
counts = [[3, 0, 1],
[2, 0, 0],
[3, 0, 0],
[4, 0, 0],
[3, 2, 0],
[3, 0, 2]]

tfidf = transformer.fit_transform(counts)

tfidf

tfidf.toarray()


array([[ 0.81940995,  0.        ,  0.57320793],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.47330339,  0.88089948,  0.        ],
       [ 0.58149261,  0.        ,  0.81355169]])

The weights of each feature computed by the fit method call are stored in a model attribute:

In [26]:
transformer.idf_ 

array([ 1.        ,  2.79175947,  2.09861229])

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model:

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(corpus)

<4x9 sparse matrix of type '<type 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [28]:
vectorizer.fit_transform(corpus).toarray()

array([[ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674],
       [ 0.        ,  0.27230147,  0.        ,  0.27230147,  0.        ,
         0.85322574,  0.22262429,  0.        ,  0.27230147],
       [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
         0.        ,  0.28847675,  0.55280532,  0.        ],
       [ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674]])