### Representing text as numerical data

In [17]:
import pandas as pd
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'call me a cab', 'please call me..PLEASE']

From the scikit-learn documentation
> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length

Use CountVectorizer to "convert text into a matrix of token counts"

In [18]:
# import and instantiate CountVectorizer (with teh default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [19]:
# learn the vocabulory of the training data
vect.fit(simple_train)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [20]:
# examine the fitted vocabulory
vect.get_feature_names()

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [21]:
# transform training data into a "document-term matrix"
simple_train_dtm = vect.transform(simple_train)
print(type(simple_train_dtm))
print(simple_train_dtm)

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


In [22]:
# convert sparse matrix into a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [24]:
# examine the vocabulory and document-term
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In this scheme, features and samples are defined as follows
* Each individual token occurrence frequency (normalized or not) is treated as a <b>feature</b>.
* The vector of all the token frequencies for a given document is considered a <b>multivariate sample</b>.

A <b>corpus of documents</b> can thus be represented by a matrix with one row per document and one column per token (e.g. word) occuring in the corpus.

we call <b>vectorization</b> the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the <b>Bag of Words</b> or "Bag-of-n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [25]:
# example text for model testing
simple_test = ["please don't call me"]

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning

In [26]:
# transform testing data into document-term
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]], dtype=int64)

In [27]:
# examine the vocabulory and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


#### Summary: 
* vect.fit(train) <b>learns the vocabulory</b> of the train