**Representing text as numerical data**

raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size



We will use CountVectorizer to "convert text into a matrix of token counts":

4 Steps for Vectorization

Import

Instantiate

Fit

Transform

In [1]:
simple_train = ['call you tonight', 'Call me a cab', 'please call me.. please']

In [8]:
#step 1
from sklearn.feature_extraction.text import CountVectorizer
#step 2
cv=CountVectorizer()
#step 3
cv.fit(simple_train) #Alphabetical order & No duplicate words
print(cv.get_feature_names())
#step 4 transform training data into a 'document-term matrix'
simple_train_dtm=cv.transform(simple_train)
simple_train_dtm

#3 rows x 6 columns
#document = rows , Because there were 3 documents
#term = columns , 6 terms that were learned during the fitting steps cv.get_feature_names()

## convert sparse matrix to a dense matrix
simple_train_dtm.toarray()



['cab', 'call', 'me', 'please', 'tonight', 'you']


<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [10]:
#examine the vocabulary and document-term matrix together
import pandas as pd
x=pd.DataFrame(simple_train_dtm.toarray(),columns=cv.get_feature_names())
#We will be training our model on this (X), that's why we need this

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


vect.fit(train) learns the vocabulary of the training data

vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data

vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

In [13]:
# example text for model testing
simple_test = ['Please don\'t call me']

#In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.
simple_test_dtm =cv.transform(simple_test)
print(simple_test_dtm.toarray())
pd.DataFrame(simple_test_dtm.toarray(), columns=cv.get_feature_names())

#It dropped the word "don't", why are we ok with the fact that the word "don't" drops?
#If we give a new word to predict the response, our model would not know what to do anyway
#In essence, we did not train on the feature "don't" so our model would not be able to predict based on that new feature

[[0 1 1 1 0 0]]


Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.