# Programming Exercise: Working with Text Data

## Working with Text Data

In this exercise, you will work with text data. Text data is usually represented as strings, made up of characters with variable length. This feature is clearly very different from the numeric features and we will need to process the data before we can apply our machine learning algorithms to it.

## Applying Bag-of-Words to a Toy Dataset

To construct a bag-of-words model based on the word counts in the respective documents, we can use the <samp>CountVectorizer</samp> class implemented in scikit-learn. As we will see in the following code section, the <samp>CountVectorizer</samp> class takes an array of text data, which can be documents or just sentences, and constructs the bag-of-words model for us: 

In [15]:
#Text data and Building Vocabulary
import numpy as np
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining the weather is sweet and one and one is two'])

from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
count.fit(docs)
print("Vocabulary size: {}". format(len(count.vocabulary_)))
print("Vocabulary content:\n {}".format(count.vocabulary_))

Vocabulary size: 9
Vocabulary content:
 {'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


Fitting the <samp>CountVectorizer</samp> consists of the tokenization of the training data and building of the vocabulary, which we can access as the vocabulary\_ attribute. In this case the vocabulary consists of 7 words.

To create the bag-of-words representation for the training dataset, we call the <samp>transform</samp> method:

In [16]:
#To create the bag-of-words representation
bag = count.transform(docs)
#Repr returns a string containing a printable representation of an object. 
print("Bag of words: {}".format(repr(bag)))
print("Dense representation of Bag of word:\n {}". format(bag.toarray()))

Bag of words: <3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>
Dense representation of Bag of word:
 [[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


The bag-of-words representation is stored in a sparse matrix that only stores the entries that are nonzero. A sparse matrix is used as most documents only contain a small subset of the words in the vocabulary. In the dense representation we can see that the word counts for each word are either 0 or 1. For example, the first feature at index position 0 resembles the count of the word "and", which only occurs in the last documents, and the word "is" at the index position 1 ( the 2nd feature in the document vectors) occurs in all the three sentences. 

## Example Application: Sentiment Analysis of Movie Reviews

In this part of this exercise, we will use a dataset of movie reviews from the IMDb (Internet Movie Database). This dataset contains the text of the reviews, together with a label that indicates whether a review is "positive" or "negative".

In [17]:
#Movie reviews Dataset
import pandas as pd
df = pd.read_csv('./datasets/movie_data.csv')
print("First Elements:\n {}".format(df.head(3)))
text_train = df.loc[:24999, 'review'].values
y_train = df.loc[:24999, 'sentiment'].values
text_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

First Elements:
                                               review  sentiment
0  In 1974, the teenager Martha Moxley (Maggie Gr...          1
1  OK... so... I really like Kris Kristofferson a...          0
2  ***SPOILER*** Do not read this, if you think a...          0


Now, you will build the vocabulary and the bag-of-word representation of the <samp>text\_train</samp>.

#### Implementation Notes:
<ul>
    <li> Use the Class <samp>CountVectorizer</samp></li>
    
    <li> Build the vocabulary using the training set </li>
    <li> Compute the bag-of-words representation of <samp>text\_train</samp> into <samp>X\_train</samp> </li> 
    <li> Print the shape of <samp>X\_train</samp> using: print("X_train:\n{}".format(repr(X_train))) </li>
</ul>

In [18]:
#Building the vocabulary and the bag of words
# HERE YOUR CODE
count = CountVectorizer()
count.fit(text_train)
X_train = count.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<25000x76852 sparse matrix of type '<class 'numpy.int64'>'
	with 3408554 stored elements in Compressed Sparse Row format>


The shape of <samp>X_train</samp> is $25000\times76852$, indicating that the vocabulary contains 76,852 entries.

Let's look at the vocabulary in a bit more detail. 

In [19]:
#Let's look at the vocabulary in a bit more detail.
feature_names = count.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20400:20430]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 76852
First 20 features:
['00', '000', '0000000000001', '00000001', '00015', '001', '007', '0079', '007s', '0080', '0083', '009', '0093638', '00am', '00o', '00pm', '00s', '01', '0148', '02']
Features 20010 to 20030:
['dozor', 'doña', 'doğan', 'dp', 'dpm', 'dpp', 'dq', 'dr', 'draaaaaaaawl', 'draaaaaags', 'drab', 'drablow', 'drably', 'drabness', 'drac', 'dracht', 'dracula', 'draculas', 'draft', 'drafted', 'draftee', 'draftees', 'drafthouse', 'drafting', 'drafts', 'drag', 'dragan', 'dragged', 'dragging', 'draggy']
Every 2000th feature:
['00', 'affiliation', 'approxiately', 'barbara', 'blobs', 'buoyancy', 'charitable', 'commentors', 'crippling', 'demotic', 'dolous', 'elysee', 'eyelid', 'follows', 'ghettos', 'gwyenths', 'hogue', 'indefinitely', 'jessie', 'kramp', 'liz', 'marketability', 'ministrations', 'naked', 'offsets', 'patently', 'poissons', 'punisher', 'reimburse', 'rosy', 'seamed', 'singing', 'splaying', 'sumatra', 'testers', 'trifunovic', 'unrolling', 'wandering'

As you can see, possibly a bit surprisingly, the first 10 entries in the vocabulary are all numbers. All these numbers appear somewhere in the reviews, and are therefore extracted as words. 

Once we have our feature, let's obtain a qualitative measure of performance by actually building a classifier. We have the training labels stored in <samp>y_train</samp> and the bag-of-words representation of the training data in <samp>X_train</samp>, so we can train a classifier on this data. For high-dimensional, sparse data like this, linear models like <samp>LogisticRegression</samp> could work best. 

Let's start by evaluating <samp>LogisticRegression</samp>:

In [20]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# HERE YOUR CODE 
#Use fit(X,y) fuction of LogisticRegression() to train your model, 
#where X is the array of your training data and y is the label 
logreg.fit(X_train,y_train)
# Compute the Bag of word representation of the testing set
X_test = count.transform(text_test)
#Use score(X,y) function of LogisticRegression() to compute 
#the performance on both training and testing set 
print("Train score: {:.2f}".format(logreg.score(X_train, y_train)))
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

Train score: 1.00
Test score: 0.89


Based on these results we can see our model overfit the data. The LogisticRegression has a regularization parameter, C, which can tune (via a grid search strategy) to reduce the overfitting effect.

In [21]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}

def grid_search(X_train,y_train,X_test,y_test,param_grid):
    grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
    # HERE YOUR CODE 
    #Use fit(X,y) fuction of GridSearchCV to select parameters and train your model, 
    grid.fit(X_train, y_train)
    print("Best cross-validation score: {:.2f}".format(grid.best_score_))
    print("Best parameters: ", grid.best_params_)
    # Test on the Testing set
    #HERE YOU CODE
    #Use score(X,y) function of GridSearchCV to compute 
    #the performance on testing set 
    print("Test score: {:.2f}".format(grid.score(X_test, y_test)))
    return grid

grid_search(X_train, y_train, X_test, y_test,param_grid)

Best cross-validation score: 0.89
Best parameters:  {'C': 0.1}
Test score: 0.89


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

We obtain an accuracy of 89%, which indicates reasonable performance for a balanced binary classification task. Note the accurcary on the test set is the same of the previous test, but now the model does not overfit.  

To reduce the computation time, in this esercixe you can test just two C values (I've already done the grid search procedure  for all the cases). In a real context you need to test multiple C values. 

In [22]:
param_grid = {'C': [0.1, 10]}

### Word with Multiple Appearances

To clean the vocabulary from no-meaningful "words" we can use a simple mechanism that works quite well in practice: only use tokens that appear only at least two documents (or at least five documents, and so on). A token that appears only in a single document is unlikely to appear in the test set and is therefore not helpful. We can set the minimum number of documents a token needs to appear in with the <samp>min\_df</samp> parameter (see below).

By requiring at least five appearances of each token, we can bring down the number of feature to 27,040 - only about a third of the original features. There are clearly many fewer numbers, and some of the more obscure words seem to have vanished. 

The validation accuracy is unchanged from before. We did not improve our model, but having fewer features to deal with speeds up processing and throwing away useless features might make the model more interpretable. 

In [23]:
#Set minimum number of documents a token needs to appear
#Building the vocabulary and the bag of words
# HERE YOUR CODE
# min_df allow us to select all word that at least appeare more than five time
count = CountVectorizer(min_df=5)
count.fit(text_train)
X_train = count.transform(text_train)

print("X_train with min_df: {}".format(repr(X_train)))
feature_names = count.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 50 features:\n{}".format(feature_names[:50]))

# HERE YOUR CODE
# Compute the Bag of word representation of the testing set
X_test = count.transform(text_test)
grid_search(X_train, y_train, X_test, y_test, param_grid) # è una modalità verbosa

X_train with min_df: <25000x27040 sparse matrix of type '<class 'numpy.int64'>'
	with 3328371 stored elements in Compressed Sparse Row format>
Number of features: 27040
First 50 features:
['00', '000', '007', '01', '02', '03', '05', '06', '07', '08', '09', '10', '100', '1000', '100s', '100th', '101', '102', '103', '104', '105', '107', '108', '109', '10s', '10th', '11', '110', '111', '115', '116', '117', '11th', '12', '120', '1200', '123', '12th', '13', '130', '13th', '14', '140', '14th', '15', '150', '15th', '16', '160', '16mm']
Best cross-validation score: 0.89
Best parameters:  {'C': 0.1}
Test score: 0.89


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1, param_grid={'C': [0.1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### Advanced Tokenization

The <samp>CountVectorize</samp> is relatively simple, but it could be improved using external methods.
One particular step that is often improved in more sophisticated text-processing applications is the first step in the bag-of-words model: tokenization. This step defines what constitutes a word for the purpose of feature extraction. 
We saw earlier that the vocabulary often contains singular and plural version of some words: "drawback" and "drabacks" or "dracula" and "draculas". For the purposes of a bag-of-words model, the semantics of "drawback" and "drawbacks" are so close that distinguishing them will only increase overfitting, and not allow the model to fully exploit the training data. 

This problem can be overcome by representing each word using its <samp>word stem</samp>, which involves identifying all the words that have the same word stem. If this is done by using a rule-based heuristic, like dropping common suffixes, it is usually referred to as <samp>stemming</samp>. If instead a dictionary of known word is used, and the role of the word is the sentence is taken into account, the process is referred to as <samp>lemmatization</samp> and the standardized form of the word is referred to as the <samp>lemma</samp>.
However, <samp>lemmatization</samp> is computationally more difficult and expensive compared to <samp>stemming</samp> and it could have little impact on the performance. 
The Natural Language Toolkit for Python (NLTK, http://www.nltk.org) implements the <samp>Snowball</samp> stemming algorithm, which we will use in the following code section. 

In [24]:
#Advanced Tokenization
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
def tokenizer_snowballStemmer(text):
    return [stemmer.stem(word) for word in text.split()]

tokenizer_snowballStemmer("runners like running and thus they run")

['runner', 'like', 'run', 'and', 'thus', 'they', 'run']

Using the <samp>Snowball</samp> stemmer from the <samp>nltk</samp> package, we can classify the movie reviews.

In [25]:
#Classification with Tokenizer NLTK
nltk_count = CountVectorizer(tokenizer=tokenizer_snowballStemmer, min_df=5).fit(text_train)

# HERE YOUR CODE
# Compute the Bag of word representation of the testing set
X_train_nltk = nltk_count.transform(text_train)
print("X_train_nltk: {}".format(X_train_nltk.shape))

# HERE YOUR CODE
# Compute the Bag of word representation of the testing set
X_test_nltk = nltk_count.transform(text_test)
grid_search(X_train_nltk,y_train,X_test_nltk,y_test,param_grid)

X_train_nltk: (25000, 34756)
Best cross-validation score: 0.88
Best parameters:  {'C': 0.1}
Test score: 0.88


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1, param_grid={'C': [0.1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

Let's briefly talk about another useful topic called <samp>stop-word removal</samp>. Stop-words are simply those words that are extremely common in all sorts of texts and likely bear no (or only little) useful information that can be used to distinguish between different classes of documents. Example of stop-words are <i>is</i>, <i>and</i>, <i>has</i> etc. 

To remove stop-word from the movie review, we will use the set of 127 English stop-words that is available from the NLTK library: 

In [26]:
#importing stop.words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_snowballStemmer("a runner like running and run a lot") [-10:] if w not in stop]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gabriele\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


['runner', 'like', 'run', 'run', 'lot']

Now we can use this list in the classifier

In [27]:
#Classification with Tokenizer NLTK + Stop-words
nltk_count = CountVectorizer(tokenizer=tokenizer_snowballStemmer, stop_words=stop, min_df=5).fit(text_train)

# HERE YOUR CODE
# Compute the Bag of word representation of the testing set
X_train_nltk = nltk_count.transform(text_train)
print("X_train_nltk: {}".format(X_train_nltk.shape))

# HERE YOUR CODE
# Compute the Bag of word representation of the testing set
X_test_nltk = nltk_count.transform(text_test)
grid_search(X_train_nltk,y_train,X_test_nltk,y_test,param_grid)

X_train_nltk: (25000, 34620)
Best cross-validation score: 0.87
Best parameters:  {'C': 0.1}
Test score: 0.88


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1, param_grid={'C': [0.1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### Rescaling the Data with tf-idf

One of the most common approach to represent text is using <i>term frequency-inverse document frequency</i> (tf-idf) method. The intuition of this method is to give high weight to any term that appears often in a particular document, but not in many documents in the corpus. If a word appears often in a particular document, but not in very many documents, it is likely to be very descriptive of the content of that document. <i>scikit-learn</i> implements the tf-idf method in a class: <samp>TfidfVectorizer</samp>, which takes in the text data and does both the bag-of-words feature extraction and the tf-idf transformation. There are several variants of the tf-idf rescaling schema (see wikipedia). The tf-idf score for word $w$ in document $d$ as implemented in <samp>TfidfVectorizer</samp> class is given by: 
\begin{equation}
 tfidf(w,d) = tf * \left(\ln\left( \frac{N+1}{N_w+1}\right)+1\right)
\end{equation}
where $N$ is the number of documents in the training set, $N_w$ is the number of documents in the training set that the word $w$ appears, and $tf$ (the term frequency) is the number of times that the word $w$ appears in the query document $d$ (the document you want to transform or encode). The class also applies L2 normalization after computing the tf-idf representation; in other words, it rescales the representation of each document to have Euclidean length (this simply means each row is divided by its sum of squared entries). Rescaling in this way means that the length of a document (the number of words) does not change the vectorized representation. Test it completing the function <samp>tf_id_example</samp> using the following code. 

In [28]:
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining the weather is sweet and one and one is two'])

from sklearn.feature_extraction.text import TfidfVectorizer
count = TfidfVectorizer()
count.fit(docs)
print("Vocabulary size: {}". format(len(count.vocabulary_)))
print("Vocabulary content:\n {}".format(count.vocabulary_))

#To create the bag-of-words representation
bag = count.transform(docs)
print("Bag of words: {}".format(repr(bag)))
print("Dense representation of Bag of word:\n {}". format(bag.toarray()))

Vocabulary size: 9
Vocabulary content:
 {'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
Bag of words: <3x9 sparse matrix of type '<class 'numpy.float64'>'
	with 17 stored elements in Compressed Sparse Row format>
Dense representation of Bag of word:
 [[0.         0.43370786 0.         0.55847784 0.55847784 0.
  0.43370786 0.         0.        ]
 [0.         0.43370786 0.         0.         0.         0.55847784
  0.43370786 0.         0.55847784]
 [0.50238645 0.44507629 0.50238645 0.19103892 0.19103892 0.19103892
  0.29671753 0.25119322 0.19103892]]


Now, you will adapt this code for the movie reviews dataset. 
Keep in mind that the tf-idf scaling is meant to find words that distinguish documents, but it is a purely unsupervised technique. So, "important" here does not necessarily relate to the "positive review" and "negative review" label we are interested in.

In [29]:
#Building the vocabulary and the TF-IDF bag of words
countTFIDF = TfidfVectorizer(min_df=5,tokenizer=tokenizer_snowballStemmer, stop_words=stop,).fit(text_train)
#HERE YOUR CODE
XTFIDF_train = countTFIDF.transform(text_train)
XTFIDF_test = countTFIDF.transform(text_test)
print("X_train:\n{}".format(repr(XTFIDF_train)))

grid=grid_search(XTFIDF_train, y_train, XTFIDF_test, y_test,param_grid)

X_train:
<25000x34620 sparse matrix of type '<class 'numpy.float64'>'
	with 2439240 stored elements in Compressed Sparse Row format>
Best cross-validation score: 0.88
Best parameters:  {'C': 10}
Test score: 0.89


### Investigating Model Coefficients

Finally, let's look in a bit more detail into what our logistic regression model actually learned from the data. Because there are so many features we clearly cannot look at all of the coefficients at the same time. 
However, we can look at the largest coefficients, and see which words these correspond to. 

The following bar char show the largest and smallest coefficients of the logistic regression model.

In [30]:
# Show coefficients
feature_names = np.array(countTFIDF.get_feature_names())
sorted_by_idf = np.argsort(countTFIDF.idf_)
print("Features wtih lowest idf:\n{}".format(feature_names[sorted_by_idf[:100]]))

import mglearn
import matplotlib.pyplot as plt
mglearn.tools.visualize_coefficients(
        grid.best_estimator_.coef_,
        feature_names, n_top_features=40)

plt.show()

Features wtih lowest idf:
['/><br' 'movi' 'one' 'film' 'like' 'make' 'see' 'get' 'veri' 'watch'
 'even' 'good' 'onli' 'would' 'time' 'realli' 'charact' 'stori' 'much'
 'look' 'go' 'becaus' 'think' 'first' 'could' 'also' 'great' 'ani' 'peopl'
 'scene' 'made' 'love' 'play' '/>the' 'thing' 'act' 'seem' 'bad' 'know'
 'end' 'want' 'mani' 'come' 'way' 'take' 'never' 'show' 'say' 'it.' 'well'
 'give' 'two' 'tri' 'littl' 'movie.' 'seen' 'doe' 'ever' 'best' 'find'
 'plot' 'work' 'still' 'actor' '-' 'better' 'use' 'year' 'film.' 'feel'
 'actual' 'someth' 'lot' '<br' 'part' 'back' 'whi' 'movie,' 'real' 'film,'
 "i'm" 'everi' 'perform' 'anoth' 'enjoy' 'interest' '/>i' 'man' 'noth'
 'director' 'turn' 'quit' 'life' "can't" 'befor' 'start' 'new' 'got'
 'live' 'thought']


<Figure size 1500x500 with 1 Axes>

The negative coefficients on the left belong to words that according to the model are indicative of negative reviews, while the positive coefficients on the right belong to words that according to the model indicate positive reviews. Most of the terms are quite intuitive, like "worst", "bad" indicating bad movie reviews, while "great", "enjoy" indicate positive movies reviews. 
The <i>mglearn</i> is a library for plotting data.