# BLU07 - Part 3 of 3 - Sentiment analysis

In [1]:
# Standard imports
import numpy as np
import pandas as pd
from collections import Counter, OrderedDict
import re
import string

# NLTK imports
from nltk.tokenize import WordPunctTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

# SKLearn related imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn import preprocessing

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

We learned about _preprocessing_ of text data in notebook 1 and then about creating features from text data or _vectorization_ in notebook 2. Here we're going to look at vectorization implementations in sklearn and finally add a machine learning model and so create a complete workflow for sentiment analysis.

Let's load our movie reviews data again and split it into a train and test set:

In [2]:
df = pd.read_csv('./data/imdb_sentiment.csv')
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

## 1. Data cleaning and preprocessing
We want to look very professional, so we're going to use pipelines. We'll be using sklearn vectorizers and models later on which are designed to fit into a pipeline, but we have to prepare our preprocessing steps to fit into a pipeline too. We're going to achieve this by building a preprocessing transformer. You already saw transformers a few times before, so hopefully they are less daunting by now. It's good to get used to pipelines because they make the workflow much more manageable.

In [3]:
class TextCleanerTransformer(TransformerMixin, BaseEstimator):
    """
    Custom transformer to clean movie reviews data.
    Removes unwanted strings like html tags, lowercases,
    removes punctuation, tokenizes, and stems the data.
    """
    def __init__(self, tokenizer, stemmer, regex_list=[('','')],
                 lower=True, remove_punct=True):
        self.tokenizer = tokenizer
        self.stemmer = stemmer
        self.regex_list = regex_list # list of tuples [(pattern, sub)] for regex substitution
        self.lower = lower
        self.remove_punct = remove_punct
     
    def fit(self, X, y=None):
        return self   
        
    def transform(self, X):
        X = X.map(self._clean_document)
        return X

    def _clean_document(self, doc):
        """ 
        Helper function to perform the cleaning.
        """
        for regex in self.regex_list:
            doc = re.sub(regex[0], regex[1], doc)
        if self.lower:
            doc = doc.lower()
        tokens = self.tokenizer.tokenize(doc)
        if self.remove_punct:
            pattern = re.compile("[" + re.escape(string.punctuation) + "]")
            sentence = " ".join(tokens)
            tokens = re.sub(pattern, '', sentence).split()
        if self.stemmer:
            tokens = map(self.stemmer.stem, tokens)
        return " ".join(tokens)

We just created a custom transformer class with a `transform()` method that will apply the private method `_clean_document` to every item of its input `X`. You need to choose a tokenizer and a stemmer when initializing the transformer. You can also give the transformer a list of tuples of type (pattern, sub) which will be used in a regex substitution.

Let's use the transformer with a `WordPunctTokenizer`, a `SnowballStemmer`, and the regex for HTML tag substitution that we used before.

In [4]:
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords=True)
regex_list = [("<[^>]*>", "")]

cleaner = TextCleanerTransformer(tokenizer, stemmer, regex_list)
train_preprocessed = cleaner.transform(train_df['text'])

Let's look at an output example:

In [5]:
train_preprocessed.iloc[100]

'this is a film about life the triumph over advers and the wonder of the human spirit i defi anyon not to shed a tear by the end of the movi this is more than just a tear jerker its an engag thought provok drama with excel perform from all the cast but especi derek luke and denzel washington 7 year on i m amaz that luke is still a virtual unknown and washington only direct one other film nevertheless apart from a slow build up the stori of this foster child s trial and tribul and how it still affect him in adulthood is the sort of movi that stay with you long after you have seen it like mani fox searchlight pictur this was more of a sleeper hit and didn t get the mass critic acclaim it deserv the scene where antwon final meet his mother sum up the movi for me there were so mani way that could have been done and it could have been all schmaltzi or it could have been unrealist but washington struck exact the right tone his mother never said a word and could only shed a tear while antown 

Cool! Now onto what we learned in notebook 2, but this time with real vectorizer classes.

## 2. Sklearn vectorizers

As you can imagine, there are many implementations of Bag of Words and TF-IDF. We're hardcore scikit-learn fans here, so we're going to use scikit-learn implementations. We will look at three different ways to create the features.

### 2.1 Bag of words - CountVectorizer
The BoW from sklearn is the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). We still haven't removed stopwords as you may have noticed. We're going to do it with the `CountVectorizer()` - we just pass the parameter `stop_words=english`. You can also give this parameter a custom list of stopwords if your application requires something special (there are many stopwords lists out there).

In [6]:
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(train_preprocessed)
train_bow = vectorizer.transform(train_preprocessed)

The output of the vectorizer is a sparse matrix where the rows are samples and the columns are feature values - word counts. 

In [7]:
train_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 350769 stored elements and shape (4000, 23715)>

We're going to look into sparse matrices later on, for now just know that it's a matrix with mostly zeros and just a few non-zero elements. Because it is so huge and mostly empty, it doesn't make sense to waste memory to store it in the usual way, so it's storing just the non-zero elements in a very smart way.

We can feed this matrix directly into a ML model, but let's examine it first. Seeing the nonzero elements requires a bit of work. We can get the indices of the nonzero elements like this (for a random document number 12):

In [8]:
train_bow[12].nonzero()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32),
 array([   44,   511,   599,   635,   744,  1015,  1519,  1551,  1573,
         1589,  1641,  1708,  2234,  3842,  4115,  4443,  4465,  4513,
         4517,  4559,  4745,  4863,  4873,  5234,  5703,  5927,  6488,
         7027,  7695,  7968,  7985,  8198,  8569,  8748,  8932,  9376,
        10121, 10210, 10568, 11614, 11628, 11924, 12297, 12576, 12634,
        12859, 12883, 13647, 14021, 14117, 14485, 14598, 15080, 15524,
        15601, 15774, 16050, 16334, 17064, 17425, 18278, 18506, 18616,
        19012, 19013, 19059, 19551, 19937, 20100, 20382, 21079, 21252,
        22079, 22628, 23122, 23334, 23476], dtype=int32))

It's a 2D matrix, so we get two indices, the first being 0 everywhere. These are the values of the non-zero elements.

In [9]:
train_bow[12][train_bow[12].nonzero()]

matrix([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

But for which words are these word counts? The words are given by the indices and the vectorizer provides a vocabulary to translate the indices to words. Let's preview a sample of the vocabulary.

In [10]:
vocabulary = vectorizer.vocabulary_
print("\nNumber of distinct words:", len(vocabulary))
print("Small sample of the vocabulary:", list(vocabulary.items())[0:20])


Number of distinct words: 23715
Small sample of the vocabulary: [('watch', 22872), ('dvd', 6558), ('movi', 14021), ('come', 4443), ('excel', 7228), ('commentari', 4463), ('track', 21381), ('english', 6939), ('cambodia', 3353), ('subtitl', 20247), ('say', 18224), ('charact', 3820), ('speak', 19613), ('thai', 20953), ('violent', 22628), ('evil', 7203), ('man', 12920), ('rais', 16896), ('boy', 2800), ('killer', 11614)]


There are 23715 distinct words in our training set reviews and each review contains maybe a few hundred of them. That explains why we need a sparse matrix to store the data.

Ok, let's proceed in our quest to take a peak at the features of the first review. This vocabulary maps the words to the indices, so we're going to make an inverse vocabulary mapping the indices to the words.

In [11]:
inv_vocabulary = {v: k for k, v in vocabulary.items()}

Now we extract feature names - the words - and the corresponding word counts.

In [12]:
# Indices of non-zero features
feat_indices = train_bow[12].nonzero()[1]

# Extract the corresponding word and count
wordcounts = [(inv_vocabulary[i], train_bow[12,i]) for i in feat_indices]

Let's see what we got:

In [13]:
print('Document 12:', train_preprocessed.iloc[12], '\n')
for word, count in wordcounts:
    print(word, ": ", count)

Document 12: this is the kind of movi which show the pauciti of french cinema when it come to make thriller the director s desir to sound american is so glare that you will not be fool a minut unless you have not seen a serial killer movi sinc peep tom two male cop or one and a half more like as you will see horribl murder a plot more complic than complex charl berl is not lucki with the genr see the astoundl dumb l inconnu de strasbourg a coupl of year ago the scene with his pregnant wife which are suppos to be a counterpart for the otherwis noir atmospher of the rest of the plot are among the worst ever film add a steami love scene between them and a gori autopsi to get a pg 12 and thus to attract the huge adolesc audienc a violent and absurd conclus follow by a silent epilogu who could make a nice commerci for the côte d azur it s realli the silenc of the lame 

12 :  1
absurd :  1
add :  1
adolesc :  1
ago :  1
american :  1
astoundl :  1
atmospher :  1
attract :  1
audienc :  1
au

### 2.2 Tf-idf - TfidfTransformer
We can further process the word counts to a tf-idf vectorization with the [TfidfTransformer()](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

In [14]:
tfidf = TfidfTransformer()
tfidf.fit(train_bow)

train_tfidf = tfidf.transform(train_bow)

The output is again a sparse matrix of the same size as the input. The word counts were just transformed to term frequencies scaled by inverse document frequency. If we want to know the feature names, we can use the vocabulary created by the CountVectorizer.

In [15]:
train_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 350769 stored elements and shape (4000, 23715)>

We can also use this matrix as input for a model, but before doing that, we're going to create another vectorization with N-grams.

### 2.3 Bag of words with N-grams

As we learned in the first notebook, we can use not only single words, unigrams, as features but also a combination of words, known as N-grams. By using `CountVectorizer()` with the parameter `ngram_range`, we can easily get all the possibilities in a chosen range of N. This parameter takes a tuple `(min_n, max_n)`, where `min_n` is the smallest N-gram size (e.g., 1 for unigrams), and `max_n` is the largest N-gram size (e.g., 3 for trigrams). For example, if we wanted to extract all unigrams and bigrams in a corpus we would pass to the model `ngram_range=(1,2)`.

Let's get all unigrams, bigrams, and trigrams from our movie reviews.

In [16]:
vectorizer_123_grams = CountVectorizer(stop_words= 'english', ngram_range=(1,3))
vectorizer_123_grams.fit(train_preprocessed)
train_bow_123_grams = vectorizer_123_grams.transform(train_preprocessed)
train_bow_123_grams.shape

(4000, 773254)

As you can see, we end up with a much larger feature space - which can be quite computationally expensive to our model! Fortunately, there are some parameters in `CountVectorizer()` that we can use to reduce our feature space while keeping the representation as informative as possible:

- `max_features` - receives an `int` that will be the size of the feature space of the model. The chosen N features will be the ones with the highest term frequency across the corpus. This goes a bit against the idea that rarer words are more interesting, but on the other hand words that are too rare and appear only once in the whole corpus won't bring us too far either.
- `min_df` - the minimum document frequency an N-gram can have to be considered. Often called the *cut-off*. Here we are selecting N-grams that appear in a reasonable minimal number of documents.

## 3. Modeling pipeline

Now that we learned how to create features with sklearn, let's add a model and prepare a workflow to predict if a review is positive or negative. In NLP, we call this task _sentiment analysis_.

<img src="./media/dwight.jpg" width="400">

We start by putting all our workflow into a pipeline.

We will use Scikit's implementation of Naive Bayes as the classifier. You can read about this classifier in the optional learning notebook.

In [17]:
text_clf = Pipeline([('prep', TextCleanerTransformer(tokenizer, stemmer, regex_list)),
                   ('vect', CountVectorizer(stop_words='english')),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())])

The final missing piece is converting the sentiment labels into numeric labels through the Scikit [Label Encoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) tool.

In [18]:
le = preprocessing.LabelEncoder()
le.fit(train_df['sentiment'])

train_df['sentiment'] = le.transform(train_df['sentiment'])
test_df['sentiment'] = le.transform(test_df['sentiment'])

Let's look at our final input dataset:

In [19]:
train_df

Unnamed: 0,sentiment,text
4227,1,I watched the DVD of this movie which also com...
4676,0,"You, know, I can take the blood and the sex, b..."
800,0,I must say that I am fairly disappointed by th...
3671,0,"I saw this not too long ago, and I must say: T..."
4193,0,*** WARNING! SPOILERS CONTAINED HEREIN! ***<br...
...,...,...
4426,0,Seeing as I hate reading long essays hoping to...
466,0,If an auteur gives himself 2 credits before th...
3092,1,Perspective is a good thing. Since the release...
3772,0,"The shame of it! There I was, comfortable in t..."


Finally, we feed the training data into the pipeline and fit the model.

In [20]:
text_clf.fit(train_df['text'], train_df['sentiment'])

Then we predict on the test set and calculate the accuracy.

In [21]:
predicted = text_clf.predict(test_df['text'])
np.mean(predicted == test_df['sentiment'])

0.844

Not bad!

Let's try to use also bigrams:

In [22]:
text_clf_bigrams = Pipeline([('prep', TextCleanerTransformer(tokenizer, stemmer, regex_list)),
                   ('vect', CountVectorizer(stop_words='english', ngram_range=(1,2))),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())])

text_clf_bigrams.fit(train_df['text'], train_df['sentiment'])

predicted_bigrams = text_clf_bigrams.predict(test_df['text'])
np.mean(predicted_bigrams == test_df['sentiment'])

0.827

Interestingly, the performance on the test set got worse! This is an example of when it can be hurtful to your model to remove stopwords. If you look at the stopwords list, words like "no" are part of it. This can be crucial to the bigram representation since if we remove them, relevant bigrams will not appear in the feature space (ex.: "no fun").

Let's remove the `stop_words` parameter from the CountVectorizer.

In [23]:
text_clf_bigrams_stopwords = Pipeline([('prep', TextCleanerTransformer(tokenizer, stemmer, regex_list)),
                   ('vect', CountVectorizer(ngram_range=(1,2))),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())])

text_clf_bigrams_stopwords.fit(train_df['text'], train_df['sentiment'])

predicted_bigrams_stopwords = text_clf_bigrams_stopwords.predict(test_df['text'])
np.mean(predicted_bigrams_stopwords == test_df['sentiment'])

0.837

A bit better, but still not as good as the unigram model. This can be due to the feature space having too many dimensions, which can hurt model performance. Fortunately, we learned that `CountVectorizer()` has parameters that help to reduce feature dimensionality.

You should try and see if you can get higher accuracy by reducing dimensionality with `max_features` and `min_df` (Spoiler alert: you will!). You can then also try to use trigrams in addition to unigrams and bigrams.

You can also try leaving out the tfidf transformation and use just word counts as features.

Try out different classifiers and play around with `CountVectorizer()` and `TfidfTransformer()` parameters to see if you can get better scores!

And when you're ready, move to the exercise notebook.