In [1]:
# Standard imports
import numpy as np
import pandas as pd
from collections import Counter, OrderedDict
import re
import string
import warnings; warnings.simplefilter('ignore')

# NLTK imports
from nltk.tokenize import WordPunctTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

# SKLearn related imports
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from sklearn import preprocessing

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

# Load the dataset
df = pd.read_csv('./data/imdb_sentiment.csv')

# Get the text
docs = df['text']

# Split in train and validation
train_df, validation_df = train_test_split(df, test_size=0.2, random_state=42)

# Putting all of this in practice

In Learning Notebook - Part 1 of this BLU you've learned about _preprocessing_ your text data into something easy for the machine to process and vectorize later. In practice, we can create a class that transforms our data into those _easier to process_ strings.

In [2]:
# Custom transformer to implement sentence cleaning
class TextCleanerTransformer(TransformerMixin):
    def __init__(self, tokenizer, stemmer, regex_list,
                 lower=True, remove_punct=True):
        self.tokenizer = tokenizer
        self.stemmer = stemmer
        self.regex_list = regex_list
        self.lower = lower
        self.remove_punct = remove_punct
        
    def transform(self, X, *_):
        X = list(map(self._clean_sentence, X))
        return X
    
    def _clean_sentence(self, sentence):
        
        # Replace given regexes
        for regex in self.regex_list:
            sentence = re.sub(regex[0], regex[1], sentence)
            
        # lowercase
        if self.lower:
            sentence = sentence.lower()

        # Split sentence into list of words
        words = self.tokenizer.tokenize(sentence)
            
        # Remove punctuation
        if self.remove_punct:
            words = list(filter(lambda x: x not in string.punctuation, words))

        # Stem words
        if self.stemmer:
            words = map(self.stemmer.stem, words)

        # Join list elements into string
        sentence = " ".join(words)
        
        return sentence
    
    def fit(self, *_):
        return self

We just created a class that has a `transform()` method that will apply the method `_clean_sentence()` to every sentence of its input `X`. Note that you can choose the tokenizer and the stemmer as inputs of this class - you can choose which ones you prefer to use. You can also give the class a list of tuples that are regexes that you want to substitute for something in your sentences.

Let's use the same tokenizer, stemmer, and HTML regex that we used before.

In [3]:
# Initialize a tokenizer and a stemmer
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords=True)
regex_list = [("<[^>]*>", "")
             ]

cleaner = TextCleanerTransformer(tokenizer, stemmer, regex_list)
docs = cleaner.transform(train_df.text.values)

Let's look at an output example:

In [4]:
docs[100]

'this is a film about life the triumph over advers and the wonder of the human spirit i defi anyon not to shed a tear by the end of the movi this is more than just a tear jerker its an engag thought provok drama with excel perform from all the cast but especi derek luke and denzel washington 7 year on i m amaz that luke is still a virtual unknown and washington only direct one other film nevertheless apart from a slow build up the stori of this foster child s trial and tribul and how it still affect him in adulthood is the sort of movi that stay with you long after you have seen it like mani fox searchlight pictur this was more of a sleeper hit and didn t get the mass critic acclaim it deserv the scene where antwon final meet his mother sum up the movi for me there were so mani way that could have been done and it could have been all schmaltzi or it could have been unrealist but washington struck exact the right tone his mother never said a word and could only shed a tear while antown 

Cool! Now onto what we learned in Learnning Notebook - Part 2.

As you can imagine, there are many implementations on the internet of Bag of Words and TF-IDF. We're going to use scikit-learn from now on to compute the feature representations learned in this BLU.

Our BoW representations, for instance, can be done with scikit's [CountVectorizer()](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). We still haven't removed stopwords, as you may have noticed. However, removing stopwords is easy with `CountVectorizer()` - just pass the parameter `stop_words` and assign it to `'english'`! This will remove all English stopwords from your corpus. You can also give this parameter a list of strings, if you prefer.

In [5]:
vectorizer = CountVectorizer(stop_words='english')

In [6]:
vectorizer.fit(docs)

# Looking at a small sample of the vocabulary:
vocabulary = list(vectorizer.vocabulary_.keys())
print("Small sample of the vocabulary:", vocabulary[0:20])

# Number of words in the vocabulary
print("\nNumber of distinct words:", len(vocabulary))

Small sample of the vocabulary: ['watch', 'dvd', 'movi', 'come', 'excel', 'commentari', 'track', 'english', 'cambodia', 'subtitl', 'say', 'charact', 'speak', 'thai', 'violent', 'evil', 'man', 'rais', 'boy', 'killer']

Number of distinct words: 23740


Looking at a random sample document, for example document 12, we can visualize our bag of words representation:

In [7]:
doc = [docs[12]]
print(doc[0], '\n')

# Tranform sentence into bag of words representation
word_count_doc = vectorizer.transform(doc)

# Find the indexes of the words which appear in the sentence
_, columns = word_count_doc.nonzero()

# Get the inverse map to map vector indexes to words
vocabulary = vectorizer.vocabulary_
inv_map = {v: k for k, v in vocabulary.items()}

# Extract the corresponding word and count
counts = [(inv_map[i], word_count_doc[0, i]) for i in columns]

for word, count in counts:
    print(word, ": ", count)

this is the kind of movi which show the pauciti of french cinema when it come to make thriller the director s desir to sound american is so glare that you will not be fool a minut unless you have not seen a serial killer movi sinc peep tom ". two male cop or one and a half more like as you will see ), horribl murder a plot more complic than complex charl berl is not lucki with the genr see the astoundl dumb l inconnu de strasbourg a coupl of year ago ). the scene with his pregnant wife which are suppos to be a counterpart for the otherwis noir atmospher of the rest of the plot are among the worst ever film add a steami love scene between them and a gori autopsi to get a pg 12 and thus to attract the huge adolesc audienc a violent and absurd conclus follow by a silent epilogu who could make a nice commerci for the côte d azur it s realli the silenc of the lame 

12 :  1
absurd :  1
add :  1
adolesc :  1
ago :  1
american :  1
astoundl :  1
atmospher :  1
attract :  1
audienc :  1
autops

We can now get the word counts (Bag of Words representation) for every sentence by calling the transform method. This returns a sparse matrix where the rows represent the samples and the columns the word counts.

In [8]:
word_count_matrix = vectorizer.transform(df['text'].values)
word_count_matrix.shape

(5000, 23740)

If we want to do TF-IDF, that can be done on top of our matrix of word counts with [TfidfTransformer()](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

In [9]:
tfidf = TfidfTransformer()
tfidf.fit(word_count_matrix)

word_term_frequency_matrix = tfidf.transform(word_count_matrix)

As we learned in Learning Notebook - Part 1, we can not only get features that correspond to each word in our vocabulary (Unigram) but also get features for any number of N-grams. We can easily get all the possibilities in a chosen range of N by using `CountVectorizer()` parameter `ngram_range`. This parameter receives a tuple `(min_n, max_n)` - if we wanted to extract all unigrams and bigrams in a corpus we would pass to the model `ngram_range=(1,2)`.

Let's get all unigram, bigrams, and trigrams.

In [10]:
vectorizer_123_grams = CountVectorizer(stop_words= 'english', ngram_range=(1,3))
vectorizer_123_grams.fit(docs)
word_count_matrix = vectorizer_123_grams.transform(df['text'].values)
word_count_matrix.shape

(5000, 773317)

As you can see, we end up with a much bigger feature space - which can be quite computationally expensive to our model! Fortunately, there are some parameters in `CountVectorizer()` that we can use to reduce our feature space while keeping as many informative representations as possible:

- `max_features` - receives an `int` that will be the size of the feature space of the model. The N features chosen will be the ones with higher term frequency across the corpus.
- `min_df` - the minimum document frequency an n-gram can have to be considered. Often called the *cut-off*.

# Predict the sentiment of the movie reviews

Let's use all we've learned to create a model that predicts if a review is positive or negative. In NLP, we call this task _sentiment analysis_.

<img src="./media/dwight.jpg" width="500">

Since there are several things that we need to do sequentially to our data, we can create a [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Pipelines allow us to easily compose transformations and classifiers.

The main advantage of pipelines is that the pipeline exposes the fit and predict functions that automatically call the transformations on the data and the classifier, keeping the transformations coherent between train and test data.

We will use Scikit's implementation of Naive Bayes as our classifier.

In [11]:
# Build the pipeline
text_clf = Pipeline([('prep', TextCleanerTransformer(tokenizer, stemmer, regex_list)),
                   ('vect', CountVectorizer(stop_words='english')),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())])

The final piece missing is converting the sentiment labels into numeric labels through the Scikit [Label Encoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) tool.

In [12]:
# Encode the labels
le = preprocessing.LabelEncoder()
le.fit(train_df['sentiment'].values)

train_df['sentiment'] = le.transform(train_df['sentiment'].values)
validation_df['sentiment'] = le.transform(validation_df['sentiment'].values)

In [13]:
# Train the classifier
text_clf.fit(map(str, train_df['text'].values), train_df['sentiment'].values)

Pipeline(steps=[('prep',
                 <__main__.TextCleanerTransformer object at 0x135c25350>),
                ('vect', CountVectorizer(stop_words='english')),
                ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

In [14]:
predicted = text_clf.predict(map(str, validation_df['text'].values))
np.mean(predicted == validation_df['sentiment'])

0.844

Not bad!

Let's try to use bigrams as well in our model, along with unigrams:

In [15]:
# Build the pipeline
text_clf = Pipeline([('prep', TextCleanerTransformer(tokenizer, stemmer, regex_list)),
                   ('vect', CountVectorizer(stop_words='english', ngram_range=(1,2))),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())])
# Train the classifier
text_clf.fit(map(str, train_df['text'].values), train_df['sentiment'].values)

predicted = text_clf.predict(map(str, validation_df['text'].values))
np.mean(predicted == validation_df['sentiment'])

0.826

Interestingly, our performance on the validation set got worse! This is an example of when it can be hurtful to your model to remove stopwords, as we warned you in Learning Notebook - Part 2. If you look at the stopwords list, words like "no" are part of it. This can be crucial to our bigram representation since if we remove them, relevant bigrams will not appear in our feature space (ex.: "no fun").

Let's remove the stop_words parameter from the CountVectorizer.

In [16]:
# Build the pipeline
text_clf = Pipeline([('prep', TextCleanerTransformer(tokenizer, stemmer, regex_list)),
                   ('vect', CountVectorizer(ngram_range=(1,2))),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())])
# Train the classifier
text_clf.fit(map(str, train_df['text'].values), train_df['sentiment'].values)

predicted = text_clf.predict(map(str, validation_df['text'].values))
np.mean(predicted == validation_df['sentiment'])

0.837

We get better performance, but still not as good as our unigram model. This can be due to our feature space having too many dimensions, which can hurt model performance. Fortunately, we learned that `CountVectorizer()` has parameters that help to reduce our feature dimensionality.

You should try and see if you can get higher accuracy by reducing dimensionality with `max_features` and `min_df` (Spoiler alert: you will!). You can then also try to use trigrams along with the unigrams and bigrams you already have.

Try out different classifiers and play around with `CountVectorizer()` and `TfidfTransformer()` parameters to see if you can get better scores!