### Imports

In [1]:
# Standard imports
import numpy as np
import pandas as pd

# SKLearn related imports
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

# Let's look at some Movie Reviews

After learning all about tokenization and regexes, let's start doing some cool stuff and apply it in a true dataset!

In Part II of this BLU, we're going to look into how to transform text into something that is meaningful to a machine. As you may have noticed, text is a bit different from other datasets you might have seen - it's just a bunch of words stringed together! Where are the features in a tightly organized table of examples? Unlike other data you might have worked with in previous BLU's, text is unstructured and thus needs some additional work on our end to make it structured and ready to be handled by a machine learning algorithm.

One thing is clear - we need features. To get features from a string of text, or a **document**, is to **vectorize** it. Normally, this means that our feature space is the **vocabulary** of the examples present in our dataset, that is, the set of unique words we can find in all of the training examples.

But enough talk - let's get our hands dirty!

In this BLU, we're going to work with some movie reviews from IMDB. Let's load the dataset into pandas...

In [2]:
df = pd.read_csv('../data/imdb_sentiment_train.csv')
df.head()

Unnamed: 0,sentiment,text
0,Negative,"Aldolpho (Steve Buscemi), an aspiring film mak..."
1,Negative,"An unfunny, unworthy picture which is an undes..."
2,Negative,A failure. The movie was just not good. It has...
3,Positive,I saw this movie Sunday afternoon. I absolutel...
4,Negative,Disney goes to the well one too many times as ...


As you can see there are two columns in this dataset - one for the labels and another for the text of the movie review. Each example is labeled as a positive or negative review. Our goal is to retrieve meaningful features from the text so a machine can predict if a given unlabeled review is positive or negative.

Let's see a positive and a negative example.

In [3]:
pos_example = df.text[21923]
print(df.sentiment[21923])
print(pos_example)

Positive
If The Lion King is a serious story about a young lion growing up to avenge his father's death, The Lion King 1 and a half is the total opposite, full of whimsy and cheer. The Lion King told the story from the side of Simba the young lion, 1 and a half is from the view of Timone and Pumbaa, a less than perfect duo made up of a meercat who left home because he could not dig tunnels without burying his friends and neighbors and a warthog who has an odor issue. The movie is a little short on substance, but Disney does a good job of filling time with various sketches starring Timone and Pumbaa as they "watch" the movie with us. My favorite is the sing-along that happens halfway through the movie, make sure you watch the bouncing bug! Disney has advertised 1 and a half as "the rest of the story," though it really isn't. It is just a different perspective of The Lion King, without all of the serious stuff that pervaded most of the second half of the original Disney classic. Credit N

Nice! So here's a review about *The Lion King 1 and a 1/2* (a.k.a. *The Lion King 3* in some countries). It seems the reviewer liked it.

In [4]:
neg_example = df.text[4]
print(df.sentiment[4])
print(neg_example)

Negative
Disney goes to the well one too many times as anybody who has seen the original LITTLE MERMAID will feel blatantly ripped off. Celebrating the birth of their daughter Melody, Ariel and Eric plan on introducing her to King Triton. The celebration is quickly crashed by Ursula 's sister, Morgana who plans to use Melody as a defense tool to get the King 's trident. Stopping the attack, Ariel and Eric build a wall around the ocean while Melody grows up wondering why she cannot go in there.<br /><br />Awful and terrible is what describes this direct to video sequel. LITTLE MERMAID 2 gives you that feeling everything you watch seemed to have come straight other Disney movies. I guess Disney can only plagiarize itself! Do not tell me that the penguin and walrus does not remind you of another duo from the LION KING!<br /><br />Other disappointing moments include the rematch between Sebastien and Louie, the royal chef. They terribly under played it! The climax between Morgana and EVERYO

Yikes. I guess that's a pass for this movie, right?

Let's get the first 200 documents of this dataset.

In [5]:
# The sentiment of each of these documents is as follows:
# POSITIVE, NEGATIVE, POSITIVE, POSITIVE, NEGATIVE, NEGATIVE

# docs = [pos_example, neg_example, df.text[24949], df.text[24995], df.text[3], df.text[0]]
docs = df.text[:200]

**USE CONCEPTS OF PART I TO CLEAN THESE EXAMPLES**

As we said, our feature space in text will be the vocabulary of our data. In our example, this is the set of unique words and symbols present in our documents.

In [6]:
def build_vocabulary():
    vocabulary = set()

    for doc in docs:
        words = doc.split()
        vocabulary.update(words)
    
    return vocabulary

build_vocabulary()

{'1940.',
 'world,',
 'hubris',
 'producer',
 "mother's",
 'refreshing,',
 'me....NOTHING.<br',
 'Castleville.',
 'describes',
 'low-budget,',
 'smitten.',
 'training',
 'paid',
 'website)',
 'standup,',
 'screenplay,',
 '"shot',
 'whole..Riff',
 'fictions',
 'know:',
 '"redemption',
 'fear.',
 'hint',
 'McLean,',
 'hands.',
 'malaise.',
 'Nikolai',
 'logical',
 '15',
 'First',
 'brilliant.<br',
 'pity',
 'U.S.',
 'nickname',
 'club.',
 'Perros',
 'cramped',
 'passing',
 'whose',
 'imprisons',
 'there,',
 'begins.<br',
 'right,',
 'Ethan',
 'certain!',
 'darthvader',
 '(which',
 '3.5/10',
 'values',
 'loneliness',
 'mainstream',
 'resembles',
 'Clipped',
 'confess',
 'restroom!',
 'one!',
 'promotion',
 'beneath',
 'lesbian.',
 'government',
 'seedy',
 'later',
 'Zone".',
 'School',
 'experiencing',
 'gore',
 "Captains'",
 'boot!',
 'suited',
 'Bobby.',
 'rent',
 'sees',
 'exist',
 'paradoxes',
 'Harry',
 'you;',
 'shopworn.',
 'sad',
 'knocked',
 'massive',
 'NOT',
 'Crystal.<br',
 'w

# Representing Text through a Bag of Words

Now that we have our vocabulary, we can vectorize our documents. The value of our features will be the most simple vectorization of a document there is - the word counts.

By doing this, each column value is the number of times that word of the vocabulary appeared in the document. This is what is called a **Bag of Words** (BoW) representation.

Note that this type of vectorization of the document loses all the syntactic information of it. That is, you could shuffle the words in document and get the same vector (that's why it's called a bag of words). Of course, since we are trying to understand if a movie review is positive or negative, one could argue that what really matters as features is what kind of words appear in the documents and not their order.

In [7]:
def vectorize():
    vocabulary = build_vocabulary()
    vectors = []
    for doc in docs:
        words = doc.split(' ')
        vector = np.array([doc.count(word) for word in vocabulary])
        vectors.append(vector)
    
    return vectors

We can visualize this better if we use a pandas DataFrame.

In [9]:
def build_df():
    return pd.DataFrame(vectorize(), columns=build_vocabulary())

build_df().head()

Unnamed: 0,1940.,"world,",hubris,producer,mother's,"refreshing,",me....NOTHING.<br,Castleville.,describes,"low-budget,",...,Caine,nothing.),Crosscoe,anybody,Snake,film-making,close,that!),boozy,see.
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0


Let's clean this data a bit.

In [10]:
import string
import re

docs = [re.sub(r'[\d+'+string.punctuation+']', '', doc.lower()) for doc in docs]

build_df().head()

Unnamed: 0,hubris,producer,furry,setdesign,yo,describes,training,poland,unmissablebr,paid,...,elsebr,allude,groups,nominations,sebastien,anybody,mccarthy,close,boozy,pullitzer
0,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,6,1,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0


# Stopwords

We're looking for the most meaningful features in our vocabulary to tell us in what category our document falls into. Text is filled with unimportant words to the meaning of a particular sentence like "the" or "and". This is in contrast with words like "love" or "hate" that have a very clear semantic meaning. The former example of words are called **stopwords** - words that don't introduce any meaning to a piece of text and are often just in document for syntactic reasons.

You can easily find lists of stopwords for several languages in the internet. Here is a list for english.

In [12]:
# The list came from here: http://snowball.tartarus.org/algorithms/english/stop.txt
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

Let's update our build_vocabulary() and vectorize() functions and remove these words from the text. This way we will reduce our vocabulary - and thus our feature space - making our representations more lightweight.

In [14]:
def build_vocabulary():
    vocabulary = set()

    for doc in docs:
        words = [word for word in doc.split() if word not in stopwords]
        vocabulary.update(words)
    
    return vocabulary

def vectorize():
    vocabulary = build_vocabulary()
    vectors = []
    for doc in docs:
        words = doc.split(' ')
        vector = np.array([doc.count(word) for word in vocabulary if word not in stopwords])
        vectors.append(vector)
    
    return vectors

BoW = build_df()
BoW.head()

Unnamed: 0,hubris,producer,furry,setdesign,yo,describes,training,poland,unmissablebr,paid,...,elsebr,allude,groups,nominations,sebastien,anybody,mccarthy,close,boozy,lizitis
0,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,6,1,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0


Another thing that we could do is to normalize our counts. As you can see, different documents have different number of words:

In [16]:
BoW.sum(axis=1).head()

0    771
1     91
2    471
3    457
4    886
dtype: int64

This can introduce bias in our features, so we should normalize each document by its number of words. This way, instead of having word counts as features of our model, we will have **term frequencies**. This way, the features in any document of the dataset sum to 1:

In [17]:
tf = BoW.div(BoW.sum(axis=1), axis=0)
tf.sum(axis=1).head()

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
dtype: float64

# Kicking it up a notch with TF-IDF

It should be clear by now that not all words have the same importance to find out in what category a document falls into.

In our dataset, if we want to classify a review as positive, for instance, the word "_good_" is much more informative than "_house_", and we should give it more weight as a feature.

In general, words that are very common in the corpus are less informative than rare words.

That is the rational behind **Term Frequency - Inverse Document Frequency (TF-IDF)**:
$$ tfidf _{t, d} =(1+log_2{(tf_{t,d})})*(1+log_2{(\frac{N}{df_{t}})})  $$

where $t$ and $d$ are the term and document for which we are computing a feature, $tf_{t,d}$ is the term frequency of term $t$ in document $d$, $N$ is the total number of documents we have, while $df_{t}$ is the number of documents that contain $t$.

We are using the word frequencies we were using before, but now we are weighting each by the inverse of the number of times they occur in all the documents. The more a word a appears in a document and the less it appears in other documents, the higher is the TF-IDF of that word in that document.

In short, we measure *the term frequency, weighted by its rarity in the entire corpus*.

In practice, the $tf_{t,d}$ part of the equation is going to lead to a lot of zeros, since not all terms are in all documents. Since if a term exists in a document the equation will always wield a number different from 0, we only compute the $tfidf _{t, d}$ if $t$ exists in document $d$. Otherwise, we will define that $tfidf=0$. This is usually done with a sparse implementation of vectors, which we will have later, where we end up only computing values for the terms that appear in a document. Since we don't have that at the moment, we will just replace the -inf we get from taking the $log_2{(0)}$ with 0's.

In [25]:
from numpy import inf

def idf(column):
    return 1 + np.log2(len(column) / sum(column > 0))

tf_idf = (1 + np.log2(tf)).multiply(tf.apply(idf))
tf_idf[tf_idf==-inf]=0
tf_idf.head()

  


Unnamed: 0,hubris,producer,furry,setdesign,yo,describes,training,poland,unmissablebr,paid,...,elsebr,allude,groups,nominations,sebastien,anybody,mccarthy,close,boozy,lizitis
0,0.0,0.0,0.0,0.0,-10.829463,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,-10.634626,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,-9.663082,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,-9.593694,-67.198385,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-75.989548,-55.5771,0.0,0.0,0.0,0.0


An interesting thing we can do with our feature representations is to check similarities between words. To do this, instead of seeing the vocabulary as the features of a given document, we see each document as a feature of a given term in the vocabulary. By doing this, we get a vectorized representation of a word!

A very popular way of computing similarities between vectors is to compute the cosine similarity.

Let's check the similarity between the word _good_ and the word _great_ in our Bag of Words representation.

In [39]:
cosine_similarity(BoW['good'].__array__().reshape(1,-1), BoW['great'].__array__().reshape(1,-1))

array([[0.31382296]])

Now, let's compute it for _good_ and _awful_. We should get a lower similarity score.

In [40]:
cosine_similarity(BoW['good'].__array__().reshape(1,-1), BoW['awful'].__array__().reshape(1,-1))

array([[0.20225996]])

Let's check the same similarity scores but in the tf-idf representation:

In [41]:
cosine_similarity(tf_idf['good'].__array__().reshape(1,-1), tf_idf['great'].__array__().reshape(1,-1))

array([[0.42873222]])

In [42]:
cosine_similarity(tf_idf['good'].__array__().reshape(1,-1), tf_idf['awful'].__array__().reshape(1,-1))

array([[0.18866055]])

Nice! The gap between the similarities of these pair of words increased with our tf-idf representation. This means that our tf-idf model is computing better and more meaningful features than our BoW model. This will surely help when we feed these feature matrices to a prediction model.

# Using all of this in practice

As you can imagine, there are a lot of implementations of what we learned above in the internet. We're going to use scikit learn from now to compute these feature representations.

Our BoW representations, for instance, can be done with scikit's CountVectorizer().

In [43]:
vectorizer = CountVectorizer()

In [44]:
vectorizer.fit(df['text'].values)

# Looking at a small sample of the vocabulary:
vocabulary = list(vectorizer.vocabulary_.keys())
print("Small sample of the vocabulary:", vocabulary[0:20])

# Number of words in the vocabulary
print("\nNumber of distinct words:", len(vocabulary))

Small sample of the vocabulary: ['skyler', 'dementia', 'inciteful', 'teletype', 'laundrette', 'landru', 'comity', 'israle', 'scope', 'sten', 'tomasso', 'jerseys', 'halo', 'investments', 'degrade', 'unhappily', 'taut', 'zavattini', 'kumar', 'marverick']

Number of distinct words: 74849


In [46]:
sentence = df['text'].values[12:13]
print(sentence[0], '\n')

# Tranform sentence into bag of words representation
word_count_sentence = vectorizer.transform(sentence)

# Find the indexes of the words which appear in the sentence
_, columns = word_count_sentence.nonzero()

# Get the inverse map to map vector indexes to words
vocabulary = vectorizer.vocabulary_
inv_map = {v: k for k, v in vocabulary.items()}

# Extract the corresponding word and count
counts = [(inv_map[i], word_count_sentence[0, i]) for i in columns]

for word, count in counts:
    print(word, ": ", count)

Kudos to Cesar Montano for reviving the Cebuano movie! Panaghoy sa Suba is very good -- it has the drama, the action, the romance, and scene that will make you laugh.<br /><br />While the story is not that original (a love triangle -- or make a four-cornered-love, Japanese occupation, rebellion, American as lord), its presentation is something cool, especially it uses it original language -- bisaya for the Filipino, nipongo for the Japanese and English for the American.<br /><br />This movie will go as one of this year's best Pinoy movies.<br /><br />Go watch this! 

action :  1
american :  2
and :  2
as :  2
best :  1
bisaya :  1
br :  6
cebuano :  1
cesar :  1
cool :  1
cornered :  1
drama :  1
english :  1
especially :  1
filipino :  1
for :  4
four :  1
go :  2
good :  1
has :  1
is :  3
it :  3
its :  1
japanese :  2
kudos :  1
language :  1
laugh :  1
lord :  1
love :  2
make :  2
montano :  1
movie :  2
movies :  1
nipongo :  1
not :  1
occupation :  1
of :  1
one :  1
or :  1
o

In [48]:
word_count_matrix = vectorizer.transform(df['text'].values)
word_count_matrix.shape

(25000, 74849)

And tf-idf can be done with TfidfTransformer().

In [49]:
tfidf = TfidfTransformer()
tfidf.fit(word_count_matrix)

word_term_frequency_matrix = tfidf.transform(word_count_matrix)