### Imports

In [1]:
# Standard imports
import numpy as np
import pandas as pd
from collections import Counter, OrderedDict
import re
import string
from numpy import inf

# NLTK imports
from nltk.tokenize import WordPunctTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

# SKLearn related imports
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from sklearn import preprocessing

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

# Let's look at some Movie Reviews

After learning all about tokenization and regexes, let's start doing some cool stuff and apply it to a true dataset!

In Part II of this BLU, we're going to look into how to transform text into something that is meaningful to a machine. As you may have noticed, text data is a bit different from other datasets you might have seen -- it's just a bunch of words strung together! Where are the features in a tightly organized table of examples? Unlike other data you might have worked with in previous BLUs, text is unstructured and thus needs some additional work on our end to make it structured and ready to be handled by a machine learning algorithm.

<img src="./media/xkcd_language_nerd.png" width="300">

Language can be messy, but one thing is clear: we need features. To get features from a string (text or a **document**), one way is to **vectorize** it. Normally, this means that our feature space is the **vocabulary** of the examples present in our dataset. That is, the set of unique words we can find in all of the training examples.

But enough talk - let's get our hands dirty!

In this BLU, we're going to work with some movie reviews from IMDB. Let's load the dataset into pandas...

In [2]:
df = pd.read_csv('./data/imdb_sentiment.csv')
df.head()

Unnamed: 0,sentiment,text
0,Negative,"Aldolpho (Steve Buscemi), an aspiring film mak..."
1,Negative,"An unfunny, unworthy picture which is an undes..."
2,Negative,A failure. The movie was just not good. It has...
3,Positive,I saw this movie Sunday afternoon. I absolutel...
4,Negative,Disney goes to the well one too many times as ...


As you can see, there are two columns in this dataset - one for the labels and another for the text of the movie review. Each example is labeled as a positive or negative review. Our goal is to retrieve meaningful features from the text so a machine can predict if a given unlabeled review is positive or negative.

Let's see a positive and a negative example.

In [3]:
pos_example = df.text[4835]
print(df.sentiment[4835])
print(pos_example)

Positive
"The Lion King" is without a doubt my favorite Disney movie of all time, so I figured maybe I should give the sequels a chance and I did. Lion King 1 1/2 was pretty good and had it's good laughs and fun with Timon and Pumba. Only problem, I feel sometimes no explanations are needed because they can create plot holes and just the feeling of wanting your own explanation. Well, I would highly recommend this movie for lion King fans or just a night with the family. It's a fun flick with the same laughs and lovable characters as the first. So, hopefully, I'll get the same with the third installment to the Lion King series. Sit back and just think Hakuna Matata! It means no worries! <br /><br />8/10


Nice! So that is a review about *The Lion King 1 1/2* (a.k.a. *The Lion King 3* in some countries). It seems the reviewer liked it.

In [4]:
neg_example = df.text[4]
print(df.sentiment[4])
print(neg_example)

Negative
Disney goes to the well one too many times as anybody who has seen the original LITTLE MERMAID will feel blatantly ripped off. Celebrating the birth of their daughter Melody, Ariel and Eric plan on introducing her to King Triton. The celebration is quickly crashed by Ursula 's sister, Morgana who plans to use Melody as a defense tool to get the King 's trident. Stopping the attack, Ariel and Eric build a wall around the ocean while Melody grows up wondering why she cannot go in there.<br /><br />Awful and terrible is what describes this direct to video sequel. LITTLE MERMAID 2 gives you that feeling everything you watch seemed to have come straight other Disney movies. I guess Disney can only plagiarize itself! Do not tell me that the penguin and walrus does not remind you of another duo from the LION KING!<br /><br />Other disappointing moments include the rematch between Sebastien and Louie, the royal chef. They terribly under played it! The climax between Morgana and EVERYO

Yikes. I guess that's a pass for this one, right?

Let's get the first 200 documents of this dataset to run experiments faster.

In [5]:
docs = df.text[:200]

As we've learned in Learning Notebook - Part 1, we can tokenize and stem our text to extract better features. Let's initialize our favorite tokenizer and stemmer. For now, we choose to keep stopwords.

In [6]:
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords=True)

We can also use a regex to clean our sentences. We can see from the examples above that our corpus has some HTML substrings `<br />` that are only polluting the sentences. We can remove them with `re.sub()` by substituting every substring that matches the regex `<[^>]*>` with an empty string.

We will define a `preprocess()` method that removes these unnecessary HTML tags, tokenizes, and stems our corpus's sentences.

In [7]:
def preprocess(doc):
    # remove html tags
    doc = re.sub("<[^>]*>", "", doc)
    # lowercase
    doc = doc.lower()
    # tokenize
    words = tokenizer.tokenize(doc)
    # remove punctuation
    words = [word for word in words if word not in string.punctuation]
    # stem
    stems = [stemmer.stem(word) for word in words]
    new_doc = " ".join(stems)
    return new_doc

In [8]:
docs = docs.apply(preprocess)

Let's see one of the above examples again, after we cleaned the corpus.

In [9]:
docs[4]

'disney goe to the well one too mani time as anybodi who has seen the origin littl mermaid will feel blatant rip off celebr the birth of their daughter melodi ariel and eric plan on introduc her to king triton the celebr is quick crash by ursula s sister morgana who plan to use melodi as a defens tool to get the king s trident stop the attack ariel and eric build a wall around the ocean while melodi grow up wonder why she cannot go in there aw and terribl is what describ this direct to video sequel littl mermaid 2 give you that feel everyth you watch seem to have come straight other disney movi i guess disney can only plagiar itself do not tell me that the penguin and walrus does not remind you of anoth duo from the lion king other disappoint moment includ the rematch between sebastien and louie the royal chef they terribl under play it the climax between morgana and everyon seem to be anoth disappoint i will not give anyth away but in 75 minut everyth seem incred cramp and too much to

Well, we may not understand it as well now, but we actually just made the text much easier for a machine to read.

As we said, our feature space in text will be the vocabulary of our data. In our example, this is the set of unique words and symbols present in our documents.

To create our vocabulary, we will use a `Counter()`. `Counter()` is a dictionary that counts the number of occurrences of different tokens in a list and can be updated with each sentence of our corpus.

After getting all counts for each unique token, we sort our dictionary by counts using `Counter()`'s built-in method `.most_common()`, and store everything in an `OrderedDict()`. This makes sure our vectorized representations of the documents will be ordered according to the most common words in the whole corpus (this is not required, but makes data visualization much nicer!).

In [10]:
def build_vocabulary():
    vocabulary = Counter()

    for doc in docs:
        words = doc.split()
        vocabulary.update(words)
    
    return OrderedDict(vocabulary.most_common())

In [11]:
vocab = build_vocabulary()

Check the number of words in the vocabulary:

In [12]:
len(vocab)

5740

In [13]:
# turn into a list of tuples and get the first 20 items
list(vocab.items())[:20]

[('the', 2706),
 ('a', 1361),
 ('and', 1349),
 ('of', 1205),
 ('to', 1115),
 ('is', 815),
 ('it', 786),
 ('in', 719),
 ('i', 690),
 ('this', 594),
 ('that', 581),
 ('s', 541),
 ('movi', 459),
 ('film', 400),
 ('as', 377),
 ('but', 358),
 ('with', 357),
 ('for', 315),
 ('was', 305),
 ('t', 295)]

You will notice that many of the most common words in the reviews are what we would consider stopwords: determiners like "the," "a"; prepositions like "of," "to"; etc. We will probably want to filter these out.

# Representing Text through a Bag of Words

Now that we have our vocabulary, we can vectorize our documents. The value of our features will be the most simple vectorization of a document there is - the word counts.

By doing this, each column value is the number of times that word of the vocabulary appeared in the document. This is what is called a **Bag of Words** (BoW) representation.

Note that this type of vectorization of the document loses all of its syntactic information. That is, you could shuffle the words in the document and get the same vector (that's why it's called a bag of words). Of course, since we are trying to understand if a movie review is positive or negative, one could argue that what really matters as features is what kind of words appear in the documents and not their order.

In [14]:
def vectorize():
    vectors = []
    for doc in docs:
        words = doc.split()
        vector = np.array([doc.count(word) for word in build_vocabulary()])
        vectors.append(vector)

    return vectors

We can visualize this better if we use a pandas DataFrame.

In [15]:
def build_df(vocabulary):
    return pd.DataFrame(vectorize(), columns=vocabulary)

In [16]:
build_df(vocab).head()

Unnamed: 0,the,a,and,of,to,is,it,in,i,this,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,11,94,7,4,7,16,5,11,76,3,...,0,0,0,0,0,0,0,0,0,0
1,0,6,0,0,1,3,2,0,11,1,...,0,0,0,0,0,0,0,0,0,0
2,8,48,6,3,1,6,3,5,35,1,...,0,0,0,0,0,0,0,0,0,0
3,8,36,3,3,3,5,6,4,45,2,...,0,0,0,0,0,0,0,0,0,0
4,21,70,9,4,16,13,6,15,73,1,...,0,0,0,0,0,0,0,0,0,0


# Stopwords

We mentioned stopwords briefly in the last learning notebook, but now we will go a bit more in-depth. 

We're looking for the most meaningful features in our vocabulary to tell us in what category our document falls into. Text is filled with words that are unimportant to the meaning of a particular sentence like "the" or "and". This is in contrast with words like "love" or "hate" that have a very clear semantic meaning. The former example of words are called **stopwords** - words that _usually_ don't introduce any meaning to a piece of text and are often just in the document for syntactic reasons.

It's important to emphasize that we used "usually" in our previous statement. You should be aware that sometimes stopwords can be useful features, especially when we use more than just unigrams as features (ex.: bigrams, trigrams, ...), and word order and word combination starts to be relevant.
 
You can easily find lists of stopwords for several languages on the internet. You can find one for English in the `data` folder.

In [17]:
stopwords = [line.strip("\n") for line in open("./data/english_stopwords.txt", "r")]

stopwords[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers']

Let's update our `build_vocabulary()` and `vectorize()` functions and remove these words from the text. This way we will reduce our vocabulary - and thus our feature space - making our representations more lightweight.

In [18]:
def build_vocabulary_without_stopwords():
    vocabulary = Counter()

    for doc in docs:
        words = [word for word in doc.split() if word not in stopwords]
        vocabulary.update(words)
    
    return OrderedDict(vocabulary.most_common())

vocab_without_stopwords = build_vocabulary_without_stopwords()

Check the size of the new vocabulary (should be smaller than before):

In [19]:
len(vocab_without_stopwords)

5612

In [20]:
# turn into a list of tuples and get the first 20 items
list(vocab_without_stopwords.items())[:20]

[('movi', 459),
 ('film', 400),
 ('one', 238),
 ('like', 204),
 ('time', 124),
 ('get', 118),
 ('watch', 110),
 ('make', 106),
 ('even', 105),
 ('see', 105),
 ('good', 99),
 ('stori', 99),
 ('charact', 98),
 ('end', 97),
 ('scene', 96),
 ('would', 94),
 ('well', 92),
 ('much', 92),
 ('peopl', 92),
 ('love', 86)]

In [21]:
def vectorize():
    vectors = []
    for doc in docs:
        words = doc.split()
        vector = np.array([doc.count(word) for word in vocab_without_stopwords])
        vectors.append(vector)
    
    return vectors

BoW = build_df(vocab_without_stopwords)
BoW.head()

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,3,5,0,0,2,1,1,3,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,0,2,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,1,1,1,0,0,0,1,2,...,0,0,0,0,0,0,0,0,0,0
4,1,0,1,0,1,1,1,0,0,4,...,0,0,0,0,0,0,0,0,0,0


Another thing that we could do is to normalize our counts. As you can see, different documents have different number of words:

In [22]:
BoW.sum(axis=1).head()

0    859
1    106
2    513
3    502
4    935
dtype: int64

This can introduce bias in our features, so we should normalize each document by its number of words. This way, instead of having word counts as features of our model, we will have **term frequencies**. This way, the features in any document of the dataset sum to 1:

In [23]:
tf = BoW.div(BoW.sum(axis=1), axis=0)

In [24]:
tf.sample(3)

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
179,0.0,0.004303,0.0,0.000861,0.0,0.002582,0.0,0.0,0.002582,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56,0.002437,0.004062,0.002437,0.0,0.000812,0.0,0.0,0.000812,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
199,0.0,0.008949,0.008949,0.002237,0.0,0.0,0.0,0.004474,0.0,0.0,...,0.0,0.0,0.0,0.0,0.002237,0.002237,0.002237,0.002237,0.002237,0.002237


In [25]:
tf.sum(axis=1).head()

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
dtype: float64

# Kicking it up a notch with TF-IDF

It should be clear by now that not all words have the same importance to find out in what category a document falls into.

In our dataset, if we want to classify a review as positive, for instance, the word "*good*" is much more informative than "*house*", and we should give it more weight as a feature.

In general, words that are very common in the corpus are less informative than rare words.

That is the rationale behind **Term Frequency - Inverse Document Frequency (TF-IDF)**:

$$ tfidf _{t, d} =(log_2{(1 + tf_{t,d})})*(log_2{(1 + \frac{N}{df_{t}})})  $$

where $t$ and $d$ are the term and document for which we are computing a feature, $tf_{t,d}$ is the term frequency of term $t$ in document $d$, $N$ is the total number of documents we have, while $df_{t}$ is the number of documents that contain $t$.

We are using the word frequencies from before, but now we are weighting each by the inverse of the number of times they occur in all the documents. The more a word appears in a document and the less it appears in other documents, the higher the TF-IDF of that word in that document.

In short, we measure **the term frequency, weighted by its rarity in the entire corpus**.

**Note**: TF-IDF can vary in formulation - the idea is always the same, but computation might change slightly. In our case, we are choosing to log-normalize our frequencies.

In [26]:
def idf(column):
    return np.log2(1 + len(column) / sum(column > 0))

tf_idf = (np.log2(1 + tf)).multiply(BoW.apply(idf))

tf_idf.head()

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,0.00666,0.01204,0.0,0.0,0.005777,0.002815,0.00317,0.009091,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.017937,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.022213,0.0,0.006956,0.004374,0.0,0.0,0.0,0.0,0.004972,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.015161,0.0,0.003558,0.004469,0.004944,0.0,0.0,0.0,0.005081,0.008385,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.002042,0.0,0.001911,0.0,0.002656,0.002586,0.002912,0.0,0.0,0.009002,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


An interesting thing we can do with our feature representations is to check similarities between words. To do this, instead of seeing the vocabulary as the features of a given document, we see each document as a feature of a given term in the vocabulary. By doing this, we get a vectorized representation of a word!

A very popular way of computing similarities between vectors is to compute the cosine similarity (the cosine of the angle between the vectors).

Let's check the similarity between the word _movi_ (which is the stem of *movie*) and the word *film* in our Bag of Words representation.

In [27]:
cosine_similarity(BoW['movi'].values.reshape(1,-1), BoW['film'].values.reshape(1,-1))

array([[0.28991834]])

Now, let's compute it for _movi_ and _shoe_. We should get a lower similarity score.

In [28]:
cosine_similarity(BoW['movi'].__array__().reshape(1,-1), BoW['shoe'].__array__().reshape(1,-1))

array([[0.14554543]])

Let's check the same similarity scores but in the tf-idf representation:

In [29]:
cosine_similarity(tf_idf['movi'].__array__().reshape(1,-1), tf_idf['film'].__array__().reshape(1,-1))

array([[0.24648296]])

In [30]:
cosine_similarity(tf_idf['movi'].__array__().reshape(1,-1), tf_idf['shoe'].__array__().reshape(1,-1))

array([[0.03625309]])

Nice! The gap between the similarities of these pairs of words increased with our TF-IDF representation. This means that our TF-IDF model is computing better and more meaningful features than our BoW model. This will surely help when we feed these feature matrices to a classifier.

`scikit-learn` comes with implementations of both Bag of Words and TF-IDF vectorizers, which you will see in the next notebook.