### Imports

In [2]:
# Standard imports
import numpy as np
import pandas as pd
from collections import Counter, OrderedDict
import re
import string
from numpy import inf

# NLTK imports
from nltk.tokenize import WordPunctTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

# SKLearn related imports
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from sklearn import preprocessing

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

# Let's look at some Movie Reviews

After learning all about tokenization and regexes, let's start doing some cool stuff and apply it to a true dataset!

In Part II of this BLU, we're going to look into how to transform text into something that is meaningful to a machine. As you may have noticed, text data is a bit different from other datasets you might have seen -- it's just a bunch of words strung together! Where are the features in a tightly organized table of examples? Unlike other data you might have worked with in previous BLUs, text is unstructured and thus needs some additional work on our end to make it structured and ready to be handled by a machine learning algorithm.

Language can be messy, but one thing is clear: we need features. To get features from a string (text or a **document**), one way is to **vectorize** it. Normally, this means that our feature space is the **vocabulary** of the examples present in our dataset. That is, the set of unique words we can find in all of the training examples.

<img src="./media/vectors.jpg" width="400">


But enough talk - let's get our hands dirty!

In this BLU, we're going to work with some movie reviews from IMDB. Let's load the dataset into pandas...

In [3]:
df = pd.read_csv('./data/imdb_sentiment.csv')
df.head()

Unnamed: 0,sentiment,text
0,Negative,"Aldolpho (Steve Buscemi), an aspiring film mak..."
1,Negative,"An unfunny, unworthy picture which is an undes..."
2,Negative,A failure. The movie was just not good. It has...
3,Positive,I saw this movie Sunday afternoon. I absolutel...
4,Negative,Disney goes to the well one too many times as ...


As you can see, there are two columns in this dataset - one for the labels and another for the text of the movie review. Each example is labeled as a positive or negative review. Our goal is to retrieve meaningful features from the text, so a machine learning model can predict if a given unlabeled review is positive or negative.

Let's see a positive and a negative example.

In [4]:
pos_example = df.text[4835]
print(df.sentiment[4835])
print(pos_example)

Positive
"The Lion King" is without a doubt my favorite Disney movie of all time, so I figured maybe I should give the sequels a chance and I did. Lion King 1 1/2 was pretty good and had it's good laughs and fun with Timon and Pumba. Only problem, I feel sometimes no explanations are needed because they can create plot holes and just the feeling of wanting your own explanation. Well, I would highly recommend this movie for lion King fans or just a night with the family. It's a fun flick with the same laughs and lovable characters as the first. So, hopefully, I'll get the same with the third installment to the Lion King series. Sit back and just think Hakuna Matata! It means no worries! <br /><br />8/10


Nice! So that is a review about *The Lion King 1 1/2* (a.k.a. *The Lion King 3* in some countries). It seems the reviewer liked it.

In [5]:
neg_example = df.text[4]
print(df.sentiment[4])
print(neg_example)

Negative
Disney goes to the well one too many times as anybody who has seen the original LITTLE MERMAID will feel blatantly ripped off. Celebrating the birth of their daughter Melody, Ariel and Eric plan on introducing her to King Triton. The celebration is quickly crashed by Ursula 's sister, Morgana who plans to use Melody as a defense tool to get the King 's trident. Stopping the attack, Ariel and Eric build a wall around the ocean while Melody grows up wondering why she cannot go in there.<br /><br />Awful and terrible is what describes this direct to video sequel. LITTLE MERMAID 2 gives you that feeling everything you watch seemed to have come straight other Disney movies. I guess Disney can only plagiarize itself! Do not tell me that the penguin and walrus does not remind you of another duo from the LION KING!<br /><br />Other disappointing moments include the rematch between Sebastien and Louie, the royal chef. They terribly under played it! The climax between Morgana and EVERYO

Yikes. I guess that's a pass for this one, right?

Let's get the first 200 documents of this dataset to run experiments faster.

In [6]:
docs = df.text[:200]

# Preprocessing

As we've learned in Learning Notebook - Part 1, we can tokenize and stem our text to extract features. Let's initialize our favorite tokenizer and stemmer. For now, we choose to keep stopwords.

In [7]:
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords=True)

Note: using `ignore_stopwords=True` prevents the stemming of stopwords, if they are present in the sentence.

Before tokenization and stemming, it is important to first clean our sentences. We can see from the examples above that our corpus has some HTML substrings `<br />` that are only adding meaningless noise to the sentences. As we've seen in Part 1, we can remove these HTML tags with `re.sub()` by substituting every substring that matches the regex `<[^>]*>` with an empty string.

We will define a `preprocess()` method that removes these unnecessary HTML tags, tokenizes, and stems our corpus's sentences.

In [8]:
def preprocess(doc):
    # remove html tags
    doc = re.sub("<[^>]*>", "", doc)
    # lowercase
    doc = doc.lower()
    # tokenize
    words = tokenizer.tokenize(doc)
    # remove punctuation
    # string.punctuation is a utility that allows us to not have to define all punctuation characters by ourselves
    words = [word for word in words if word not in string.punctuation]
    # stem
    stems = [stemmer.stem(word) for word in words]
    new_doc = " ".join(stems)
    return new_doc

--- 

#### A small note on punctuation removal

Note that above, we've used `string.punctuation` instead of defining a list of characters. This is a handy constant provided by the `string` package that we can use instead of defining our own punctuation regex pattern. 

You can see below the characters included:



In [9]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


It doesn't cover everything (remember the quotation marks from the previous unit), but it is still a nice utility to use as the base for punctuation removal.

In our preprocess function, we're using it in a list comprehension to remove single punctuation characters from our list of words returned by the tokenizer:

In [10]:
text = "Is this a test? Yes, it is a test."

# Before removing punctuation
print([word for word in tokenizer.tokenize(text)])

# After removing punctuation
print([word for word in tokenizer.tokenize(text) if word not in string.punctuation])

['Is', 'this', 'a', 'test', '?', 'Yes', ',', 'it', 'is', 'a', 'test', '.']
['Is', 'this', 'a', 'test', 'Yes', 'it', 'is', 'a', 'test']


However, it's easy to find examples where this is suboptimal


In [11]:
text = "Is this a test? No, it isn't ..."

print([word for word in tokenizer.tokenize(text) if word not in string.punctuation])

['Is', 'this', 'a', 'test', 'No', 'it', 'isn', 't', '...']


Notice that `...` wasn't considered as punctuation under this rule. What you could do instead is to use regex combined with this utility to make sure sentences are completely cleaned:

In [12]:
text = "Is this a test? No, it isn't ..."

pattern = re.compile("[" + re.escape(string.punctuation) + "]")

sentence = " ".join(tokenizer.tokenize(text))

print(re.sub(pattern, '', sentence).split())


['Is', 'this', 'a', 'test', 'No', 'it', 'isn', 't']


Let's break it down:

- First, punctuation is transformed into a regex pattern.
- Then, the text is tokenized and the tokens are saved in a string, separated by spaces.
- `re.sub()` is applied to the string and removes all characters that are in the regex pattern. Since these characters include the `.` the last three dots are removed.

---

But back to our dataset

In [13]:
docs = docs.apply(preprocess)

Let's see one of the above examples again, after we cleaned the corpus.

In [14]:
docs[4]

'disney goe to the well one too mani time as anybodi who has seen the origin littl mermaid will feel blatant rip off celebr the birth of their daughter melodi ariel and eric plan on introduc her to king triton the celebr is quick crash by ursula s sister morgana who plan to use melodi as a defens tool to get the king s trident stop the attack ariel and eric build a wall around the ocean while melodi grow up wonder why she cannot go in there aw and terribl is what describ this direct to video sequel littl mermaid 2 give you that feel everyth you watch seem to have come straight other disney movi i guess disney can only plagiar itself do not tell me that the penguin and walrus does not remind you of anoth duo from the lion king other disappoint moment includ the rematch between sebastien and louie the royal chef they terribl under play it the climax between morgana and everyon seem to be anoth disappoint i will not give anyth away but in 75 minut everyth seem incred cramp and too much to

Well, we may not understand it very well now, but we actually just made the text much easier for a machine to read.

Check the number of words in the vocabulary:

You will notice that many of the most common words in the reviews are what we would consider stopwords: determiners like "the," "a"; prepositions like "of," "to"; etc. We will probably want to filter these out.

# Representing Text as Structured Data

Now that we have cleaned and tokenized our dataset, we need to find a way of making this information useful for a machine learning model. As you know, machine learning models can only deal with numerical data. Therefore, we need to find a way of summarizing information about text in numbers.

There are a lot of ways of doing this, but the simplest one is called **Bag of Words (BoW)**!

<img src="./media/bag_of_words.png" width="600">

Bag of words is a type of document vectorization that consists in **word counting**. Each document is represented by a vector with the size of our vocabulary, and each feature is the number of times each word on the vocabulary appears in the document.

By doing this, our data becomes structured and tabular and each column represents the number of times a word of the vocabulary appeared in the document, whereas each row corresponds to a document.

Note that this type of vectorization of the document loses all of its syntactic information. That is, you could shuffle the words in the document and get the same vector (that's why it's called a bag of words). Of course, since we are trying to understand if a movie review is positive or negative, one could argue that simply looking at what kind of words appear in the document is enough to understand its feeling (positive or negative), regardless of the order of these words.

Nevertheless, for other more difficult tasks this might not be enough, and a different representation of text that conserves the order of words might be needed. But let's focus on BoW for now.

## Vocabulary

To transform our textual data into a vectorized Bag of Words, we first need to define our feature space.

As mentioned before, the feature space of our text will be the vocabulary of our data. In our example, this is the set of unique words and symbols present in our documents.

To create our vocabulary, we will use a `Counter()`. `Counter()` is a dictionary that counts the number of occurrences of different tokens in a list and can be updated with each sentence of our corpus.

After getting all counts for each unique token, we sort our dictionary by counts using `Counter()`'s built-in method `.most_common()`, and store everything in an `OrderedDict()`. This makes sure our vectorized representations of the documents will be ordered according to the most common words in the whole corpus (this is not required, but makes data visualization much nicer!).

In [15]:
def build_vocabulary():
    vocabulary = Counter()

    for doc in docs:
        words = doc.split(' ')
        vocabulary.update(words)
    
    return OrderedDict(vocabulary.most_common())

In [16]:
vocab = build_vocabulary()

In [17]:
len(vocab)

5740

In [18]:
# turn into a list of tuples and get the first 20 items
list(vocab.items())[:20]

[('the', 2706),
 ('a', 1361),
 ('and', 1349),
 ('of', 1205),
 ('to', 1115),
 ('is', 815),
 ('it', 786),
 ('in', 719),
 ('i', 690),
 ('this', 594),
 ('that', 581),
 ('s', 541),
 ('movi', 459),
 ('film', 400),
 ('as', 377),
 ('but', 358),
 ('with', 357),
 ('for', 315),
 ('was', 305),
 ('t', 295)]

## Bag of Words

Now that we have our vocabulary, we can vectorize our documents.

In [19]:
def vectorize():
    vectors = []
    for doc in docs:
        vector = np.array([doc.count(word) for word in build_vocabulary()])
        vectors.append(vector)

    return vectors

We can visualize this better if we use a pandas DataFrame.

In [20]:
pd.DataFrame(vectorize(), columns=vocab).head()

Unnamed: 0,the,a,and,of,to,is,it,in,i,this,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,11,94,7,4,7,16,5,11,76,3,...,0,0,0,0,0,0,0,0,0,0
1,0,6,0,0,1,3,2,0,11,1,...,0,0,0,0,0,0,0,0,0,0
2,8,48,6,3,1,6,3,5,35,1,...,0,0,0,0,0,0,0,0,0,0
3,8,36,3,3,3,5,6,4,45,2,...,0,0,0,0,0,0,0,0,0,0
4,21,70,9,4,16,13,6,15,73,1,...,0,0,0,0,0,0,0,0,0,0


As you can see, each document is a vector of the size of the vocabulary (5740 columns), and each feature corresponds to the number of times a word is in the document.

### Stopwords

We mentioned stopwords briefly in the last learning notebook, but now we will go a bit more in-depth. 

We're looking for the most meaningful features in our vocabulary to tell us in what category our document falls into. Text is filled with words that are unimportant to the meaning of a particular sentence like "the" or "and". This contrasts with words like "love" or "hate" that have a very clear semantic meaning. The former example of words are called **stopwords** - words that _usually_ don't introduce any meaning to a piece of text and are often just in the document for syntactic reasons.

It is important to emphasize that we used "usually" in our previous statement. You should be aware that sometimes stopwords can be useful features, especially when we use more than just unigrams as features (ex.: bigrams, trigrams, ...), where word order and word combination starts to be relevant.
 
You can find a list of English stopwords on the NLTK library.

In [21]:
from nltk.corpus import stopwords

stop_eng = set(stopwords.words('english'))

list(stop_eng)[:20]

['yours',
 'how',
 'over',
 'where',
 'they',
 'ourselves',
 't',
 'can',
 'he',
 'down',
 'hadn',
 'do',
 'these',
 'hers',
 'because',
 'into',
 'then',
 'shan',
 "aren't",
 'his']

Let's update our `build_vocabulary()` and `vectorize()` functions and remove these words from the text. This way we will reduce our vocabulary - and thus our feature space - making our representations more lightweight.

In [22]:
def build_vocabulary_without_stopwords():
    vocabulary = Counter()

    for doc in docs:
        words = [word for word in doc.split(' ') if word not in stop_eng]
        vocabulary.update(words)
    
    return OrderedDict(vocabulary.most_common())

vocab_without_stopwords = build_vocabulary_without_stopwords()

Check the size of the new vocabulary (should be smaller than before):

In [23]:
len(vocab_without_stopwords)

5595

In [24]:
# turn into a list of tuples and get the first 20 items
list(vocab_without_stopwords.items())[:20]

[('movi', 459),
 ('film', 400),
 ('one', 238),
 ('like', 204),
 ('time', 124),
 ('get', 118),
 ('watch', 110),
 ('make', 106),
 ('even', 105),
 ('see', 105),
 ('good', 99),
 ('stori', 99),
 ('charact', 98),
 ('end', 97),
 ('scene', 96),
 ('would', 94),
 ('well', 92),
 ('much', 92),
 ('peopl', 92),
 ('love', 86)]

In [25]:
def vectorize():
    vectors = []
    for doc in docs:
        vector = np.array([doc.count(word) for word in vocab_without_stopwords])
        vectors.append(vector)
    
    return vectors

bow = pd.DataFrame(vectorize(), columns=vocab_without_stopwords)
bow.head()

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,3,5,0,0,2,1,1,3,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,0,2,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,1,1,1,0,0,0,1,2,...,0,0,0,0,0,0,0,0,0,0
4,1,0,1,0,1,1,1,0,0,4,...,0,0,0,0,0,0,0,0,0,0


Another thing that we could do is to normalize our counts. As you can see, different documents have different number of words:

In [26]:
bow.sum(axis=1).head()

0    724
1     97
2    434
3    417
4    801
dtype: int64

This can introduce bias in our features, so we should normalize each document by its number of words. This way, instead of having word counts as features of our model, we will have **term frequencies**. This way, the features in any document of the dataset sum to 1:

In [27]:
tf = bow.div(bow.sum(axis=1), axis=0)

In [28]:
tf.sample(3)

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
156,0.003195,0.001597,0.001597,0.000799,0.000799,0.0,0.0,0.002396,0.001597,0.002396,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
77,0.0,0.0,0.002278,0.0,0.000759,0.0,0.0,0.000759,0.000759,0.000759,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
139,0.0,0.0,0.0,0.001572,0.001572,0.0,0.0,0.003145,0.0,0.001572,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
tf.sum(axis=1).head()

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
dtype: float64

# Kicking it up a notch with TF-IDF

It should be intuitive that not all words have the same importance to find out in what category a document falls into.

If we want to classify a review as positive, the word "*good*", for instance, is much more informative than "*house*", and we should give it more weight as a feature.

In general, words that are very common among all classes are less informative, while words that appear particularly connected to a specific class are more informative. This is usually also the case for rare words, which show up little and may show up only on one particular class.

That is the rationale behind **Term Frequency - Inverse Document Frequency (TF-IDF)**:

$$ \text{TF-IDF} _{t, d} =(log{(1 + TF_{t,d})})*(log{(1 + \frac{N}{DF_{t}})})  $$

where
- $t$ and $d$ are the term and document for which we are computing a feature
- $TF_{t,d}$ is the term frequency of term $t$ in document $d$
- $DF_{t}$ is the number of documents that contain $t$
- $N$ is the total number of documents in the dataset

We are using the word frequencies from before, but now we are weighting each by the inverse of the number of times they occur in all the documents. The more a word appears in a document and the less it appears in other documents, the higher the TF-IDF of that word in that document.

In short, we measure **the term frequency, weighted by its rarity in the entire corpus**.

**Note**: You may find some variations on the expression for TF-IDF online. Despite that, the idea is always the same but the computation might change slightly. In our case, we are choosing to log-normalize our frequencies.

Let's define this computation in a function:

In [30]:
(bow > 0).sum(axis = 0)

movi        133
film        117
one         147
like        103
time         87
           ... 
erupt         1
semblanc      1
miser         1
shoe          2
mail          2
Length: 5595, dtype: int64

In [31]:
def tf_idf(bow):
    # Term frequency: divide word count by the number of words in the document
    tf = bow.div(bow.sum(axis=1), axis=0)

    # Document frequency: number of documents containing the word
    df = (bow > 0).sum(axis=0)

    # N: number of documents
    n = bow.shape[0]

    return np.log(1 + tf) * np.log(1 + n / df)

tf_idf_df = tf_idf(bow)

tf_idf_df.head()

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,0.003795,0.00686,0.0,0.0,0.003293,0.001605,0.001807,0.00518,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.009413,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.012601,0.0,0.003949,0.002483,0.0,0.0,0.0,0.0,0.002823,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.008762,0.0,0.002057,0.002584,0.002859,0.0,0.0,0.0,0.002938,0.004848,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.001145,0.0,0.001072,0.0,0.001489,0.00145,0.001633,0.0,0.0,0.005047,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that you've learned how to represent documents using vectors, you can use these new dataset representation as inout of a classification model!

# N-Grams

In Part 1, you were introduced to N-Grams. Now, you can see how N-Grams could serve as feature of vectors to represent documents. Essentially, the Bag of Words (BoW) technique involves creating a vocabulary and features based on individual words, or uni-grams.

Everything that has been done using uni-grams can be extended to N-Grams for any value of N. This means each feature can represent an N-Gram instead of a uni-gram, while the overall functionality remains the same.

However, it is important to mention that as *N* increases, so does the size of the N-Grams vocabulary due to the exponential growth in possible combinations of words. This makes our document representation sparser (vectors with many zeros), introducing more complexity and noise and potentially harming the machine learning models' ability to find patterns in data effectively. Additionally, with larger datasets, the vocabulary size already becomes bigger by itself, so combining this with an *N* greater than 1 has an even bigger consequence on data sparsity.

On the other hand, using N-grams as features partially preserves word order information, which may be useful for some tasks!

Therefore, the choice of *N* to use in document vectorization should take into account both these advantages and disadvantages. You'll see this decision-making process in action on the Part 3!


# Representing Words as Structured Data

Before we finish this notebook, let's just quickly look at word representation. Until now, we've only found ways of representing documents as vectors, where the features of these vectors are related to word counting.

However, we can also think the other way around. What if instead of using word counts to vectorize documents, we used document counts to vectorize words? This may sound confusing, and not very helpful for the task of classifying movie reviews... But representing words as vectors might be, in fact, a very interesting exercise!

To do this, instead of seeing the vocabulary as the features of a given document, we see each document as a feature of a given term in the vocabulary. By doing this, we get a vectorized representation of a word!

When we have numerical representations of words, we can check similarities between them! A very popular way of computing similarities between vectors is by calculating the cosine similarity (the cosine of the angle between the vectors).

Let's check the similarity between the word _movi_ and the word _good_ in our Bag of Words representation. These are words that typically show up side by side in movie reviews (at least the positive ones) and as such we would expect a bigger similarity when compared with other words.

In [32]:
cosine_similarity(bow['movi'].values.reshape(1,-1), bow['good'].values.reshape(1,-1))

array([[0.46827681]])

Now, let's see the similarity of _movi_ and _shoe_. We should get a lower similarity score.

In [33]:
cosine_similarity(bow['movi'].values.reshape(1,-1), bow['shoe'].values.reshape(1,-1))

array([[0.14554543]])

Let's check the same similarity scores but in the tf-idf representation:

In [34]:
cosine_similarity(tf_idf_df['movi'].values.reshape(1,-1), tf_idf_df['good'].values.reshape(1,-1))

array([[0.49023951]])

In [35]:
cosine_similarity(tf_idf_df['movi'].values.reshape(1,-1), tf_idf_df['shoe'].values.reshape(1,-1))

array([[0.03487183]])

Nice! The gap between the similarities of these pairs of words increased with our TF-IDF representation. This means that our TF-IDF model is computing better and more meaningful features than our BoW model. This will surely help when we feed these feature matrices to a classifier.

**Note**: Our evaluation focuses only on words similarities within the context of this dataset and its document distribution (i.e. word X has a similar distribution in these docs as word Y). This means that you should not take this as a general interpretation of the english language (i.e. word X is close to word Y). That's why we speak in relative terms in our comparisons, saying that a set of words has "higher" or "lower" similarity and not on absolute terms. 

Congratulations on finishing Part 2! The next and last part will be about sentiment analysis, a very common introductory exercise to NLP.

You'll also see that `scikit-learn` comes with implementations of both Bag of Words and TF-IDF vectorizers, which will make our lives easier, as you'll see!