# BLU07 - Part 2 of 3 - From words to vectors

In [1]:
import numpy as np
import pandas as pd
from collections import Counter, OrderedDict
import re
import string
from numpy import inf

from nltk.tokenize import WordPunctTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
 
from sklearn import preprocessing

As you may have noticed, text data is a bit different from other datasets you have seen -- it's just a bunch of words strung together! Where are the features in a tightly organized table? Unlike other data you have worked with in previous BLUs, text is unstructured and thus needs some additional work on our end to make it structured and ready to be handled by a machine learning algorithm.

Language can be messy, but one thing is clear: we need features. The process of getting features out of a text is called **vectorization**. Remember those nice tidy datasets where each sample was a row in a table and each feature was a column? Well, in those data, each sample was a vector in the feature space. Feature space is an n-dimensional space where each axis represents the values of a feature. Every sample from the dataset is a point in that space whose coordinates are that sample's feature values.

In our situation, each sample is a text document. Vectorization means that we're going to define the features and map each sample as a point (or vector) in that feature space. The simplest way to define the features is to take all the words in the documents as features. In that case, the feature space is the **vocabulary** of all the words in our documents and the vector of each document is represented by the count of each word in the given document.

<img src="./media/vectors.jpg" width="400">

But enough talk - let's get our hands dirty!

We're going to work with movie reviews from IMDB and see a few ways of data vectorization. Let's load the dataset.

In [2]:
df = pd.read_csv('./data/imdb_sentiment.csv')
df.head()

Unnamed: 0,sentiment,text
0,Negative,"Aldolpho (Steve Buscemi), an aspiring film mak..."
1,Negative,"An unfunny, unworthy picture which is an undes..."
2,Negative,A failure. The movie was just not good. It has...
3,Positive,I saw this movie Sunday afternoon. I absolutel...
4,Negative,Disney goes to the well one too many times as ...


As you can see, there are two columns in this dataset - one for the labels and another for the text of the movie reviews. Each sample is labeled as a positive or negative review. Our goal is to retrieve meaningful features from the text so that in the next notebook we can use a machine learning model to predict if a given review is positive or negative.

Let's see a positive and a negative example.

In [3]:
pos_example = df.text[4835]
print(df.sentiment[4835])
print(pos_example)

Positive
"The Lion King" is without a doubt my favorite Disney movie of all time, so I figured maybe I should give the sequels a chance and I did. Lion King 1 1/2 was pretty good and had it's good laughs and fun with Timon and Pumba. Only problem, I feel sometimes no explanations are needed because they can create plot holes and just the feeling of wanting your own explanation. Well, I would highly recommend this movie for lion King fans or just a night with the family. It's a fun flick with the same laughs and lovable characters as the first. So, hopefully, I'll get the same with the third installment to the Lion King series. Sit back and just think Hakuna Matata! It means no worries! <br /><br />8/10


Nice! So that is a review about *The Lion King 1 1/2* (a.k.a. *The Lion King 3* in some countries). It seems the reviewer liked it.

In [4]:
neg_example = df.text[4]
print(df.sentiment[4])
print(neg_example)

Negative
Disney goes to the well one too many times as anybody who has seen the original LITTLE MERMAID will feel blatantly ripped off. Celebrating the birth of their daughter Melody, Ariel and Eric plan on introducing her to King Triton. The celebration is quickly crashed by Ursula 's sister, Morgana who plans to use Melody as a defense tool to get the King 's trident. Stopping the attack, Ariel and Eric build a wall around the ocean while Melody grows up wondering why she cannot go in there.<br /><br />Awful and terrible is what describes this direct to video sequel. LITTLE MERMAID 2 gives you that feeling everything you watch seemed to have come straight other Disney movies. I guess Disney can only plagiarize itself! Do not tell me that the penguin and walrus does not remind you of another duo from the LION KING!<br /><br />Other disappointing moments include the rematch between Sebastien and Louie, the royal chef. They terribly under played it! The climax between Morgana and EVERYO

Yikes. I guess that's a pass for this one, right?

Let's get the first 200 documents of this dataset to run experiments faster.

In [5]:
docs = df.text[:200]

## 1. Preprocessing

As we've learned in the previous learning notebook, we should tokenize and stem our text before feature extraction. Let's initialize our favorite tokenizer and stemmer and put the workflow we learned into practice.

In [6]:
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords=True)

We are keeping the stopwords for now, but do not stem them. Using `ignore_stopwords=True` prevents the stemming of stopwords.

As usual, the data is not clean, so we'll have to deal with that before tokenization and stemming. We can see from the examples above that our _corpus_ (just a fancy way of saying a document collection for language analysis) has some HTML substrings `<br />` that are only adding meaningless noise. We can remove these HTML tags with `re.sub()` by substituting every substring that matches the regex `<[^>]*>` with an empty string.

We will define a `preprocess()` method that removes these unnecessary HTML tags, tokenizes, and stems our corpus.

In [7]:
def preprocess(doc):
    # remove html tags
    doc = re.sub("<[^>]*>", "", doc)
    # lowercase
    doc = doc.lower()
    # tokenize
    words = tokenizer.tokenize(doc)
    # remove punctuation
    # string.punctuation is a utility that allows us to not have to define all punctuation characters by ourselves
    words = [word for word in words if word not in string.punctuation]
    # stem
    stems = [stemmer.stem(word) for word in words]
    new_doc = " ".join(stems)
    return new_doc

Note that we've used `string.punctuation` instead of defining a list of punctuation characters ourselves. This is a handy constant provided by the `string` package. These are the characters in the list:

In [8]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


It doesn't cover everything, like the weird quotation marks from the previous unit, but it's a good base. The simple list comprehension we used to filter the tokens in the function above wouldn't work for the following situation though.

In [9]:
text = "Is this a test? No, it isn't ..."

print([word for word in tokenizer.tokenize(text) if word not in string.punctuation])

['Is', 'this', 'a', 'test', 'No', 'it', 'isn', 't', '...']


Notice that `...` wasn't detected as punctuation with this procedure. It's better to use the punctuation list with a regex in this case:

In [10]:
text = "Is this a test? No, it isn't ..."

pattern = re.compile("[" + re.escape(string.punctuation) + "]")

sentence = " ".join(tokenizer.tokenize(text))

print(re.sub(pattern, '', sentence).split())

['Is', 'this', 'a', 'test', 'No', 'it', 'isn', 't']


Let's break it down:

- First, punctuation is transformed into a regex pattern.
- Then, the text is tokenized and the tokens are saved in a string, separated by spaces.
- `re.sub()` is applied to the string which removes all characters that are in the regex pattern. Since these characters include the `.` the last three dots are removed.
- Finally, the string is split again on spaces to get back the tokens.

But back to our dataset. Let's apply the preprocessing function to all the documents.

In [11]:
docs_preprocessed = docs.apply(preprocess)

This is the Little mermaid review after preprocessing:

In [12]:
docs_preprocessed[4]

'disney goe to the well one too mani time as anybodi who has seen the origin littl mermaid will feel blatant rip off celebr the birth of their daughter melodi ariel and eric plan on introduc her to king triton the celebr is quick crash by ursula s sister morgana who plan to use melodi as a defens tool to get the king s trident stop the attack ariel and eric build a wall around the ocean while melodi grow up wonder why she cannot go in there aw and terribl is what describ this direct to video sequel littl mermaid 2 give you that feel everyth you watch seem to have come straight other disney movi i guess disney can only plagiar itself do not tell me that the penguin and walrus does not remind you of anoth duo from the lion king other disappoint moment includ the rematch between sebastien and louie the royal chef they terribl under play it the climax between morgana and everyon seem to be anoth disappoint i will not give anyth away but in 75 minut everyth seem incred cramp and too much to

Well, you may not understand it very well now, but we actually made the text much more readable for a machine.

You can notice that many of the most common words in the reviews are what we consider stopwords: determiners like "the," "a"; prepositions like "of," "to"; etc. We will probably want to filter these out later on, but now let's get to the vectorization.

## 2. Bag of words

As we said, the simplest way to vectorize a document collection is by counting the words. This technique is called a **Bag of Words (BoW)**. Look at the following image to see how it works.

<img src="./media/bag_of_words.png" width="600">

Basically, we throw all the words in all documents in a bag, those will be our features. Then for each document, we take the words out of the bag one by one and count how many times the given word appears in that document. What we get is a **document vector**. 

We store the document vectors in a table where the column names (feature names) are the words. By doing this, our data becomes structured and tabular and each column represents the number of times a word of the vocabulary appeared in every document, whereas each row corresponds to a document.

Note that this type of vectorization loses all the syntactic information. That is, you could shuffle the words in each document and still get the same vector (that's why it's called a bag of words). This is enough for some applications, but it might not be enough for others and a different representation of text that conserves the word order might be needed.

In the following sections, we're going to manually create the BoW representation of our documents before using 'official' sklearn vectorizers in the next notebook.

### 2.1 Vocabulary

So now we have each document preprocessed into tokens. To transform this data into a vectorized Bag of Words, we first need to define our feature space.

As mentioned before, the feature space will be the vocabulary of our data, so the set of unique tokens in our documents.

To create our vocabulary, we will use a [Counter](https://docs.python.org/3/library/collections.html#collections.Counter). `Counter()` is a dictionary that counts the number of occurrences of different items in a list. We're going to feed it with our preprocessed documents to count the occurences of the tokens.

We then sort our dictionary by counts using `Counter()`'s built-in method `.most_common()` and store everything in an `OrderedDict()`. Like this, the features in the document vectors will be ordered from the most to the least common words in the corpus (this is not required, but makes data visualization much nicer!).

In [13]:
def build_vocabulary(docs):
    vocabulary = Counter()

    for doc in docs:
        words = doc.split(' ')
        vocabulary.update(words)
    
    return OrderedDict(vocabulary.most_common())

In [14]:
vocab = build_vocabulary(docs_preprocessed)

Let's take a sneak peak at our vocabulary:

In [15]:
len(vocab)

5740

It's not too large. What are the most common words?

In [16]:
# turn into a list of tuples and get the first 20 items
list(vocab.items())[:20]

[('the', 2706),
 ('a', 1361),
 ('and', 1349),
 ('of', 1205),
 ('to', 1115),
 ('is', 815),
 ('it', 786),
 ('in', 719),
 ('i', 690),
 ('this', 594),
 ('that', 581),
 ('s', 541),
 ('movi', 459),
 ('film', 400),
 ('as', 377),
 ('but', 358),
 ('with', 357),
 ('for', 315),
 ('was', 305),
 ('t', 295)]

Does it also remind you of the list of stopwords? We'll deal with this in a minute.

### 2.2 Vectorization

Now that we have our vocabulary, we can create the document vectors. We simply count how many times each term from the vocabulary appears in each document.

In [17]:
def vectorize(docs, vocab):
    vectors = []
    for doc in docs:
        vector = np.array([doc.count(word) for word in vocab])
        vectors.append(vector)

    return vectors

This function returns a list of vectors which is not so pretty to look at. We can visualize it better with a pandas dataframe where the column names are the feature names (the words in the vocabulary).

In [18]:
bow = pd.DataFrame(vectorize(docs_preprocessed, vocab), columns=vocab)
bow.head()

Unnamed: 0,the,a,and,of,to,is,it,in,i,this,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,11,94,7,4,7,16,5,11,76,3,...,0,0,0,0,0,0,0,0,0,0
1,0,6,0,0,1,3,2,0,11,1,...,0,0,0,0,0,0,0,0,0,0
2,8,48,6,3,1,6,3,5,35,1,...,0,0,0,0,0,0,0,0,0,0
3,8,36,3,3,3,5,6,4,45,2,...,0,0,0,0,0,0,0,0,0,0
4,21,70,9,4,16,13,6,15,73,1,...,0,0,0,0,0,0,0,0,0,0


Tadaa, that's our bag of words! As you can see, each document is one row in this dataframe and is a vector of the size of the vocabulary (length=5740). Each feature in that vector corresponds to the number of times a word appears in this document.

And yes, it's a humongous dataframe, 5740 features is way more than we've ever seen before. We're going to talk about this in the next BLU.

### 2.3 Stopwords

We're looking for the most meaningful features in our vocabulary so that we can classify each document as a positive or negative review. But, as you can see from the above dataframe, the text is filled with words with no meaning like "the" or "and". This contrasts with words like "love" or "hate" that have a very clear semantic meaning. Those meaningless words are **stopwords** - words that _usually_ don't introduce any meaning to a piece of text and are in the document just for syntactic reasons.

It is important to emphasize that word "usually" in the previous sentence. Sometimes stopwords can be useful features, especially when we use more than just unigrams as features (ex.: bigrams, trigrams, ...), where word order and word combinations start to be relevant.

Let's update our `build_vocabulary()` function and remove the stopwords from the text. This way we will reduce our vocabulary - and thus our feature space - making the data representation more lightweight.

In [19]:
def build_vocabulary_without_stopwords(docs, stop_eng):
    vocabulary = Counter()

    for doc in docs:
        words = [word for word in doc.split(' ') if word not in stop_eng]
        vocabulary.update(words)
    
    return OrderedDict(vocabulary.most_common())

vocab_without_stopwords = build_vocabulary_without_stopwords(docs_preprocessed, stopwords.words('english'))

Check the size of the new vocabulary (it should be smaller than before):

In [20]:
len(vocab_without_stopwords)

5595

It also has more meaningfull words now:

In [21]:
# turn into a list of tuples and get the first 20 items
list(vocab_without_stopwords.items())[:20]

[('movi', 459),
 ('film', 400),
 ('one', 238),
 ('like', 204),
 ('time', 124),
 ('get', 118),
 ('watch', 110),
 ('make', 106),
 ('even', 105),
 ('see', 105),
 ('good', 99),
 ('stori', 99),
 ('charact', 98),
 ('end', 97),
 ('scene', 96),
 ('would', 94),
 ('well', 92),
 ('much', 92),
 ('peopl', 92),
 ('love', 86)]

And this is our new BoW vectorization without stopwords:

In [22]:
bow_no_stopwords = pd.DataFrame(vectorize(docs_preprocessed, vocab_without_stopwords), columns=vocab_without_stopwords)
bow_no_stopwords.head()

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,3,5,0,0,2,1,1,3,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,0,2,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,1,1,1,0,0,0,1,2,...,0,0,0,0,0,0,0,0,0,0
4,1,0,1,0,1,1,1,0,0,4,...,0,0,0,0,0,0,0,0,0,0


### 2.4 Normalized BoW - term frequencies
Another thing that we could do is to normalize our word counts. As you can see, different documents have different numbers of words:

In [23]:
bow_no_stopwords.sum(axis=1).head()

0    724
1     97
2    434
3    417
4    801
dtype: int64

This can introduce bias in our features, so we should normalize each document by its number of words. Instead of having word counts as features, we will have **term frequencies**. This way, the features in any document vector sum to 1.

In [24]:
tf = bow_no_stopwords.div(bow_no_stopwords.sum(axis=1), axis=0)

In [25]:
tf.sample(3)

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
40,0.001506,0.006024,0.004518,0.0,0.003012,0.0,0.0,0.0,0.001506,0.003012,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.005089,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007634,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
136,0.013158,0.0,0.001645,0.000822,0.0,0.000822,0.003289,0.0,0.001645,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
tf.sum(axis=1).head()

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
dtype: float64

And that's it, our beautiful, normalized bag of words.

### 2.5 Weighting by document frequency

Here is another way how we can improve our bag of words features. The idea is that very common words that appear in every document are less informative, while rare words are more interesting and can bring us further in the classification task at hand. This strategy was originally developed for document searching. In our example, words like 'film' or 'movie' appear in many documents, so they're not likely to help decide the category where the review belongs.

The weighting term is called **inverse document frequency** - we are giving more weight to rarer words appearing in less documents. Document frequency simply means in how many documents the word appears. Let's take the first feature, the word 'movi' and see how often it appears.

In [27]:
(bow_no_stopwords['movi']>0).sum()

133

It appears in 133 documents out of 200, so its document frequency is 133/200 which is 0.665 and the inverse of it is about 1.5. Now another word, 'great':

In [28]:
(bow_no_stopwords['great']>0).sum()

62

It appears in 62 documents out of 200, so its document frequency is 62/200 = 0.31 and the inverse of it is about 3.22.

So the values of the feature 'great' would be weighted by a larger weight than the values of the feature 'movi', giving it more importance in a classification model.

That's the general idea. In reality, the inverse document frequency weighting factor is calculated in a slighly more complicated way using a logarithm:

$$ idf(t) = log{(\frac{n+1}{df_t+1})} + 1 $$

where $t$ is the term (word) for which we are calculating the weight, $n$ is the number of documents, and $df_{t}$ is the number of documents where the term appears.

This is the formula used by sklearn, there are other versions out there that differ in the use of the +1 factors.

The final value of the feature for a given document, **tf-idf**, is calculated as the product of the term frequency and the inverse document frequency

$$ tf\text{-}idf(t,d) = tf(t,d) \times idf(t) $$

where $d$ is the document for which the term frequency is calculated. It is common to normalize the resulting sample vectors (rows) to a unit norm.

Let's calculate tf-idf for our bag of words from above. First we calculate in how many documents each term appears:

In [29]:
(bow_no_stopwords > 0).sum(axis = 0)

movi        133
film        117
one         147
like        103
time         87
           ... 
erupt         1
semblanc      1
miser         1
shoe          2
mail          2
Length: 5595, dtype: int64

Now we calculate tf-idf starting from the BoW without stopwords:

In [30]:
def calculate_tf_idf(bow):
    # Term frequency: divide word count by the number of words in the document
    # We already did this in the previous section
    tf = bow.div(bow.sum(axis=1), axis=0)

    # Document frequency: number of documents containing the word
    df = (bow > 0).sum(axis=0)

    # n: number of documents
    n = bow.shape[0]

    return tf * (np.log((n+1) / (df+1)) + 1)

In [31]:
tf_idf = calculate_tf_idf(bow_no_stopwords)

tf_idf.head()

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,0.005824,0.010584,0.0,0.0,0.005044,0.002461,0.002743,0.00791,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.014489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.01943,0.0,0.006019,0.003822,0.0,0.0,0.0,0.0,0.004314,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.013482,0.0,0.003132,0.003978,0.004379,0.0,0.0,0.0,0.00449,0.007474,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.001755,0.0,0.001631,0.0,0.00228,0.002224,0.002479,0.0,0.0,0.007782,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can further [normalize](https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.normalize.html#sklearn.preprocessing.normalize) the sample vectors (each document, row) to unit vectors:

In [32]:
tf_idf_norm = pd.DataFrame(preprocessing.normalize(tf_idf),columns=tf_idf.columns)
tf_idf_norm.head()

Unnamed: 0,movi,film,one,like,time,get,watch,make,even,see,...,championship,"...""",endear,cortney,incid,erupt,semblanc,miser,shoe,mail
0,0.028932,0.052582,0.0,0.0,0.025058,0.012224,0.013626,0.039294,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.051401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.089806,0.0,0.027819,0.017667,0.0,0.0,0.0,0.0,0.019941,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.05846,0.0,0.013582,0.017251,0.018988,0.0,0.0,0.0,0.019472,0.03241,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.007648,0.0,0.007107,0.0,0.009936,0.009694,0.010806,0.0,0.0,0.03392,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


That's it, the tf-idf is now ready to be fed into a model.

### 2.6 Bag of words with N-Grams

We created our bag of words using individual words, or unigrams. In principle, we could have used other N-grams as features if we considered that this makes sense for the given classification task.

However, it is important to mention that the number of features grows with *N*. Using bigrams gives us roughly twice as many features as the word count of the corpus, trigrams three times as many, and so on. This makes the resulting dataframe sparser (vectors with many zeros), introducing more complexity and noise and potentially harming the machine learning model's ability to find patterns in the data. Additionally, with larger datasets, the vocabulary size already becomes bigger by itself, so combining this with N-grams with *N* > 1 has an even bigger effect on data sparsity.

On the other hand, using N-grams as features partially preserves word order information, which may be useful for some tasks!

Therefore, the choice of *N* to use in document vectorization should take into account both these advantages and disadvantages. You'll see this decision-making process in action in the next notebook.

Congratulations on finishing Part 2! The next and last part will be about sentiment analysis, a very common introductory NLP exercise. We're going to apply our new skills to classify movie reviews as positive or negative.

You'll also see that `scikit-learn` comes with implementations of both Bag of Words and TF-IDF vectorizers, which will make our lives easier.