# Text Processing and Word Embeddings

Welcome to this new exercise! In this exercise, we will play around with text instead of images as before, using Recurrent Neural Networks. Generally it is called Natural Language Processing (NLP) when dealing with text, speech, etc. But the data structure is very different to images, i.e. text is string instead of numbers in images. So we need some preprocessing steps to transform raw text to other data format. And this notebook will introduce these basic concepts in NLP pipelines. Specifically, you will learn about:

1. How to preprocess text classification datasets
2. How to create a simple word embedding layer that maps words to dense vectors

# 0. Setup

As usual, we first import some packages to setup this notebook.

In [2]:
import os
import torch
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader

from exercise_code.rnn.sentiment_dataset import (
    create_dummy_data,
    download_data
)

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# 1. Preprocessing a Text Classification Dataset

As a starting point, let's load a dummy text classification dataset and have a sense how it looks like. We take these samples from the IMDb movie review dataset, which includes movie reviews and labels that show whether they are negative (0) or positive (1). You will investigate this task further in the second notebook.

In this section, our goal is to create a text processing dataset. You are not required to write any code in this section. However, the concept introduced here is very important for working on NLP datasets in the future as well as in the rest of this exercise. So take your time to understand the procedure here. :)

First, let us download the data and take a look at some data samples.

In [3]:
i2dl_exercises_path = os.path.dirname(os.path.abspath(os.getcwd()))
data_root = os.path.join(i2dl_exercises_path, "datasets", "SentimentData")
path = download_data(data_root)
data = create_dummy_data(path)
for text, label in data:
    print('Text: {}'.format(text))
    print('Label: {}'.format(label))
    print()

Text: I don't know why I like this movie so well, but I never get tired of watching it.
Label: 1

Text: Smallville episode Justice is the best episode of Smallville ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! It's my favorite episode of Smallville! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
Label: 1

Text: Smallville episode Justice is the best episode of Smallville ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 

## 1.1 Tokenizing Data

As seen above, we loaded 3 positive and 3 negative reviews. Since the basic semantic unit of text is word, the first thing we need to do is **tokenizing** the dataset, which means converting each review to a list of words.

In [4]:
import re

# use regular expression to split the sentence
# check https://docs.python.org/3/library/re.html for more information
def tokenize(text):
    return [s.lower() for s in re.split(r'\W+', text) if len(s) > 0]

tokenized_data = []
for text, label in data:
    tokenized_data.append((tokenize(text), label))
    print(tokenized_data[-1], '\n')

(['i', 'don', 't', 'know', 'why', 'i', 'like', 'this', 'movie', 'so', 'well', 'but', 'i', 'never', 'get', 'tired', 'of', 'watching', 'it'], 1) 

(['smallville', 'episode', 'justice', 'is', 'the', 'best', 'episode', 'of', 'smallville', 'it', 's', 'my', 'favorite', 'episode', 'of', 'smallville'], 1) 

(['smallville', 'episode', 'justice', 'is', 'the', 'best', 'episode', 'of', 'smallville', 'it', 's', 'my', 'favorite', 'episode', 'of', 'smallville'], 1) 

(['great', 'movie', 'especially', 'the', 'music', 'etta', 'james', 'at', 'last', 'this', 'speaks', 'volumes', 'when', 'you', 'have', 'finally', 'found', 'that', 'special', 'someone'], 0) 

(['this', 'movie', 'is', 'terrible', 'but', 'it', 'has', 'some', 'good', 'effects'], 0) 

(['long', 'boring', 'blasphemous', 'never', 'have', 'i', 'been', 'so', 'glad', 'to', 'see', 'ending', 'credits', 'roll'], 0) 



## 1.2 Creating a Vocabulary

We have converted the dataset into pairs of token lists and corresponding labels. But strings have variant length and are less easy for handling. It would be nice to represent words with numbers. So, we need to create a <b>vocabulary</b>, which is a dictionary that maps each word to an integer id.

In large datasets, there are too many words and most of them don't occur very frequently. One common approach we use to tackle this problem is to pick most common N words from the dataset. Therefore, we restrict the number of words.

Let's first compute the word frequencies in our dummy dataset. To compute frequencies, we use the [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) data structure.

In [5]:
from collections import Counter

freqs = Counter()
for tokens, _ in tokenized_data:
    freqs.update(tokens)

freqs

Counter({'i': 4,
         'don': 1,
         't': 1,
         'know': 1,
         'why': 1,
         'like': 1,
         'this': 3,
         'movie': 3,
         'so': 2,
         'well': 1,
         'but': 2,
         'never': 2,
         'get': 1,
         'tired': 1,
         'of': 5,
         'watching': 1,
         'it': 4,
         'smallville': 6,
         'episode': 6,
         'justice': 2,
         'is': 3,
         'the': 3,
         'best': 2,
         's': 2,
         'my': 2,
         'favorite': 2,
         'great': 1,
         'especially': 1,
         'music': 1,
         'etta': 1,
         'james': 1,
         'at': 1,
         'last': 1,
         'speaks': 1,
         'volumes': 1,
         'when': 1,
         'you': 1,
         'have': 2,
         'finally': 1,
         'found': 1,
         'that': 1,
         'special': 1,
         'someone': 1,
         'terrible': 1,
         'has': 1,
         'some': 1,
         'good': 1,
         'effects': 1,
         'long

To create the dictionary, let's select the most common 20 words to create a vocabulary. In addition to the words that appear in our data, we need to have two special words:

- `<eos>` End of sequence symbol used for padding
- `<unk>` Words unknown in our vocabulary

In [6]:
vocab = {'<eos>': 0, '<unk>': 1}
for token, freq in freqs.most_common(20):
    vocab[token] = len(vocab)
vocab

{'<eos>': 0,
 '<unk>': 1,
 'smallville': 2,
 'episode': 3,
 'of': 4,
 'i': 5,
 'it': 6,
 'this': 7,
 'movie': 8,
 'is': 9,
 'the': 10,
 'so': 11,
 'but': 12,
 'never': 13,
 'justice': 14,
 'best': 15,
 's': 16,
 'my': 17,
 'favorite': 18,
 'have': 19,
 'don': 20,
 't': 21}

## 1.3 Creating the Dataset

Putting it all together, we can now create a dataset class. First, let's create index-label pairs:

In [7]:
indexed_data = []
for tokens, label in tokenized_data:
    indices = [vocab.get(token, vocab['<unk>']) for token in tokens]    
    # the token that is not in vocab get assigned <unk>
    indexed_data.append((indices, label))
    

for indices, label in indexed_data:
    print(indices, ' -> ', label)
    print()

[5, 20, 21, 1, 1, 5, 1, 7, 8, 11, 1, 12, 5, 13, 1, 1, 4, 1, 6]  ->  1

[2, 3, 14, 9, 10, 15, 3, 4, 2, 6, 16, 17, 18, 3, 4, 2]  ->  1

[2, 3, 14, 9, 10, 15, 3, 4, 2, 6, 16, 17, 18, 3, 4, 2]  ->  1

[1, 8, 1, 10, 1, 1, 1, 1, 1, 7, 1, 1, 1, 1, 19, 1, 1, 1, 1, 1]  ->  0

[7, 8, 9, 1, 12, 6, 1, 1, 1, 1]  ->  0

[1, 1, 1, 13, 19, 5, 1, 11, 1, 1, 1, 1, 1, 1]  ->  0



<div class="alert alert-success"> 
    <h3>Task: Check Code</h3>
    <p>We now use the PyTorch dataset class we provided in <code>exercise_code/rnn/sentiment_dataset.py</code> file. Please also take a look at the code.</p>
 </div>
    


Dataset class also reverse sorts the sequences with respect to the lengths. Thanks to this sorting, we can reduce the total number of padded elements, which means that we have less computations for padded values.

In [8]:
from exercise_code.rnn.sentiment_dataset import SentimentDataset

combined_data = [
    (raw_text, tokens, indices, label)
    for (raw_text, label), (tokens, _), (indices, _)
    in zip(data, tokenized_data, indexed_data)
]

dataset = SentimentDataset(combined_data)

for elem in dataset:
    print(elem)
    print()

{'data': tensor([ 1,  8,  1, 10,  1,  1,  1,  1,  1,  7,  1,  1,  1,  1, 19,  1,  1,  1,
         1,  1]), 'label': tensor(0.)}

{'data': tensor([ 5, 20, 21,  1,  1,  5,  1,  7,  8, 11,  1, 12,  5, 13,  1,  1,  4,  1,
         6]), 'label': tensor(1.)}

{'data': tensor([ 2,  3, 14,  9, 10, 15,  3,  4,  2,  6, 16, 17, 18,  3,  4,  2]), 'label': tensor(1.)}

{'data': tensor([ 2,  3, 14,  9, 10, 15,  3,  4,  2,  6, 16, 17, 18,  3,  4,  2]), 'label': tensor(1.)}

{'data': tensor([ 1,  1,  1, 13, 19,  5,  1, 11,  1,  1,  1,  1,  1,  1]), 'label': tensor(0.)}

{'data': tensor([ 7,  8,  9,  1, 12,  6,  1,  1,  1,  1]), 'label': tensor(0.)}



## 1.4 Minibatching
Note that in the dataset we created, not all sequences have the same length. Therefore, we cannot minibatch the data trivially. This means we cannot use a `DataLoader` class easily.

<b>If you uncomment the following cell and run it, you will very likely get an error!</b>

In [9]:
loader = DataLoader(dataset, batch_size=3)

for batch in loader:
     print(batch)

RuntimeError: stack expects each tensor to be equal size, but got [20] at entry 0 and [19] at entry 1

<div class="alert alert-success"> 
    <h3>Task: Check Code</h3>
    <p>To solve the problem, we need to pad the sequences with <code> < eos > </code> tokens that we indexed as zero. To integrate this approach into the Pytorch <a href="https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader" target="_blank">Dataloader</a> class, we will make use of the <code>collate_fn</code> argument. For more details, check out the <code>collate</code> function in <code>exercise_code/rnn/sentiment_dataset</code>. </p>
    <p> In addition, we use the <a href="https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html" target="_blank">pad_sequence</a> that pads shorter sequences with 0. </p>
 </div>

In [10]:
from torch.nn.utils.rnn import pad_sequence

def collate(batch):
    assert isinstance(batch, list)
    data = pad_sequence([b['data'] for b in batch])
    lengths = torch.tensor([len(b['data']) for b in batch])
    label = torch.stack([b['label'] for b in batch])
    return {
        'data': data,
        'label': label,
        'lengths': lengths
    }

loader = DataLoader(dataset, batch_size=3, collate_fn=collate)
for batch in loader:
    print('Data: \n', batch['data'])
    print('\nLabels: \n', batch['label'])
    print('\nSequence Lengths: \n', batch['lengths'])
    print('\n')

Data: 
 tensor([[ 1,  5,  2],
        [ 8, 20,  3],
        [ 1, 21, 14],
        [10,  1,  9],
        [ 1,  1, 10],
        [ 1,  5, 15],
        [ 1,  1,  3],
        [ 1,  7,  4],
        [ 1,  8,  2],
        [ 7, 11,  6],
        [ 1,  1, 16],
        [ 1, 12, 17],
        [ 1,  5, 18],
        [ 1, 13,  3],
        [19,  1,  4],
        [ 1,  1,  2],
        [ 1,  4,  0],
        [ 1,  1,  0],
        [ 1,  6,  0],
        [ 1,  0,  0]])

Labels: 
 tensor([0., 1., 1.])

Sequence Lengths: 
 tensor([20, 19, 16])


Data: 
 tensor([[ 2,  1,  7],
        [ 3,  1,  8],
        [14,  1,  9],
        [ 9, 13,  1],
        [10, 19, 12],
        [15,  5,  6],
        [ 3,  1,  1],
        [ 4, 11,  1],
        [ 2,  1,  1],
        [ 6,  1,  1],
        [16,  1,  0],
        [17,  1,  0],
        [18,  1,  0],
        [ 3,  1,  0],
        [ 4,  0,  0],
        [ 2,  0,  0]])

Labels: 
 tensor([1., 0., 0.])

Sequence Lengths: 
 tensor([16, 14, 10])




We can see that these two batches have different length, this is how the reverse sort mentioned in `1.3 Creating the Dataset` benefits for less memory and less computation.

# 2. Embeddings

In the previous section, we explored how to convert text into a sequence of integers. In this form, sequences are still not ready to be inputs of RNNs you implemented in the optional notebook. Integer representation is some kind of one-hot encoding, while not the same since they are not equally weighted given only an integer. 

Moreover, it fails to express the semantic relations between words and the order of the words has no meaning. We would like a better representation form to keep semantic meaning of the word. For example, as shown in the following picture, the difference between man and woman and difference between king and queen should be close, since the difference is only the gender. If we use a vector for each word, the above relation can be expressed as $vec(\text{women})-vec(\text{man}) \approx vec(\text{queen}) - vec(\text{king})$. Usually we call such vector representations as embeddings.

<img src='https://developers.google.com/machine-learning/crash-course/images/linear-relationships.svg' width=80% height=80%/>

While one can use pre-trained embedding vectors such as [word2vec](https://arxiv.org/abs/1301.3781) or [GLoVe](https://nlp.stanford.edu/projects/glove/), in this exercise we use randomly initialized embedding vectors that will be trained from scratch together with our networks.

<div class="alert alert-info">

<h3> Task: Implement Embedding</h3>
 <p>In this part, you will implement a simple embedding layer. Embedding is a simple lookup table that stores a dense vector to represent each word in the vocabulary.</p> 

 <p>Your task is to implement the <code>Embedding</code> class in <code>exercise_code.rnn.rnn_nn</code> file. Once you are done, run the below cell to test your implementation. Note that we ensure eos embeddings to be zero by using the <code>padding_idx</code> argument.

 </div>

In [11]:
import torch.nn as nn

from exercise_code.rnn.rnn_nn import Embedding
from exercise_code.rnn.tests import embedding_output_test


i2dl_embedding = Embedding(len(vocab), 16, padding_idx=0)
pytorch_embedding = nn.Embedding(len(vocab), 16, padding_idx=0)

loader = DataLoader(dataset, batch_size=len(dataset), collate_fn=collate)
for batch in loader:
    x = batch['data']

embedding_output_test(i2dl_embedding, pytorch_embedding, x)


Difference between outputs: 0.0
Test passed :)!


True

In [14]:
print(pytorch_embedding(x))

tensor([[[ 1.2191e+00, -8.5988e-01, -1.3563e+00,  ...,  3.3668e-01,
          -1.3229e+00, -5.5500e-01],
         [ 1.1721e+00,  4.5254e-01, -4.7469e-01,  ...,  4.6849e-01,
           3.6240e-01,  9.9435e-01],
         [-1.8657e-01, -1.3264e+00,  1.3197e-01,  ...,  1.1573e+00,
          -7.6387e-04,  2.9154e-01],
         [-1.8657e-01, -1.3264e+00,  1.3197e-01,  ...,  1.1573e+00,
          -7.6387e-04,  2.9154e-01],
         [ 1.2191e+00, -8.5988e-01, -1.3563e+00,  ...,  3.3668e-01,
          -1.3229e+00, -5.5500e-01],
         [ 1.6401e-03, -2.3469e+00, -1.4465e-01,  ...,  7.0386e-02,
           2.7355e-01,  2.2413e-01]],

        [[ 5.4771e-01,  1.2905e+00,  1.2137e+00,  ..., -2.4904e-02,
           7.0228e-01,  2.5137e+00],
         [-1.5732e-01,  1.8855e+00, -1.7624e-01,  ...,  8.2225e-01,
           7.3651e-01, -2.8617e-01],
         [-1.1045e+00,  1.2915e-01,  5.2119e-01,  ...,  4.8613e-01,
           2.0941e+00,  2.4975e-01],
         [-1.1045e+00,  1.2915e-01,  5.2119e-01,  ...

# 3. Conclusion

In this notebook, you learned how to prepare text data and how to create an embedding layer. In the next notebook, you will combine your Embedding and RNN implementations to create a sentiment analysis network!