### A closer look on sentiment analysis in keras and embeddings.

In this notebook we are going to have a closer look on sentiment analyisis and word embeddings.


### Exploding & Vanishing Gradients

In order to train the weights for the gates inside the recurrent unit, we need to minimize some loss-function which measures the difference between the actual output of the network relative to the desired output.

The reccurent units are applied recursively for each word in the input sequence. This means the recurrent gate is applied once for each time-step. The gradient-signals have to flow back from the loss-function all the way to the first time the recurrent gate is used. If the gradient of the recurrent gate is multiplicative, then we essentially have an exponential function.

We will use texts that have more than 500 words. This means the RU's gate for updating its internal memory-state is applied recursively more than 500 times. If a gradient of just 1.01 is multiplied with itself 500 times then it gives a value of about 145. If a gradient of just 0.99 is multiplied with itself 500 times then it gives a value of about 0.007. These are called exploding and vanishing gradients. The only gradients that can survive recurrent multiplication are 0 and 1.

To avoid these so-called exploding and vanishing gradients, care must be made when designing the recurrent unit and its gates. That is why the actual implementation of the `GRU` is more complicated, because it tries to send the gradient back through the gates without this distortion.


### Imports

In [1]:
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np


from scipy.spatial.distance import cdist
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


tf.__version__

'2.6.0'

### Data 

We are going to download, extract and load the data using the following helper ffunctions. The code in teh following cell was found [here](https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/imdb.py).

In [2]:
import os, glob, sys
import urllib.request
import tarfile, zipfile

### The function that download the files from internet.

Again the following code was found [here](https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/download.py). Note that the code is not exactly the same. It is a modified vesion.

In [3]:
class Download:
  def __init__(self):
    pass

  def print_download_progress(self, count, block_size, total_size):
    pct_complete = float(count * block_size) / total_size
    pct_complete = min(1.0, pct_complete)
    msg = "\r- Download progress: {0:.1%}".format(pct_complete)
    sys.stdout.write(msg)
    sys.stdout.flush()

  def download(self, base_url, filename, download_dir):
    save_path = os.path.join(download_dir, filename)
    if not os.path.exists(save_path):
        if not os.path.exists(download_dir):
            os.makedirs(download_dir)
        print("Downloading", filename, "...")
        url = base_url + filename
        file_path, _ = urllib.request.urlretrieve(url=url,
                                                  filename=save_path,
                                                  reporthook=self.print_download_progress)
        print(" Done!")

  def maybe_download_and_extract(self, url, download_dir):
      filename = url.split('/')[-1]
      file_path = os.path.join(download_dir, filename)
      if not os.path.exists(file_path):
          if not os.path.exists(download_dir):
              os.makedirs(download_dir)

          file_path, _ = urllib.request.urlretrieve(url=url,
                                                    filename=file_path,
                                                    reporthook=self.print_download_progress)
          print()
          print("Download finished. Extracting files.")

          if file_path.endswith(".zip"):
              # Unpack the zip-file.
              zipfile.ZipFile(file=file_path, mode="r").extractall(download_dir)
          elif file_path.endswith((".tar.gz", ".tgz")):
              # Unpack the tar-ball.
              tarfile.open(name=file_path, mode="r:gz").extractall(download_dir)

          print("Done.")
      else:
          print("Data has apparently already been downloaded and unpacked.")


In [4]:
download = Download()

In [5]:
data_dir = "data/IMDB/"
data_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

def _read_text_file(path):
    with open(path, 'rt', encoding='utf-8') as file:
        lines = file.readlines()
        text = " ".join(lines)
    return text

def maybe_download_and_extract():
    download.maybe_download_and_extract(url=data_url, download_dir=data_dir)

def load_data(train=True):
    train_test_path = "train" if train else "test"
    dir_base = os.path.join(data_dir, "aclImdb", train_test_path)
    path_pattern_pos = os.path.join(dir_base, "pos", "*.txt")
    path_pattern_neg = os.path.join(dir_base, "neg", "*.txt")
    paths_pos = glob.glob(path_pattern_pos)
    paths_neg = glob.glob(path_pattern_neg)
    data_pos = [_read_text_file(path) for path in paths_pos]
    data_neg = [_read_text_file(path) for path in paths_neg]
    x = data_pos + data_neg
    y = [1.0] * len(data_pos) + [0.0] * len(data_neg)
    return x, y


In [6]:
maybe_download_and_extract()

Data has apparently already been downloaded and unpacked.


### Loading the train and test sets


In [7]:
x_train_text, y_train = load_data(train=True)
x_test_text, y_test = load_data(train=False)

y_train = np.array(y_train)
y_test = np.array(y_test)

### Counting examples

In [8]:
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))

Train-set size:  25000
Test-set size:   25000


### Combining features together

In [9]:
features = x_train_text + x_test_text

### Visualizing a single example from the tests data.

In [10]:
x_test_text[0]

'I decided I need to lengthen up my review for my all time favorite film. Unlike other war films that focus on the event, Apocalypse Now takes the viewer into a psychological head trip. The sheer surrealism makes the body uncomfortable, yet you can\'t lay your eyes off of it. Based off of Joseph Conrad\'s Heart Of Darkness, Apocalypse Now slowly descends its protagonist, Willard (Martin Sheen) into madness, most likely the same way Kurtz plunged into insanity. The production of this film is notorious for its delays provided by the monsoon season and for Brando\'s unprepared performance (he read his lines from cue cards). There is a documentary titled Apocalypse Now: A filmmakers Apocalypse which shows the hell everyone went through in making this.<br /><br />The opening sequence is one of the most famous and popular in any film. As the blade of the helicopters are heard in slow motion and napalm is dropped in the trees, the song "The End" by the Doors can be heard. The next shot is of 

Checking the label for the above review...

In [11]:
y_test[0]

1.0

### Tokenizer

A neural network cannot work directly on text-strings so we must convert it somehow. There are two steps in this conversion, the first step is called the "tokenizer" which converts words to integers and is done on the data-set before it is input to the neural network. The second step is an integrated part of the neural network itself and is called the "embedding"-layer.

WE CAN TELL THE `Tokenizer` TO USE ONLY CERTAIN EXAMPLES OF PORPULAR WORDS IN THE DATASET FOR EXAPMLE WE ARE GOING TO USE ONLY `10000` WORDS FOR THIS SIMPLE EXAMPLE.


In [12]:
num_words = 10_000

In [13]:
tokenizer = Tokenizer(num_words=num_words, oov_token="<oov>")

### Fitting Our dataset into the `tokenizer`

We then need to fit the data to the tokenizer using the `tokenizer.fit_on_texts()` function as follows:

In [14]:
tokenizer.fit_on_texts(features)

You can inspect the vocabulary of the tokenizer by calling:

```
tokenizer.word_index
```

In [15]:
tokenizer.word_index

{'<oov>': 1,
 'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'br': 8,
 'in': 9,
 'it': 10,
 'i': 11,
 'this': 12,
 'that': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18,
 'but': 19,
 'film': 20,
 'on': 21,
 'not': 22,
 'you': 23,
 'are': 24,
 'his': 25,
 'have': 26,
 'be': 27,
 'one': 28,
 'he': 29,
 'all': 30,
 'at': 31,
 'by': 32,
 'an': 33,
 'they': 34,
 'so': 35,
 'who': 36,
 'from': 37,
 'like': 38,
 'or': 39,
 'just': 40,
 'her': 41,
 'out': 42,
 'about': 43,
 'if': 44,
 "it's": 45,
 'has': 46,
 'there': 47,
 'some': 48,
 'what': 49,
 'good': 50,
 'when': 51,
 'more': 52,
 'very': 53,
 'up': 54,
 'no': 55,
 'time': 56,
 'my': 57,
 'even': 58,
 'would': 59,
 'she': 60,
 'which': 61,
 'only': 62,
 'really': 63,
 'see': 64,
 'story': 65,
 'their': 66,
 'had': 67,
 'can': 68,
 'me': 69,
 'well': 70,
 'were': 71,
 'than': 72,
 'much': 73,
 'we': 74,
 'bad': 75,
 'been': 76,
 'get': 77,
 'do': 78,
 'great': 79,
 'other': 80,
 'will': 81,
 'also': 82,
 '

### Converting tokens to sequences

We can then now convert our tokes to sequences of integers using the `text_to_sequences` on the tokenizer object as follows.

In [16]:
train_tokens = tokenizer.texts_to_sequences(x_train_text)
test_tokens = tokenizer.texts_to_sequences(x_test_text)

In [17]:
np.array(test_tokens[0])

array([  11,  878,   11,  363,    6,    1,   54,   57,  717,   16,   57,
         30,   56,  522,   20, 1031,   80,  296,  105,   13, 1152,   21,
          2, 1562, 4888,  147,  307,    2,  537,   83,    4, 2007,  409,
       1232,    2, 2025, 8272,  163,    2,  651, 3323,  245,   23,  186,
       4199,  126,  520,  123,    5,   10,  442,  123,    5, 2447,    1,
        490,    5, 2370, 4888,  147, 1374,    1,   93, 2077, 7719, 1522,
       4575,   83, 2911,   89, 1280,    2,  168,   96, 6767,    1,   83,
       5157,    2,  354,    5,   12,   20,    7, 3287,   16,   93,    1,
       2135,   32,    2,    1,  884,    3,   16,    1,    1,  242,   29,
        340,   25,  410,   37, 5204, 4032,   47,    7,    4,  641, 3717,
       4888,  147,    4, 1069, 4888,   61,  277,    2,  604,  305,  415,
        141,    9,  233,   12,    8,    8,    2,  616,  709,    7,   28,
          5,    2,   89,  816,    3, 1088,    9,  100,   20,   15,    2,
       3365,    5,    2,    1,   24,  555,    9,  5

### Padding and Trancating

What you must know is that these reviews has different length. So we need to change them to have the same length by the use of:

1. Padding
* The sequences that has less than the defined legth of tokens are then padded either `pre` or `post` with `0`.

2. Truncating

* The sequences that has more than the defined legth of tokens are then trancated either `pre` or `post`.


In [18]:
max_tokens = 100
padding = truncating = "post"

Now we then need to pad our sequences.

In [19]:
train_tokens_padded = pad_sequences(
    train_tokens,
    maxlen=max_tokens,
    truncating=truncating,
    padding=padding
)

test_tokens_padded = pad_sequences(
    test_tokens,
    maxlen=max_tokens,
    truncating=truncating,
    padding=padding
)

In [20]:
train_tokens_padded.shape, test_tokens_padded.shape

((25000, 100), (25000, 100))

### Inverse map

For some strange reason, the Keras implementation of a tokenizer does not seem to have the inverse mapping from integer-tokens back to words, which is needed to reconstruct text-strings from lists of tokens.

In [21]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

Helper function that converts tokens to `strings` is as follows:

In [22]:
def tokens_to_string(tokens):
  return " ".join([inverse_map[token] for token in tokens if token != 0])

In [23]:
x_train_text[0]

'This movie is perfect for all the romantics in the world. John Ritter has never been better and has the best line in the movie! "Sam" hits close to home, is lovely to look at and so much fun to play along with. Ben Gazzara was an excellent cast and easy to fall in love with. I\'m sure I\'ve met Arthur in my travels somewhere. All around, an excellent choice to pick up any evening.!:-)'

In [24]:
tokens_to_string(train_tokens_padded[0])

"this movie is perfect for all the <oov> in the world john ritter has never been better and has the best line in the movie sam hits close to home is lovely to look at and so much fun to play along with ben <oov> was an excellent cast and easy to fall in love with i'm sure i've met arthur in my travels somewhere all around an excellent choice to pick up any evening"

### Creating an RNN (Recurrent Neural Network)

The first layer that we are going to add to our RNN is the Embedding layer.


### Embedding Layer

This layer converts each integer-token into a vector of values. This is necessary because the integer-tokens may take on values between 0 and 10000 for a vocabulary of 10000 words. The RNN cannot work on values in such a wide range. The embedding-layer is trained as a part of the RNN and will learn to map words with similar semantic meanings to similar embedding-vectors.

_The values of the embedding-vector will generally fall roughly between -1.0 and 1.0, although they may exceed these values somewhat._

The size of the embedding-vector is typically selected between ``100-300``, but it seems to work reasonably well with small values for Sentiment Analysis.


The following cell defines the parameters of our embedding layer.

In [25]:
embbeding_size = 100
input_dim = len(tokenizer.word_index) # the number of words in the vocabulary
input_length = max_tokens

### GRU
The second layer that we will add to our RNN model is the Gated Recurrent Unit Layer. 
> Because we are going to add a `GRU` as the layer that follows the `GRU` we are going to specify `return_sequences=True` this is because the `GRU` that follows expect sequences as it's input.


### Dense

We are going to add a Dense (fully connedted) layer after our third GRU. This layer will have a `sigmoid` activation function since it we are doing a binary classification.


### Now let's build our model.

In [26]:
model = keras.Sequential(
    [keras.layers.Embedding(
        input_dim=input_dim,
        output_dim = embbeding_size,
        input_length = input_length,
        name="embedding_layer"
    ),
    keras.layers.GRU(units=128, return_sequences=True),
    keras.layers.GRU(units=256, return_sequences=True),
    keras.layers.GRU(units=64),
    keras.layers.Dense(1, activation="sigmoid")],
    name="simple_model"
)

# combiling the model

model.compile(
    loss=keras.losses.binary_crossentropy,
    optimizer="adam",
    metrics=["acc"]
)

model.summary()

Model: "simple_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_layer (Embedding)  (None, 100, 100)          12425300  
_________________________________________________________________
gru (GRU)                    (None, 100, 128)          88320     
_________________________________________________________________
gru_1 (GRU)                  (None, 100, 256)          296448    
_________________________________________________________________
gru_2 (GRU)                  (None, 64)                61824     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 12,871,957
Trainable params: 12,871,957
Non-trainable params: 0
_________________________________________________________________


### Training the RNN model

We are going to use `5%` of the training data for validation. We are going to specify the validation split during the `model.fit()`

In [27]:
model.fit(
        train_tokens_padded, y_train,
          validation_split=0.05, epochs=5,
          batch_size=128,
          verbose=1,
          shuffle=True,
          validation_batch_size=64
          )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f00a21f13d0>

### Evaluating the model

To evaluate the mode we are going to call the `model.evaluate()` on the test data.

In [28]:
model.evaluate(test_tokens_padded, y_test, verbose=1)



[0.5472753643989563, 0.79448002576828]

### Embeddings

_What embedding essentially does is to group the words with simmilar meaning close to each other in the vector space._


The model cannot work on integer-tokens directly, because they are integer values that may range between 0 and the number of words in our vocabulary, e.g. 10000. So we need to convert the integer-tokens into vectors of values that are roughly between -1.0 and 1.0 which can be used as input to a neural network.

This mapping from integer-tokens to real-valued vectors is also called an "embedding". It is essentially just a matrix where each row contains the vector-mapping of a single token. This means we can quickly lookup the mapping of each integer-token by simply using the token as an index into the matrix. The embeddings are learned along with the rest of the model during training.

Ideally the embedding would learn a mapping where words that are similar in meaning also have similar embedding-values.

In [29]:
embedding_layer = model.get_layer("embedding_layer")
embedding_layer

<keras.layers.embeddings.Embedding at 0x7f00a37f0b90>

Now we can get the embedding weights of the embedding layer.

In [31]:
embedding_weights = embedding_layer.get_weights()[0]
embedding_weights.shape

(124253, 100)


Let us get the integer-token for the word 'good', which is just an index into the vocabulary.

In [32]:
token_good = tokenizer.word_index['good']
token_good

50

Let us also get the integer-token for the word 'great'.

In [33]:
token_great = tokenizer.word_index['great']
token_great

79

> _These integertokens may be far apart and will depend on the frequency of those words in the data-set._

In [34]:
embedding_weights[token_good]

array([ 0.03654959,  0.05182431,  0.00789072, -0.02546148,  0.03894342,
       -0.00170818,  0.00561844,  0.0036592 ,  0.0351015 ,  0.03506294,
        0.00292486, -0.03541147,  0.04214965, -0.01096839,  0.03691989,
        0.00019001,  0.00644179,  0.05432602,  0.02255417,  0.04553063,
       -0.01186652,  0.02498116, -0.00444889,  0.01255342,  0.01616195,
        0.01642016, -0.04517556, -0.03839105, -0.00937105,  0.01329357,
        0.07752912, -0.01429281,  0.03701348, -0.05405187,  0.03863146,
       -0.02035496, -0.02392949, -0.04286262, -0.02362911,  0.04718411,
       -0.00326303, -0.02075995, -0.03899069, -0.0362411 ,  0.00220468,
       -0.03179303,  0.03786547,  0.03486606, -0.02811677, -0.01869663,
        0.01828381,  0.00058268,  0.00803954, -0.0439345 ,  0.00742137,
        0.03557568, -0.00484395,  0.03125272,  0.04664703, -0.04787948,
        0.08408724, -0.01070623,  0.0320828 ,  0.0153351 ,  0.01114408,
        0.03950981,  0.04133573, -0.05306445, -0.0173699 , -0.00

In [35]:
embedding_weights[token_great]

array([ 0.07881194,  0.0643089 , -0.02972414,  0.09887278,  0.1089631 ,
        0.1030881 , -0.09217654, -0.09186198, -0.05157833,  0.02069101,
        0.00117218,  0.00072236,  0.0785922 ,  0.02382322,  0.06888454,
        0.09756953, -0.00853645,  0.01786042,  0.00340038,  0.0597227 ,
       -0.0633194 , -0.0812819 , -0.05079522, -0.07104529,  0.05604706,
       -0.03787503,  0.06226208,  0.0494625 ,  0.00696902,  0.08610889,
       -0.00151188,  0.03483196,  0.03773798, -0.04069658, -0.02446506,
        0.05908711, -0.01301472,  0.01599188, -0.01294837, -0.04162518,
       -0.06321004,  0.02767151, -0.01786947, -0.00384987, -0.06131248,
       -0.05199639, -0.09441877, -0.09548356, -0.09463506,  0.00136399,
       -0.02894701, -0.03729963, -0.05165001, -0.02650168, -0.07028391,
        0.06605057, -0.04024285, -0.04439946,  0.07065816, -0.05921933,
        0.06593991, -0.07370889,  0.026708  ,  0.03967481, -0.01717189,
        0.04019506, -0.1038826 , -0.01338773,  0.03761966,  0.03

### Soted words


We can also sort all the words in the vocabulary according to their "similarity" in the embedding-space. We want to see if words that have similar embedding-vectors also have similar meanings.

> _Similarity of embedding-vectors can be measured by different metrics, e.g. **Euclidean distance** or cosine distance._

The following helper function do the calculation these distances and printing the words in sorted order.

In [36]:
def print_sorted_words(word, metric='cosine'):
    """
    Print the words in the vocabulary sorted according to their
    embedding-distance to the given word.
    Different metrics can be used, e.g. 'cosine' or 'euclidean'.
    """

    # Get the token (i.e. integer ID) for the given word.
    token = tokenizer.word_index[word]

    # Get the embedding for the given word. Note that the
    # embedding-weight-matrix is indexed by the word-tokens
    # which are integer IDs.
    embedding = embedding_weights[token]

    # Calculate the distance between the embeddings for
    # this word and all other words in the vocabulary.
    distances = cdist(embedding_weights, [embedding],
                      metric=metric).T[0]
    
    # Get an index sorted according to the embedding-distances.
    # These are the tokens (integer IDs) for words in the vocabulary.
    sorted_index = np.argsort(distances)
    
    # Sort the embedding-distances.
    sorted_distances = distances[sorted_index]
    
    # Sort all the words in the vocabulary according to their
    # embedding-distance. This is a bit excessive because we
    # will only print the top and bottom words.
    sorted_words = [inverse_map[token] for token in sorted_index
                    if token != 0]

    # Helper-function for printing words and embedding-distances.
    def _print_words(words, distances):
        for i, (word, distance) in enumerate(zip(words, distances)):
            print("{0:.3f} - {1}".format(distance, word))

    # Number of words to print from the top and bottom of the list.
    k = 10

    print("Distance from '{0}':".format(word))

    # Print the words with smallest embedding-distance.
    _print_words(sorted_words[0:k], sorted_distances[0:k])

    print("...")

    # Print the words with highest embedding-distance.
    _print_words(sorted_words[-k:], sorted_distances[-k:])


We can then print the words that are near and far from the word 'great' in terms of their vector-embeddings. Note that these may change each time you train the model.

In [37]:
print_sorted_words('great', metric='cosine')

Distance from 'great':
0.000 - great
0.172 - masterpiece
0.182 - archive
0.187 - ensemble
0.191 - communication
0.195 - malone
0.196 - sergio
0.196 - georges
0.199 - jackson
0.202 - nowadays
...
1.808 - quantum
1.809 - waste
1.809 - prom
1.809 - limp
1.811 - impersonation
1.812 - uninteresting
1.813 - murray
1.814 - lousy
1.825 - pink
1.825 - tiresome


We can also try the word `"worst"`

In [38]:
print_sorted_words('worst', metric='cosine')

Distance from 'worst':
0.000 - worst
0.044 - irwin
0.044 - disappointment
0.047 - dreck
0.048 - downhill
0.050 - lifeless
0.052 - prom
0.053 - forgettable
0.053 - clunky
0.054 - nauseating
...
1.912 - popping
1.912 - darker
1.914 - steamy
1.914 - contributed
1.916 - communication
1.916 - article
1.917 - superbly
1.918 - showdown
1.918 - malone
1.923 - shared


### Saving word embeddings.

We are then going to save our word embeddings, we are going to use the `glove.6B` formart of saving their word embeddings as a `txt` file. The file will look as follows:

```txt
the  0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658
to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.4365
....
```


In [50]:
def save_word_embeddings(path, embedding_weights=None, tokenizer=None):
  fileW = open(path, "w")
  words = [word for word in tokenizer.word_index.keys() if word not in ["<oov>", "<pad>", "<unk>", "<sos>", "<eos>"]]

  for w in words:
    idx = tokenizer.word_index[w]
    try:
      weight = embedding_weights[idx]
    except:
      pass
    line = f"{w} " + " ".join([ f"{round(float(f), 5)}" for f in weight])
    fileW.write(f"{line}\n")
  fileW.close()
  print("done")

In [46]:
embedding_weights.shape

(124253, 100)

In [49]:
%%time

save_word_embeddings("word-embeddings.100d.txt", embedding_weights, tokenizer)

done
CPU times: user 18.4 s, sys: 310 ms, total: 18.7 s
Wall time: 18.7 s
