## Setup

This section installs required packages, and initializes some imports and helper functions to keep the notebook code below neater.

In [1]:
!pip uninstall tensorflow -yq
!pip install tensorflow-gpu>=2.0 gpustat -Uq



In [0]:
from IPython.core.display import display, HTML

def export_html(result, max_activation):
    output = ""
    max_activation += 1e-8
    
    for line in result:
        word, activation = line
            
        if activation>0:
            activation = activation/max_activation
            colour = str(int(255 - activation*255))
            tag_open = "<span style='background-color: rgb(255,"+colour+","+colour+");'>"
            
        else:
            activation = -1 * activation/max_activation
            colour = str(int(255 - activation*255))
            tag_open = "<span style='background-color: rgb("+colour+","+colour+",255);'>"
            
        tag_close = "</span>"
        tag = " ".join([tag_open, word, tag_close])
        
        output = output + tag
        
    output = output + ""
    
    return output

In [0]:
import time
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = "retina"
import tensorflow.compat.v2 as tf
from tensorflow.keras import layers

In [0]:
def train_simple_lm(model, x_train, y_train, verbose=2, test=False):
    start_time = time.time()

    print("[Phase 1/3] Warming up...")
    opt = tf.keras.optimizers.Adam(learning_rate=0.001)
    model.compile(loss="sparse_categorical_crossentropy",
                optimizer=opt,
                metrics=["acc"])
    history_1 = model.fit(x_train, y_train, epochs=10,
                          batch_size=1, shuffle=False,
                          callbacks=[], verbose=verbose)
    scores = model.evaluate(x_train, y_train, batch_size=32, verbose=verbose)
    print(" - Loss:", scores[0])
    print(" - Acc: ", scores[1])

    if not test:
        print("[Phase 2/3] Fast training...")
        opt = tf.keras.optimizers.Adam(learning_rate=0.01)
        model.compile(loss="sparse_categorical_crossentropy",
                    optimizer=opt,
                    metrics=["acc"])
        early_stop = tf.keras.callbacks.EarlyStopping(monitor='acc',
                                                      restore_best_weights=True,
                                                      patience=5)
        history_2 = model.fit(x_train, y_train, epochs=100,
                            batch_size=2, shuffle=True,
                            callbacks=[early_stop], verbose=verbose)
        scores = model.evaluate(x_train, y_train, batch_size=32, verbose=verbose)
        print(" - Loss:", scores[0])
        print(" - Acc: ", scores[1])

        print("[Phase 3/3] Train to convergence...")
        opt = tf.keras.optimizers.Adam(learning_rate=0.001)
        model.compile(loss="sparse_categorical_crossentropy",
                    optimizer=opt,
                    metrics=["acc"])
        early_stop = tf.keras.callbacks.EarlyStopping(monitor='acc',
                                                      restore_best_weights=True,
                                                      patience=10)
        history_3 = model.fit(x_train, y_train, epochs=200,
                              batch_size=1, shuffle=True,
                              callbacks=[early_stop], verbose=verbose)
        scores = model.evaluate(x_train, y_train, batch_size=32, verbose=verbose)
        print(" - Loss:", scores[0])
        print(" - Acc: ", scores[1])
        
        opt = tf.keras.optimizers.Adam(learning_rate=0.0001)
        model.compile(loss="sparse_categorical_crossentropy",
                    optimizer=opt,
                    metrics=["acc"])
        early_stop = tf.keras.callbacks.EarlyStopping(monitor='acc',
                                                      restore_best_weights=True,
                                                      patience=10)
        history_4 = model.fit(x_train, y_train, epochs=200,
                              batch_size=1, shuffle=True,
                              callbacks=[early_stop], verbose=verbose)
        scores = model.evaluate(x_train, y_train, batch_size=32, verbose=verbose)
        print(" - Loss:", scores[0])
        print(" - Acc: ", scores[1])

        log_x = history_1.history['loss'] + history_2.history['loss'] + history_3.history['loss'] + history_4.history['loss']
        plt.plot(log_x)
        plt.ylabel('Loss')
        plt.xlabel('Epoch')
        plt.show()
    elif test:
        log_x = history_1.history['loss']
        plt.plot(log_x)
        plt.ylabel('Loss')
        plt.xlabel('Epoch')
        plt.show()

    end_time = time.time()

    print("Done! Training took", int(end_time-start_time), "seconds")

    return model

# Exploring RNNs

In this notebook, we will train an **LSTM** and a vanilla **RNN** (Keras `SimpleRNN`) on a small language modelling task and visualize how an LSTM or RNN works when learning how to model sequences.

We will visualize the **activations**, **hidden states** and **information dependency** inside these models.

In [5]:
!gpustat

[1m[37m1d0978a7080a       [m  Wed Jan  1 14:51:43 2020  [1m[30m418.67[m
[36m[0][m [34mTesla P4        [m |[31m 44'C[m, [32m  0 %[m | [36m[1m[33m    0[m / [33m 7611[m MB |


In [0]:
seq_len = 64
model_dim = 16
TRAIN = False

## Load Text Data

We will load a short paragraph from Wikipedia about NVIDIA.

The goal here is to train an LSTM and RNN to autocomplete the passage.

In [0]:
text = "Nvidia Corporation is more commonly referred to as Nvidia. It was formerly stylized as nVidia on products from the mid 90s to early 2000s. Nvidia is an American technology company incorporated in Delaware and based in Santa Clara, California. Nvidia designs graphics processing units for the gaming and professional markets, as well as system on a chip units for the mobile computing and automotive market. Nvidia primary GPU product line, labeled GeForce, is in direct competition with Advanced Micro Devices Radeon products. Nvidia expanded its presence in the gaming industry with its handheld Shield Portable, Shield Tablet, and Shield Android TV. Since 2014, Nvidia has diversified its business focusing on four markets: gaming, professional visualization, data centers, and auto. Nvidia is also now focused on artificial intelligence. In addition to GPU manufacturing, Nvidia provides parallel processing capabilities to researchers and scientists that allow them to efficiently run high performance applications. They are deployed in supercomputing sites around the world. "

In [8]:
text = text.lower().replace(" ", "_").replace(",", "")
text_len = len(text)
print("Text length:", text_len)

vocab = sorted(set(text))
vocab_size = len(vocab) + 1
print("Vocab size:", vocab_size)

tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=True, char_level=True)
tokenizer.fit_on_texts([text])

tokens = tokenizer.texts_to_sequences([text])[0]

x_train = []
y_train = []

for i in range(text_len-seq_len):
    x_train.append(tokens[i:i+seq_len])
    y_train.append(tokens[i+seq_len])

Text length: 1069
Vocab size: 33


## Build LSTM model

In [0]:
tf.keras.backend.clear_session()
tf.config.optimizer.set_jit(False)

In [10]:
l_input = layers.Input(shape=(seq_len,))
l_embed = layers.Embedding(vocab_size, model_dim)(l_input)
l_rnn_1, state_h, state_c = layers.LSTM(model_dim,
                                        return_state=True,
                                        return_sequences=False)(l_embed)
preds = layers.Dense(vocab_size,
                     activation="softmax")(l_rnn_1)

model = tf.keras.models.Model(inputs=l_input, outputs=preds)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 64)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 64, 16)            528       
_________________________________________________________________
lstm (LSTM)                  [(None, 16), (None, 16),  2112      
_________________________________________________________________
dense (Dense)                (None, 33)                561       
Total params: 3,201
Trainable params: 3,201
Non-trainable params: 0
_________________________________________________________________


## Load/Train LSTM model

In [11]:
if TRAIN:
    model = train_simple_lm(model, x_train, y_train, verbose=2)
    model.save("lstm.h5")
else:
    print("Loading pretrained LSTM model:")
    model_url = "https://github.com/OpenSUTD/machine-learning-workshop/releases/download/v0.0.02/lstm.h5"
    model_path = tf.keras.utils.get_file("lstm.h5", model_url)
    model.load_weights(model_path)
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer="adam",
                  metrics=["acc"])
    scores = model.evaluate(x_train, y_train, batch_size=32, verbose=2)
    print(" - Loss:", scores[0])
    print(" - Acc: ", scores[1])

Loading pretrained LSTM model:
1005/1 - 2s - loss: 0.2780 - acc: 0.9264
 - Loss: 0.3344373474666728
 - Acc:  0.9263682


## Visualizing the LSTM

### LSTM Plotting Functions

In [0]:
model_act = tf.keras.models.Model(inputs=l_input, outputs=[state_h, state_c])
model_act.save("lstm_act.h5")

def infer_h_c(start_n):
    end_n = start_n + seq_len + 1
    h_list, c_list = [], []
    for n in range(start_n, end_n):
        input_text = text[n-seq_len:n]
        input_tokens = tokenizer.texts_to_sequences([input_text])
        h, c = model_act.predict([input_tokens])
        h, c = h[0], c[0]
        h_list.append(h)
        c_list.append(c)
    print("Magnitude of Hidden State")
    for dim in range(model_dim):
        act_t_list = [a[dim]**2 for a in h_list]
        act_t_max = max(act_t_list)
        result = zip(input_text, act_t_list)
        output = export_html(result, act_t_max)
        if dim < 10:
            dim = "0"+str(dim)
        else:
            dim = str(dim)
        output = "<tt>" + dim + " : " + output + "</tt>"
        display(HTML(output))
    print("")
    print("Magnitude of Cell State")
    for dim in range(model_dim):
        act_t_list = [a[dim]**2 for a in c_list]
        act_t_max = max(act_t_list)
        result = zip(input_text, act_t_list)
        output = export_html(result, act_t_max)
        if dim < 10:
            dim = "0"+str(dim)
        else:
            dim = str(dim)
        output = "<tt>" + dim + " : " + output + "</tt>"
        display(HTML(output))
    print("")

def plot_dependency(n):
    input_text = text[n:n+seq_len]
    input_tokens = tokenizer.texts_to_sequences([input_text])
    label = [text[n+seq_len]]
    label = tokenizer.texts_to_sequences([label])
    loss = tf.keras.losses.SparseCategoricalCrossentropy()
    x = tf.convert_to_tensor(input_tokens, dtype=tf.float32)
    y_true = tf.convert_to_tensor(label, dtype=tf.float32)

    with tf.GradientTape() as g:
        g.watch(x)
        y = model(x)
        loss_value = loss(y_true, y)
        grads = g.gradient(loss_value, model.trainable_weights)
    input_grads = grads[0].values.numpy()
    input_grads = np.sum(np.abs(input_grads)**0.5, axis=-1)

    result = zip(input_text, input_grads)
    output = export_html(result, max(input_grads))
    output = output + " &nbsp; -> &nbsp; " + text[n+seq_len]
    output = "<tt>" + output + "</tt>"
    display(HTML(output))

### LSTM Visualizations

**Activations** and **Cell State (Memory)**

In [13]:
n = 130 # pick a segment of text

infer_h_c(n)

Magnitude of Hidden State



Magnitude of Cell State





Visualizing **Information Dependency** (*Connectivity*, as described in this [distill.pub blog post](https://distill.pub/2019/memorization-in-rnns/)).

This is the **magnitude of the gradient of each input embedding with respect to the model output**. We use the magnitude of the gradient as a measure of information dependency as the gradient is a measure of how much the model's output will be affected by changes in the input.

Hence, We will be able to visualize how much the model's prediction is dependent on elements in the current sequence.

In [14]:
plot_dependency(n)

We can see how the LSTM decides on the next character when predicting the word "`nvidia`".

The LSTM looks far in the previous sentence to predict start predicting the word "`nvidia`", and then gradually relies more on the partially-predicted word and sentence to complete the sentence.

In [15]:
s = 177
for i in range(s,s+16):
    plot_dependency(i)

In [16]:
plot_dependency(699)

## Build RNN model

In [0]:
# improve vanilla RNN training speed
# LSTM doesn't need this since it has a cuDNN implementation
tf.keras.backend.clear_session()
tf.config.optimizer.set_jit(True)
unroll = True

In [18]:
l_input = layers.Input(shape=(seq_len,))
l_embed = layers.Embedding(vocab_size, model_dim)(l_input)
l_rnn_1, h = layers.SimpleRNN(model_dim,
                              unroll=unroll,
                              return_state=True,
                              return_sequences=False)(l_embed)
preds = layers.Dense(vocab_size,
                     activation="softmax")(l_rnn_1)

model = tf.keras.models.Model(inputs=l_input, outputs=preds)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 64)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 64, 16)            528       
_________________________________________________________________
simple_rnn (SimpleRNN)       [(None, 16), (None, 16)]  528       
_________________________________________________________________
dense (Dense)                (None, 33)                561       
Total params: 1,617
Trainable params: 1,617
Non-trainable params: 0
_________________________________________________________________


## Load/Train RNN model

In [19]:
if TRAIN:
    model = train_simple_lm(model, x_train, y_train, verbose=2)
    model.save("rnn.h5")
else:
    print("Loading pretrained RNN model:")
    model_url = "https://github.com/OpenSUTD/machine-learning-workshop/releases/download/v0.0.02/rnn.h5"
    model_path = tf.keras.utils.get_file("rnn.h5", model_url)
    model.load_weights(model_path)
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer="adam",
                  metrics=["acc"])
    scores = model.evaluate(x_train, y_train, batch_size=32, verbose=2)
    print(" - Loss:", scores[0])
    print(" - Acc: ", scores[1])

Loading pretrained RNN model:
1005/1 - 3s - loss: 1.2131 - acc: 0.6308
 - Loss: 1.2072982505779362
 - Acc:  0.6308458


## Visualizing the RNN

### RNN Plotting Functions

In [0]:
model_act = tf.keras.models.Model(inputs=l_input, outputs=[h])
model_act.save("rnn_act.h5")

def infer_h_c(start_n):
    end_n = start_n + seq_len + 1
    h_list, c_list = [], []
    for n in range(start_n, end_n):
        input_text = text[n-seq_len:n]
        input_tokens = tokenizer.texts_to_sequences([input_text])
        h = model_act.predict([input_tokens])[0]
        h_list.append(h)
    print("Magnitude of Hidden State")
    for dim in range(model_dim):
        act_t_list = [a[dim]**2 for a in h_list]
        act_t_max = max(act_t_list)
        result = zip(input_text, act_t_list)
        output = export_html(result, act_t_max)
        if dim < 10:
            dim = "0"+str(dim)
        else:
            dim = str(dim)
        output = "<tt>" + dim + " : " + output + "</tt>"
        display(HTML(output))

### RNN Visualizations

**Activations**

(same as the **hidden state** before activation function for the vanilla RNN)

In [21]:
infer_h_c(n)

Magnitude of Hidden State


**Information Dependency**

Note that the RNN has much limited ability to look further back into the sequence to help it make predictions.

As a result, the performance of the RNN is much worse (accuracy of ~60% compared to ~80% of the LSTM).

In [22]:
plot_dependency(n)

In [23]:
s = 177
for i in range(s,s+16):
    plot_dependency(i)

In [24]:
plot_dependency(699)

# Conclusion

In the notebook, you will have seen some of the different learning and performance characteristics between the LSTM and a vanilla RNN.