## DS807: Applied machine learning
Christian M. Dahl. cmd@sam.sdu.dk.

### Recurrent neural networks

Notes: For the purpose of this notebook, DLWP refers to Deep Learning with Python by Francois Collet (ISBN10: 9781617294433).

### Version

v2: 25-11-2022

## Introduction

*Recurrent neural networks* (RNNs) are a type of neural network used to handle/analyze *sequential* data, i.e. data where the next data point is somehow related to the prior data points.

This is the case in many applications, including forecasting sales, analyzing text, working with videos, and interacting with dynamic environments.

RNNs are a *special case* of feed forward neural networks - just as convolutional neural networks (CNNs) are. And just as CNNs allowed us to more easily work with grids of values (such as images), RNNs more easily allow us to work with sequences of values (such as time series or text).

In this lecture, we will learn how to use RNNs to solve tasks of modelling sequences. Today, we will learn the fundamentals of RNNs, including showcasing how they can be used to solve regression and classification problems.

Next week, we will use an RNN to generate text. In particular, we will use the works of Shakespeare in order to train a neural network that can act as a playwriter - in the style of Shakespeare! In order to write meaningful text, we must know what is already written! Hence, we exploit the recurrent/sequential structure of the data in order to create meaningful text.

## Program

After this lecture, you will:
1. Know when to use RNNs, including considerations for the types of data to use.
1. Know about different types of RNNs (standard, LSTM, GRU, and bidirectional).
1. Know how to handle text data, including n-grams and embeddings.
1. Know how to build and train RNNs, including considerations for optimization and regularization.
1. Have applied your knowledge in order to solve classification problems.
1. Have applied your knowledge in order to use an RNN as a playwriter in the style of Shakespeare (i.e. text generation).


<img src="./graphics/examplesofsequences.png" alt="Drawing" style="width: 1000px;"/>

Source: "Andrew Ng"

## What is an RNN?

An RNN is a type of network that for each input outputs both the standard output we are familiar with, as well as a "state" that is passed on as part of the input for the next period.

This way, the network makes a prediction - as we are used to - and *summarizes the information needed to be passed on*. In this way, the next prediction the network makes (can) use this information as part of its input.

As such, an RNN is a type of network with a "loop", in that it feeds into itself. This loop is sometimes called the "recurrent connection".

## Input, output, and the recurrent connection

<img src="./graphics/figure_6-9.png" alt="Drawing" style="width: 600px;"/>
Source: "DLWP"

## Neural networks are directed acyclic graphs

We train neural networks by backpropagation, *which is not possible if a true loop exists*.

Such models, with no true loops, are known as directed acyclic graphs (DAGs). Any neural network can be described as a DAG.

But then, how does RNNs work? We use something called "unfolding", which is to say that we "unfold" the loop so that it takes the form of a DAG.

This is possible since we never in practice loop an infinite number of times, but rather some finite number, *k*.

As such, a *k*-loop may be unfolded into a DAG consisting of *k* parts.

## Recurrent cells and unfolding

Usually, a recurrent neural network looks something like the image below - i.e. it has a "loop" where it feeds into itself (left). This can then be "unfolded" (right), in which way it looks more like a regular network, with an additional input and output (these are the "hidden states").

<img src="./graphics/RNN_illustrating_same_weights_01.png" alt="Drawing" style="width: 1500px;"/>


## Recurrent cells - what is the hidden state?

A key part of recurrent neural networks is the hidden state, specifically how we construct (output) it and how we use (input) it. 

The idea of the hidden state is to pass along historical information that will be used at later stages.

This can be done in many ways. You can use the raw output, construct some "optimal" summary of the history, or even use earlier raw inputs!

In general, finding some "optimal" summary is the best method. This "optimal" summary is typically found exactly how we normally train neural networks, i.e. as part of the training through stochastic gradient descent.

However, there any multiples types of recurrent cells, and these differ in how they construct, use, and pass on their hidden states.

## Example:

**Text:** "Yesterday, Harry Potter met Hermione Granger"

**Task:** Identify if word in text is a name

**Define vocabulary:**

$ V = (a,aaron,...,granger,...,harry,...,hermione,...,
    met,...,potter,...,yesterday,...,zulu) $


**Define output:**

$(o_1,o_2,o_3,o_4,o_5,o_6)= (0,1,1,0,1,1)$

**Define input (one-hot-encoding):**

$ x_1 = (0,0,...,0,...,0,...,0,...,0,...,0,...,1,...,0) $,
$ x_2 = (0,0,...,0,...,1,...,0,...,0,...,0,...,0,...,0) $,
$ x_3 = (0,0,...,0,...,0,...,0,...,0,...,1,...,0,...,0) $,
$ x_4 = (0,0,...,0,...,0,...,0,...,1,...,0,...,0,...,0) $,
$ x_5 = (0,0,...,0,...,0,...,1,...,0,...,0,...,0,...,0) $,
$ x_6 = (0,0,...,1,...,0,...,0,...,0,...,0,...,0,...,0) $.


$h_1 = g_1(W_{hh}*h_0 + W_{hx}*x_1 + b_h) \\ 
o_1 = g_2(W_{oh}*h_1 + b_o) \\ 
h_2 = g_1(W_{hh}*h_1 + W_{hx}*x_2 + b_h) \\
o_2 = g_2(W_{oh}*h_2 + b_o) \\
... \\
... \\
h_t = g_1(W_{hh}*h_{t-1} + W_{hx}*x_t + b_h) \\
o_t = g_2(W_{oh}*h_{t} + b_o) \\
g_1: \text{tanh or ReLu} \\ 
g_2: \text{sigmoid or softmax}$



## Types of recurrent cells

There are three main types of recurrent cells that are used: the $\texttt{SimpleRNN}$, $\texttt{LSTM}$, and $\texttt{GRU}$ layers.

In practice, the first approach is very naïve and often not good, but it serves as a starting point for understanding the more complex versions. Its main weakness is that it has difficulties learning long-term dependencies, as it is not able to effectively pass along information over long distances (i.e. where it needs to be carried forward many steps).

Both LSTM (long short-term memory) and GRU (gated recurrent unit) layers are widely used and very powerful. They both aim to solve the issues related to carrying information forward many steps.

Further, in some cases it makes sense to pass information along *backwards*. This may seem counterintuitive, and in time-series forecasting it certainly often does not make sense, but think of understanding a movie review. The start of the review may well be as important as the end of the review, for which reason reading it "backwards" can still provide information about its contents.

An extension of this idea is to *use both ways*. That is, have a part of your NN that reads the sequence "forward" and another that reads it "backwards". Such a combination is known as a *bidirectional* layer.

## Simple RNN

The most *simple* - hence the name - way to build a recurrent layer is to simply **use its output as the state it passes along to the next step**.

This what the $\texttt{SimpleRNN}$ layer does. 

However, it is problematic for the same reasons that training deep models (without residual connections) is problematic - backpropagating back through the steps now leads to gradients that easily collapse to zero or diverge, making learning difficult.

Further, the information will have difficulties traversing long distances (as it will be non-linearly transformed at each step).

A $\texttt{SimpleRNN}$ layer with *i* input features and *j* nodes will have $(j + i + 1)j$ trainable parameters (assuming for simplicity that there is no output layer, i.e., $o_t=h_t$)

## Simple RNN

<img src="./graphics/SimpleRNN.jpg" alt="Drawing" style="width: 800px;"/>

## Building a $\texttt{SimpleRNN}$ layer

**Note**: The default activation of this layer is the hyperbolic tangent (and not linear, as with many layers). If you set return_sequences=False, the shape of this output is (batch_size, nb_nodes/units). If you set return_sequences=True, the shape of this output is (batch_size, nb_timesteps, nb_nodes/units).

In [10]:
import tensorflow as tf
print(f'tf-version {tf.__version__}')
nb_input_features = 10000
nb_timesteps = 6
nb_nodes = 6

simple_rnn_model = tf.keras.models.Sequential([
    tf.keras.layers.SimpleRNN(nb_nodes, 
                              input_shape=(nb_timesteps, nb_input_features),
                              return_sequences=True),
    tf.keras.layers.SimpleRNN(units=nb_timesteps, activation="softmax")
])

tf-version 2.10.1


In [15]:
simple_rnn_model.summary()

print(
    f'Number of trainable parameters (o=h) = '
    f'{(nb_input_features + nb_nodes + 1) * nb_nodes}'
)
print(
    f'Number of trainable parameters (o=g2(h)) = '
    f'{(nb_input_features + nb_nodes + 1) * nb_nodes} +'
    f'{(nb_nodes + nb_timesteps + 1) * nb_timesteps}'
)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_2 (SimpleRNN)    (None, 6, 6)              60042     
                                                                 
 simple_rnn_3 (SimpleRNN)    (None, 6)                 78        
                                                                 
Total params: 60,120
Trainable params: 60,120
Non-trainable params: 0
_________________________________________________________________
Number of trainable parameters (o=h) = 60042
Number of trainable parameters (o=g2(h)) = 60042 +78


## Long Short Term Memory (LSTM)

LSTM layers were introduced in "Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780".

They address the problem of retaining information over long sequences while making backpropagation feasible even for very long sequences.

How? They add an additional state that is passed along, typically called the *carry* track. Crucially, the carry track is not affected directly by any other operation than multiplication and addition, making information easier to retain over long periods while also allowing for gradients to easier flow through long sequences.

**Note**: The exact structure of an LSTM cell is often interpreted as containing "forget" gates. Chollet warns about this interpretation. We do not strictly enforce "forget" gates, and interpreting what they do is far more complex.

## Example where long range dependence is needed 
<img src="./graphics/ingvald_bleken_all.png" alt="Drawing" style="width: 1000px;"/>

**Note**:  Spouse is located deep in the text

## The LSTM cell and the carry track

<img src="./graphics/The-structure-of-the-Long-Short-Term-Memory-LSTM-neural-network-Reproduced-from-Yan_W640.jpg" alt="Drawing" style="width: 600px;"/>


## LSTM Cells

LSTM cells are a type of recurrent neural network architecture designed for long-term information retention. 

### Key Components

#### 1. Cell State
- Acts as a conveyor belt through the LSTM.
- Transports and stores information across the sequence.

#### 2. Hidden State
- Transfers information down the sequence.
- Interacts with the cell through various gates.



### Core Gates

#### a. Forget Gate
- Decides what information to discard from the cell state.
- Uses a sigmoid function to output values between 0 (forget) and 1 (keep).

#### b. Input Gate
- Updates the cell state with new information.
- Consists of two parts:
  - A sigmoid layer deciding which values to update.
  - A tanh layer creating a vector of new candidate values.

#### c. Output Gate
- Determines the next hidden state.
- Uses the current input and previous hidden state to decide the output.
<br></br>

### Operations
- The cell state is modified through a series of steps involving these gates.
- The forget gate discards irrelevant information.
- The input gate adds new information.
- The output gate updates the hidden state based on the cell state.

## Long Short Term Memory (LSTM)
    
<img src="./graphics/a-A-vanilla-LSTM-cell-b-Equations-of-a-vanilla-LSTM-cell.ppm.png" alt="Drawing" style="width: 1000px;"/>


## Exercise

Based on the two previous slides identify (within the LSTM cell)
- input gate
- forget gate
- output gate
- memory cell

## Key take-away from LSTM layers

In using LSTM layers, and in understanding why they often perform so much better than the naïve recurrent layer, knowing the importance of the carry track is crucial.

Often when solving tasks where recurrent layers are useful, the dependencies are long. Think of text data. To analyze it successfully, we *need* to be able to use information about words distanced far from each other.

The introduction of the additional components of the LSTM layer (as opposed to the naïve approach) does, however, lead to many more parameters (specifically four times as many).

1. Input is now transformed four different ways.
1. Same with the hidden state(s) (i.e. the "normal" portion and the carry track together).

The activations used in an LSTM cell tend to be a combination of the hyperbolic tangent and the sigmoid (where the sigmoid parts are sometimes referred to as forget gates).

An $\texttt{LSTM}$ layer with *i* input features and *j* nodes will have $4(j + i + 1)j$ trainable parameters.

## Key take-away from LSTM layers

### Sigmoid vs Tanh

<img src="./graphics/tanh.jpg" alt="Drawing" style="width: 1000px;"/>

## Building an $\texttt{LSTM}$ layer

**Note**: The default activation of this layer is the hyperbolic tangent (and not linear, as with many layers).

In [3]:
nb_input_features = 10
nb_timesteps = 5
nb_nodes = 4

lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(nb_nodes, 
                         input_shape=(nb_timesteps, nb_input_features))
])

lstm_model.summary()

print(f'Number of trainable parameters = 
      {4 * (nb_input_features + nb_nodes + 1) * nb_nodes}')

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 4)                 240       
                                                                 
Total params: 240
Trainable params: 240
Non-trainable params: 0
_________________________________________________________________
Number of trainable parameters = 240


## Gated Recurrent Unit (GRU)

GRU layers were introduced in "Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078".

They are a somewhat simpler version of an LSTM layer (the capability of an LSTM layer weakly dominates that of a corresponding GRU layer). However, in practice it does in some cases outperform LSTM layers (with fewer parameters).

Works by combining the "forget" and input gates, and merging the "carry track" with the hidden state. This means the number of parameters is now "only" three times larger than that of the naïve approach.

A $\texttt{GRU}$ layer with *i* input features and *j* nodes will have $3(j + i + 1)j$ trainable parameters OR $3(j + i + 2)j$ trainable parameters, depending on the exact implementation; these are the implementations in TensorFlow 1.x and 2.x, respectively.

The additional $3j$ parameters in the second case is due to separate bias terms for input and recurrent kernels. This is a hyperparameter you can choose whether you want to use. It is used by default in TensorFlow 2.x: https://newbedev.com/calculating-the-number-of-parameters-of-a-gru-layer-keras#:~:text=As%20you%20can%20see%2C%20the%20default%20parameter%20of,%2A%203%20%2A%202%20%3D%209600%20in%20tensorflow2.

## Gated Recurrent Unit (GRU)

<img src="./graphics/1920px-Gated_Recurrent_Unit,_base_type.svg.png" alt="Drawing" style="width: 600px;"/>
<img src="./graphics/GRUmath.png" alt="Drawing" style="width: 600px;"/>


# Gated Recurrent Unit (GRU)

GRU is a type of recurrent neural network architecture optimized for sequence modeling and handling long-term dependencies.

## Key Features

### Simplified Structure
- GRU units are designed to be simpler than LSTMs, with fewer gates.

### Key Components

#### 1. Update Gate
- Determines how much of the past information (from previous time steps) needs to be passed along to the future.
- Balances between the old information (previous hidden state) and the new candidate information.

#### 2. Reset Gate
- Decides how much of the past information to forget.
- Helps the model to decide how much of the past information is irrelevant for the future.

### Operations

- **Combining Information**: The update gate helps the GRU to capture dependencies over various time scales.
- **Memory Content**: The reset gate allows the GRU to drop any irrelevant information in the future, effectively resetting the memory.

## Advantages

- **Efficiency**: Generally requires fewer computational resources than LSTM.
- **Performance**: Often performs on par with LSTM, especially in smaller datasets or less complex tasks.

GRUs are effective for various sequence modeling tasks, including language modeling, speech recognition, and time series analysis.


## Building a $\texttt{GRU}$ layer

**Note**: The default activation of this layer is the hyperbolic tangent (and not linear, as with many layers).

In [4]:
nb_input_features = 10
nb_timesteps = 5
nb_nodes = 4
batch=32
inputs = tf.random.normal([batch, nb_timesteps, nb_input_features])
gru = tf.keras.layers.GRU(units= nb_nodes)
output = gru(inputs)
print(output.shape)

gru = tf.keras.layers.GRU(4, return_sequences=True, return_state=True)
whole_sequence_output, final_state = gru(inputs)
print(whole_sequence_output.shape)

print(final_state.shape)


(32, 4)
(32, 5, 4)
(32, 4)


In [5]:
nb_input_features = 10
nb_timesteps = 5
nb_nodes = 4

gru_model = tf.keras.models.Sequential([
    tf.keras.layers.GRU(nb_nodes, 
                        input_shape=(nb_timesteps, nb_input_features))
]) # depends on whether reset_after is True or False! If False, will substract 3 * nb_nodes parameters.

gru_model.summary()

print(f'Number of trainable parameters = 
      {3 * (nb_input_features + nb_nodes + 2) * nb_nodes}')

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru_2 (GRU)                 (None, 4)                 192       
                                                                 
Total params: 192
Trainable params: 192
Non-trainable params: 0
_________________________________________________________________
Number of trainable parameters = 192


## Bidirectional recurrent layers

When we think of sequences, we think of reading them in a forward fashion (from start to end).

In many cases, this makes perfect sense, as the last information is often the most important.

However, in some cases the start of a sequence may be as - or even more - important as its end.

This is, for example, the case when reading a text. The start may be as important - or even more important - than the end. However, if we feed a sequence in forward, it is difficult to retain this initial information.

For this reason, some RNNs read sequences backward. But we do not have to choose between these approaches exclusively - we can use *both* directions at one. This is known as a bidirectional layer. It is simply two seperate layers (one forward, one backward) that are then merged (**suggestion**: Try to use the functional API to implement a bidirectional layer as an exercise).

<img src="./graphics/figure_6-25.png" alt="Drawing" style="width: 600px;"/>
Source: "DLWP"

## Building a bidirectional layer

... will result in twice as many parameters as normally, since we now use two of each recurrent layer.

In [6]:
nb_input_features = 10
nb_timesteps = 5
nb_nodes = 4

bidirectional_concat_lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(nb_nodes), 
                                  input_shape=(nb_timesteps,
                                               nb_input_features)), 
])

bidirectional_concat_lstm_model.summary()

print(f'Number of trainable parameters = 
      {2 * 4 * (nb_input_features + nb_nodes + 1) * nb_nodes}')

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional (Bidirectiona  (None, 8)                480       
 l)                                                              
                                                                 
Total params: 480
Trainable params: 480
Non-trainable params: 0
_________________________________________________________________
Number of trainable parameters = 480


We can also use addition to merge layers (or some other method, such as multiplication).

This does not change the number of parameters of the layer, but does change the number of outputs - which is turn may change the number of parameters of some of the *other* layers of your model.

In [7]:
bidirectional_add_lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(nb_nodes), merge_mode='sum', input_shape=(nb_timesteps, nb_input_features)), 
])

bidirectional_add_lstm_model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_1 (Bidirectio  (None, 4)                480       
 nal)                                                            
                                                                 
Total params: 480
Trainable params: 480
Non-trainable params: 0
_________________________________________________________________


## Common types of RNN networks
<img src="./graphics/onetomanyetc.jpeg" alt="Drawing" style="width: 1000px;"/>

## Common types of RNN networks:

- One to one: Classification of images
- One to many: Generation of text
- Many to one: Sentiment anlysis
- Many to many: Translation

## Multiple recurrent layers

In some cases, we may want to stack multiple recurrent layers after each other.

In these cases, we need to pass along the entire sequence to the next layer (and not just the output of the last step of the sequence).

This is done by setting the argument **return_sequences** of a recurrent layer to **True**. As such, this needs to be done for each recurrent layer aside from the last (where we are only interested in the last time step, and as such no longer need the entire sequence).

Also, we may mix different types of recurrent layers (although typically this is not done).

## Multiple recurrent layers
<img src="./graphics/deepRNN.png" alt="Drawing" style="width: 1000px;"/>

## Building an RNN with multiple recurrent layers

In [16]:
nb_input_features = 10
nb_timesteps = 5;   nb_nodes_1 = 4; nb_nodes_2 = 3; nb_nodes_3 = 2

deep_rnn_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(nb_nodes_1, input_shape=(nb_timesteps, nb_input_features), return_sequences=True),
    tf.keras.layers.GRU(nb_nodes_2, return_sequences=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(nb_nodes_3)),
    tf.keras.layers.Dense(1), # maybe we want to perform regression, where this might be the final layer
])

deep_rnn_model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_1 (LSTM)               (None, 5, 4)              240       
                                                                 
 gru_3 (GRU)                 (None, 5, 3)              81        
                                                                 
 bidirectional (Bidirectiona  (None, 4)                24        
 l)                                                              
                                                                 
 dense (Dense)               (None, 1)                 5         
                                                                 
Total params: 350
Trainable params: 350
Non-trainable params: 0
_________________________________________________________________


## Exercise

1. Build three RNNs which take as input a sequence of length 5 with 20 features, uses a single recurrent layer with 20 nodes with a $\texttt{ReLU}$ activation function, and then uses a $\texttt{Dense}$ layer to perform classification of 10 classes. Use $\texttt{SimpleRNN}$, $\texttt{LSTM}$, and $\texttt{GRU}$ for the recurrent layers (hence 3 total models). What is the number of parameters in each of the 3 cases?
1. Modify your models fom **1.** using $\texttt{Bidirectional}$ layers. Do this for (at least) the merging methods $\texttt{concat}$ and $\texttt{sum}$. What is the number of parameters in each of the 6 cases?
1. Modify your models from **1.** by using a *second* recurrent layer (of the same type, so still 3 total models) with 10 nodes. What is the number of parameters in each of the 3 cases?

**Hint**: You may use the notebook I have uploaded under this lecture as a starting point (exercise-rnn.ipynb). It provides some of the code, and you then have to fill in the rest. You do not have to use it - it is there if you think it might be helpful!

## Working with text data

A common use of RNNs is to analyze text data. However, it is not obvious how to work with text data.

Note that the consideration here is not directly related to RNNs, but just as with any model, considerations of how to prepare the data are crucial. Indeed, if we tried to solve our task at hand using another method than an RNN, these considerations are also valid!

If we forecast sales each day, each day is the obvious observational unit to work with. However, we can think of different ways to structure text data. We could split it into sentences, words, or even characters. Each of these options have advantages and disadvantages.

1. Sentence: Not very "flexible" - can only generate sentences from the training data! But the quality of a sentence is quite high.
2. Word: Middle-ground - and one of the most often used methods. Model outputs words from the training data, but this is often "enough" to provide sufficient flexibility and at the same time not reinvent the wheel.
3. Character: Very flexible, but the model now needs to learn to spell individual words, and the output can be complete gibberish (not even bad English but not English at all)!

## More details on text data

Neural networks work on numbers - not text. So we need some way to represent text data - here in this example in the form of words - using numbers.

There are two very common methods:
1. An old, but still occasionally useful, method is *n*-grams of words.
1. A more modern, and *much* more powerful, method is to use one-hot encoding/hashing and then build embeddings on these.

## *n*-grams of words (bag of words)

Think of the sentence "the cat sat on the mat". 

We can construct sub-sentences of at most length *n*. Extracting all these sub-sentences gives the *n*-gram of the sentence.

**Set of 2-grams**: {the cat, cat sat, sat on, on the, the mat, the, cat, sat, on the, mat}.

**Set of 3-grams**: {the cat sat, cat sat on, sat on the, on the mat, the cat, cat sat, sat on, on the, the mat, the, cat, sat, on the, mat}.

However, this is not very informative, as it isn't even order-preserving and further, if large *n* and long sentences are used the sets become quite large. Still useful for shallow learning, but not so much for deep learning.

**Note**: *n*-grams also work on the character level (or sentence level, for that matter - any level, in fact).

In [3]:
import tensorflow as tf
import numpy as np
text    = ["the cat sat on the mat"]
encoder = tf.keras.layers.TextVectorization(ngrams=2,max_tokens=100,
                                            output_mode="multi_hot")
encoder.adapt(text) # Computes a vocabulary of string terms from tokens in a dataset.
vocab   = np.array(encoder.get_vocabulary()) # Get and print the vocabulary
print(f'length of vacabulary: {len(vocab)}')
print(f'vocabulary: {vocab}')
encoded_example = []
for ngram in vocab:
    print(ngram)
    print(list(encoder(ngram).numpy()))
    encoded_example.append(list(encoder(ngram).numpy()))

length of vacabulary: 11
vocabulary: ['[UNK]' 'the' 'the mat' 'the cat' 'sat on' 'sat' 'on the' 'on' 'mat'
 'cat sat' 'cat']
[UNK]
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
the
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
the mat
[0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
the cat
[0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
sat on
[0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0]
sat
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
on the
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0]
on
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
mat
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
cat sat
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0]
cat
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]


In [2]:
print('transforming the bag of words to a matrix:\n', 
      *encoded_example,sep='\n')

transforming the bag of words to a matrix:

[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
[0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
[0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]


## One-hot encoding (hashing) and word embeddings

A much more common way to use text data is to use one-hot encoding (hashing) and word embeddings. 

Contrary to bag-of-words this allow us to model text as a sequence.

This is simply building a dictionary for each word (or more generally token, it could be the character level).

For example, we may build the following dictionary for our earlier example:

## One-hot encoding (hashing)

Hash function is a function that can be used to map data of arbitrary size to data of fixed size

<img src="./graphics/hashing.jpg" alt="Drawing" style="width: 1000px;"/>


In [4]:
## Hashing
one_hot_dict = {
    'the': 0,
    'cat': 1,
    'sat': 2,
    'on': 3,
    'mat': 4,
}

numerical_encoded_sentence = [one_hot_dict[word] 
                              for word in 'the cat sat on the mat'.split(' ')]
print(numerical_encoded_sentence)

[0, 1, 2, 3, 0, 4]


## One-hot encoding

Then, we use one-hot encoding of the numerical data.

In [5]:
import tensorflow as tf

print(tf.keras.utils.to_categorical(numerical_encoded_sentence))

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]]


## From raw text to vectors

<img src="./graphics/fromrawtexttovectors.PNG" alt="Drawing" style="width: 1000px;"/>
Source: "DLWP"

## Embedding

One-hot encoding is *sparse*. This is particularly true if there are many classes.

In text data working at the word level, there are a TON (often many thousands) of classes. Hence each input consists of a vector with one 1 and many 0's.

However, words are *related*, and we can often store them much more efficiently. Embedding is one such way, in which a sparse vector is transformed to a (often much smaller) non-sparse vector. In this more dense space, we often see that words such as "lion" and "tiger" might be closer to each other than to "car".

For a problem with *n* words (or tokens, in the general case) and an embedding dimension *k*, the embedding function is a function $f: \{0, 1\}^n \rightarrow \mathbb{R}^k$.

As such, an embedding is "just" a lookup table, "looking up" *k* real numbers for each word.

How to find these numbers? Typically just a part of the neural network. Pre-trained versions are also availble.

## From one-hot to embeddings

<img src="./graphics/06fig02.jpg" alt="Drawing" style="width: 800px;"/>
Source: "DLWP"

## From one-hot to embeddings

<img src="./graphics/figure_6-1.png" alt="Drawing" style="width: 800px;"/>
Source: "DLWP"

## Embeddings may be interpretable

### Leading example: "King" - "Man" + "Woman" = "Queen"

### Visualization (clustering)

<img src="./graphics/figure_6-3.png" alt="Drawing" style="width: 600px;"/>


Source: "DLWP"

## Building $\texttt{Embedding}$ layers and vectors

In [7]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import text_to_word_sequence

reviews = ['nice food',
        'amazing restaurant',
        'too good',
        'just loved it!',
        'will go again',
        'horrible food',
        'never go there',
        'poor service',
        'poor quality',
        'needs improvement']
sentiment = np.array([1,1,1,1,1,0,0,0,0,0])

In [8]:
vocabulary_size = 4
print(one_hot("nice food",vocabulary_size))
print(one_hot("amazing restaurant",vocabulary_size))
print("Is there a problem?...collisions?")

print("")
print("Trying with a larger vocabulary")
vocabulary_size = 40 #75
print(one_hot("nice food",vocabulary_size))
print(one_hot("amazing restaurant",vocabulary_size))

[3, 3]
[1, 3]
Is there a problem?...collisions?

Trying with a larger vocabulary
[30, 18]
[28, 36]


In [9]:
encoded_reviews = [one_hot(d, vocabulary_size) for d in reviews]
print(f'one hot encoding/hashing: {encoded_reviews}')
print("Any collisions?")

one hot encoding/hashing: [[30, 18], [28, 36], [31, 27], [16, 23, 28], [6, 14, 4], [1, 18], [38, 14, 26], [5, 17], [5, 4], [32, 23]]
Any collisions?


In [15]:
max_length = 4
padded_reviews = pad_sequences(encoded_reviews, maxlen=max_length, padding='post')
print(f'one hot encoding/hashing: {padded_reviews}')

#Note that there can be a "collision": Some words are encoded with the same integer!!
#Increasing the vocabulary will reduce the likelihood of a collision...but what are
#the effects of this downstream?..a lager embedding layer/matrix?

one hot encoding/hashing: [[30 18  0  0]
 [28 36  0  0]
 [31 27  0  0]
 [16 23 28  0]
 [ 6 14  4  0]
 [ 1 18  0  0]
 [38 14 26  0]
 [ 5 17  0  0]
 [ 5  4  0  0]
 [32 23  0  0]]


In [16]:
#Approach that eliminate collisions without increasing the vocabulary
MAX_VOCAB_SIZE = 40
encoder = tf.keras.layers.TextVectorization(
    max_tokens=MAX_VOCAB_SIZE)
encoder.adapt(reviews)
vocab = np.array(encoder.get_vocabulary())
print(f'length of vacabulary: {len(vocab)}')

encoded_example = encoder(reviews).numpy()
max_length = 4
padded_reviews = pad_sequences(encoded_example, maxlen=max_length,
                               padding='post')
print(f'one hot encoding/hashing: {reviews}')
print(f'one hot encoding/hashing:\n {padded_reviews}')

length of vacabulary: 22
one hot encoding/hashing: ['nice food', 'amazing restaurant', 'too good', 'just loved it!', 'will go again', 'horrible food', 'never go there', 'poor service', 'poor quality', 'needs improvement']
one hot encoding/hashing:
 [[11  4  0  0]
 [20  9  0  0]
 [ 6 19  0  0]
 [15 14 16  0]
 [ 5  3 21  0]
 [18  4  0  0]
 [12  3  7  0]
 [ 2  8  0  0]
 [ 2 10  0  0]
 [13 17  0  0]]


## Bulding a simple model with an embedding layer

In [17]:
embedding_dimension = 5
embedding_model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=len(vocab), output_dim=embedding_dimension,
                             input_length=max_length,name="embedding"),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [18]:
embedding_model.compile(optimizer='adam', loss='binary_crossentropy', 
                        metrics=['accuracy'])
print(embedding_model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 5)              110       
                                                                 
 flatten (Flatten)           (None, 20)                0         
                                                                 
 dense (Dense)               (None, 1)                 21        
                                                                 
Total params: 131
Trainable params: 131
Non-trainable params: 0
_________________________________________________________________
None


## Defining input and output and fitting/evaluating the model

In [19]:
X = padded_reviews
y = sentiment

In [20]:
embedding_model.fit(X, y, epochs=50, verbose=0)

<keras.callbacks.History at 0x29c7c3b3670>

In [21]:
# evaluate the model
loss, accuracy = embedding_model.evaluate(X, y)
accuracy



1.0

## Comparing the estimated word vectors

In [22]:
weights =embedding_model.get_layer('embedding').get_weights()[0]
#Horrible
print(f'Horrible: {weights[padded_reviews[5][0]]}')
#Poor
print(f'Poor: {weights[padded_reviews[7][0]]}')
#Good
print(f'Nice: {weights[padded_reviews[0][0]]}')
#Amazing
print(f'Amazing: {weights[padded_reviews[1][0]]}')

Horrible: [ 0.04541338 -0.02032448  0.01450942  0.09498853  0.07235244]
Poor: [ 0.02196703 -0.09323229  0.03048165  0.04725321  0.04861407]
Nice: [-0.02132252  0.08659092 -0.03001419 -0.09454425 -0.03538161]
Amazing: [-0.00172583  0.05351468 -0.03273857 -0.08425601 -0.08209368]


## Building a model with embeddings and recurrent layers

Let us try to combine embedding layers and recurrent layers. This is often the basic structure of successful RNNs for text data.

In this example, we will imagine that there are $1000$ possible words in our lookup, we will condense these to 128 dimensions (this is the embedding dimension), we will then use a recurrent layer with $64$ nodes, and finally a fully connected layer with $10$ nodes and a softmax activation function (imagine this is a classification problem with $10$ classes).

Such a network will have 165,898 parameters, since:
1. Embedding layer associate 128 numbers with each of 1000 words = 128,000 parameters.
1. The recurrent layer chosen is GRU, which will have 128 inputs and 64 outputs. Using our earlier formula, this is $3(128 + 64 + 2)64=37,248$.
1. The fully connected layer will have 64 inputs and 10 outputs, and $64\cdot10+10=650$.

## A recurrent model with an embedding layer

The below model would serve as a decent starting point for many problems related to text.

In [23]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=1000, output_dim=128),
    tf.keras.layers.GRU(64),
    tf.keras.layers.Dense(10, activation='softmax'),
])

model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 128)         128000    
                                                                 
 gru_4 (GRU)                 (None, 64)                37248     
                                                                 
 dense_2 (Dense)             (None, 10)                650       
                                                                 
Total params: 165,898
Trainable params: 165,898
Non-trainable params: 0
_________________________________________________________________


## The embedding layer architecture

<img src="./graphics/Emded 3 Middle.png" alt="Drawing" style="width: 1000px;"/>

## Exercise

1. Use the IMDB movie review data (positive/negative movie reviews) to build a model that is able to predict the sentiment (positive/negative) from movie reviews. Your initial model should use one embedding layer, one recurrent layer (up to you which type), and a final fully connected layer to perform the classification.
1. Try to improve your model by doing (at least) the following: add an additional recurrent layer and/or use a bidirectional recurrent layer (**note**: If you have a good 1-layer model this may be difficult - just try your best).
1. In the preprocessing of the data made by me, I kepts the top 1000 words and let all reviews be 100 words long. Consider changing one/both of these to try to improve your best model (**hint**: the limit of only 100 words is very severe - try doubling it to 200, this may likely improve your performance).

**Hint**: You may use the notebook I have uploaded under this lecture as a starting point (exercise-imdb.ipynb). It provides some of the code, and you then have to fill in the rest. You do not have to use it - it is there if you think it might be helpful!

## Special considerations for RNNs - dropout

Remember how we used some layers differently for CNNs (such as dropout)? Not because it was needed, or even necessarily better, but because there might be cases where the "normal" version works in some unexpected way (which may be detrimental), such as dropout not performing strong regularization if adjacent pixels are highly correlated.

It turns out that such differences are also occasionally warranted for RNNs - and once again dropout is such a case.

First, we may ask: **where** do we even apply dropout? We now have two candidates - the input units and the recurrent units.

Here, one answer would be *both* places. But as it turns out (see next slide), recurrent dropout may be prohibitively slow.

Second, we may ask **how** do we apply dropout? It turns out that fixing the dropout mask along the time-dimension for each forward pass is often helpful (TensorFlow easily handles this).

Great blog: https://adriangcoder.medium.com/a-review-of-dropout-as-applied-to-rnns-72e79ecd5b7b

## Special considerations for RNNs - dropout

### Dropout on input units

<img src="./graphics/dropout_simple.JPG" alt="Drawing" style="width: 1000px;"/>

## Special considerations for RNNs - dropout

### Dropout on recurrent units

<img src="./graphics/dropout_lstm.JPG" alt="Drawing" style="width: 1000px;"/>

**Note**: tf is based on "ours" described in https://arxiv.org/abs/1603.05118

## Special considerations for RNNs - "hardware acceleration"

In order to make use of optimized implementations of recurrent layers in TensorFlow, a number of conditions has to be met.

To see the full list, look specifically at the documentation of specific layers:
1. https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN
1. https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
1. https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU

In general, however, there are two "easy" ways to break the rules required for efficient implementations:
1. You may be tempted to change the activation function (i.e. to ReLU). This often results in heavy slowdowns, however, so I recommend keeping the hyperbolic tangent (on my computer, this can easily make a model 30 times slower!).
1. Using recurrent dropout (i.e. applying dropout to the recurrent state). This often results in heavy slowdowns (on my computer, this can easily make a model 50 times slower!).

## Convolutional layers for sequential data

Recall how I briefly mentioned that convolutional layers could also be used for non-image data?

One such case is sequential data, where 1D convolutions are sometimes very helpful.

A great benefit is that these models are often blazingly fast compared to recurrent models (which are notoriously slow), while still explicitly using parameter sharing to handle sequential data.

They are also much easier to train.

Note, however, that if we want to learn from a very long sequence where the entire sequence is needed for understanding, we need very wide kernels and/or many convolutions after each other.

However, kernels may also be much larger, since they are now vectors rather than matrices (i.e. a kernel of length 9 has as many parameters as a 2D kernel of size (3, 3)).

Further, pooling may still be used (which significantly helps in handling long sequences).

# 1D convolutions

*Note that pooling works similarly, but without with a dot product of parameters but rather the specific operation specified (whether max, average, or something else - depends on the type of pooling).*

<img src="./graphics/Architecture-overview-of-the-CNN-model.png" alt="Drawing" style="width: 1000px;"/>
Source: "DLWP"

## A convolutional model for sequence data...

... here combined with an embedding layer, which is often done for text data.

In [19]:
vocabulary_size = 10
embedding_dimension = 5

cnn_model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=vocabulary_size, 
                              output_dim=embedding_dimension),
    tf.keras.layers.Conv1D(filters=32, kernel_size=3, activation='relu')
])

cnn_model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 5)           50        
                                                                 
 conv1d (Conv1D)             (None, None, 32)          512       
                                                                 
Total params: 562
Trainable params: 562
Non-trainable params: 0
_________________________________________________________________


## Combining convolutional and recurrent layers

There is no barrier to combining convolutional and recurrent layers.

Indeed, this may make a lot of sense. Typically, this will be done by starting by applying convolutional (and maybe pooling) layers to shorten a sentence, and then apply recurrent layers to the shorter sentence. 

Intutitively (*but always be cautious when making such interpretations*), this works by first combining neighboring words into a new, smaller set of features (i.e. "summarizing" a part of a sentence), and then applying a a recurrent neural network to interpret the sequence of these new features.



<img src="./graphics/figure_6-30.png" alt="Drawing" style="width: 800px;"/>
Source: "DLWP"

## A convolutional recurrent model for sequence data...

... here combined with an embedding layer, which is often done for text data.

In [20]:
vocabulary_size = 10
embedding_dimension = 5

cnn_rnn_model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=vocabulary_size, 
                              output_dim=embedding_dimension),
    tf.keras.layers.Conv1D(filters=32, kernel_size=3, activation='relu'),
    tf.keras.layers.GRU(units=16),
])

cnn_rnn_model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 5)           50        
                                                                 
 conv1d_1 (Conv1D)           (None, None, 32)          512       
                                                                 
 gru (GRU)                   (None, 16)                2400      
                                                                 
Total params: 2,962
Trainable params: 2,962
Non-trainable params: 0
_________________________________________________________________


## Exercise

1. Try to improve your model from the earlier IMDB exercise by using some of the optimization and regularization tricks you know (for example, try dropout, early stopping, and weight regularization - there are many things to potentially try!).
1. Attempt to solve the IMDB classification task using a convolutional neural network.
1. Build a model which combines convolutional and recurrent layers to solve thr IMDB classification task.

**Hint**: You may use the notebook I have uploaded under this lecture as a starting point (exercise-imdb-2.ipynb). It provides some of the code, and you then have to fill in the rest. You do not have to use it - it is there if you think it might be helpful!

## An introduction to generative models

We now have all the pieces needed to do something brand new - instead of focusing on classification or regression$^1$, which is what we have mostly been doing, we can start to *generate* new data.

Specifically, we will build an RNN to generate text by using Shakespear plays. We will start *from scratch*, in the sense of loading a text file of raw text. Think of what this means: you can use an entirely similar approach for any text data you can imagine!

The exact implementation can still be done in many ways, but for this example, we will:
1. Use the text at the character level (instead of our current examples, which were built at the word level).
2. Use an embedding layer.
3. Use a GRU layer.
4. Use a dense layer (for character classification).

**Note**: This draws heavily from https://www.tensorflow.org/tutorials/text/text_generation, but I have tried to explain some additional stuff.

$^1$As will become apparent, we actually still perform classification to train our model in the first place.

## Graphic illustration

Putting everything together, we will end up with something like:
<img src="./graphics/text_generation_training.png" alt="Drawing" style="width: 800px;"/>
Source: https://www.tensorflow.org/tutorials/text/text_generation

In [50]:
import os
import tensorflow as tf
import numpy as np

In [51]:
# We load the data from the web
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 
                                       'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [52]:
# Read and decode (to get \n to "enter")
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print(f'Length of text: {len(text)} characters')

Length of text: 1115394 characters


In [53]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [54]:
# The unique characters in the file
vocab = sorted(set(text))
print (f'{len(vocab)} unique characters')
print(vocab)

65 unique characters
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [55]:
# Creating a mapping from unique characters to indices
char2idx = {unique_char: idx for idx, unique_char in enumerate(vocab)}
print(char2idx)

# And convert the entire text to integers (corrosponding to characters 
#using the mapping above.)
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[char] for char in text])

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}


In [56]:
# Show how the first 13 characters from the text are mapped to integers
print('{} ---- characters mapped to int ---- > {}'.format(text[:13], text_as_int[:13]))

First Citizen ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]


## Now to get the data in a format for the network

So far so good - we have the data *as well* as a representation for the data using integers. This is great!

The next steps prepares the data optimally for a (recurrent) neural network and are somewhat advanced...

In [58]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)

# Create training examples / targets: 
# tf.data.Dataset.from_tensor_slices converts numpy into tf.data
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

# Let us print the first 5 characters. Notice they spell "First", as they should!
for idx in char_dataset.take(5):
    print(f'{idx.numpy()} ---- corresponds to ---- > {idx2char[idx.numpy()]}')

18 ---- corresponds to ---- > F
47 ---- corresponds to ---- > i
56 ---- corresponds to ---- > r
57 ---- corresponds to ---- > s
58 ---- corresponds to ---- > t


In [61]:
# We have the characters - now we want to create sequences. 
# Remember seq_length = 100 here! This can be tuned
# We drop the remainder to ensure all sequences are equally long.
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

# Let's print the furst 2 batches - each of length 101. Why 101 and not 100?? 
# This is since we want to predict the next word.
# For this, we will later (next slide) split this into 2 chunks, 
#one with words 1:100 and one with words 2:101 (both length 100).
for jter,item in enumerate(sequences.take(2)):
    print(f'Length of batch {jter+1} is {item.numpy().shape} and writes:')    
    print('--------')
    print(''.join(idx2char[item.numpy()]))
    print('--------')    
    print('--------')

Length of batch 1 is (101,) and writes:
--------
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 
--------
--------
Length of batch 2 is (101,) and writes:
--------
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k
--------
--------


In [39]:
# Now we split the sequences into the input for the network and the output 
#(target) of the network!
def split_input_target(chunk):
    input_text = chunk[:-1] # 1:100
    target_text = chunk[1:] # 2:101

    return input_text, target_text

dataset = sequences.map(split_input_target)
print(dataset) # input and output 100 long now

<MapDataset element_spec=(TensorSpec(shape=(100,), dtype=tf.int32, name=None), TensorSpec(shape=(100,), dtype=tf.int32, name=None))>


In [40]:
# Let us print just the first 2 examples. See how the target is "shifted" 1 character!
for input_example, target_example in  dataset.take(2):
    print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
Input data:  'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you '
Target data: 're all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'


In [41]:
# Below we showcase how the individual input/outputs to the network look. 
#Remember the first word is "are"!
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print(f"Step_ {i}")
    print("  Input:          {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  Target output:  {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step_ 0
  Input:          39 ('a')
  Target output:  56 ('r')
Step_ 1
  Input:          56 ('r')
  Target output:  43 ('e')
Step_ 2
  Input:          43 ('e')
  Target output:  1 (' ')
Step_ 3
  Input:          1 (' ')
  Target output:  39 ('a')
Step_ 4
  Input:          39 ('a')
  Target output:  50 ('l')


In [42]:
# Now for the batch size - 64 here. That is, each minibatch consists of 64 sequences, 
#each of 100 characters!
batch_size = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
buffer_size = 10000

# Note how we drop the remainder here - otherwise not all batches are exactly 64 long.
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

dataset

<BatchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int32, name=None), TensorSpec(shape=(64, 100), dtype=tf.int32, name=None))>

In [43]:
# Now for some parameters to build the network!
vocab_size = len(vocab) # The 65 unique characters defined above

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [44]:
# And FINALLY we can build it. Yes, it really is that simple in Keras!
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(
            input_dim=vocab_size, 
            output_dim=embedding_dim,
            batch_input_shape=[batch_size, None], 
            # None allows for different sequence lengths to be used
        ),
        tf.keras.layers.GRU(
            units=rnn_units,
            return_sequences=True,
            stateful=True,
            recurrent_initializer='glorot_uniform',
        ),
        tf.keras.layers.Dense(vocab_size, activation='softmax')
    ])

    return model

In [45]:
model = build_model(
  vocab_size = vocab_size,
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=batch_size)

In [46]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (64, None, 256)           16640     
                                                                 
 gru_1 (GRU)                 (64, None, 1024)          3938304   
                                                                 
 dense_1 (Dense)             (64, None, 65)            66625     
                                                                 
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


In [47]:
# Let us test we can perform a forward pass - succes!
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch)
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

tf.Tensor(
[[12  0 35 ...  5 57  1]
 [63  1 46 ... 46 47 50]
 [43  1 51 ... 57 43 52]
 ...
 [46  1 58 ... 60 43 57]
 [39 56  1 ...  6  1 47]
 [53 53 42 ... 13 30 32]], shape=(64, 100), dtype=int32)
(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [49]:
# Just for fun, let us look at how our network performs before we train it.
print(tf.math.log(example_batch_predictions)[0])
sampled_indices = tf.random.categorical(tf.math.log(example_batch_predictions)[0], 
                                        num_samples=1)
print(sampled_indices.shape)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print(sampled_indices.shape)

print(f'sampled indices {sampled_indices}')

tf.Tensor(
[[-4.172442  -4.1706777 -4.17662   ... -4.1725254 -4.1558123 -4.168696 ]
 [-4.1792617 -4.172461  -4.1754208 ... -4.1703706 -4.1631985 -4.1841884]
 [-4.1662087 -4.155744  -4.1872277 ... -4.1833854 -4.183887  -4.183751 ]
 ...
 [-4.1796517 -4.184438  -4.173463  ... -4.160218  -4.1683226 -4.1516323]
 [-4.177162  -4.181773  -4.172084  ... -4.161225  -4.1751876 -4.1675224]
 [-4.1922317 -4.1769347 -4.18009   ... -4.1683035 -4.1681237 -4.1680255]], shape=(100, 65), dtype=float32)
(100, 1)
(100,)
sampled indices [62 62  4 54 21 23  8  8 60  3 41 14 21 27 24  3  9 63 50 44  0 14 47 45
  4 12 28 22 35 46 54 50 33 12  4 57 49  7 54 41 28 38 10 39 16  6  5 58
 54 39 45 11 38  2 39 11 52 25 52  4 58  7 40 35 29 62 33 12 53  7 14 38
 45  3 27 39  7 15 47 54  1 43 44 37  8 51  2 41 12 48 45 15 28 38 42  1
 37 63 41  1]


In [72]:
# Not good! Complete gibberish - but that is expected!
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 '.\n\nVOLUMNIA:\nSweet madam.\n\nVIRGILIA:\nI am glad to see your ladyship.\n\nVALERIA:\nHow do you both? you '

Next Char Predictions: 
 "wNqxJyfjD\nL s'UyLJI-gEyNyMD,DWtQqcPSW,H.XMEi3FoU,AiO&jNlT\nXyh\nbvc!:Ecv!i;zDotW'csPi!JO&tMSceTKwmJXgq"


In [73]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [74]:
# Directory where the checkpoints will be saved
checkpoint_dir = 'C:/Users/cmd/Documents'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [75]:
history = model.fit(dataset, epochs=50, callbacks=[checkpoint_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


We have now trained our model - great! Let us check it out.

In [76]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
sampled_indices = tf.random.categorical(tf.math.log(example_batch_predictions)[0], 
                                        num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

# Still not good! 
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))    

Input: 
 "MNIA:\nO, he is wounded; I thank the gods for't.\n\nMENENIUS:\nSo do I too, if it be not too much: bring"

Next Char Predictions: 
 "OOE:\nO, se is nounded? I'ehank theeaaws forgt:\n\nEENENIUS:\nIi iooI to;, if it be non\nto ,such,\nbuidgs"


## How to predict more than one character?

Our network only predicts ONE character at a time - but we want to generate long sequences of text!

This is easily handled by simply *using each prediction as input to the next prediction*.

<img src="./graphics/text_generation_sampling.png" alt="Drawing" style="width: 800px;"/>
Source: https://www.tensorflow.org/tutorials/text/text_generation

To keep the prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, 
the model only accepts a fixed batch size once built.

To run the model with a different batch_size, w
e need to rebuild the model and restore the weights from the checkpoint.

In [78]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

model.summary()

Model: "sequential_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (1, None, 256)            16640     
                                                                 
 gru_13 (GRU)                (1, None, 1024)           3938304   
                                                                 
 dense_10 (Dense)            (1, None, 65)             66625     
                                                                 
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


We need to format the text (i.e. the predictions) to look pretty to us "humans".

The function on the next page makes it more pretty to look at for us.

Further, it provides some options on how to make the prediction.

In [79]:
def generate_text(model, start_string, num_generate = 300, sample = True, 
                  stateless = False,visuals = False):
    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    model.reset_states()
    for i in range(num_generate):
        if visuals:
            print(f'input_eval {i}: {input_eval.shape}')
        
        predictions = model(input_eval)
        
        if visuals:
            print(f'sequence {i}: {predictions.shape}')

        predictions = tf.squeeze(predictions, 0) # remove the batch dimension
        
        if sample:
            predicted_id = tf.random.categorical(tf.math.log(predictions), 
                                                 num_samples=1)[-1,0].numpy()
        else:
            predicted_id = tf.argmax(predictions[0]).numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)
        
        text_generated.append(idx2char[predicted_id])
        if visuals:
            print(f'text: {"".join(text_generated)}')
        
        # if stateless, we DO NOT pass along hidden state. This is a terrible idea
        if stateless:
            model.reset_states()

    return start_string + ''.join(text_generated)

And now, with the model and a way to format the data, there is only one thing to do...

Let us see what I have to say - according the the network!

In [80]:
print(generate_text(model, start_string="LECTURER: ", num_generate=30,visuals=True))

input_eval 0: (1, 10)
sequence 0: (1, 10, 65)
text: y
input_eval 1: (1, 1)
sequence 1: (1, 1, 65)
text: yo
input_eval 2: (1, 1)
sequence 2: (1, 1, 65)
text: you
input_eval 3: (1, 1)
sequence 3: (1, 1, 65)
text: you 
input_eval 4: (1, 1)
sequence 4: (1, 1, 65)
text: you m
input_eval 5: (1, 1)
sequence 5: (1, 1, 65)
text: you mi
input_eval 6: (1, 1)
sequence 6: (1, 1, 65)
text: you mis
input_eval 7: (1, 1)
sequence 7: (1, 1, 65)
text: you mist
input_eval 8: (1, 1)
sequence 8: (1, 1, 65)
text: you mista
input_eval 9: (1, 1)
sequence 9: (1, 1, 65)
text: you mistak
input_eval 10: (1, 1)
sequence 10: (1, 1, 65)
text: you mistake
input_eval 11: (1, 1)
sequence 11: (1, 1, 65)
text: you mistake,
input_eval 12: (1, 1)
sequence 12: (1, 1, 65)
text: you mistake,

input_eval 13: (1, 1)
sequence 13: (1, 1, 65)
text: you mistake,
W
input_eval 14: (1, 1)
sequence 14: (1, 1, 65)
text: you mistake,
We
input_eval 15: (1, 1)
sequence 15: (1, 1, 65)
text: you mistake,
We 
input_eval 16: (1, 1)
sequence 16:

In [81]:
# Or what about you?
print(generate_text(model, start_string="THE STUDENT: "))

THE STUDENT: she is well.
How is't your mistress
We follow'd to-morrow no incense, yet they
Upon this is the rabbect, standing each part
The heavy lion-fiends still live che to them.

COMINIUS:
I know ye well.

ANGELO:
Were you in your suit!

Apacrion, obe in all duty, freed the better, time an arguill
To my det


In [83]:
# Or what about Christian M. Dahl..the course responsible
print(generate_text(model, start_string="THE RESPONSIBLE: "))

THE RESPONSIBLE: has best believe the
purposes in the field?
I think, say I would say 'twere past all brought together, must
In open rance, as I though my reasons are
cure-diercus for that sounds:
I God, Hermione would the work about the government of Clarence,
That private him sings? called me but every tood,
That 


In [84]:
print(generate_text(model, start_string="Christian: "))

Christian: faith becomes thy foolish knavest in your stomach, though make pale?

CLIFFORD:
Plantagenet! for he gives my sweet wonder and
call me to bey the like ancertier.

PETRUCHIO:
And you.

CORIOLANUS:
Voul-placery and gone.

SEBASTIAN:
Good with an unloverch for my boss:
Might have found inclination: he h


## The importance of sampling

A problem with not using sampling (aside from it not being very "creative") is that it **potentially** results in looping.

This specific example looks fine, but I promise that there are many catastrophic cases ("my" network just happened to behave well).

<img src="./graphics/IllustratingSampling.PNG" alt="Drawing" style="width: 800px;"/>

In [85]:
print(generate_text(model, start_string="THE STUDENT", num_generate=300, 
                    sample=False))

THE STUDENTIO:
And when the king shall be common fools; if you had
When such a fellow is a gentle provost:
The other for us in a pick, if thou hast
The ordering of the mind of Bolingbroke
It for the son of Henry the Fourth!

KING EDWARD IV:
But now you have made fair work, some dear dog! shall I be so contente


## Importance of state

Without the hidden state, the model goes crazy. This is not surprising, as it is not enough only to know the last letter to write something meaningful.

In [86]:
print(generate_text(model, start_string="THE LECTURER: ", num_generate=300, 
                    stateless=True))

THE LECTURER: MINDUCKINCKINUCKIOLANUCKIO:
RDYOMOLOFFFRDULANDWhe,
Whevizen wan,
TENCHARIDULINTIOPELINCKINTHOLANICK:
Whe METRKINCKETERK:
T:
TINUCKILABY:
Whamend he,
TINCUCKIOLUCHOPENCKINCK:
TADUCKINULOFRIUST:
BY:
ANINTERK:
CK:
ANCKINTHANINCKI ISTIOMINCKIOLANUCUCK:
TETHOPENDWhevend;
OMIOFRKINULOFFOWhe,
CKINUCKININES


## No sampling, no state, no fun

In [87]:
print(generate_text(model, start_string="THE LECTURER: ", num_generate=300, 
                    sample=False, stateless=True))

THE LECTURER: INCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCKINCK


## How to improve the model

The model presented in these slides is decent but not super impressive. However, it is also very simple (and since we perform character prediction, the model needs to spell/use grammer and so on!). You may want to try to improve it.

Specifically, you may want to:
1. Use a larger model. Embedding, nodes in GRU, even potentially more layers.
2. Train for more epochs.
3. Apply different optimization and/or regularization.
4. Use a different architecture.
5. Use words instead of characters.

In general, tinkering with a model to try to improve performance is a great way to become more familiar with a topic - so try to see how much you can improve it!

## Classical language models

An important method in natural language processing (NLP) - as well as other topics - is the concept of *attention*. This has laid the foundation of transformer networks (*this is something else than a spatial transformer!*).

For those of you interested in learning more, I suggest you may read more on this. One approach would be to:
1. Read "Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008)", see https://arxiv.org/abs/1706.03762. Paper that introduces the transformer.
1. Then read "Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165", see https://arxiv.org/abs/2005.14165. Some of the amazing results by scaling models up BIG (175B parameters).

## Some suggestions for further work

The generative example we worked with today was built directly on raw text. This means that it can work quite easily on other sources of text data.

As such, if you want to explore more, I suggest you find some text data you find fun. This may be a text conversation you have with someone else - maybe a group conversation - a piece of literature you like, or something else.

Then, try to apply the same techniques (most of the code does not need to change). Perhaps you can come up with something fun!

## Advantages and disadvantages of RNNs

Advantages
1. RNNs are a type of model that allows the learning of long-term dependencies by making efficient use of parameter sharing.
1. Although not directly related to RNNs, embedding layers are a powerful tool to map sparse data (such as one-hot encoded words) to dense matrices. Combining embedding layers with RNNs allows for powerful language models.
1. RNNs may be used to perform sequence prediction, including using them in an "autoregressive" way (using their output as a new input), which allows for generative modelling.

Disadvantages
1. RNNs are notoriously difficult and slow to train, since the loop has to be unfolded for backpropagation. This results in extremely deep models for long sequences.
1. This had led to research in e.g. NLP to focus on non-RNN models, such as the transformer.

## Summary and looking ahead

In these lecture, we dived deep into the world of NLP and more generally working with sequential data, covering in detail RNNs, including their structure and use, what tricks they use to manage sequential data, and getting a brief glimpse at what the current SOTA looks like.

Further, we encountered for the first time generative modelling. 

As such, this lecture is a turning point. Moving forward, we will dive beyond the topics covered in *Deep Learning*, moving on to highly advanced generative modelling. We will start by learning about (variational) autoencoders and then move on to generative adversarial networks.

Much of the work done in these fields deal with CNNs, but the methods are much more general. That is, even though we will work extensively with image data rest assured that the methods are useful in many fields.