## Recurrent Neural Networks

## The basic RNN Cell

RNN cells incorporate this dependence by having a hidden state, or memory, that holds the essence of what has been seen so far. The value of the hidden state at any point in time is a function of the value of the hidden state at the previous time step, and the value of the input at the current time step, that is :  
  
$$h_t = \phi(h_{t-1}, X_t)$$
  
Here, $h_t$ and $h_{t-1}$ are the values of the hidden states at the time $t$ and $t-1$ respectively, and $x_t$ is the value of the input at time $t$. Notice that the equation is recursive, that is, $h_{t-1}$ can be represented in terms of $h_{t-2}$ and $x_{t-1}$, and so on, until the beginning of the sequence. This is how RNNs encode and incorporate information from arbitrarily long sequences.  
  
Just as in a traditional neural network, where the learned parameters are stored as weight matrices, the RNN's parameters are defined by the three weight matrices $U$, $V$, and $W$, corresponding to the weight of the input, output, and hidden states respectively.  
  
Note that the weight matrices $U$, $V$, and $W$, that we spoke about earlier, are shared between each of the time steps. This is because we are applying the same operation to different inputs at each time step. Being able to share these weights across all the time steps greatly reduce the number of parameters that the RNN needs to learn.  
  
We can also describe the RNN as a computation graph in terms of equations. The internal state of the RNN at a time $t$ is given by the value of the hidden vector $h(t)$, which is the sum of the weight matrix $W$ and the hidden state $h_{t-1}$ at time $t-1$, and the product of the weight matrix $U$ and the input $x_t$ at time $t$, passed through a **`than`** activation function. The choice of **`than`** over other activation functions such as sigmoid has to do with it being more effieicnt for learning in practice, and helps combat the vanishing gradient problem.  
  
The output vector $y_t$ at time $t$ is the product of the weight matrix $V$ and the hidden state $h_t$, passed through a softmax activation, such that the resulting vector is a set of output probalities:
  
$$h_t = than(Wh_{t-1} + Ux_t)$$
$$y_t = softmax(Vh_t)$$
  
Keras provides the SimpleRNN recurrent layer that incorporates all the logic we have seen so far, as well as the more advanced variants such as LSTM and GRU.

## Backpropagation through time (BPTT)

Just like traditional neural networks, training RNNs also invloves backpropagation of gradients. The difference in this case is that since the weights are shared by all time steps, the gradient at each output depends not only on the current time step, but also on the previous ones. This process is called backpropagation through time. Because the weights $U$, $V$, and $W$, are shared across the different time steps in case of RNNs, we need to sum up the gradients across the various time steps in case of BPTT. This is the key differnece between traditional backpropagation and BPTT.  
  
Consider the RNN with five time steps shown in *FIgure* 2. During the forward pass, the network produces predictions $\hat y_t$ at time $t$ that are compared with the label $y_t$ to compute a loss $L_t$. During backpropagation (shown by the dotted lines), the gradients of the loss with respect to the weights $U$, $V$, and $W$, are computed at each time step and the parameters updated with the sum of the gradients:

![image.png](attachment:image.png)

<center>Figure 2 : Backpropagation through time

THe following equation shows the gradient of the loss with respect to $W$. We focus on this weight because it is the cause for the phenomenon known as the vanishing and exploding gradient problem.  
  
This problem manifests as the gradients of the loss approaching either zero or infinity, making the network hard to train. To understand why this happens, consider the equation of the SimpleRNN we was earlier; the hidden state $h_t$ is dependent on $h_{t-1}$, which is turn is dependent on $h_{t-2}$, and so on:
  

$$\frac {\partial L}{\partial W} = \sum_t \frac {\partial L_t}{\partial W}$$

Let us know see what happens to this gradient at timestep $t=3$. By the chain rule, the gradient of the loss with respect to $W$ can be decomposed to a product of three sub-gradients. The gradient of the hidden state $h_2$ with respect to $W$ can be further decomposed as the sum of the gradient of each hidden state with respect to the previous one. Finally, each gradient of the hidden state with respect to previous one can be further decomposed as the product of gradients of the current hidden state against the previous hidden state:

$$ \frac {\partial L_3}{\partial W} = 
\frac {\partial L_3}{\partial \hat y_3} 
\frac {\partial \hat y_3}{\partial h_3}
\frac {\partial h_3}{\partial W}
\\
= \sum_{t=0}^3 
\frac {\partial L_3}{\partial \hat y_3}
\frac {\partial \hat y_3}{\partial h_3} 
\frac {\partial h_3}{\partial h_t}
\frac {\partial h_t}{\partial W}
\\
\sum_{t=0}^3 
\frac {\partial L_3}{\partial \hat y_3} 
\frac {\partial \hat y_3}{\partial \hat h_3}
(\prod_{j=t+1}^3 
\frac {\partial h_j}{\partial h_{h-1}})
\frac {\partial h_t}{\partial W}
$$

Similar calculations are done to compute the gradient of the other losses $L_0$ through $L_4$ with respect to $W$, and sum them up into the gradient update for $W$. We will not explore the math further in this book, but this WildML blog post (http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/) has a very good explanation of BPTT, including a more detailed derivation of the math behind the process.

### Vanishing and exploding gradients

The reason BERT is particularly sensitive to the problem of vanishing and exploding gradients comes from the product part of the expression representing the final formulation of the gradient of the loss with respect to $W$. Consider the case where the individual gradients of a hidden state with respect to the previous one is less then 1.  
  
As we backpropaget across multiple time steps, the product of gradients get smaller and smaller, ultimately leading to the problem of vanishing gradients. Similarly, if the gradients are larget than 1, the products get larger and larger, and ultimately lead to the problem of exploding gradients.  
  
of the two, exploding gradients are more easily detectable. The gradients will become very large and turn into **Not a number (NaN)** and the training process will crash. Exploding gradients can be controlled by clipping them at a predefined threshold. TensorFlow 2.0 allows you to clip gradients using the **`clipvalue`** or **`clipnorm`** parameter during optimizer construction, or by explicitly clipping gradients using **`tf.clip_by_balue`**.  
  
The effect of vanishing gradients is that gradients from time steps that are far away do not contribute anything to the learning process, so the RNN ends up not learning any long-range dependencies. While there are a few approaches to minimizing the problem, such as proper initialization of the $W$ matrix, more aggressive regularization, using ReLU instead of **`than`** activation, anc pretraining the layers using unsupervised methods, the most popular solution is to use LSTM or GRU architectures, each of which will be explained shortly. These architectures have been designed to deal with vanishing gradients and learn long-term dependencies more effectively.

## RNN Cell variants

### Long short-term memory (LSTM)

The LSTM is a variant of the SimpleRNN cell that is capable of learning long-term dependencies. We have seen how the SimpleRNN combines the hidden state from the previous time step and the current input through a tanh layer to implement recurrence. LSTMs also implement recurrence in a similar way, but instead of a simgle **`than`** layer, there are four layers interacting in very specific way.The following diagram illustrates the transformations that are applied in the hidden state at time step $t$.  
  
The diagram looks complicated, but let us look at it component by component. The line across the top of the diagram is the cell state $c$, representing the internal memory of the unit.  
  
The line across the bottom is the hidden state $h$, and the $i$, $f$, $o$ and $g$ gates are the mechanisms by which the LSTM works around the vanishing gradient problem. During training, the LSTM learns the parameters for these gates:

![image.png](attachment:image.png)

An alternative way to think about how these gates work inside and LSTM cell is to consider the equations for the cell. These equations describe how the value of the hidden state $h_t$ at time $t$ is calculated from the value of hidden state $h_{t-1}$ at the previous time step.

The set of equations representing an LSTM are shown as follows:  
  
$$i = \sigma (W_i h_{t-1} + U_i x_t + V_i c_{t-1})\\
f = \sigma (W_f h_{t-1} + U_f x_t + V_f c_{t-1})\\
o = \sigma (W_o h_{t-1} + U_o x_t + V_o c_{t-1})\\
g = tanh(W_g h_{t-1} + U_g x_t)\\
c_t = (f * c_{t-1}) + (g * i)\\
h_t = tanh(c_t) * o
$$
  
Here $i$, $f$, and $o$ are the input, forget, and output gates. They are computed using the same equations but with different parameter matrices $W_i$, $U_i$, $W_f$, $U_f$, and $W_o$, $U_o$. The sigmoid function modulates the output of these gates between 0 and 1, so the output vectors produced can be multipled element-wise with another vector to define how much of the second vector can pass through the first one.  
  
The forget gate defines how much of the previous state $h_{t-1}$ you want to allow to pass through. The input gate defines how much of the newly computed state for the current input $x_t$ you want to let through, and the output gate defines how much of the internal state you want to expose to the next layer. The internal hidden state $g$ is computed based on the current input $x_t$ and the previous hidden state $h_{t-1}$. Notice that the equation for $g$ is identical to that for the SimpleRNN, except that in this case we will modulate the output by the output of input vector $i$.  
  
Given $i$, $f$, $o$, and $g$, we can now calculate the cell state $c_t$ at time $t$ as the cell state $c_{t-1}$ at time ($t-1$) multiplied by the value of the forget gate $f$, plus the state $g$ multiplied by the input gate $i$. This is basically a way to combine the previous memory and the new input - setting the forget gate to 0 ignores the old memory and setting the input gate to 0 ignores the newly computed state. Finally, the hidden state $h_t$ at time $t$ is computed as the memory $c_t$ at time $t$, with the output gate $o$.  
  
If you would like to learn more about LSTMs, please take a look at the WildML RNN tutorial(http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/) and Christopher Olah's blog post(https://colah.github.io/posts/2015-08-Understanding-LSTMs/). The first covers LSTM in somewhat greater detail and the second takes you step by step through the computations in a very visual way.

### Gated recurrent unit (GRU)

The GRU is a variant of the LSTM and was introduced by Cho, et al. It retains the LSTM's resistance to the vanishing gradient problem, but its internal structure is simpler, and is therefore faster to train, since less computations are needed to make updates to its hidden state.  
  
Instead of the input ($i$), forgot ($f$), and output ($o$) gates in the LSTM cell, the GRU cell has two gates, an update gate $z$ and a reset gate $r$. The update gate defines how much previous memory to keep around, and the reset gate defines how to combine the new input with the previous memory. There is no persistent cell state distinct from the hidden state as it is in LSTM.  
  
The GRU cell defines the computation of the hidden state $h_t$ at time $t$ from the hidden state $h_{t-1}$ at the previous time step using the following set of equations:  
  
$$z = \sigma (W_z h_{t-1} + U_z x_t)\\
r = \sigma (W_r h_{t-1} + U_r x_t)\\
c = tanh(W_c (h_{t-1} * r) + U_c x_t)\\
h_t = (z * c) + ((1-z) * h_{t-1})$$  
  
The outputs of the update gate $z$ and the reset gate $r$ are both computed using a combination of the previous hidden state $h_{t-1}$ and the current input $x_t$. The sigmoid function modulates the output of these functions between 0 and 1. The cell state $c$ is computed as a function of the output of the reset gate $r$ and input $x_t$. Finally, the hidden state $h_t$ at time $t$ is computed as a function of the cell state $c$ and the previous hidden state $h_{t-1}$. The parameters $W_z$, $U_z$, $W_r$, $U_r$, and $W_c$, $U_c$ are learned during training.  

### Peephole LSTM

The peephole LSTM is an LSTM variant. It adds "peepholes" to the input, forget, and output gates, so they can see the previous cell state $c_{t-1}$.

## RNN variants

In this section, we will look at a couple of variations on the basic RNN architecture that can provide performance improvements in some specific circumstances. Note that these strategies can be applied for different kinds of RNN cells, as well as for different RNN topologies, which we will learn about later.

### Bidirectional RNNs

We have seen how, at any given time step $t$, the output of the RNN is dependent on the outputs all previous time steps. However, it is entirely possible that the output is also dependent on the future outputs as well. This is especially true for applications such as natural language processing where the attributes of the word or phrase we are trying to predict may be dependent on the context given by the entire enclosing sentence, not just the words that came before it.  
  
This problem can be solved using a bidirectional LSTM, which are essentially two RNNs stacked on top of each other, one reading the input from left to right, and the other reading the input from the right to the left. The output at each time step will be based on the hidden state of both RNNs. Bidirectional RNNs allow the network to place equal emphasis on the beginning and end of the sequence, and typically results in performance improvements.

### Stateful RNNs

RNNs can also be stateful, which means that they can maintain state across batches during training. That is, the hidden state computed for a batch of training data will be used as the initialhidden state for the next batch of training data. Setting an RNN to be stateful means that it can build state across its training sequence and even maintain that state when doing predictions. The benefits of using stateful RNNs are smaller network size and/or lower training times. The disadvantage is that we are now reponsible for training the network with a batch size that reflects the periodicity of the data and resetting the state after each epoch. In addition, data should not be shuffled while training the network since the order in which the data is presented is relevant for stateful networks.  
  
To set a RNN layer as stateful, set the named variable stateful to **`True`**. In our example of a one-to-many topology for learning to generate text, we provide an example of using a stateful RNN. Here, we train using data consisting of contiguous text slices, so setting the LSTM to stateful means that the hidden state generated from the previous text chunk is reused for the current text chunk.

## RNN topologies

We have seen examples of how MLP and CNN architectures can be composed to form more complex networks. RNNs offer yet another degree of freedom, in that it allows sequence input and output. This means that RNN cells can be arranged in different ways to build networks that are adapted to solve different types of problems. *Figure 4* shows five different configurations of inputs, hidden layers, and outputs, represented by red, green, and blue boxes respectively:

![image.png](attachment:image.png)

Of these, the first one one-to-one) is not interesting from a sequence processing point of view, since it can be implemented as a simple Dense network with one input and one output.  
  
The one-to-many case has a single input and outputs a sequence. An example of such a network might be a network that can generate text tags from images [6], containing short text descriptions of different aspects of the images. Such a network would be trained with image input and labled sequences of text representing the image tags.  
  
The many-to-one case is the reverse; it takes a sequence of tensors as input but outputs a single tensor. Examples of such networks would be a sentiment analysis network [7], which takes as input a block of text such as a movie review and outputs a single sentiment value.  
  
The many-to-many use case comes in two flavors. The first one is more popular and is better known as the seq2seq model. In this model, a sequence is read in and produces a context vector representing the input sequence, which is used to generate the output sequence.  
  
The topology has been used with great success in the field of machine translation, as well as problems that can be reframed as machine translation problems. Real life examples of the former can be found in [8,9], and an example of the latter is described in [10].  
  
The second many-to-many type has an output cell corresponding to each input cell. This kind of network is suited for use case where there is a 1:1 correspondence between the input and output, such as time series. The major difference between this model and the seq2seq model is that the input does not have to be completely encoded before the decoding process begins.  
  
In the next three sections, we provide examples of a one-to-many network that learns to generate text, a many-to-one network that does sentiment analysis, and a many-to-many network of the second type, which predicts **Part-of-speech (POS)** for words in a sentence.

## Example - One-to-Many - learning to generate text

RNNs have been used extensively by the **Natural Language Processing (NLP)** community for various applications. One such application is to build language models. A language model is a model that allow us to predict the probability of a word in a text given previous words. Language models are important for various higher-level tasks such as machine translation, spelling correction, and so on.  
  
The ability of a language model to predict the next word in a sequence makes it a generative model that allows us to generate text by sampling fron the output probabilities of different words in the vocabulary. The training data is a sequence of words, and the label is the word appearing at the next time step in the sequence.  
  
For our example, we will train a character-based RNN on the text of the children's stories "Alice in Wonderland" and its sequel "Through the Looking Glass" by Lewis Carroll. We have chosen to build a character-based model because it has a smaller vocabulary and trains quicker. The idea is the same as training and using a word-based language model, except we will use characters instead of words. Once trained, the model can be used to generate some text in the same style.  
  
The data for our example will come from the plain texts of two novels from the Project Gutenberg website [36]. Input to the network are sequence of 100 characters, and the corresponding output is another sequence of 100 characters, offset from the input by 1 position.  
  
That is, if the input is the sequence [$c_1,\, c_2,\, ...,\,c_n$], the output will be [$c_2,\, c_3,\, ...,\, c_{n=1}$]. We will train the network for 50 epochs, and at the end of every 10 epochs, we will generate a fixed size sequence of characters starting with a standard prefix. In the following example, we have used the prefic "Alice", the name of the protagonist in our novels.

In [1]:
import os
import numpy as np
import re
import shutil
import tensorflow as tf

In [2]:
DATA_DIR = "data" # where you download the source code
CHECKPOINT_DIR = os.path.join(DATA_DIR, "checkpoints") # where we will save the weights

In [3]:
# download text data and preprocessing
def download_and_read(urls):
    texts = []
    for i, url in enumerate(urls):
        # check to see whether the file is already downloaded, and if not download a file
        p = tf.keras.utils.get_file('ex1-{:d}.txt'.format(i), url, cache_dir=".")
        text = open(p, 'r', encoding='utf8').read()
        
        # remove byte order mark
        text = text.replace('\ufeff', '')
        
        # remove newlines
        text = text.replace('\n', ' ')
        text = re.sub(r'\s+', ' ', text)
        
        # add it to the list
        texts.extend(text)
    return texts

In [4]:
# download and read into local data structure (list of chars)
texts = download_and_read([
    "http://www.gutenberg.org/cache/epub/28885/pg28885.txt",
    "https://www.gutenberg.org/files/12/12-0.txt"
])

Next, we will create our vocaburary. In our case, our vocabulary contains 90 uinque characters, composed of uppercase and lowercase alphabets, numbers, and special characters. We also create some mapping dictionaries to convert each vocabulary character to a unique integer and vice versa. As noted earlier, the input and output of the network is a sequence of characters. However, the actual input and output of the network are sequences of integers, and we will use these mapping dictionaries to handle this conversion.

In [5]:
# create the vocabulary
vocab = sorted(set(texts))
print('vocab size: {:d}'.format(len(vocab)))

# create mapping from vocab chars to ints
char2idx = {c:i  for i, c in enumerate(vocab)}
idx2char = {i:c for c, i in char2idx.items()}

vocab size: 90


The next step is to use these mapping dictionaries to convert our character sequence input into an integer sequence, and then into a TensorFlow dataset. Each of our sequences is going to be 100 characters long, with the output being offset from the input by 1 character position. We first batch the dataset into slices of 101 characters, then apply the **`split_train_labels()`** function to every element of the dataset to create our sequences dataset, which is a dataset of tuples of two elements, each element of the tuple being a vector of size 100 and type **`tf.int64`**. We then shuffle these sequences and then create batches of 64 tuples each for input to our network. Each element of the dataset is now a tuple consisting of a pair of matrices, each of size (64, 100) and type **`tf.int64`**

In [6]:
# numericize the texts
texts_as_ints = np.array([char2idx[c] for c in texts])
data = tf.data.Dataset.from_tensor_slices(texts_as_ints)

In [7]:
i=0
for element in data:
    print(element)
    i=i+1
    if i==10: break

tf.Tensor(44, shape=(), dtype=int32)
tf.Tensor(75, shape=(), dtype=int32)
tf.Tensor(72, shape=(), dtype=int32)
tf.Tensor(67, shape=(), dtype=int32)
tf.Tensor(62, shape=(), dtype=int32)
tf.Tensor(60, shape=(), dtype=int32)
tf.Tensor(77, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(35, shape=(), dtype=int32)
tf.Tensor(78, shape=(), dtype=int32)


In [8]:
# number of characters to show before asking for prediction
# sequence : [None, 100]
seq_length = 100
sequences = data.batch(seq_length + 1, drop_remainder=True)

def split_train_labels(sequence):
    input_seq = sequence[0:-1]
    output_seq = sequence[1:]
    return input_seq, output_seq

sequences = sequences.map(split_train_labels)

In [9]:
i=0
for element in sequences:
    print(element)
    i=i+1
    if i==5: break

(<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([44, 75, 72, 67, 62, 60, 77,  0, 35, 78, 77, 62, 71, 59, 62, 75, 64,
        7, 76,  0, 29, 69, 66, 60, 62,  7, 76,  0, 29, 61, 79, 62, 71, 77,
       78, 75, 62, 76,  0, 66, 71,  0, 51, 72, 71, 61, 62, 75, 69, 58, 71,
       61, 11,  0, 59, 82,  0, 40, 62, 80, 66, 76,  0, 31, 58, 75, 75, 72,
       69, 69,  0, 48, 65, 66, 76,  0, 62, 30, 72, 72, 68,  0, 66, 76,  0,
       63, 72, 75,  0, 77, 65, 62,  0, 78, 76, 62,  0, 72, 63,  0])>, <tf.Tensor: shape=(100,), dtype=int32, numpy=
array([75, 72, 67, 62, 60, 77,  0, 35, 78, 77, 62, 71, 59, 62, 75, 64,  7,
       76,  0, 29, 69, 66, 60, 62,  7, 76,  0, 29, 61, 79, 62, 71, 77, 78,
       75, 62, 76,  0, 66, 71,  0, 51, 72, 71, 61, 62, 75, 69, 58, 71, 61,
       11,  0, 59, 82,  0, 40, 62, 80, 66, 76,  0, 31, 58, 75, 75, 72, 69,
       69,  0, 48, 65, 66, 76,  0, 62, 30, 72, 72, 68,  0, 66, 76,  0, 63,
       72, 75,  0, 77, 65, 62,  0, 78, 76, 62,  0, 72, 63,  0, 58])>)
(<tf.Tensor: shap

In [10]:
# set up for training
# batches : [None, 64, 100]
batch_size = 64
steps_per_epoch = len(texts) // seq_length // batch_size
dataset = sequences.shuffle(10000).batch(batch_size, drop_remainder=True)

We are now ready to define our network. As before, we define our network as a subclass of **`tf.keras.Model`** as shown next. The network is fairly simple; it takes as input a sequence of integers of size 100 (**`num_timesteps`**) and passes them through an Embedding layer so that each integer in the sequence is converted to a vector of size 256 (**`embedding_dim`**). So, assuming a batch size of 64, for our input sequence of size (64, 100), the output of the Embedding layer is a matrix of shape (64, 100, 256).  
  
The next layer is the RNN layer with 100 time steps. The implementation of RNN chosen is a GRU. This GRU layer will take, at each of its time steps, a vector of size (256,) and output a vector of shape (1024,) (**`rnn_output_dim`**). Note also that the RNN is stateful, which means that the hidden state output from the previous training epoch will be used as input to the current epoch. The **`return_sequences=True`** flag also indicates that the RNN will output at each of the time steps rather than an aggregate output at the last time steps.  
  
Finally, each of the time steps will emit a vector of shape (1024,) into a Dense layer that outputs a vector of shape (90,)(**`vocab_size`**). The output from this layer will be a tensor of shape (64, 100, 90). Each position in the output vector corresponds to a character in our vocabulary, and the values correspond to the probability of that character occurring at that output position.

In [11]:
class CharGenModel(tf.keras.Model):
    def __init__(self, vocab_size, num_timesteps, embedding_dim, **kwargs):
        super(CharGenModel, self).__init__(**kwargs)
        self.embedding_layer = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.rnn_layer = tf.keras.layers.GRU(num_timesteps,
                                            recurrent_initializer='glorot_uniform',
                                            recurrent_activation='sigmoid',
                                            stateful=True,
                                            return_sequences=True)
        self.dense_layer = tf.keras.layers.Dense(vocab_size)
        
    def call(self, x):
        x = self.embedding_layer(x)
        x = self.rnn_layer(x)
        x = self.dense_layer(x)
        return x

In [12]:
vocab_size = len(vocab)
embedding_dim = 256

In [13]:
model = CharGenModel(vocab_size, seq_length, embedding_dim)
model.build(input_shape=(batch_size, seq_length))

In [14]:
model.summary()

Model: "char_gen_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        multiple                  23040     
_________________________________________________________________
gru (GRU)                    multiple                  107400    
_________________________________________________________________
dense (Dense)                multiple                  9090      
Total params: 139,530
Trainable params: 139,530
Non-trainable params: 0
_________________________________________________________________


Next we define a loss function and compile our model. We will use the sparse categorical cross-entropy as our loss function because that is the standard loss function to use when our inputs and outputs are sequences of integers. For the optimizer, we will choose the Adam optimizer:

In [15]:
def loss(labels, predictions):
    return tf.losses.sparse_categorical_crossentropy(labels, predictions, from_logits=True)

In [16]:
model.compile(optimizer=tf.optimizers.Adam(), loss=loss)

Normally, the character at each position of the output is found by computing the argmax of the vector at that position, that is, the character corresponding to the maximum probability value. This is known as greedy search. In the case of language models where the output of one timestep becomes the input to the next timestep, this can lead to repetitive output. The two most common approaches to overcome this problem is either to sample the output randomly or to use beam search, which samples from $k$ the most probable values at each time step. Here we will use the **`tf.random.categorical()`** function to sample the output randomly. The following function takes a string as a prefix and uses it to generate a string whose length is specified by **`num_chars_to_generate`**. The temperature parameter is used to control the quality of the predictions. Lower values will create a more predictable output.  
  
The logic follows a predictable pattern. We convert the sequence of characters in our **`prefix_string`** into a sequence of integers, then **`expand_dims`** to add a batch dimension so the input can be passed into our model. We then reset the state of the model. This is needed because our model is stateful, and we don't want the hidden state for the first timestep in our prediction run to be carried over from the one computed during training. We then run the input through our model and get back a prediction. This is the vector of shape (90,) representing the probabilities of each character in the vocabulary appearing at the next time step. We then reshape the prediction by removing the batch dimension and dividing by the temperature, then randomly sample from the vector. We then set our prediction as the input to the next time step. We repeat this for the number of characters we need to generate, converting each prediction back to character form and accumulating in a list, and returning the list at the end of the loop:

In [17]:
def generate_text(model, prefix_string, char2idx, idx2char, 
                 num_chars_to_generate=1000, temperature=1.0):
    input = [char2idx[s] for s in prefix_string]
    input = tf.expand_dims(input, 0)
    
    text_generated = []
    model.reset_states()
    
    for i in range(num_chars_to_generate):
        preds = model(input)
        # 차원이 1인 차원을 모두 제거한다
        preds = tf.squeeze(preds, 0) / temperature
        # predict char returned by model
        pred_id = tf.random.categorical(preds, num_samples=1)[-1, 0].numpy()
        text_generated.append(idx2char[pred_id])
        # pass the prediction as the next input to the model
        input = tf.expand_dims([pred_id], 0)
        
    return prefix_string + "".join(text_generated)

In [18]:
outs = np.array([[0.1,0.2,0.3,0.4,0.5,0], [0.1,0.2,0.3,0.4,0.5,0]])
tf.random.categorical(outs, num_samples=1)[-1, 0].numpy()

3

Finally, we are ready to run our training and evaluation loop. As mentioned earlier, we will train our network for 50 epochs, and at every 10 epoch intervals, we will try to generate some text with the model trained so far. Our prefix at each stage is the string "Alice". Notive that in order to accommodate a single string prefix, we save the weights after every 10 epochs and build a separate generative model with these weights but with an input shape with a batch size of 1. Here is the code to do this:

In [19]:
num_epochs = 50
for i in range(num_epochs // 10):
    # When passing an infinitely repeating dataset, you must specify the steps_per_epoch argument
    # steps_per_epoch = len(texts) // seq_length // batch_size = 345739 // 100 // 64 = 54
    model.fit(dataset.repeat(), epochs=10, steps_per_epoch=steps_per_epoch)
    
    checkpoint_file = os.path.join(CHECKPOINT_DIR, 'model_epoch_{:d}'.format(i+1))
    model.save_weights(checkpoint_file)
    
    # create generative model using the trained model so far
    gen_model = CharGenModel(vocab_size, seq_length, embedding_dim)
    gen_model.load_weights(checkpoint_file)
    gen_model.build(input_shape=(1, seq_length))
    print('after epoch: {:d}'.format(i+1))
    print(generate_text(gen_model, "Alice ", char2idx, idx2char))
    print("---")

Train for 54 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
after epoch: 1
Alice Lingter," doth----Boud thoug curion, and the ardo trimect. mice wasthe Thinks a cac, you greavther do set oter ase hadled sare, I dry had reazlblefumbour! None hery might a'le torest thef woind, a save assming to bence Theom. "It and the sall leat't Byem wry aich tight though: ‘I sas she; _the beat the firk---saind in ipull Pread Lith ming dopee the Rxptishs the sulding albock thundy ryois.’ Ficr ser othe pear the ho moust. I'ch feey a Yout or: in of jut Quincind it voleghen arself comtle for armorice with said at you gas an rawled leanded a knicut. ‘I’s so din thing so nowiss coean FONE she wastle one surtte?’ "It "bet very lylised." "I_ ha’t promed dor sheen damall explish furstine.’ ‘I suppiers ligen telln! Thea‘But inice wnid LAUCNTENATS ‘I nokes om the begilfe quittight makingadedly mory non’tplid wither said thimere, inty all cumpl

after epoch: 3
Alice appair, I'S Alice, so-dn't Alice know.’ ‘But to Carptenst?’ she said at able---’ Jutter, even well, ‘Lo,’ said the experformoming eBooked her thoust or and mink on jo Cater; Sent, non--wossued pix like in all thim to all you had about chome ouf her by that’s speary. CHAD me edgenot!’ "that’s all think me to out got the tares)d.’ ‘Allice forst much reailing. "ever sigven, the have as the prit, and made his lighttrattens ansterdrauners,’ she't she you, and-buting as sowern, Alice very might doess this you a queennif." "Mever fally amove a fove then my to her here. Thery Pige a don’t leave!’ ‘In a little great be goenough, disten, if Alice sat so allily is! I'm gardened said, it voleet repeated fire it,’ he think it to--rothat coaceaplaghasie of a might down only set into roud side, and that and shop her quartly of comptions of jukn the both she begance up Lichesen nents to one--thought a gits on dish. ‘I very you smile!" said Tiustser hat-very-like. The own are behis

after epoch: 5
Alice to only a good again, she said, gree! Let aw for withour. "When them? I had too-Some was stall douft I doke-borge isigethation: _hoss by her I can there's gener of agerringoc round the March eyes-- that’s not asks pa.neys longs. ‘It's quite fings as be in mise! They cave you?”’ ‘I shall it get inst first? Roow itsual to it, ‘a stophave had look two go and said, still she should Frow. Ah, voice I gave her she hat this found the Gaten among to ensays, of the kince a gone so far tleAM Knithing atain?’ she said anguned, and Alice off your license clots. The exectle, a fes twowdun. And you’ve beenog the copy, as she would know she beginning of the listering to new in as she can-very Gutenberg-tm electronices are and volmishes quite for it washe of the pats daw down thing, eair with pearn and _you_, just vilenth, they were think, very clears, had best warche aloud alought his pay it with I cebre. ‘As musing for the pogs, the bearnly 100/8115 VE RITI MY!’ she WOUMIND. LAT

Generating the next character or next word in the text isn't the only thing you can do with this sort of model. Similar models have been built to make stock price predictions [3] or generate classical music [4]. Andrej Karpathy covers a few other fun examples, such as generating fake Wikipedia pages, algebraic geometry proofs, and Linux source code in his blog post [5].

## Example - Many-to-One - Sentiment Analysis

In this example, we will use a many-to-one network that takes a sentence as input and predicts its sentiment as being either positive or negative. Our dataset is the Sentiment labeled sentences dataset on the UCI Machine Learning Repository [20], a set of 3,000 sentences from reviews on Amazon, IMDb, and Yelp, each labeled with 0 if it expresses a negative sentiment, or 1 if it expresses a positive sentiment.

In [3]:
import numpy as np
import os
import shutil
import tensorflow as tf
from sklearn.metrics import accuracy_score, confusion_matrix

The dataset is provided as a zip file, which expands into a folder containing three files of labeled sentences, one for each provider, with one sentence and label per line, with the sentence and label separated by the tab character. We first download the zip file, then parse the files into a list of (sentence, label) pairs:

In [4]:
def download_and_read(url):
    local_file = url.split('/')[-1]
    local_file = local_file.replace("%20", " ")
    p = tf.keras.utils.get_file(local_file, url, extract=True, cache_dir=".")
    local_folder = os.path.join("datasets", local_file.split('.')[0])
    
    labeled_sentences = []
    for labeled_filename in os.listdir(local_folder):
        if labeled_filename.endswith("_labelled.txt"):
            with open(os.path.join(local_folder, labeled_filename), "r") as f:
                for line in f:
                    sentence, label = line.strip().split('\t')
                    labeled_sentences.append((sentence, label))
    return labeled_sentences

In [5]:
# download and read data into data structures
labeled_sentences = download_and_read("https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip")

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip


In [6]:
sentences = [s for (s, l) in labeled_sentences]
labels = [int(l) for (s, l) in labeled_sentences]

In [11]:
print(type(sentences), len(sentences))

<class 'list'> 3000


In [12]:
print(type(labels), len(labels))

<class 'list'> 3000


In [13]:
sentences[0]

'So there is no way for me to plug it in here in the US unless I go by a converter.'

In [14]:
labels[0]

0

Our objective is to train the model so that, given a sentence as input, it learns to predict the corresponding sentiment provided in the label. Each sentence is a sequence of words. However, in order to input it into the model, we have to convert it into a sequence of integers. Each integer in the sequence will point to a word. The mapping of integers to words for our corpus is called a vocabulary. Thus we need to tokenize the sentences and produce a vocabulary. This is done using the following code:

In [21]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(sentences)
vocab_size = len(tokenizer.word_counts)
print('vocabulary size: {:d}'.format(vocab_size))
word2idx = tokenizer.word_index
idx2word = {v:k for (k, v) in word2idx.items()}

vocabulary size: 5271


In [20]:
word2idx

{'the': 1,
 'and': 2,
 'i': 3,
 'a': 4,
 'is': 5,
 'it': 6,
 'to': 7,
 'this': 8,
 'of': 9,
 'was': 10,
 'in': 11,
 'for': 12,
 'not': 13,
 'that': 14,
 'with': 15,
 'my': 16,
 'very': 17,
 'good': 18,
 'on': 19,
 'great': 20,
 'you': 21,
 'but': 22,
 'have': 23,
 'movie': 24,
 'are': 25,
 'as': 26,
 'so': 27,
 'phone': 28,
 'film': 29,
 'be': 30,
 'all': 31,
 'one': 32,
 'had': 33,
 'at': 34,
 'food': 35,
 'like': 36,
 'just': 37,
 'place': 38,
 "it's": 39,
 'time': 40,
 'service': 41,
 'an': 42,
 'were': 43,
 'if': 44,
 'from': 45,
 'bad': 46,
 'really': 47,
 'there': 48,
 'they': 49,
 'we': 50,
 'well': 51,
 'out': 52,
 'has': 53,
 'would': 54,
 'about': 55,
 'no': 56,
 'or': 57,
 'your': 58,
 'only': 59,
 'by': 60,
 'best': 61,
 "don't": 62,
 'even': 63,
 'here': 64,
 'ever': 65,
 'up': 66,
 'also': 67,
 'will': 68,
 'back': 69,
 'me': 70,
 'when': 71,
 'more': 72,
 'than': 73,
 'quality': 74,
 'go': 75,
 'what': 76,
 'love': 77,
 'he': 78,
 "i've": 79,
 'can': 80,
 'made': 81,
 'w

Our vocabulary consists of 5271 unique words. It is possible to make the size smaller by dropping words that occur fewer than some threshold number of times, which can be found by inspecting the **`tokenizer.word_counts`** dictionary. In such cases, we need to add 1 to the vocabulary size for the UNK (unknown) entry, which will be used to replace every word that is not found in the vocabulary.  
  
We also construct lookup dictionaries to convert from word to word index and back. The first dictionary is useful during training, in order to construct integer sequences to feed the network. The second dictionary is used to convert from word index back to word in our prediction code later.  
  
Each sentence can have a different number of words. Our model will require us to provide sequences of integers of identical length for each sentence. In order to support this requirement, it is common to choose a maximum sequence length that is large enough to accommodate most of the sentences in the training set. Any sentences that are shorter will be padded with zeros, and any sentences that are longer will be trauncated. An easy way to choose a good value for the maximum sequence length is to look at the sentence length (in number of words) at different percentile positions:

In [23]:
seq_lengths = np.array([len(s.split()) for s in sentences])
print([(p, np.percentile(seq_lengths, p)) for p in [75,80,90,95,99,100]])

[(75, 16.0), (80, 18.0), (90, 22.0), (95, 26.0), (99, 36.0), (100, 71.0)]


As can be seen, the maximum sentence length is 71 words, but 99% of the sentences are under 36 words. If we choose a value of 64, for example, we should be able to get away with not having to truncate most of the sentences.  
  
The preceding blocks of code can be run interactively multiple times to choose good values of vocabulary size and maximum sequence length respectively. In our example, we have chosen to keep all the words (so **`vocab_size = 5271`**), and we have set our **`max_seqlen`** to 64.  
  
Our next step is to create a dataset that our model can consume. We first use our trained tokenizer to convert each sentence from a sequence of words (**`sentences`**) to a sequence of integers (**`sentences_as_ints`**), where each corresponding integer is the index of the word in the **`tokenizer.word_index`**. It is then truncated and padded with zeros. The labels are also converted to a NumPy array **`labels_as_ints`**, and finally, we combine the tensors **`sentences_as_ints`** and **`labels_as_ints`** to form a TensorFlow dataset:

In [25]:
max_seqlen = 64
# create dataset
sentences_as_ints = tokenizer.texts_to_sequences(sentences)
sentences_as_ints = tf.keras.preprocessing.sequence.pad_sequences(sentences_as_ints, maxlen=max_seqlen)
labels_as_ints = np.array(labels)
dataset = tf.data.Dataset.from_tensor_slices((sentences_as_ints, labels_as_ints))

We want to set aside 1/3 of the dataset for evaluation. Of the remaining data, we will use 10% as an inline validation dataset that the model will use to gauge its own progress during training, and the remaining as the training dataset. Finally, we create batches of 64 sentences for each dataset:

In [26]:
dataset = dataset.shuffle(10000)
test_size = len(sentences) // 3 # = 1000
val_size = (len(sentences) - test_size) // 10 # = 200

In [31]:
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)
batch_size = 64
train_dataset = train_dataset.batch(batch_size)
val_dataset = val_dataset.batch(batch_size)
test_dataset = test_dataset.batch(batch_size)

Next we define our model. As you can see, the model is fairly straightforward, each input sentence is a sequence of integers of size **`max_seqlen`** (64). This is input into an Embedding layer that converts each word into a vector given by the size of the vocabulary + 1. The additional word is to account for the padding integer 0 that was introduced during the **`pad_sequences()`** call above. The vector at each of the 64 time steps are then fed into a bidirectional LSTM layer, which converts each word to a vector of size (64,). The output of the LSTM at each time step is fed into a Dense layer, which produces a vector of size (64,) with ReLU activation. The output of this Dense layer is then fed into another Dense layer, which outputs a vector of (1,) at each time step, modulated through a sigmoid activation.  
The model is compiled with the binary corss-entropy loss function and the Adam optimizer, and then trained over 10 epochs:

In [34]:
class SentimentAnalysisModel(tf.keras.Model):
    def __init__(self, vocab_size, max_seqlen, **kwargs):
        super(SentimentAnalysisModel, self).__init__(**kwargs)
        self.embedding = tf.keras.layers.Embedding(vocab_size, max_seqlen)
        self.bilstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(max_seqlen))
        self.dense = tf.keras.layers.Dense(64, activation='relu')
        self.out = tf.keras.layers.Dense(1, activation='sigmoid')
        
    def call(self, x):
        x = self.embedding(x)
        x = self.bilstm(x)
        x = self.dense(x)
        x = self.out(x)
        return x

In [35]:
model = SentimentAnalysisModel(vocab_size+1, max_seqlen)
model.build(input_shape=(batch_size, max_seqlen))
model.summary()

Model: "sentiment_analysis_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        multiple                  337408    
_________________________________________________________________
bidirectional (Bidirectional multiple                  66048     
_________________________________________________________________
dense (Dense)                multiple                  8256      
_________________________________________________________________
dense_1 (Dense)              multiple                  65        
Total params: 411,777
Trainable params: 411,777
Non-trainable params: 0
_________________________________________________________________


In [36]:
# compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [43]:
# train
data_dir = '.\\data'
logs_dir = os.path.join('.\\logs')
best_model_file = os.path.join(data_dir, "best_model.h5")

checkpoint = tf.keras.callbacks.ModelCheckpoint(best_model_file, 
                    save_weights_only=True, save_best_only=True)
tensorboard = tf.keras.callbacks.TensorBoard(log_dir=logs_dir)

In [44]:
num_epochs = 10
history = model.fit(train_dataset, epochs=num_epochs, validation_data=val_dataset,
                   callbacks=[checkpoint, tensorboard])

Train for 29 steps, validate for 4 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
%load_ext tensorboard

In [50]:
%tensorboard --logdir ./logs

ERROR: Timed out waiting for TensorBoard to start. It may still be running as pid 3452.

Our checkpoint callback has saved the best model based on the lowest value of validation loss, and we can now reload this for evaluation against our held out test set:

In [51]:
best_model = SentimentAnalysisModel(vocab_size+1, max_seqlen)
best_model.build(input_shape=(batch_size, max_seqlen))
best_model.load_weights(best_model_file)
best_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [52]:
test_loss, test_acc = best_model.evaluate(test_dataset)
print('test loss: {:.3f}, test accuracy: {:.3f}'.format(test_loss, test_acc))

test loss: 0.047, test accuracy: 0.991


we can also use **`model.predict()`** to retrieve our predictions and compare them individually to the labels and use external tools (from scikit-learn, for example) to compute our results:

In [54]:
labels, predictions = [], []
idx2word[0] = 'PAD'
is_first_batch = True
for test_batch in test_dataset:
    inputs_b, labels_b = test_batch
    pred_batch = best_model.predict(inputs_b)
    predictions.extend([(1 if p>0.5 else 0) for p in pred_batch])
    labels.extend([i for i in labels_b])
    
    if is_first_batch:
        # print first batch of label, prediction, and sentence
        for rid in range(inputs_b.shape[0]):
            words = [idx2word[idx] for idx in inputs_b[rid].numpy()]
            words = [w for w in words if w != "PAD"]
            sentence = " ".join(words)
            print('{:d}\t{:d}\t{:s}'.format(labels[rid], predictions[rid], sentence))
        
        is_first_batch = False
        
print('accuracy score: {:.3}'.format(accuracy_score(labels, predictions)))
print('confusion matrix')
print(confusion_matrix(labels, predictions))
    

0	0	if this premise sound stupid that's because it is
1	1	paolo sorrentino has written a wonderful story about loneliness and tony has built one of the most unforgettable characters seen in movies in recent years
0	0	the ambiance here did not feel like a buffet setting but more of a douchey indoor garden for tea and biscuits
1	1	i have recommended it to friends
0	0	mic doesn't work
0	0	does not work for listening to music with the cingular 8125
1	1	the staff is super nice and very quick even with the crazy crowds of the downtown juries lawyers and court staff
1	1	i really like this product over the motorola because it is allot clearer on the ear piece and the mic
0	0	we waited for forty five minutes in vain
1	1	thus far have only visited twice and the food was absolutely delicious each time
1	1	that said our mouths and bellies were still quite pleased
1	1	if you want a movie that's not gross but gives you some chills this is a great choice
0	0	the update procedure is difficult and cumb

for the first batch of 64 sentences in our test dataset, we reconstruct the sentence and display the label(first column) as well as prediction from the model(second column).

## Example - Many-to-Many - POS tagging

In this example, we will use a GRU layer to build a network that does POS tagging. A POS is a grammatical category of words that are used in the same way across multiple sentences. Examples of POS are nouns, verbs, adjectives, and so on. For example, nouns are typically used to identify things, verbs are typically used to identify what they do, and adjectives are used to describe attributes of these things. POS tagging used to be done manually in the past, but this is now mostly a solved problem, initially through statistical model, and more recently by using deep learning models in an end-to-end manner, as described in Collobert, at al. [21]. For our training data, we will need sentences tagged with part of speech tags. The Penn Treebank [22] is one of such dataset; it is a human-annotated corpus of about 4.5 million words of America English. However, it is a non-free resource. A 10% sample of the Penn Treeback is freely available as part of NLTK[23], which we will use to train our network.  
  
Our model will take a sequence of words in a sentence as input, then will output the corresponding POS tag for each word. Thus, for an input sequence consisting of the words [The, cat, sat, on, the, mat,]. the output sequence should be the POS symbols [DT, NN, VB, IN, DT, NN].  
  
In order to get the data, the 10% treebank dataset, to install NLTK, follow the steps on the NLTK install page [23]. To install the treebank dataset, perform the following at the Python REPL:

In [1]:
import nltk
nltk.download('treebank')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\polas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.


True

One this is done, we are ready to build our network

In [3]:
import numpy as np
import os
import shutil
import tensorflow as tf

We will lazily import the NLTK treebank dataset into a pair of parallel flat files, one containing the sentences and the other containing a corresponding POS sequence:

In [4]:
def download_and_read(dataset_dir, num_pairs=None):
    sent_filename = os.path.join(dataset_dir, "treebank-sents.txt")
    poss_filename = os.path.join(dataset_dir, "treebank-poss.txt")
    if not(os.path.exists(sent_filename) and os.path.exists(poss_filename)):
        import nltk    

        if not os.path.exists(dataset_dir):
            os.makedirs(dataset_dir)
        fsents = open(sent_filename, "w")
        fposs = open(poss_filename, "w")
        sentences = nltk.corpus.treebank.tagged_sents()
        for sent in sentences:
            fsents.write(" ".join([w for w, p in sent]) + "\n")
            fposs.write(" ".join([p for w, p in sent]) + "\n")

        fsents.close()
        fposs.close()
    sents, poss = [], []
    with open(sent_filename, "r") as fsent:
        for idx, line in enumerate(fsent):
            sents.append(line.strip())
            if num_pairs is not None and idx >= num_pairs:
                break
    with open(poss_filename, "r") as fposs:
        for idx, line in enumerate(fposs):
            poss.append(line.strip())
            if num_pairs is not None and idx >= num_pairs:
                break
    return sents, poss

In [6]:
# download and read source and target data into data structure
sents, poss = download_and_read("./datasets")
assert(len(sents) == len(poss))
print("# of records: {:d}".format(len(sents)))

# of records: 3914


There are 3194 sentences in our dataset. We will then use the TensorFlow tokenizer to tokenize the sentences and create a list of sentence tokens. We reuse the same infrastructure to tokenize the parts of speech, although we could have simply split on spaces. Each input record to the network is currently a sequence of text tokens, but they need to be a sequence of integers. During the tokenizing process, the Tokenizer also maintains the tokens in the vocabualry, from which we can build mappings from token to integer and back.  
  
We have two vocabularies to consider, first the vocabulary of word tokens in the sentence collection, and the vocabulary of POS tags in part-of-speech collection. The following code shows how to tokenize both collections and generate the necessary maaping dictionaries:

In [7]:
def tokenize_and_build_vocab(texts, vocab_size=None, lower=True):
    if vocab_size is None:
        tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=lower)
    else:
        tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=vocab_size+1, oov_token="UNK", lower=lower)
    tokenizer.fit_on_texts(texts)
    if vocab_size is not None:
        # additional workaround, see issue 8092
        # https://github.com/keras-team/keras/issues/8092
        tokenizer.word_index = {e:i for e, i in tokenizer.word_index.items() 
            if i <= vocab_size+1 }
    word2idx = tokenizer.word_index
    idx2word = {v:k for k, v in word2idx.items()}
    return word2idx, idx2word, tokenizer

In [9]:
# vocabulary sizes
word2idx_s, idx2word_s, tokenizer_s = tokenize_and_build_vocab(sents, vocab_size=9000)
word2idx_t, idx2word_t, tokenizer_t = tokenize_and_build_vocab(poss, vocab_size=38, lower=False)
source_vocab_size = len(word2idx_s)
target_vocab_size = len(word2idx_t)
print("vocab sizes (source): {:d}, (target): {:d}".format(source_vocab_size, target_vocab_size))

vocab sizes (source): 9001, (target): 39


Our sentences are going to be of diefferent lengths, although the number of tokens in a sentence and their corresponding POS tag seqnece are the same. The network expects input to have the same length, so we have to decide how much to make our sentence length. The following (throwaway) code computes various percentiles and prints sentence length at these percentiles on the console:

In [19]:
sequence_lengths = np.array([len(s.split()) for s in sents])
print([(p, np.percentile(sequence_lengths, p)) for p in [75, 80, 90, 95, 99, 100]])

[(75, 33.0), (80, 35.0), (90, 41.0), (95, 47.0), (99, 58.0), (100, 271.0)]
