# Recurrent Neural Networks

Recurrent neural networks, or RNNs, are a family of neural networks for processing sequential data. Much as a convolutional networkis a neural network that is specialized for processing a grid of values Xsuch as an image, a recurrent neural network is a neural network that is specialized for processing a sequence of values $ x^{(1)} ,...x^{(T)}$ 
![alt text](https://stanford.edu/~shervine/images/architecture-rnn.png)

* $x^{<t>}$ is the input at time step t. For example, $x^{<1>}$ could be a one-hot vector corresponding to the second word of a sentence.
* $a^{<t>}$ is the hidden state at time step t. It’s the “memory” of the network. $a^{<t>}$ is calculated based on the previous hidden state and the input at the current step: $a^{<t>}=f(Ux_t + Wa_{t-1})$. The function f usually is a nonlinearity such as tanh or ReLU.  $a^{<0>}$, which is required to calculate the first hidden state, is typically initialized to all zeros.
* $y^{<t>}$ is the output at step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary. $y^{<t>} = \mathrm{softmax}(Vy_t)$.

For each timestep $t$ the activation $a^{<t>}$  and the output $y^{<t>}$ 
$\boxed{a^{< t >}=g_1(W_{aa}a^{< t-1 >}+W_{ax}x^{< t >}+b_a)}\quad\mbox{and}\quad\boxed{y^{< t >}=g_2(W_{ya}a^{< t >}+b_y)}$

where $W_{ax}, W_{aa}, W_{ya}, b_a, b_y$ are coefficients that are shared temporally and $g_1, g_2$  activations functions

![alt text](https://stanford.edu/~shervine/images/description-block-rnn.png)

The pros and cons of a typical RNN architecture are summed up in the table below:

## Advantages 
*  Possibility of processing input of any length
*  Model size not increasing with size of input
*  Computation takes into account historical information
*  Weights are shared across time

## Drawbacks
* Computation being slow
* Difficulty of accessing information from a long time ago
* Cannot consider any future input for the current state


### One to One 
Tradicional Neural Network 
$$T_x=T_y=1$$
![alt text](https://stanford.edu/~shervine/images/rnn-one-to-one.png)

### One to Many
Music generation

$$T_x=1, T_y>1$$
![alt text](https://stanford.edu/~shervine/images/rnn-one-to-many.png)

### Many to One
Sentiment Clasification
$$T_x>1, T_y=1$$
![alt text](https://stanford.edu/~shervine/images/rnn-many-to-one.png)


### Many to Many 
Name Entity Recognition
$$T_x=T_y$$
![alt text](https://stanford.edu/~shervine/images/rnn-many-to-many-same.png)

### Many to Many
Machine Translation
$$T_x\neq T_y$$
![alt text](https://stanford.edu/~shervine/images/rnn-many-to-many-different.png)

**Loss function** ― In the case of a recurrent neural network, the loss function  $L$ of all time steps is defined based on the loss at every time step as follows:
$$\boxed{\mathcal{L}(\widehat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}(\widehat{y}^{< t >},y^{< t >})}$$

**Backpropagation through time** ― Backpropagation is done at each point in time. At timestep $T$, the derivative of the loss $L$ with respect to weight matrix $W$is expressed as follows:
$$ \boxed{\frac{\partial \mathcal{L}^{(T)}}{\partial W}=\sum_{t=1}^T\left.\frac{\partial\mathcal{L}^{(T)}}{\partial W}\right|_{(t)}}$$





In [None]:
import os
import numpy as np

import tensorflow as tf
from tensorflow.contrib.eager.python import tfe


In [None]:
# enable eager mode
tf.enable_eager_execution()
tf.set_random_seed(0)
np.random.seed(0)

In [None]:
if not os.path.exists('weights/'):
    os.makedirs('weights/')

# constants
units = 128
batch_size = 100
epochs = 2
num_classes = 10

In [None]:

class DataLoader():
    """Load data MNIST """
    def __init__(self):
      
        # Download data
        (self.X_train, self.y_train),(self.X_test,self.y_test)= tf.keras.datasets.mnist.load_data()
        
        # Preprocessing
        self.X_train = self.X_train.reshape(-1, 28, 28 ).astype(np.float32)/255.0 # Debe ser de la forma [batch, H, W, num_canales]
        self.X_test  = self.X_test.reshape(-1, 28, 28).astype(np.float32)/255.0
        self.y_train=self.y_train.astype(np.int32) 
        self.y_test=self.y_test.astype(np.int32)
        
    def get_batch(self,batch_size):
        # Muestreo aleatorio de los datos de la forma [0, stop, size]
        index=np.random.randint(0, self.X_train.shape[0], batch_size)
        return self.X_train[index,:], self.y_train[index]

In [59]:
data=DataLoader()

(60000, 28, 28)

In [61]:
# onehot encoding 
y_train_ohe = tf.one_hot(data.y_train, depth=num_classes).numpy()
y_test_ohe = tf.one_hot(data.y_test, depth=num_classes).numpy()

print('x train', data.X_train.shape)
print('y train', y_train_ohe.shape)
print('x test', data.X_test.shape)
print('y test', y_test_ohe.shape)

('x train', (60000, 28, 28))
('y train', (60000, 10))
('x test', (10000, 28, 28))
('y test', (10000, 10))


# Long Short Term Memory
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn.

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

![alt text](https://stanford.edu/~shervine/images/lstm.png)

$$\tilde{c}^{< t >}=\textrm{tanh}(W_c[\Gamma_r\star a^{< t-1 >},x^{< t >}]+b_c)$$
$$c^{< t >}= \Gamma_u\star\tilde{c}^{< t >}+\Gamma_f\star c^{< t-1 >}$$
$$a^{< t >}=\Gamma_o\star c^{< t >}$$

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

**Gates** :
 A system of gating units that controls the ﬂow of information
* Update gate $\Gamma_u$--> How much past should matter now?
* Forget Gate $\Gamma_f$-->Erase a cell or not?
* Output gate $ \Gamma_o$--> How much to reveal of a cell?
* Reveleance gate  $ \Gamma_r$-->  Drop previous information?

[Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)



In [None]:
class BasicLSTM(tf.keras.Model):
    def __init__(self, units, return_sequence=False, return_states=False, **kwargs):
        super(BasicLSTM, self).__init__(**kwargs)
        self.units = units
        self.return_sequence = return_sequence
        self.return_states = return_states

        def bias_initializer(_, *args, **kwargs):
            return tf.keras.backend.concatenate([
                tf.keras.initializers.Zeros()((self.units,), *args, **kwargs),  # input gate
                tf.keras.initializers.Ones()((self.units,), *args, **kwargs),  # forget gate
                tf.keras.initializers.Zeros()((self.units * 2,), *args, **kwargs),  # context and output gates
            ])

        self.kernel = tf.keras.layers.Dense(4 * units, use_bias=False)
        self.recurrent_kernel = tf.keras.layers.Dense(4 * units, kernel_initializer='glorot_uniform', bias_initializer=bias_initializer)

    def call(self, inputs, training=None, mask=None, initial_states=None):
        # reset the states initially if not provided, else use those
        if initial_states is None:
            h_state = tf.zeros((inputs.shape[0], self.units))
            c_state = tf.zeros((inputs.shape[0], self.units))
        else:
            assert len(initial_states) == 2, "Must pass a list of 2 states when passing 'initial_states'"
            h_state, c_state = initial_states

        h_list = []
        c_list = []

        for t in range(inputs.shape[1]):
            # LSTM gate steps
            ip = inputs[:, t, :]
            z = self.kernel(ip)
            z += self.recurrent_kernel(h_state)

            z0 = z[:, :self.units]
            z1 = z[:, self.units: 2 * self.units]
            z2 = z[:, 2 * self.units: 3 * self.units]
            z3 = z[:, 3 * self.units:]

            # gate updates
            i = tf.keras.activations.sigmoid(z0)
            f = tf.keras.activations.sigmoid(z1)
            c = f * c_state + i * tf.nn.tanh(z2)

            # state updates
            o = tf.keras.activations.sigmoid(z3)
            h = o * tf.nn.tanh(c)

            h_state = h
            c_state = c

            h_list.append(h_state)
            c_list.append(c_state)

        hidden_outputs = tf.stack(h_list, axis=1)
        hidden_states = tf.stack(c_list, axis=1)

        if self.return_states and self.return_sequence:
            return hidden_outputs, [hidden_outputs, hidden_states]
        elif self.return_states and not self.return_sequence:
            return hidden_outputs[:, -1, :], [h_state, c_state]
        elif self.return_sequence and not self.return_states:
            return hidden_outputs
        else:
            return hidden_outputs[:, -1, :]

In [None]:
class BasicLSTMModel(tf.keras.Model):
    def __init__(self, units, num_classes):
        super(BasicLSTMModel, self).__init__()
        self.units = units
        self.lstm = BasicLSTM(units)
        self.classifier = tf.keras.layers.Dense(num_classes)

    def call(self, inputs, training=None, mask=None):
        h = self.lstm(inputs)
        output = self.classifier(h)

        # softmax op does not exist on the gpu, so always use cpu
        with tf.device('/cpu:0'):
            output = tf.nn.softmax(output)

        return output

In [None]:
device = '/cpu:0' if tfe.num_gpus() == 0 else '/gpu:0'

with tf.device(device):
  
    # build model and optimizer
    model = BasicLSTMModel(units, num_classes)
    model.compile(optimizer=tf.train.AdamOptimizer(0.01), loss='categorical_crossentropy',
                  metrics=['accuracy'])

    dummy_x = tf.zeros((1, 28, 28))
    model._set_inputs(dummy_x)

    # train
    model.fit(data.X_train, y_train_ohe, batch_size=batch_size, epochs=epochs,
              validation_data=(data.X_test, y_test_ohe), verbose=1)

    # evaluate on test set
    scores = model.evaluate(x_test, y_test_ohe, batch_size, verbose=1)
    print("Final test loss and accuracy :", scores)

    saver = tfe.Saver(model.variables)
    saver.save('weights/06_02_rnn/weights.ckpt')

Epoch 1/2
Epoch 2/2
('Final test loss and accuracy :', [0.1013777525583282, 0.9720000064373017])


### LSTM cell pre-built

In [None]:
class LSTM(tf.keras.Model):
    def __init__(self, units, num_classes):
        super(LSTM, self).__init__()
        self.units = units
        self.lstm_cell = tf.nn.rnn_cell.LSTMCell(units)  
        self.classifier = tf.keras.layers.Dense(num_classes)

    def call(self, inputs, training=None, mask=None):
        state = self.lstm_cell.zero_state(batch_size=inputs.shape[0], dtype=tf.float32)
        x = inputs

        for t in range(inputs.shape[1]):
            input = inputs[:, t, :]  
            x, state = self.lstm_cell(input, state=state) 

        output = self.classifier(x)  # feed the last `x` as the hidden embedding of the lstm to the classifier

        # softmax op does not exist on the gpu, so always use cpu
        with tf.device('/cpu:0'):
            output = tf.nn.softmax(output)

        return output

In [None]:

device = '/cpu:0' if tfe.num_gpus() == 0 else '/gpu:0'

with tf.device(device):
  
    # build model and optimizer
    model = LSTM(units, num_classes)
    model.compile(optimizer=tf.train.AdamOptimizer(0.01), loss='categorical_crossentropy',
                  metrics=['accuracy'])

    dummy_x = tf.zeros((1, 28, 28))
    model._set_inputs(dummy_x)

    # train
    model.fit(x_train, y_train_ohe, batch_size=batch_size, epochs=epochs,
              validation_data=(x_test, y_test_ohe), verbose=1)

    # evaluate on test set
    scores = model.evaluate(x_test, y_test_ohe, batch_size, verbose=1)
    print("Final test loss and accuracy :", scores)

    saver = tfe.Saver(model.variables)
    saver.save('weights/06_03_rnn/weights.ckpt')