In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os

  from ._conv import register_converters as _register_converters


In [2]:
tf.__version__

'1.5.0'

In [3]:
tf.test.gpu_device_name()

''

In [4]:
# for the sake of reproducibility 

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# Outline

* [RNN recap](#RNN-recap)
    - [Task1](#Task1)
    - [Dynamic RNN](#Dynamic-RNN)
* [Generate names with RNN](#Name-generation)
    - [Task2](#Task2)
* [SRU implementation](#SRU-implementation)
    - [Task3](#Task3)
* [Bonus part](#Bonus-part)
* [How to evaluate the work](#How-to-evaluate-the-work)

# RNN recap

<img src="./pics/rnn.png" width="90%">

Simplest RNN consisting of 1 layer receives $x_{(t)}$ and could be written as:

$$y_{(t)} = \phi (x_{(t)}^T \cdot w_x + y_{(t-1)}^T \cdot w_y + b)$$

where 
* $x(t)$ -- input vector at time step _t_ 
* $y(t)$ -- output vector at time step _t_
* $w_x$ -- weights vector for input 
* $w_y$ -- weights vector for output
* $y(t-1)$ -- output vector at previous time step; for 0th step it's zero vector
* $b$ -- bias
* $\phi$ -- some activation function, i.e. ReLU


Also we should mention **hidden_state** ( $h(t)$ ) -- it's a recurrent cell memory.

In general case $h_{(t)} = f(h_{(t-1)}, x_{(t)})$, but also $y{(t)} = f(h{(t-1)}, x{(t)})$. So in this case $h(t) == y(t)$, but in practice more complex architectures are used, where **hidden_state** is not equal to the RNN output.

------

## Lets write simple RNN
To write RNN we need to make few improvements to the formula.

Lets say that we have not only one vector $x_{(t)}$ as input, but a few vectors in mini-batch $X_{(t)}$ of size $m$ . So all consequent computaions will be in a matrix form.

$$ Y_{(t)} = \phi(X_{(t)} \cdot W_x + Y_{(t-1)} \cdot W_y + b) = \phi([X_{(t)} Y_{(t-1)}] \cdot W + b) $$
where
$$ W = [W_x W_y]^T $$

*It's a matrix concatination in square brackets

Dimentions:
* $Y_{(t)}$ -- matrix [$m$ x n_neurons]
* $X_{(t)}$ -- matrix [$m$ x n_features]
* $b$ -- vector of size `n_neurons`
* $W_x$ -- input weights of size [n_features x n_neurons]
* $W_y$ -- output weights of size [n_neurons x n_neurons]

In [5]:
reset_graph() # just clear default graph and set seed for reproducibility

n_features = 3
n_neurons = 5

# two time steps
# the first dimension in shape parameter is None
# because of possibility to feed any sized batch

X0 = tf.placeholder(tf.float32, shape=[None, n_features])
X1 = tf.placeholder(tf.float32, shape=[None, n_features])

Wx = tf.Variable(tf.random_normal(shape=[n_features, n_neurons], dtype=tf.float32))
Wy = tf.Variable(tf.random_normal(shape=[n_neurons, n_neurons], dtype=tf.float32))
b = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32))

# tanh as phi
Y0 = tf.tanh(tf.matmul(X0, Wx) + b)
Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b)

init = tf.global_variables_initializer()

In [6]:
# mini-batches of size 4
X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]])  # time step 0 of mini-batch
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]])  # time step 2 mini-batch


with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})

In [7]:
Y0_val

array([[-0.0664006 ,  0.9625767 ,  0.68105793,  0.7091854 , -0.898216  ],
       [ 0.9977755 , -0.719789  , -0.9965761 ,  0.9673924 , -0.9998972 ],
       [ 0.99999774, -0.99898803, -0.9999989 ,  0.9967762 , -0.9999999 ],
       [ 1.        , -1.        , -1.        , -0.99818915,  0.9995087 ]],
      dtype=float32)

In [8]:
Y1_val

array([[ 1.        , -1.        , -1.        ,  0.4020025 , -0.9999998 ],
       [-0.12210421,  0.6280527 ,  0.9671843 , -0.9937122 , -0.25839362],
       [ 0.9999983 , -0.9999994 , -0.9999975 , -0.8594331 , -0.9999881 ],
       [ 0.99928284, -0.99999815, -0.9999058 ,  0.9857963 , -0.92205757]],
      dtype=float32)

## Task1

Make the same computation using only one matrix multiplication per one step.

In [23]:
reset_graph() # just clear default graph and set seed for reproducibility

X0 = tf.placeholder(tf.float32, [None, n_features])
X1 = tf.placeholder(tf.float32, [None, n_features])
W = tf.Variable(tf.random_normal(shape=[n_features, 2*n_neurons], dtype=tf.float32))
b = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32))
#< your code here >
Wx = tf.Variable(tf.random_normal(shape=[n_features, n_neurons], dtype=tf.float32))

# tanh as phi
Y0 = tf.tanh(tf.matmul(X0, Wx) + b)
#Y0 = tf.tanh(tf.matmul(tf.concat([X0,X0], axis = 1), W) + b)
b1 = tf.Variable(tf.zeros([1, 2*n_neurons], dtype=tf.float32))
print(X0.shape)
print(Y0.shape)
z = tf.concat([Y0,X1], axis = -1)
print(z.shape)
Y1 = tf.tanh(tf.matmul(z, W) + b1)
with tf.Session() as sess:
    init.run()
    Y0_val_1, Y1_val_1 = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})

(?, 3)
(?, 5)
(?, 8)


ValueError: Dimensions must be equal, but are 8 and 3 for 'MatMul_1' (op: 'MatMul') with input shapes: [?,8], [3,10].

# Dynamic RNN

In TensorFlow there is a function `tf.contrib.rnn.static_rnn` which create for each time step (unrolling) specific cell of desired type. Our implementation follows `tf.nn.rnn_cell.BasicRNNCell`. This implementation has such a drawback - we could need a lot of memory for long sequences. And because we want to work with such sequences we need to allocate a lot of memory at once. But in TF there is another option -- `dynamic_rnn`, where memory is allocated dynamically for each provided sequence, acording to its length.

Lets rewrite the code with `dynamic_rnn`.

As always in tensorflow the first step is writing a recipe.

In [None]:
n_steps = 2
n_features = 3
n_neurons = 5

reset_graph() # just clear default graph and set seed for reproducibility

# adding new parts to the default graph
X = tf.placeholder(tf.float32, [None, n_steps, n_features])

# we have created the same cell in the Task1;
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)

In [None]:
seq_length = tf.placeholder(tf.int32, [None]) # create placeholder to feed in real values;

# create dynamic_rnn and connect all existing graph components to it (i.e basic_cell, X, seq_length);
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32,
                                    sequence_length=seq_length)

Now create matrices with real values in numpy.

Notice that now `X_batch` have shape = `[None, n_steps, n_features]` that is not the same as in `Task1`.

That's because of putting all time steps of batch instances in a single matrix `X_batch` (in the `Task1` we used two separated matrices `X0` and `X1` to feed values in each time stamp).

In [None]:
X_batch = np.array([
        # step 0     step 1
        [[0, 1, 2], [9, 8, 7]], # instance 1
        [[3, 4, 5], [0, 0, 0]], # instance 2 (padded with zero vectors)
        [[6, 7, 8], [6, 5, 4]], # instance 3
        [[9, 0, 1], [3, 2, 1]], # instance 4
    ])

# sequence lengths
seq_length_batch = np.array([2, 1, 2, 2]) # note the length of second instance is 1

Feed these real values into created network to get outputs and states values.

In [None]:
# create new session context manager;
# session will be closed as soon as this cell finish running
with tf.Session() as sess:
    tf.global_variables_initializer().run() # initialize all variables
    
    # run session and feed input values into the network, get outputs and states values
    outputs_val, states_val = sess.run(
        [outputs, states], feed_dict={X: X_batch, seq_length: seq_length_batch})

The shape of `outputs_val` is `[batch_size, time_steps, n_neurons]` as it returns all outputs for each time step for each instance.

The shape of `states_val` is `[batch_size, n_neurons]` as it returns only last state for each instance of batch.

__For the BasicRNNCell outputs and states are the same.__

In [None]:
print(outputs_val.shape)
print(states_val.shape)

In [None]:
# for the second sample there are zeros in output 
print(outputs_val)

In [None]:
# but in state there are not
print(states_val)

If we feed `sequence_length` parameter into the `dynamic_rnn` we make `dynamic_rnn` to stop calculating states after actual sequence is ended. If we don't provide `sequence_length` parameter the calculating of states will continue and useful information about sequence could be lost if the padding is long enough. 

# Name generation

Lets try to do something useful with our RNNs.

_Teaser:_

* It is hard to choose a name for a variable. But its much harder to choose a name for a person.
  So lets make neural net to do it instead!
* Dataset consists of 8 thousand people names from different cultures all around the world.
* Our toy task is training a model for name generation.

In [None]:
start_token = " "

with open("names") as f:
    names = f.readlines()
    names = [start_token + name.lower() for name in names]

In [None]:
print('n samples = ', len(names))
for x in names[::1000]:
    print(x.strip().capitalize())

### Text processing

Lets take all the latters disregarding a case + symbol ')' for the end of a name

In [None]:
token_set = set()
for name in names:
    for letter in name:
        token_set.add(letter)


token_set.add(')')
tokens = list(token_set)
tokens.sort()

print('n_tokens = ', len(tokens))

In [None]:
token_to_id = {t: i for i, t in enumerate(tokens)}

id_to_token = {i: t for i, t in enumerate(tokens)}

### Name length distribution

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(list(map(len, names)))

# max length of a name in this dataset
MAX_LEN = min([60, max(list(map(len, names)))])-1

print(MAX_LEN)

### Convert symbols to their ids

In [None]:
names_ix = list(map(lambda name: list(map(token_to_id.get, name + ')')), names))


for i in range(len(names_ix)):
    names_ix[i] = names_ix[i][:MAX_LEN+1] #crop too long
    
    if len(names_ix[i]) < MAX_LEN+1:
        names_ix[i] += [token_to_id[" "]]*(MAX_LEN+1 - len(names_ix[i])) #pad too short
        
assert len(set(map(len, names_ix))) == 1

names_ix = np.array(names_ix)

In [None]:
names_ix[:10]

### Batch generator

In [None]:
def sample_batch(data, batch_size):
    
    rows = data[np.random.randint(0, len(data), size=batch_size)]
    x = rows[:, :-1]
    y = rows[:, 1:]
    
    count = lambda r: np.sum([id_to_token[t] != ' ' for t in r])
    lengths = list(map(count, x))
    
    return x, y, lengths

In [None]:
x, y, length = sample_batch(names_ix, 10)
y.shape

In [None]:
x

In [None]:
y

In [None]:
length

## Network architecture and text generation process


We will implement the class `MyLittleNetwork` which will be used to generate sequences.

<img src="https://vignette.wikia.nocookie.net/mlp/images/4/48/FANMADE_Rainbow_Dash_flying.png/revision/latest?cb=20121227194529" width="100" align="right">

Implemented class will have two useful properties:
* Several instances of the class could live in one default graph thanks to using `tf.variable_scope()`
* Each class instance could be created with its own recurrent cell type.

These properties are useful for us as we want to compare several cell types by creating several class instances.

**Outline of our work**

1. **[[Build]](#Building-network-graph)** Creating network graph in `MyLittleNetwork.__init__` method
2. **[[Train]](#Train-part)** Creating train procedure in `MyLittleNetwork.train` method
3. **[[Infer]](#Sequence-generation)** Creating generation procedure in `MyLittleNetwork.generate_sample` method



### Building network graph

Consider method `__init__`. It takes several parameters that will be further discussed.

```python
...................................................
def __init__(self, scope_name,
             embedding_size = 8,
             cell_class = tf.contrib.rnn.BasicRNNCell,
             cell_params_dict = {'num_units': 60, 'activation':tf.tanh},
             vocabulary_size = len(tokens)):
...................................................         
```
 
Here `scope_name` just used to separate graph variables belonging to this particular instance of class `MyLittleNetwork`. `Tf.variable_scope` just add `scope_name` to the full name of all graph variables. In other words `tf.variable_scope` used for namespaces in Tf.
We save `scope_name` parameter into `self.scope_name` to use this part of global default_graph along with a particular class instance.

In the snippet below we just creating placeholders for inputs `_X`, targets `_y`, sequence length and learning_rate within `scope_name`. We do that using context manager `with tf.variable_scope`.
```python
...................................................
self.scope_name = scope_name
with tf.variable_scope(self.scope_name):
    self._X = tf.placeholder(tf.int32, [None, None], name= 'X')
    self._y = tf.placeholder(tf.int32, [None, None], name = 'y')
    self._lengths = tf.placeholder(tf.int32, [None], name = 'lengths')
    self._learning_rate_ph = tf.placeholder(dtype=tf.float32, shape=[], name = 'learning_rate_ph')
...................................................
```

Look at the `_X` placeholder. It says you that it needs int32 input values. And there is no mistake.

As input we will feed a sequence of numbers (it's our mapping numbers in dictionary). Actual shape of the `_X` is not defined yet and could be any. But actually it is `(batch_size, max_sequence_length)`.

As you may remember from the part about [dynamic_rnn](#Dynamic-RNN) actually we feed into `dynamic_rnn` inputs with the shape `(batch_size, max_sequence_length, n_features)`. But how to get `n_features` dimension of inputs? Actually, we use `tf.embedding_lookup` function to map indices in `_X` to the vectors of embedding matrix. We put these obtained vectors to the `embed` variable.  And `_embedding_mtx` is just a usual `tf.Variable` with shape `(vocab_size, embedding_size)`. 

After obtaining embedding vectors for input `_X` we feed them into rnn cycle (i.e dynamic_rnn) which returns to us outputs and states (you may remember how it works from the [dynamic_rnn](#Dynamic-RNN) paragraph).   

Actually we could use further either `rnn_outputs` or `states` or both to obtain logits. You can try different settings.
But the most simple way is just using `rnn_outputs` as it contains information about each time step (and it is more then in `states`). So, use any option to obtain `_pred_logits` (i.e unnormalized scores for each token in the vocabulary).

In the last line of the snippet below you have to translate input `_y` to one_hot representation using tf function.
```python

...................................................
self._embedding_mtx = <create matrix of embeddings>
embed = < embed the input sequence >

self._cell = cell_class(**cell_params_dict)

rnn_outputs, states = tf.nn.dynamic_rnn(< choose params >)
self._pred_logits = < make logits >
labels_one_hot = < create one_hot for targets self._y >
...................................................
```

This is the last part of the architecture implementation.

`tf.softmax_cross_entropy_with_logits` measures the probability error in discrete classification tasks in which the classes are mutually exclusive. The function calculate `softmax` under unnormilized logits entirely for efficiency. It returns 1-D Tensor of length `batch_size` of the same type as logits with the softmax cross entropy loss. That means it calculates loss for each instance in the batch separately for all vocabulary using formula: $$- \Sigma y \cdot log(\hat{y})$$

Since we have one-hot distribution for $y$ the resulted loss (for each instance in batch) takes into account only the logit value of the corresponding right token. Minimizing this loss leads us to maximizing the similarity between distributions of $y$ and it's estimate $\hat{y}$.

But in fact, loss must be scalar value, not tensor. That's why we apply `tf.reduce_mean` function next.

Having loss function it is possible to take `AdamOptimizer` and minimize it (i.e calculate gradients giving a particular input and apply them to change network params). That's it and we will do that in the last line of this snippet. Besides we also define `_pred_probas` which is actually used only to generate sequence on the inference stage and doesn't need at train stage.

```python
....................................................
self._stepwise_cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
                                                        labels=labels_one_hot,
                                                        logits=self._pred_logits)
self._loss = tf.reduce_mean(self._stepwise_cross_entropy, name='loss')

self._pred_probas = tf.nn.softmax(self._pred_logits, name='pred_probas')

self._train_op = tf.train.AdamOptimizer(self._learning_rate_ph)
                                        .minimize(self._loss, name='train_op')
...................................................
```

### Train part

To make computations run we need to create new tf.Session or use existing one which hasn't been already closed.
In this code snippet new session created but it used without context manager (i.e without `with tf.Session() ...`). It is useful to notice that fact as this session will be also used in the inference stage and we don't want to close it just after train finish.

As always we initialize variables in this session in this variable scope. 
```python
...................................................

def train(self, n_epochs=10, batches_per_epoch = 500, batch_size = 10, lr = 1e-2):

    losses = []
    self.sess = tf.Session() 
    with tf.variable_scope(self.scope_name):
        self.sess.run(tf.global_variables_initializer())
...................................................
```

In each epoch for each batch we run `_train_op` and get value of `_loss`.

All the loss values collected into `losses` list which is returned at the end of training.

See the next paragraph to understand `generate_sample` function.

```python
...................................................
for epoch in range(n_epochs):
    print(">>Generated: ", self.generate_sample(n_snippets=6))
    print("-------\n")
    avg_cost = 0
    for batch in range(batches_per_epoch):
        x_, y_, len_ = sample_batch(names_ix, batch_size)

        _, iloss = self.sess.run([self._train_op, self._loss],
                                   {self._X: x_,
                                    self._y: y_,
                                    self._lengths: len_,
                                    self._learning_rate_ph: lr})
        avg_cost += iloss
        losses.append(iloss)

    print("EPOCH: ", epoch)
    print("AVERAGE LOSS: ", avg_cost / batches_per_epoch)

print(">>Generated: ", self.generate_sample(n_snippets=6))
...................................................
```

### Sequence generation
**Inference stage**

<img src="http://tommymullaney.com/img/google-hangouts-feature.png" width="400">

**How it works?**

* Lets take seed phrase
* Feeding it to the network
* Predicting next token
    * Next token is being sampled from model predicted distribution
* Token is added to seed phrase
* Repeat (from step 2)


**`def generate_sample()`** in the *class `MyLittleNetwork`* actually do that. But it use `numpy` for sampling.
So it actually run session to get probability distribution for the last token, then sample with `numpy` from that distribution to get next token. Token then added to the seed phrase and everything starts again from feeding phrase into the network. The picture greatly illustrate the process. Generation ends when the end token {here we use that token `)`} has been sampled or when the max length riched.


It could be implemented more effectively using `tf.multinomial` and `tf.while_loop`.
You could try to implement this function for generation sequences using tf only. This part of task is challenging and very optional.


## Task2

Add your code where necessary to create network architecture


In [None]:
reset_graph()

class MyLittleNetwork:
    def __init__(self, scope_name,
                 embedding_size = 8,
                 cell_class = tf.contrib.rnn.BasicRNNCell,
                 cell_params_dict = {'num_units': 60, 'activation':tf.tanh},
                 vocabulary_size = len(tokens)):
        
        self.scope_name = scope_name
        
        with tf.variable_scope(self.scope_name):
            
            #################### PLACE FOR YOUR CODE  BELOW #########################
            
            self._X = tf.placeholder(tf.int32, [None, None], name= 'X')
            self._y = tf.placeholder(tf.int32, [None, None], name = 'y')
            self._lengths = tf.placeholder(tf.int32, [None], name = 'lengths')
            self._learning_rate_ph = tf.placeholder(dtype=tf.float32, shape=[], name = 'learning_rate_ph')

            self._embedding_mtx = <create matrix of embeddings>
            embed = < embed the input sequence >
            
            self._cell = cell_class(**cell_params_dict)

            rnn_outputs, states = tf.nn.dynamic_rnn(< choose params >)
            self._pred_logits = < make logits >
            labels_one_hot = < create one_hot for targets self._y >
            
            ##################### END OF YOUR TASK HERE ##############################

            self._stepwise_cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
                                                                    labels=labels_one_hot,
                                                                    logits=self._pred_logits)

            self._loss = tf.reduce_mean(self._stepwise_cross_entropy, name='loss')

            self._pred_probas = tf.nn.softmax(self._pred_logits, name='pred_probas')

            self._train_op = tf.train.AdamOptimizer(self._learning_rate_ph).
                                                minimize(self._loss, name='train_op')
    
    def generate_sample(self, seed_phrase=None, N=MAX_LEN, n_snippets=1):
        """
        If you don't want to reimplement the function with tf
                        don't touch it!
        """
        if seed_phrase is None:
            seed_phrase = ' '
        elif seed_phrase[0].isalpha():
            seed_phrase = ' ' + seed_phrase
        seed_phrase = seed_phrase.lower()
        seed_phrase = np.array([token_to_id[tok] for tok in seed_phrase])
        L = len(seed_phrase)
        snippets = []
        
        with tf.variable_scope(self.scope_name):
            for _ in range(n_snippets):
                x = np.zeros(N)
                x[:len(seed_phrase)] = seed_phrase
                for n in range(N - L):
                    feed_dict = {self._X: x[:L + n].reshape([1, -1]), self._lengths: [len(x)]}
                    p = self.sess.run(self._pred_probas[:, -1], feed_dict=feed_dict).reshape(-1)
                    ix = np.random.choice(np.arange(len(tokens)), p=p)
                    x[L + n] = ix
                snippet = ''.join([id_to_token[idx] for idx in x])
                if ')' in snippet:
                    upto = snippet.index(')')
                    snippet = snippet[:upto]
                snippets.append(snippet.strip().capitalize())
        return snippets

    def train(self, n_epochs=10, batches_per_epoch = 500, batch_size = 10, lr = 1e-2):

        losses = []
        self.sess = tf.Session() 
        with tf.variable_scope(self.scope_name):
            self.sess.run(tf.global_variables_initializer())

            for epoch in range(n_epochs):
                print(">>Generated: ", self.generate_sample(n_snippets=6))
                print("-------\n")
                avg_cost = 0
                for batch in range(batches_per_epoch):
                    x_, y_, len_ = sample_batch(names_ix, batch_size)

                    _, iloss = self.sess.run([self._train_op, self._loss],
                                               {self._X: x_,
                                                self._y: y_,
                                                self._lengths: len_,
                                                self._learning_rate_ph: lr})
                    avg_cost += iloss
                    losses.append(iloss)

                print("EPOCH: ", epoch)
                print("AVERAGE LOSS: ", avg_cost / batches_per_epoch)

            print(">>Generated: ", self.generate_sample(n_snippets=6))
        return losses

In [None]:
myBasicNN = MyLittleNetwork(scope_name="BasicRNNCell")

### Which params in the network are trainable?

Sometimes it is useful to look at trainable network parameters.

* **for comparison**
 - A specially it is useful to compare one recurrent cell type to another.

* **for sanity check**
 - Another reason is just to check is everything ok in your current default graph. Maybe there're redundant components which are unwanted. They may not be included into train procedure but may litter graph visualisation in tensorboard. Or maybe you forget to set `trainable=False` for your embedding matrix with pretrained embeddings.

In [None]:
tf.trainable_variables(scope=myBasicNN.scope_name)

## Lets train it

**now train basic rnn**

In [None]:
loss_hist_basic_rnn = myBasicNN.train(n_epochs=1)

In [None]:
%time loss_hist_basic_rnn = myBasicNN.train(n_epochs=5)

In [None]:
%time myBasicNN.generate_sample(seed_phrase='Puti', n_snippets=6)

In [None]:
myBasicNN.generate_sample(seed_phrase='Q', n_snippets=6)

In [None]:
myBasicNN.generate_sample(seed_phrase='Eug', n_snippets=6)

In [None]:
myBasicNN.generate_sample(seed_phrase='Lu', n_snippets=6)

**now lets train LSTM**

In [None]:
myBasicLSTM = MyLittleNetwork(scope_name="BasicLSTMCell", cell_class=tf.nn.rnn_cell.BasicLSTMCell)

**check trainable params**

If you look at shapes you will see that LSTM has more params then BasicRNN

In [None]:
tf.trainable_variables(scope=myBasicLSTM.scope_name)

In [None]:
%timeit -n 3 loss_hist_basic_lstm = myBasicLSTM.train(n_epochs=5)

In [None]:
%time myBasicLSTM.generate_sample(seed_phrase='Puti', n_snippets=6)

# SRU implementation

There are a lot of different types of recurrent cells.
But the $SRU$ [Simple Recurrent Unit] is one created to address the parallelism issue.

It was introduced in a paper [**TRAINING RNNS AS FAST AS CNNS**](https://arxiv.org/abs/1709.02755).

** Equations from the article**

1. $\tilde{x_{t}} = Wx_t$
2. $f_t = \sigma(W_f x_t + b_f)$
3. $r_t = \sigma(W_r x_t + b_r)$
4. $c_t = f_t \odot c_{t-1} + (1-f_t) \odot \tilde{x_{t}}$
5. $h_t = r_t \odot g(c_t) + (1-r_t) \odot x_t$

$\odot$ -- means point-wise multiplication

__Description__


Lets look at the first two steps

[1] Given an input $x_t$ at time $t$, we compute a linear
transformation $\tilde{x_{t}}$ ...

[2] ... and the forget gate $f_t$

This computation depends on $x_t$ only, which _enables computing it in parallel_ across all time steps.
The forget gate is used to modulate the internal state $c_t$, which is used to compute the output state $h_t$.

In the simplest form of the $SRU$ the equation for $h_t$ looks like this:
$$h_t = g(c_t)$$
where the $g(\cdot)$ is an activation function. But the authors decided to use **highway connection** (i.e directly use inputs on the higher layers) and they added reset gate [3] to address this. The reset gate is used to compute the output state $h_t$ [5] as a combination of the internal state $g(c_t)$ and the input $x_t$.

**Optional question to check yourself:**

*does SRUCell have to be as fast on inference stage as it was on the train stage?*

## Task3

Implemet SRU cell in TF framework

In [None]:
import tensorflow as tf
from tensorflow.contrib.rnn import RNNCell
from tensorflow.python.ops import variable_scope

In [None]:
class SRUCell(RNNCell):
    """Simple recurrent unit cell.
    The implementation of: https://arxiv.org/abs/1709.02755.
    """

    def __init__(self, num_units, activation=tf.nn.tanh, reuse=None):
        super(SRUCell, self).__init__(_reuse=reuse)
        self._num_units = num_units 
        self._activation = activation

        self.Wr = tf.Variable(self.init_matrix([self._num_units, self._num_units]))
        self.br = tf.Variable(self.init_matrix([self._num_units]))

        self.Wf = tf.Variable(self.init_matrix([self._num_units, self._num_units]))
        self.bf = tf.Variable(self.init_matrix([self._num_units]))

        self.W = tf.Variable(self.init_matrix([self._num_units, self._num_units]))
        
        # this will be True if we call the cell once
        self.used = False

    @property
    def state_size(self):
        return self._num_units

    @property
    def output_size(self):
        return self._num_units

    def call(self, inputs, state, scope=None):
        self.__call__(inputs, state, scope)

    def __call__(self, inputs, state, scope=None):
        """
        f - forget gate
        r - reset gate
        c - final cell
        :param inputs:
        :param state:
        :param scope:
        :return: state, cell
        """

        if self.used:
            # if self.used then just get projector matrix as it's already exists;
            # It is possible to get it in this way because of reuse=True
            # As it just gives us existing variable with name='projector'
            projected_inputs = tf.layers.dense(inputs, self._num_units, name='projector', reuse=True)
        else:
            # In case we call the cell for the first time, we create new variable with name='projector';
            # Then we use this projector to get inputs with convinient dimensions;
            projected_inputs = tf.layers.dense(inputs,  self._num_units, name='projector')
            self.used = True
            
        with variable_scope.variable_scope(scope or type(self).__name__):

            # just to clarify
            # c_prev = state
            <write code to compute f, r, c>

            hidden_state = <compute hidden here>

            return hidden_state, c

    def init_matrix(self, shape):
        return tf.random_normal(shape, stddev=0.1)


## Checking

Check your implementation by running **Name generation** with this custom cell.

* Train `MyLittleNetwork` with your `SRUCell`. If your implementation is right it should work.
* Plot loss history of your model on the one plot with `BasicRNNCell` model and `BasicLSTMCell` model.

In [None]:
mySRUModel = MyLittleNetwork(scope_name='BasicSRU', cell_class=SRUCell,
                             cell_params_dict={'num_units': 120, 'activation':tf.tanh})

In [None]:
tf.trainable_variables(scope=mySRUModel.scope_name)

In [None]:
start = timeit.default_timer()
for i in range(3):
    sru_loss_history = mySRUModel.train(n_epochs=1)
stop = timeit.default_timer()
execution_time = stop-start
print('OVERALL: {}; For one cycle: {}'.format(execution_time, execution_time/3))

In [None]:
%time sru_loss_history = mySRUModel.train(n_epochs=5)

In [None]:
%time mySRUModel.generate_sample(seed_phrase='Pur', n_snippets=6)

### Plot losses

In [None]:
def running_mean(x, N=1000):
    cumsum = np.cumsum(np.insert(x, 0, 0))
    return (cumsum[N:] - cumsum[:-N]) / float(N)

In [None]:
plt.figure(figsize=(9, 5))
plt.plot(running_mean(loss_hist_basic_rnn), label='BasicRNNCell', alpha=0.4)
plt.plot(running_mean(loss_hist_basic_lstm), label='BasicLSTMCell', alpha=0.4)
plt.plot(running_mean(sru_loss_history), label='SRUCell', alpha=0.4)

plt.title("Loss history")
plt.legend()
plt.show()

## Bonus part
### Do more interesting stuff

* Multi-layer (MultiRNNCell);
* Try to generate tweet, using [this](http://study.mokoron.com) dataset.



# How to evaluate the work

**Check if the work meets the requirements below.**

Calculate final mark based on collected points.

* Code in the task1 contains 2 matrix multiplications at all and produce similar(or the same) result as code without this optimization. **(+2)**
* Generated names looks like names; **(+3)**
* Model with SRU generates names as well as models above; **(+3)**
* There's a plot with loss history at the end; **(+1)**
    - loss should decrease smoothly
    
    
* If any **optional tasks** has been done (any of tasks below); **(+1)**

    * On the plot there's also model with Multi-layer RNN or smth like this;
    * Another dataset has been checked and you see results in the notebook;
    * If generate_sample function was rewritten in tf;

**Final mark = ( sum of all the points )/2**