In [None]:
!wget https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/dinos.txt -P datasets

In [None]:
import numpy as np
import random
import pprint
from lib.utils import *

# 1) Data Loading

In [None]:
data = open('./datasets/dinos.txt', 'r').read()
data = data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total data and %d unique characters in your data.' % (data_size, vocab_size))

There are 19909 total characters and 27 unique characters in your data.



* The characters are a-z (26 characters) plus the "\n" (or newline character).
* In this assignment, the newline character "\n" plays a role similar to the `<EOS>` (or "End of sentence") token we had discussed in lecture.  
    - Here, "\n" indicates the end of the dinosaur name rather than the end of a sentence.
* `char_to_ix`: In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26.
* `ix_to_char`: We also create a second python dictionary that maps each index back to the corresponding character.
    -  This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer.

In [None]:
chars = sorted(chars)
print(chars)

['\n', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [None]:
char_to_ix = {ch:i for i,ch in enumerate(chars) }
ix_to_char = {i:ch for i,ch in enumerate(chars) }
ix_to_char

{0: '\n',
 1: 'a',
 2: 'b',
 3: 'c',
 4: 'd',
 5: 'e',
 6: 'f',
 7: 'g',
 8: 'h',
 9: 'i',
 10: 'j',
 11: 'k',
 12: 'l',
 13: 'm',
 14: 'n',
 15: 'o',
 16: 'p',
 17: 'q',
 18: 'r',
 19: 's',
 20: 't',
 21: 'u',
 22: 'v',
 23: 'w',
 24: 'x',
 25: 'y',
 26: 'z'}

# 2) Building blocks of the model

Your model will have the following structure:

- Initialize parameters
- Run the optimization loop
    - Forward propagation to compute the loss function
    - Backward propagation to compute the gradients with respect to the loss function
    - Clip the gradients to avoid exploding gradients
    - Using the gradients, update your parameters with the gradient descent update rule.
- Return the learned parameters
    
<img src="https://github.com/sebastianbirk/coursera-deep-learning-specialization/blob/master/05_sequence_models/02_character_level_language_model_with_numpy_rnn/images/rnn.png?raw=true" style="width:450;height:300px;">
<caption><center> **Figure 1**: Recurrent Neural Network, similar to what you had built in the previous notebook "Building a Recurrent Neural Network - Step by Step".  </center></caption>

* At each time-step, the RNN tries to predict what is the next character given the previous characters.
* The dataset $\mathbf{X} = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is a list of characters in the training set.
* $\mathbf{Y} = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ is the same list of characters but shifted one character forward.
* At every time-step $t$, $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$.  The prediction at time $t$ is the same as the input at time $t + 1$.


## 2.1 Clipping the gradients in the optimization loop

**Exploding gradients**
* When gradients are very large, they're called "exploding gradients.
* Exploding gradients make the training process more difficult, because the updates may be so large that they "overshoot" the optimal values during back propagation.

Recall that your overall loop structure usually consists of:
* forward pass,
* cost computation,
* backward pass,
* parameter update.

Before updating the parameters, you will perform gradient clipping to make sure that your gradients are not "exploding."

**gradient clipping**
In the exercise below, you will implement a function `clip` that takes in a dictionary of gradients and returns a clipped version of gradients if needed.
* There are different ways to clip gradients.
* We will use a simple element-wise clipping procedure, in which every element of the gradient vector is clipped to lie between some range [-N, N].
* For example, if the N=10
    - The range is [-10, 10]
    - If any component of the gradient vector is greater than 10, it is set to 10.
    - If any component of the gradient vector is less than -10, it is set to -10.
    - If any components are between -10 and 10, they keep their original values.

<img src="https://github.com/sebastianbirk/coursera-deep-learning-specialization/blob/master/05_sequence_models/02_character_level_language_model_with_numpy_rnn/images/clip.png?raw=true" style="width:400;height:150px;">
<caption><center> **Figure 2**: Visualization of gradient descent with and without gradient clipping, in a case where the network is running into "exploding gradient" problems. </center></caption>

In [None]:
def clip(gradients, maxValue):
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']

    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue, maxValue, out=gradient)
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    return gradients

In [None]:
# Test with a maxValue of 5
maxValue = 5
dWax = np.random.randn(5,3)*10
dWaa = np.random.randn(5,5)*10
dWya = np.random.randn(2,5)*10
db = np.random.randn(5,1)*10
dby = np.random.randn(2,1)*10
gradients = {"dWax": dWax, "dWaa": dWaa, "dWya": dWya, "db": db, "dby": dby}
gradients = clip(gradients, maxValue)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWya\"][1][2] =", gradients["dWya"][1][2])
print("gradients[\"db\"][4] =", gradients["db"][4])
print("gradients[\"dby\"][1] =", gradients["dby"][1])

gradients["dWaa"][1][2] = 4.373479451817902
gradients["dWax"][3][1] = 5.0
gradients["dWya"][1][2] = 3.3458167583803884
gradients["db"][4] = [-1.25954685]
gradients["dby"][1] = [-3.3819235]


## 2.2 Sampling

Now assume that your model is trained. You would like to generate new text (characters). The process of generation is explained in the picture below:

<img src="https://github.com/sebastianbirk/coursera-deep-learning-specialization/blob/master/05_sequence_models/02_character_level_language_model_with_numpy_rnn/images/dinos3.png?raw=true" style="width:500;height:300px;">
<caption><center> **Figure 3**: In this picture, we assume the model is already trained. We pass in $x^{\langle 1\rangle} = \vec{0}$ at the first time step, and have the network sample one character at a time. </center></caption>

In [None]:
def sample(parameters, char_to_ix, seed=99):

    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]

    # Step 1: Create the a zero vector x that can be used as the one-hot vector
    x = np.zeros((vocab_size, 1))
    a_prev = np.zeros((n_a, 1))

    # Create an empty list of indices
    indices = []
    idx = -1

    # Loop over time-steps t. At each time-step:
    # sample a character from a probability distribution and append its index (`idx`) to the list "indices".
    # We'll stop if we reach 50 characters
    # Setting the maximum number of characters helps with debugging and prevents infinite loops.
    counter = 0
    newline_character = char_to_ix['\n']

    while (idx != newline_character and counter != 50):
        # Step 2: Forward propagate x using the equations (1), (2) and (3)
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        z = np.dot(Wya, a) + by
        y = softmax(z)

        # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
        idx = np.random.choice(list(range(vocab_size)), p=y.ravel())

        # Append the index to "indices"
        indices.append(idx)

        # Step 4: Overwrite the input character as the one corresponding to the sampled index.
        x = np.zeros((vocab_size, 1))
        x[idx] = 1

        # Update "a_prev" to be "a"
        a_prev = a
        counter +=1

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    return indices

In [None]:
_, n_a = 20, 100
Wax, Waa, Wya = np.random.randn(n_a, vocab_size), np.random.randn(n_a, n_a), np.random.randn(vocab_size, n_a)
b, by = np.random.randn(n_a, 1), np.random.randn(vocab_size, 1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b, "by": by}

indices = sample(parameters, char_to_ix, 0)
print("Sampling:")
print("list of sampled indices:\n", indices)
print("list of sampled characters:\n", [ix_to_char[i] for i in indices])

Sampling:
list of sampled indices:
 [16, 25, 17, 5, 7, 12, 15, 16, 1, 15, 26, 5, 15, 21, 9, 26, 23, 16, 15, 10, 1, 15, 8, 0]
list of sampled characters:
 ['p', 'y', 'q', 'e', 'g', 'l', 'o', 'p', 'a', 'o', 'z', 'e', 'o', 'u', 'i', 'z', 'w', 'p', 'o', 'j', 'a', 'o', 'h', '\n']


# 3) Model Building

* Given the dataset of dinosaur names, we use each line of the dataset (one name) as one training example.
* Every 100 steps of stochastic gradient descent, you will sample 10 randomly chosen names to see how the algorithm is doing.
* Remember to shuffle the dataset, so that stochastic gradient descent visits the examples in random order.

In [None]:
def clip(gradients, maxValue):
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']

    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue, maxValue, out=gradient)

    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    return gradients

def initialize_parameters(n_a, n_x, n_y):
    Wax = np.random.randn(n_a, n_x)*0.01 # input to hidden (50, 27)
    Waa = np.random.randn(n_a, n_a)*0.01 # hidden to hidden (50, 50)
    Wya = np.random.randn(n_y, n_a)*0.01 # hidden to output (27, 50)
    b = np.zeros((n_a, 1)) # hidden bias (50, 1)
    by = np.zeros((n_y, 1)) # output bias (27, 1)

    parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
    return parameters

def get_initial_loss(vocab_size, seq_length):
    return -np.log(1.0 / vocab_size) * seq_length

def smooth(loss, cur_loss):
    return loss * 0.999 + cur_loss * 0.001

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

def rnn_step_forward(parameters, a_prev, x):
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # (50, 1)
    p_t = softmax(np.dot(Wya, a_next) + by) # (27, 1)
    return a_next, p_t

def rnn_forward(X, Y, a0, parameters, vocab_size=27):
    # Initialize x, a and y_hat as empty dictionaries
    x, a, y_hat = {}, {}, {}
    a[-1] = np.copy(a0) # (50, 1)

    # initialize your loss to 0
    loss = 0
    for t in range(len(X)):
        # Set x[t] to be the one-hot vector representation, if X[t] == None, we just have x[t]=0.
        x[t] = np.zeros((vocab_size, 1)) # (27, 1)
        if (X[t] != None):
            x[t][X[t]] = 1

        # Run one step forward of the RNN
        a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])

        # For Softmax, Update the loss by substracting the cross-entropy term of this time-step from it.
        loss -= np.log(y_hat[t][Y[t], 0])
    cache = (y_hat, a, x)
    return loss, cache

def update_parameters(parameters, gradients, lr):
    parameters['Wax'] += -lr * gradients['dWax']
    parameters['Waa'] += -lr * gradients['dWaa']
    parameters['Wya'] += -lr * gradients['dWya']
    parameters['b'] += -lr * gradients['db']
    parameters['by'] += -lr * gradients['dby']
    return parameters

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    gradients['dWya'] += np.dot(dy, a.T)
    gradients['dby'] += dy
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    return gradients

def rnn_backward(X, Y, parameters, cache):
    gradients = {}
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']

    # each one should be initialized to zeros of the same dimension as its corresponding parameter
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])

    # Backpropagate through time
    for t in reversed(range(len(X))):
        dy = np.copy(y_hat[t])
        dy[Y[t]] -= 1
        gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
    return gradients, a

def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    # Forward propagate through time
    loss, cache = rnn_forward(X, Y, a_prev, parameters)
    # Backpropagate through time
    gradients, a = rnn_backward(X, Y, parameters, cache)
    # Clip your gradients between -5 (min) and 5 (max)
    gradients = clip(gradients, 5)
    # Update parameters
    parameters = update_parameters(parameters, gradients, learning_rate)
    return loss, gradients, a[len(X)-1]

In [None]:
def model(data, ix_to_char, char_to_ix, num_iterations=2000, n_a=50, dino_names=7, vocab_size=27):

    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size # 27

    # Initialize parameters
    parameters = initialize_parameters(n_a, n_x, n_y)

    # Initialize loss (this is required because we want to smooth our loss)
    loss = get_initial_loss(vocab_size, dino_names) # scalar

    # Build list of all dinosaur names (training examples).
    with open("./datasets/dinos.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]

    # Shuffle list of all dinosaur names
    np.random.shuffle(examples)

    # Initialize the hidden state of your LSTM
    a_prev = np.zeros((n_a, 1)) #(50, 1)

    # Optimization loop
    for j in range(num_iterations):
        # define one training example (X,Y)
        index = j % len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]]
        Y = X[1:] + [char_to_ix["\n"]]

        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters)

        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        loss = smooth(loss, curr_loss)

        # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
        if j % 2000 == 0:
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')

            # The number of dinosaur names to print
            for name in range(dino_names):
                sampled_indices = sample(parameters, char_to_ix)
                print_sample(sampled_indices, ix_to_char)
            print('\n')
    return parameters

In [None]:
parameters = model(data, ix_to_char, char_to_ix)

Iteration: 0, Loss: 23.100519

Emxlxjtwaqckolvpapftmkwblkrifncrequcgfxqrcpuqysthu
Laepgprgjarmtczzcobdtpzqwcgai
Ticwtwahdjirlbygznwdtovowmcbaguoh
Ywyughwhgionja
Tomgvgdanurktsptezrxtagxjutvmjeflgadtsdsmcxdznvgbd
Dhbb
Lukoeodeztwzzcggnarppaddjnfphluwjhypkopwbmifailf




```Python
Iteration: 34000, Loss: 22.447230

Onyxipaledisons
Kiabaeropa
Lussiamang
Pacaeptabalsaurus
Xosalong
Eiacoteg
Troia
```

The RNN-Shakespeare model is very similar to the one you have built for dinosaur names. The only major differences are:
- LSTMs instead of the basic RNN to capture longer-range dependencies
- The model is a deeper, stacked LSTM model (2 layer)
- Using Keras instead of python to simplify the code

**References**:
- This exercise took inspiration from Andrej Karpathy's implementation: https://gist.github.com/karpathy/d4dee566867f8291f086. To learn more about text generation, also check out Karpathy's [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
- For the Shakespearian poem generator, our implementation was based on the implementation of an LSTM text generator by the Keras team: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py