# Character level language model - Dinosaurus Island

Using a list of dinosaur names, build a character level language model to generate new dinosaur names.

In [1]:
import numpy as np
from utils import *
import random
import pprint

## 1. Problem Statement

### 1.1 - Dataset and Preprocessing

In [2]:
data = open('dinos.txt', 'r').read()
data = data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

chars = sorted(chars)
print(chars)

There are 19909 total characters and 27 unique characters in your data.
['\n', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [3]:
char_to_ix = { ch:i for i,ch in enumerate(chars) } # creates a dictionary that maps each char to an index (0-26)
ix_to_char = { i:ch for i,ch in enumerate(chars) } # creates a dictionary that maps each index to its char
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(ix_to_char)

{   0: '\n',
    1: 'a',
    2: 'b',
    3: 'c',
    4: 'd',
    5: 'e',
    6: 'f',
    7: 'g',
    8: 'h',
    9: 'i',
    10: 'j',
    11: 'k',
    12: 'l',
    13: 'm',
    14: 'n',
    15: 'o',
    16: 'p',
    17: 'q',
    18: 'r',
    19: 's',
    20: 't',
    21: 'u',
    22: 'v',
    23: 'w',
    24: 'x',
    25: 'y',
    26: 'z'}


### 1.2 - Overview of the model

Your model will have the following structure: 

1. Initialize parameters 
2. Run the optimization loop
    - Forward propagation to compute the loss function
    - Backward propagation to compute the gradients with respect to the loss function
    - Clip the gradients to avoid exploding gradients
    - Using the gradients, update your parameters with the gradient descent update rule.
3. Return the learned parameters 


At each time-step, the RNN tries to predict what the next character is given the previous characters. 
* $\mathbf{X} = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ = a list of characters in the training set
* $\mathbf{Y} = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ = the same list of characters but shifted one character forward

At every $t$, $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$.

## 2 - Building blocks of the model

### 2.1 - Gradient Clipping

**Exploding Gradients** occur when gradients are very large. They make the training process more difficult, because they can "overshoot" the optimal values during back propagation. When gradient clipping is performed before updating our parameters, we can ensure our gradients are not exploding. 

**`clip() Overview`**

`Clips the gradients' values between minimum and maximum.`
    
**`Arguments`**
```
gradients = dictionary of gradients "dWaa", "dWax", "dWya", "db", "dby"
maxValue = everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
```
**`Returns`**
```
gradients = dictionary of clipped gradients
```

In [4]:
def clip(gradients, maxValue):
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue, maxValue, out=gradient)
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

### 2.2 - Sampling

With sampling, we assume that our model is trained and generate new text (characters).

<img src="images/dinos3.png" style="width:500;height:300px;">
<caption><center> **Figure 3**: In this picture, we assume the model is already trained. We pass in $x^{\langle 1\rangle} = \vec{0}$ at the first time step, and have the network sample one character at a time. </center></caption>

**Step 1**: Input the "dummy" vector of zeros $x^{\langle 1 \rangle} = \vec{0}$ and set $a^{\langle 0 \rangle} = \vec{0}$.


**Step 2**: Run one step of forward propagation to get $a^{\langle 1 \rangle}$ and $\hat{y}^{\langle 1 \rangle}$. $\hat{y}^{\langle t+1 \rangle}_i$ represents the probability that the character indexed by "i" is the next character.  


$$ a^{\langle t+1 \rangle} = \tanh(W_{ax}  x^{\langle t+1 \rangle } + W_{aa} a^{\langle t \rangle } + b)\tag{1}$$

$$ z^{\langle t + 1 \rangle } = W_{ya}  a^{\langle t + 1 \rangle } + b_y \tag{2}$$

$$ \hat{y}^{\langle t+1 \rangle } = softmax(z^{\langle t + 1 \rangle })\tag{3}$$


**Step 3**: Now that we have $y^{\langle t+1 \rangle}$, we want to select the next letter in the dinosaur name. We randomly select a next letter that is likely, but not always the same. We pick the next character's index according to the probability distribution specified by $\hat{y}^{\langle t+1 \rangle }$. This means that if $\hat{y}^{\langle t+1 \rangle }_i = 0.16$, you will pick the index "i" with 16% probability. 


**Step 4**: Update `x` from $x^{\langle t \rangle }$ to $x^{\langle t + 1 \rangle }$, the one-hot vector corresponding to the randomly selected character. We iterate this process until a "\n" character is picked, indicating the end of the dinosaur name. 

**`sample() Overview`**

`Sample a sequence of characters according to a sequence of probability distributions output of the RNN`

**`Arguments`**
```
parameters = dictionary of parameters Waa, Wax, Wya, by, and b. 
char_to_ix = dictionary mapping each char to an index
```
**`Returns`**
```
indices = list of length n containing indices of sampled chars
```

In [9]:
def sample(parameters, char_to_ix, seed):
    
    # retrieve parameters and shapes
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    # Step 1: Create Zero Vector X 
    x = np.zeros((vocab_size, 1))          # initialize sequence generation
    a_prev = np.zeros((n_a, 1))            # initialize a_prev
    indices = []                           # indices of characters to generate
    idx = -1                               # initialize idx (index of the one-hot vector x that is set to 1) 
    
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        # Step 2: Forward Propagate X
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        z = np.dot(Wya, a) + by
        y = softmax(z)
        
        np.random.seed(counter+seed) 
        
        # Step 3: Sample Char Index from Probability Distribution y
        idx = np.random.choice(list(range(vocab_size)), p=y.ravel())
        indices.append(idx)
        
        # Step 4: Overwrite Input X with Char of `idx`
        x = np.zeros((vocab_size, 1))
        x[idx] = 1
        
        a_prev = a
        seed += 1
        counter +=1
        
    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

## 3. Build the Language Model 

Implement the opotimization process by performing one step of stochastic gradient descent with clipped gradients. 

**Helper Functions**

```python
def rnn_forward(X, Y, a_prev, parameters):
    # Performs the forward propagation through the RNN and computes the cross-entropy loss.
    return loss, cache
    
def rnn_backward(X, Y, parameters, cache):
    # Performs backprop to compute loss gradients with respect to parameters, and returns all hidden states.
    return gradients, a

def update_parameters(parameters, gradients, learning_rate):
    # Updates parameters using the Gradient Descent Update Rule.
    return parameters

def clip(gradients, maxValue)
    # Clips the gradients' values between minimum and maximum.
    return gradients
```
**`optimize() Overview`**

`Execute one step of the optimization to train the model.`    

**`Arguments`**
```
X = list of ints that maps to a character in the vocabulary
Y = X but shifted one index to the left
a_prev = previous hidden state
parameters = python dictionary containing:
                    Wax = input weight            (n_a, n_x)
                    Waa = hidden state weight     (n_a, n_a)
                    Wya = output weight           (n_y, n_a)
                    b = bias                      (n_a, 1)
                    by = output bias              (n_y, 1)
learning_rate = model's learning rate
```
**`Returns`**
```
loss = value of cross-entropy loss
gradients = python dictionary containing:
                    dWax = Gradients of input-to-hidden weights         (n_a, n_x)
                    dWaa -- Gradients of hidden-to-hidden weights       (n_a, n_a)
                    dWya -- Gradients of hidden-to-output weights       (n_y, n_a)
                    db -- Gradients of bias vector                      (n_a, 1)
                    dby -- Gradients of output bias vector              (n_y, 1)
a[len(X)-1] = last hidden state                                         (n_a, 1)
```

In [10]:
def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    
    loss, cache = rnn_forward(X, Y, a_prev, parameters)                   # forward propagate 
    gradients, a = rnn_backward(X, Y, parameters, cache)                  # backpropagate
    gradients = clip(gradients, 5)                                        # clip gradients
    parameters = update_parameters(parameters, gradients, learning_rate)  # update parameters
        
    return loss, gradients, a[len(X)-1]

### 3.2 - Train the Model 

After shuffling the dataset, we sample a few randomly chosen names every 100 steps of stochastic gradient descent to see how the algorithm is doing. 
    
**`model() Overview`**

`Trains the model and generates dinosaur names.`

**`Arguments`**
```
data = text corpus
ix_to_char = dictionary that maps the index to a character
char_to_ix = dictionary that maps a character to an index
num_iterations = number of iterations to train the model for
n_a = number of units of the RNN cell
dino_names = number of dinosaur names you want to sample at each iteration. 
vocab_size = number of unique characters found in the text, size of the vocabulary
```
**`Returns`**
```
parameters = learned parameters
```

In [12]:
def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
    
    n_x, n_y = vocab_size, vocab_size
    parameters = initialize_parameters(n_a, n_x, n_y)   # initialize parameters
    loss = get_initial_loss(vocab_size, dino_names)     # initialize loss
    
    # list of training examples (dino names)
    with open("dinos.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    np.random.seed(0)
    np.random.shuffle(examples)
     
    a_prev = np.zeros((n_a, 1))     # initialize hidden state of LSTM
     
    for j in range(num_iterations):
        
        # define one training example (X,Y)
        index = j % len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]] 
        Y = X[1:] + [char_to_ix["\n"]]
        
        # perform one optimization step: Forward Prop > Back Prop > Clip > Update Parameters
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters)
        
        # keep the loss smooth, accelerating the training
        loss = smooth(loss, curr_loss)

        # Every 2000 iterations, generate "n" characters to check if the model is learning properly
        if j % 2000 == 0:
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            seed = 0
            
            for name in range(dino_names):
                
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix, seed)
                print_sample(sampled_indices, ix_to_char)
                seed += 1

            print('\n')
        
    return parameters

In [13]:
parameters = model(data, ix_to_char, char_to_ix)

Iteration: 0, Loss: 23.087336

Nkzxwtdmfqoeyhsqwasjkjvu
Kneb
Kzxwtdmfqoeyhsqwasjkjvu
Neb
Zxwtdmfqoeyhsqwasjkjvu
Eb
Xwtdmfqoeyhsqwasjkjvu


Iteration: 2000, Loss: 27.884160

Liusskeomnolxeros
Hmdaairus
Hytroligoraurus
Lecalosapaus
Xusicikoraurus
Abalpsamantisaurus
Tpraneronxeros


Iteration: 4000, Loss: 25.901815

Mivrosaurus
Inee
Ivtroplisaurus
Mbaaisaurus
Wusichisaurus
Cabaselachus
Toraperlethosdarenitochusthiamamumamaon


Iteration: 6000, Loss: 24.608779

Onwusceomosaurus
Lieeaerosaurus
Lxussaurus
Oma
Xusteonosaurus
Eeahosaurus
Toreonosaurus


Iteration: 8000, Loss: 24.070350

Onxusichepriuon
Kilabersaurus
Lutrodon
Omaaerosaurus
Xutrcheps
Edaksoje
Trodiktonus


Iteration: 10000, Loss: 23.844446

Onyusaurus
Klecalosaurus
Lustodon
Ola
Xusodonia
Eeaeosaurus
Troceosaurus


Iteration: 12000, Loss: 23.291971

Onyxosaurus
Kica
Lustrepiosaurus
Olaagrraiansaurus
Yuspangosaurus
Eealosaurus
Trognesaurus


Iteration: 14000, Loss: 23.382338

Meutromodromurus
Inda
Iutroinatorsaurus
Maca
Yusteratop

## 4. Shakespeare Poem Generator

In [14]:
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking
from keras.layers import LSTM
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
from shakespeare_utils import *
import sys
import io

Using TensorFlow backend.


Loading text data...
Creating training set...
number of training examples: 31412
Vectorizing training set...
Loading model...


In [15]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y, batch_size=128, epochs=1, callbacks=[print_callback])

Epoch 1/1


<keras.callbacks.History at 0x7f3eb85ee588>

In [16]:
generate_output()

Write the beginning of your poem, the Shakespeare machine will complete it. Your input is: know thyself


Here is your poem: 

know thyself this,
that than thy gride tunch stepe wor's conse make.
 love heart thatel nor cinter helf my sight,
it siviftatthe sud mesthire my beling his'.
of thy mall naf enders may nimpece ang thee?
ey for made end borte all my pose will thee,
might life my to? wi the vo deed adeloned,
and soum your of, out letge is my a do all,
haff withret fur mear's upess geasss not,
wimend, and my to coverses worme ha