### Load the training data
The network need a big txt file as an input.
The content of the file will be used to train the network.

In [4]:
data = open('kafka.txt', 'r').read()
## this will return a set of unique chars
chars = list(set(data)) 
data_size, vocab_size = len(data), len(chars)
print 'data has %d chars, %d unique' % (data_size, vocab_size)

data has 137629 chars, 81 unique


### Encode/Decode char/vector
Neural networks operate on vectors (a vector is an array of float) So we need a way to encode and decode a char as a vector.
We'll count the number of unique chars (vocab_size). That will be the size of the vector. The vector contains only zero exept for the position of the char wherae the value is 1.

In [2]:
## this are 2 dicts to convert a char to int and int to char
char_to_ix = { ch:i for i,ch in enumerate(chars)}
ix_to_char = { i:ch for i, ch in enumerate(chars)}
print char_to_ix
print ix_to_char

{'\n': 0, 'C': 31, '!': 3, ' ': 4, '"': 5, '%': 6, '$': 7, "'": 8, ')': 9, '(': 10, '*': 11, '-': 12, ',': 13, '/': 2, '.': 15, '1': 16, '0': 17, '3': 18, '2': 19, '5': 20, '4': 21, '7': 22, '6': 23, '9': 24, '8': 25, ';': 26, ':': 27, '?': 28, 'A': 29, '@': 30, '\xc3': 1, 'B': 32, 'E': 33, 'D': 34, 'G': 35, 'F': 36, 'I': 37, 'H': 38, 'K': 39, 'J': 40, 'M': 41, 'L': 42, 'O': 43, 'N': 44, 'Q': 45, 'P': 46, 'S': 47, 'R': 48, 'U': 49, 'T': 50, 'W': 51, 'V': 52, 'Y': 53, 'X': 54, 'd': 59, 'a': 55, 'c': 56, 'b': 57, 'e': 58, '\xa7': 14, 'g': 60, 'f': 61, 'i': 62, 'h': 63, 'k': 64, 'j': 65, 'm': 66, 'l': 67, 'o': 68, 'n': 69, 'q': 70, 'p': 71, 's': 72, 'r': 73, 'u': 74, 't': 75, 'w': 76, 'v': 77, 'y': 78, 'x': 79, 'z': 80}
{0: '\n', 1: '\xc3', 2: '/', 3: '!', 4: ' ', 5: '"', 6: '%', 7: '$', 8: "'", 9: ')', 10: '(', 11: '*', 12: '-', 13: ',', 14: '\xa7', 15: '.', 16: '1', 17: '0', 18: '3', 19: '2', 20: '5', 21: '4', 22: '7', 23: '6', 24: '9', 25: '8', 26: ';', 27: ':', 28: '?', 29: 'A', 30: '

The dictionary defined above allosw us to create a vector of size 61 instead of 256.
Here and exemple of the char 'a'
The vector contains only zeros, except at position char_to_ix['a'] where we put a 1.

In [6]:
import numpy as np

vector_for_char_a = np.zeros((vocab_size, 1))
vector_for_char_a[char_to_ix['a']] = 1
print vector_for_char_a.ravel()

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.]


In [7]:
#model parameters

hidden_size = 100
seq_length = 25
learning_rate = 1e-1

Wxh = np.random.randn(hidden_size, vocab_size) * 0.01 #input to hidden
Whh = np.random.randn(hidden_size, hidden_size) * 0.01 #input to hidden
Why = np.random.randn(vocab_size, hidden_size) * 0.01 #input to hidden
bh = np.zeros((hidden_size, 1))
by = np.zeros((vocab_size, 1))

The parameters are:
* Wxh are parameters to connect a vector that contain one input to the hidden layer.
* Whh are parameters to connect the hidden layer to itself. This is the Key of the Rnn: Recursion is done by injecting the previous values from the output of the hidden state, to itself at the next iteration.
* Why are parameters to connect the hidden layer to the output
* bh contains the hidden bias
* by contains the output bias

#### Define the loss function
The loss is a key concept in all neural networks training. It is a value that describe how good is our model.
The smaller the loss, the better our model is.
(A good model is a model where the predicted output is close to the training output)
During the training phase we want to minimize the loss.
The loss function calculates the loss but also the gradients (by backward pass):
* It perform a forward pass: calculate the next char given a char from the training set.
* It calculate the loss by comparing the predicted char to the target char. (The target char is the input following char in the tranning set)
* It calculates the backward pass to calculate the gradients

This function take as input:
* a list of input char
* a list of target char
* and the previous hidden state

This function outputs:
* the loss
* the gradient for each parameters between layers
* the last hidden state

In [None]:
def lossFun(inputs, targets, hprev):
  """                                                                                                                                                                                         
  inputs,targets are both list of integers.                                                                                                                                                   
  hprev is Hx1 array of initial hidden state                                                                                                                                                  
  returns the loss, gradients on model parameters, and last hidden state
  
  """
  #stores our inputs, hidden states, outputs, and probability values
    xs, hs, ys, ps, = {}, {}, {}, {} #Empty dicts
      # Each of these are going to be SEQ_LENGTH(Here 25) long dicts i.e. 1 vector per time(seq) step
      # xs will store 1 hot encoded input characters for each of 25 time steps (26, 25 times)
      # hs will store hidden state outputs for 25 time steps (100, 25 times)) plus a -1 indexed initial state
      # to calculate the hidden state at t = 0
      # ys will store targets i.e. expected outputs for 25 times (26, 25 times), unnormalized probabs
      # ps will take the ys and convert them to normalized probab for chars
      # We could have used lists BUT we need an entry with -1 to calc the 0th hidden layer
      # -1 as  a list index would wrap around to the final element
    xs, hs, ys, ps = {}, {}, {}, {}
      #init with previous hidden state
      # Using "=" would create a reference, this creates a whole separate copy
      # We don't want hs[-1] to automatically change if hprev is changed
    hs[-1] = np.copy(hprev)
      #init loss as 0
    loss = 0
    
  # forward pass                                                                                                                                                                              
    for t in xrange(len(inputs)):
        xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation (we place a 0 vector as the t-th input)                                                                                                                     
        xs[t][inputs[t]] = 1 # Inside that t-th input we use the integer in "inputs" list to  set the correct
        hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state                                                                                                            
        ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars                                                                                                           
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars                                                                                                              
        loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)  
        
    # backward pass: compute gradients going backwards    
    #initalize vectors for gradient values for each set of weights 
    dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dhnext = np.zeros_like(hs[0])
    for t in reversed(xrange(len(inputs))):
        #output probabilities
        dy = np.copy(ps[t])
        #derive our first gradient
        dy[targets[t]] -= 1 # backprop into y  
        #compute output gradient -  output times hidden states transpose
        #When we apply the transpose weight matrix,  
        #we can think intuitively of this as moving the error backward
        #through the network, giving us some sort of measure of the error 
        #at the output of the lth layer. 
        #output gradient
        dWhy += np.dot(dy, hs[t].T)
        #derivative of output bias
        dby += dy
        #backpropagate!
        dh = np.dot(Why.T, dy) + dhnext # backprop into h                                                                                                                                         
        dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity                                                                                                                     
        dbh += dhraw #derivative of hidden bias
        dWxh += np.dot(dhraw, xs[t].T) #derivative of input to hidden layer weight
        dWhh += np.dot(dhraw, hs[t-1].T) #derivative of hidden layer to hidden layer weight
        dhnext = np.dot(Whh.T, dhraw) 
        
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients      
        
    return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
