# Recursive Neural Networks

**Nietsche Philosophy**   
We will learn to write philosophy like Nietsche, one character at a time. Like the language model of word at at time, lesson 4, but this time one character at a time.

An RNN is no different than what you have already learned. To show this, we will build it using plain PyTorch layers that we have already used

<img src="./WhyWeNeedRNNs.png" style='width:500px;height:400px;'>

**Basic Idea of RNN**  
Basic idea of RNN you want to keep track of state over long term dependencies. Consider the diagram, with the template language (in the black box). The model needs to keep track of state over long term dependencies. The model needs to keep track that its inside a comment, after the start and before the end of the loop. This is the idea of state, is difficult with CNNs or basic DL, however easy for RNN

Stateful representation keep track of where we now, have memory, variable length sequences, and long-term dependencies.

**Swiftkey Blog Post**  
Swift key  posted 1 year ago a blog about their new language model. 

**Andrew Karpathy** 
Character level RNN to product Latex. Pased in some Latex text and it started writing something. 



### Notation
Start with something that is not an RNN. Introduce NN notation using boxes, circles and triangles. 

<img src = "NN_Notation.png" style="width:500px;height:300px;" >

**Every shape represents a set of activations**    

***Rectangle*** -  Input activations. input number of rows = batch size, column number of variables (inputs).    

***Arrow*** - Layer operation, possibly more than one. The first arrow is a matrix producrt followed by a RELU.  

***Circle*** - hidden activations. Remember, an activation is a number that is being calculated by, for example, a RELU or matrix product. Circle reprsents a matrix of activations. After matrix of operations (arrows) we now have BatchSize x No.Columns. Circle represents activations after the arrow of operations.

***Triangle*** - Output activations, another matrix of operations. 

**ML Course vs DL course**  
Note: in the ML course we learn. If you havn't done one of these from scratch try it. We do it in the ML course. In the ML course we build things up from the foundation. This course is top down.

**Some Conventions Conv Net** 

- Activaton function is always RELU for hidden layers 
- Ouput for classification is always Softmax
- No. of Rows always the Batch Size   
- Replace No. Channels is No. of filters

## First NLP Example

**Predict 4th Character**   
Predict the 3-rd character in a 3 character sequence based on the previous three characters. 

<img src="NLP_Prediction_Example.png" style="width:600px;height:400px;">

Our Input is the first characther of each string of the mini-batch. Assuming one-hot encoded, then the width is number of characters in the vocabulary. In reality (implementation) not one-hot encoded. It is an integer lookup of an embedding layer (as discussed in lesson 5 movie lens lesson), but is mathematically identical. 

Activations go through fully connected layer 

Next, is another fully connected layer. 

Then, bring in character 2 into the the second circle. The arrow from character 2 represents a matrix product, the same dimensions as the other arrow coming into second circle. Add output of two arrows.

Final output, matrix product to a softmax that gi

In summary the network contains two hidden layer, 3 matrix products (not counting extra input, only count 1 between each layer), first one through an embedding layer. We also have a second input that is coming in. 

Now, lets implement that. We will develop this from scratch without Torch Text.

**Why characters not words**  
Characters models can be useful. Generally want to combine character and word models. We will discuss word level models in part 2.


This is what the actual model looks like for predicting the 4-th character

<img src="RNN_Prediction2.png" style="width:600px;height:400px;">


all the arrows of the same color are going to use the same weight matrix dimensions  
- input embeddings 
- inner layer matrices
- output matrix

Will have 3 different matrix sizes

## Basic RNN Architecture

An RNN is exactly the same thing. But we draw more simply. If we have a green arrow going to a circle or an orange arrow, drawn differently. 

The key is how many times to go around the circle. 


<img src="./RNN.png" style="width:400px;height:300px;">

###  Generalize the input

Initilize the inputs to zero, just like the code example above (class Char3Model)

Now it is very simple architecture. Inputs are treated the same, but with activations initialized to zero.

<img src="./RNN2.png" style="width:400px;height:300px;">

## Multi-output model

An RNN with an output for each iteration. After each circle of the unrolled network. 3 Character input, spit out an output after each charachter. 

For example:  
In the for loop do something like  
```
results = [] empty list, results.append
```

This is interesting. Because in the previous network there are inneficiences. Grab first row, then next row  all but 1 overlap the previous one. We are recalculating the same embeddings and transitions. Very inneficient. Recalculate 7 out of 8 then add one more to the end. 

Basic idea is lets take in non-oeverlabing set of characters. 
Lets not do it that way. Lets take in non-overlaping set of caracthers. 

<img src="./RNN3.png" style="width:400px;height:300px;">

Should not be more or less accurte, but it should be more efficient.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

## Setup

We're going to download the collected works of Nietzsche to use as our data for this class.

In [2]:
PATH='data/nietzsche/'

In [3]:
get_data("https://s3.amazonaws.com/text-datasets/nietzsche.txt", f'{PATH}nietzsche.txt')
text = open(f'{PATH}nietzsche.txt').read()
print('corpus length:', len(text))

corpus length: 600893


In [4]:
text[:400]

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman? Certainly she has never allowed herself '

In [4]:
# grab a set (unique letters) and sort ... "set" creates unique letters
# 85 unique letters
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 85


Sometimes it's useful to have a zero value in the dataset, e.g. for padding

In [5]:
# put padding character, null, at the start
chars.insert(0, "\0")

# this is our vocab
''.join(chars[1:-6])

'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxy'

Map from chars to indices and back again

In [6]:
# maps every character to a unique id
char_indices = {c: i for i, c in enumerate(chars)}
# maps every unique id to a character
indices_char = {i: c for i, c in enumerate(chars)}

*idx* will be the data we use from now on - it simply converts all the characters to their index (based on the mapping above)

In [7]:
# just to confirm, grab every index and join them together 
idx = [char_indices[c] for c in text]

idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [8]:
''.join(indices_char[i] for i in idx[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## Three char model

### Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters

In [9]:
# list of 0-th, 1st, 2nd, 3rd, 4th characters
# each is a list
# in the next step turn the list into a numpy array
cs=3
c1_dat = [idx[i]   for i in range(0, len(idx)-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)]

# each of is a list
len(c1_dat)

200297

Our inputs

In [10]:
# Np.Stack to pop them together
#  0 1 and 2 character
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)


x1.size

200297

In [15]:
# np.stack Demo ... don't really need this
# get familiar with np.stack
# don't really need all this

a = np.array([[1,2,3],[4,5,6]])
b = np.array([[7,8,9],[10,11,12]]) 


print('a=',a)
print('a.shape',a.shape)

print('\n')

print('b=',b)
print('b.shape',b.shape)

print('\n')

print('np.stack(a,b) = ')
print(np.stack((a,b)))

print('\n')

# stack along the 0 diminsion ... row
print('\n')
c=np.stack((a,b),0)
print('c= np.stack((a,b),0)')
print('c =',c)
print('c.shape', c.shape)

print('\n')

d=np.stack((a,b),1)
print('d= np.stack((a,b),1)')
print('d=',d)
print('d.shape', d.shape)

print('\n')

e=np.vstack((a,b))
print('e= np.vstack((a,b))')
print('e=',e)
print('e.shape', e.shape)

print('\n')

f=np.hstack((a,b))
print('f= np.hstack((a,b))')
print('f=',f)
print('f.shape', f.shape)

a= [[1 2 3]
 [4 5 6]]
a.shape (2, 3)


b= [[ 7  8  9]
 [10 11 12]]
b.shape (2, 3)


np.stack(a,b) = 
[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]




c= np.stack((a,b),0)
c = [[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]
c.shape (2, 2, 3)


d= np.stack((a,b),1)
d= [[[ 1  2  3]
  [ 7  8  9]]

 [[ 4  5  6]
  [10 11 12]]]
d.shape (2, 2, 3)


e= np.vstack((a,b))
e= [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
e.shape (4, 3)


f= np.hstack((a,b))
f= [[ 1  2  3  7  8  9]
 [ 4  5  6 10 11 12]]
f.shape (2, 6)


Our output

In [11]:
# next character in the list
y = np.stack(c4_dat)

The first 4 inputs and outputs

In [12]:
# for example x1, 2, 3 ... 
#  first few items 40, 42, 29
#  first few items ... grab idx 40, 42, 29, and predict 30
#  next grab item with idx 30, 25, 27, and predict item with idx 29
#     and so on
x1[:4], x2[:4], x3[:4]

(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [22]:
# next character to predict 
y[:4]

array([30, 29,  1, 40])

In [23]:
# 200K of these that we will model
x1.shape, y.shape

((200297,), (200297,))

### Create and train model

Pick a size for our hidden state

In [13]:
# 256 hidden activations
n_hidden = 256

The number of latent factors to create (i.e. the size of the embedding matrix)

In [14]:
# size of embeddings
# 42 approximately half the number of characters
n_fac = 42

In [16]:
class Char3Model(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)

        # The 'green arrow' from our diagram - the layer operation from input to hidden
        self.l_in = nn.Linear(n_fac, n_hidden)

        # The 'orange arrow' from our diagram - the layer operation from hidden to hidden
        #  notice that this matrix is square
        #  the trick is to make it a square matrix
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        
        # The 'blue arrow' from our diagram - the layer operation from hidden to output
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
                # pass in three characters
    def forward(self, c1, c2, c3):
                # Each will go through... embedding -> linear -> RELU 
                 # self.e length 42
            # h input activartions apply L hidden
            # will be of size n_hidden (first set of hidden activations)
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
            # create a PyTorch Variable on the GPU ... make it zeros
        h = V(torch.zeros(in1.size()).cuda())
            # take first input ... h is all zeros at this point
            #    don't need to add h here, but want code to be identical
            #    h is all zeros at this point
     # REPLACE this with a loop
     # **RECURRENT** that's what makes it a recurrent NN
        h = F.tanh(self.l_hidden(h+in1))
           # from first hidden to next hidden and add 2nd input
        h = F.tanh(self.l_hidden(h+in2))
           # from second hidden to next hidden and add 3rd input
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

In [20]:
# use the same columnarData class that we used before
# from arrays spits back the same arrays np.stack ...  that we gave it ...
# bs = 512, data is tiny, so we can use larger batch size

# using fastai to setup data, but will use PyTorch not Fastai
#  will not create a Fastai Learner
md = ColumnarModelData.from_arrays('.', [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)

In [19]:
# grab first row and predict first element of next row
#   note in RNN example below the characters are overlapping 
#   here they are not overlapping
np.stack([x1,x2,x3], axis=1)

array([[40, 42, 29],
       [30, 25, 27],
       [29,  1,  1],
       ...,
       [72, 62, 67],
       [59, 74, 65],
       [67, 58, 72]])

In [21]:
# Standard PyTorch Model
# Using PyTorch so need to put it on GPU
# 
m = Char3Model(vocab_size, n_fac).cuda()

In [23]:
# iterate through the training set
#   trn data loader ... 
#   grab iterator to iterate through training set
it = iter(md.trn_dl)
*xs,yt = next(it)  # next grab minibatcy, return x and y t
t = m(*V(xs))



In [26]:
# here's our xs
# you would expect xs to be length 3 ... we are passing imn 3 things
len(xs)

3

In [33]:
xs

[tensor([58, 72, 58, 31,  2, 72, 61, 59, 54, 77, 56,  2, 58, 68, 54,  2, 58,  2,
         62, 56, 71, 67, 67,  2, 58, 67, 67, 72, 69, 57, 58, 62, 73,  2,  2, 71,
         72, 72, 73, 78, 72, 68, 73, 72, 58, 62, 66, 74,  2, 59, 73, 61,  1, 61,
          2,  8, 72, 68, 67, 55, 66, 59,  2, 68, 54,  2, 72, 73, 72,  8, 62,  2,
         73,  2, 62, 68, 73, 67, 58,  2, 65, 66, 62,  2, 74, 58, 60, 75,  1, 59,
         68, 71,  2,  2, 56, 43, 59, 61,  2,  1, 58, 67, 73, 56, 72,  2, 57, 72,
         72,  2,  2, 59, 68, 72, 75, 71,  2, 60, 58,  2,  2,  1, 58, 62, 68, 62,
         67,  8, 59, 61,  2, 62, 73, 56, 73, 68, 67, 58, 62, 71, 67, 45, 74,  1,
          2, 56,  2, 67,  2,  2, 76, 68, 71,  2, 79, 68, 61, 59, 33, 58, 73,  2,
         55, 72, 54, 67,  1, 54, 73,  2, 54, 68, 66, 67, 75, 73, 61, 78, 72, 58,
          2, 73, 67,  2,  8, 62,  5, 78, 73, 73, 67, 62, 61, 72,  1, 58, 74, 62,
          2, 61, 56,  2, 58,  2,  5, 54, 73, 54, 24, 62, 68, 72, 67, 62, 73, 73,
         54, 71, 62, 57, 76,

In [29]:
# 0th element is length 512, batch size
xs[0].size()

torch.Size([512])

In [34]:
t

tensor([[-4.3356, -4.4544, -4.5640,  ..., -4.6722, -4.2110, -4.5510],
        [-4.3434, -4.4378, -4.6888,  ..., -4.6448, -4.2506, -4.3595],
        [-3.9783, -4.5035, -4.5312,  ..., -4.5166, -4.1282, -4.0794],
        ...,
        [-4.2302, -4.4665, -4.6359,  ..., -4.6978, -4.2540, -4.3633],
        [-4.2957, -4.4809, -4.7079,  ..., -4.5581, -3.9241, -4.6099],
        [-4.2088, -4.4721, -4.5719,  ..., -4.4334, -4.2776, -4.4175]],
       device='cuda:0', grad_fn=<LogSoftmaxBackward>)

In [35]:
# PyTorch optimizer 
# call m.paramters for list of things to optimize
opt = optim.Adam(m.parameters(), 1e-2)

In [36]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

  5%|▌         | 21/392 [00:00<00:05, 68.42it/s, loss=2.98]



 12%|█▏        | 47/392 [00:00<00:03, 89.48it/s, loss=2.65]



 15%|█▌        | 60/392 [00:00<00:03, 97.90it/s, loss=2.55]



 21%|██▏       | 84/392 [00:00<00:03, 100.09it/s, loss=2.45]



 28%|██▊       | 111/392 [00:01<00:02, 101.92it/s, loss=2.38]



 35%|███▍      | 137/392 [00:01<00:02, 112.86it/s, loss=2.31]



 41%|████      | 161/392 [00:01<00:02, 107.50it/s, loss=2.27]



 44%|████▍     | 173/392 [00:01<00:01, 110.58it/s, loss=2.25]



 49%|████▉     | 194/392 [00:01<00:02, 88.69it/s, loss=2.2]  



 56%|█████▋    | 221/392 [00:02<00:01, 96.83it/s, loss=2.17]



 63%|██████▎   | 248/392 [00:02<00:01, 111.46it/s, loss=2.15]



 69%|██████▉   | 271/392 [00:02<00:01, 101.13it/s, loss=2.14]



 76%|███████▌  | 298/392 [00:02<00:00, 115.61it/s, loss=2.13]



 79%|███████▉  | 311/392 [00:02<00:00, 106.22it/s, loss=2.12]



 86%|████████▌ | 338/392 [00:03<00:00, 118.18it/s, loss=2.1] 



 93%|█████████▎| 363/392 [00:03<00:00, 110.59it/s, loss=2.09]



epoch      trn_loss   val_loss                               
    0      2.093053   0.377143  





[0.37714290618896484]

In [38]:
set_lrs(opt, 0.001)

In [39]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

  2%|▏         | 9/392 [00:00<00:04, 89.65it/s, loss=2.05]



  8%|▊         | 32/392 [00:00<00:03, 92.08it/s, loss=1.97]



 15%|█▍        | 57/392 [00:00<00:03, 105.87it/s, loss=1.93]



 22%|██▏       | 86/392 [00:00<00:02, 111.59it/s, loss=1.91]



 28%|██▊       | 111/392 [00:01<00:02, 103.90it/s, loss=1.89]



 35%|███▌      | 139/392 [00:01<00:02, 118.92it/s, loss=1.87]



 41%|████      | 161/392 [00:01<00:02, 100.65it/s, loss=1.87]



 48%|████▊     | 188/392 [00:01<00:01, 115.44it/s, loss=1.86]



 51%|█████▏    | 201/392 [00:01<00:01, 104.31it/s, loss=1.85]



 59%|█████▊    | 230/392 [00:02<00:01, 120.71it/s, loss=1.86]



 65%|██████▌   | 255/392 [00:02<00:01, 109.75it/s, loss=1.85]



 71%|███████   | 279/392 [00:02<00:00, 113.89it/s, loss=1.85]



 77%|███████▋  | 301/392 [00:02<00:00, 99.94it/s, loss=1.85] 



 84%|████████▍ | 329/392 [00:02<00:00, 115.83it/s, loss=1.84]



 87%|████████▋ | 341/392 [00:03<00:00, 104.24it/s, loss=1.83]



 93%|█████████▎| 365/392 [00:03<00:00, 110.38it/s, loss=1.84]



epoch      trn_loss   val_loss                               
    0      1.84138    0.379587  





[0.37958669662475586]

### Test model

write function to test the model

In [40]:
# pass in 3 charcthers
def get_next(inp):
      # tensor of array of characther indexes
    idxs = T(np.array([char_indices[c] for c in inp]))
      # turn idxs to variables and pass to our model
    p = m(*VV(idxs))
      # argmax to grab character number
      # to_np to turn into numpy
    i = np.argmax(to_np(p))
    return chars[i]

In [41]:
# T good way to start sentence after "y.d"
get_next('y. ')



'T'

In [42]:
get_next('ppl')



'e'

In [43]:
get_next(' th')



'e'

In [44]:
get_next('and')



' '

## Our first RNN!

Now lets create an RNN.

An RNN is exactly the same thing. But we draw more simply. If we have a green arrow going to a circle or an orange arrow, drawn differently. 

The key is how many times to go around the circle.

Orange is the hiddent to hidden matrix


<img src="./RNN.png" style="width:400px;height:300px;">

###  Generalize the input

Initilize the inputs to zero, just like the code example above (class Char3Model)

Now it is very simple architecture. Inputs are treated the same, but with activations initialized to zero.

<img src="./RNN2.png" style="width:400px;height:300px;">

### Create inputs

This is the size of our unrolled RNN.

In [45]:
# this time 8 characthers, 8 cs
# predict every 9-th characther
cs=8

For each of 0 through 7, create a list of every 8th character with that starting point. These will be the 8 inputs to our model.

In [None]:
# Create a list of 8 characters from 0 to 7 and the nest will be our output
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs)]

Then create a list of the next character in each of these series. This will be the labels for our model.

In [47]:
c_out_dat = [idx[j+cs] for j in range(len(idx)-cs)]

In [48]:
# above use axis = 1, this creates overlapping
xs = np.stack(c_in_dat, axis=0)

In [49]:
xs.shape

(600885, 8)

In [50]:
y = np.stack(c_out_dat)

So each column below is one series of 8 characters from the text.

In [None]:
# overlapping groups of 8
#  0-th to 7-th
#  2-nd to 9-th
#  the model will grab the first row and 
#    predict the last item of the second row 
# and so on

# for example, 
#  for the 3rd row the predicted will be 43, 
#  last item of the 4-th row

xs[:cs,:cs]

...and this is the next character after each sequence.

In [52]:
y[:cs]

array([ 1,  1, 43, 45, 40, 40, 39, 43])

### Create and train model

In [None]:
val_idx = get_cv_idxs(len(idx)-cs-1)

In [None]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512)

In [None]:
# same code as before
#  replace forward with a loop
#  tan_h like a sigmoid that is offset
#  common to use tan_h in state to state transition
#  to keep activations within bounds
#  state to state still uses tan_h rather than RELU

# DEEP 8 Network when Unrolled
class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)

    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

In [None]:
m = CharLoopModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       2.02986  1.99268]                                



In [None]:
set_lrs(opt, 0.001)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.73588  1.75103]                                 



In [None]:
# CONCATENATE Hiddent and Input ... don't add
# maybe should not add inut and hidden together
# input state and hidden state are qualitatively different
#   input encoding of character
#   h encoding of series of charcters so far
#   adding them may lose information
# maybe better to concatenate rather than add together
# input layer needs to be from n_fac + n_hiddent to nhiddent 
#   make dimensions work

# design heuristic
# IF HAVE DIFFERENT KINDS OF INFORMATION TO COMBINE, THEN YOU WANT TO
#  CONCATENATE not ADD
# then convert back to fixed size with matrix product
class CharLoopConcatModel(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac+n_hidden, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
                  # this is of size nfac+nhidden
            inp = torch.cat((h, self.e(c)), 1)
                  # now back to size n_hidden
            inp = F.relu(self.l_in(inp))
                  # same square matrix as before, nhidden
            h = F.tanh(self.l_hidden(inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

In [None]:
m = CharLoopConcatModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs))

In [None]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81654  1.78501]                                



In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.69008  1.69936]                                 



### Test model

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [None]:
get_next('for thos')

'e'

In [None]:
get_next('part of ')

't'

In [None]:
get_next('queens a')

'n'

## RNN with pytorch

Now do it with PyTorch

It will write loop and linear input layers automaticall

Use the nn.rnn class

In [None]:
# same thing in less code with PyTorch
# nn.RNN class
class CharRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
            # create the RNN
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
            # creates our starting poing
        h = V(torch.zeros(1, bs, n_hidden))
            # embeddings
        inp = self.e(torch.stack(cs))
        
            # this is our for loop ... pass in initial hidden state
            #                          starting point
            # pass in the input and hiddent state
            #   get back the output and hidden state back
            #   h is hiddent state of activations, circle, (size 256)
        outp,h = self.rnn(inp, h)
            # Torch returns all the h's stacked on top of each other
            # return only the last one see "outpu[-1]"
            # h is rank 3 tensor ... to support backward RNN, need additional axis
            #    for now 3rd demension will be 1
        
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)

In [None]:
m = CharRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xs,yt = next(it)

In [None]:
t = m.e(V(torch.stack(xs)))
t.size()

torch.Size([8, 512, 42])

In [None]:
ht = V(torch.zeros(1, 512,n_hidden))
outp, hn = m.rnn(t, ht)
outp.size(), hn.size()

(torch.Size([8, 512, 256]), torch.Size([1, 512, 256]))

In [None]:
t = m(*V(xs)); t.size()

torch.Size([512, 85])

In [None]:
# 4 epochs
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.86065  1.84255]                                 
[ 1.       1.68014  1.67387]                                 
[ 2.       1.58828  1.59169]                                 
[ 3.       1.52989  1.54942]                                 



In [None]:
set_lrs(opt, 1e-4)

In [None]:
# 2 more epochs
fit(m, md, 2, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.46841  1.50966]                                 
[ 1.       1.46482  1.5039 ]                                 



### Test model

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [None]:
get_next('for thos')

'e'

In [None]:
# get next, and feed in a new set of 8 characthers characther 
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [None]:
# here are 40 characthers
get_next_n('for thos', 40)

'for those the same the same the same the same th'

## Multi-output model

An RNN with an output for each iteration. After each circle of the unrolled network produce and outut. In the 3 Character input, spit out an output after each charachter. 

For example:  
In the for loop do something like  
```
results = [] empty list, results.append
```

This is interesting. Because in the previous network there are inneficiences. Grab first row, then next row  all but 1 overlap the previous one. We are recalculating the same embeddings and transitions. Very inneficient. Recalculate 7 out of 8 then add one more to the end. 

Basic idea is lets take in non-oeverlabing set of characters. 
Lets not do it that way. Lets take in non-overlaping set of caracthers. 

<img src="./RNN3.png" style="width:400px;height:300px;">

Should not be more or less accurte, but it should be more efficient.

### Setup

Let's take non-overlapping sets of characters this time

In [None]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(0, len(idx)-cs-1, cs)]

Then create the exact same thing, offset by 1, as our labels

In [None]:
c_out_dat = [[idx[i+j] for i in range(cs)] for j in range(1, len(idx)-cs, cs)]

In [None]:
xs = np.stack(c_in_dat)
xs.shape

(75111, 8)

In [None]:
ys = np.stack(c_out_dat)
ys.shape

(75111, 8)

In [None]:
xs[:cs,:cs]

array([[40, 42, 29, 30, 25, 27, 29,  1],
       [ 1,  1, 43, 45, 40, 40, 39, 43],
       [33, 38, 31,  2, 73, 61, 54, 73],
       [ 2, 44, 71, 74, 73, 61,  2, 62],
       [72,  2, 54,  2, 76, 68, 66, 54],
       [67,  9,  9, 76, 61, 54, 73,  2],
       [73, 61, 58, 67, 24,  2, 33, 72],
       [ 2, 73, 61, 58, 71, 58,  2, 67]])

In [None]:
ys[:cs,:cs]

array([[42, 29, 30, 25, 27, 29,  1,  1],
       [ 1, 43, 45, 40, 40, 39, 43, 33],
       [38, 31,  2, 73, 61, 54, 73,  2],
       [44, 71, 74, 73, 61,  2, 62, 72],
       [ 2, 54,  2, 76, 68, 66, 54, 67],
       [ 9,  9, 76, 61, 54, 73,  2, 73],
       [61, 58, 67, 24,  2, 33, 72,  2],
       [73, 61, 58, 71, 58,  2, 67, 68]])

### Create and train model

In [None]:
val_idx = get_cv_idxs(len(xs)-cs-1)

In [None]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=512)

In [None]:
class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
          # this time return all h's ... last time only returned last one
        return F.log_softmax(self.l_out(outp), dim=-1)

In [None]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xst,yt = next(it)

In [None]:
# labels are 512 x 8 because now we are predicting 8 things every time
yt

## Negative log likelihood loss function

Like RMSE expects to received two Rank 2 tensors. Two mini batches of tensors

We have 8  time steps (8 characgters), for each one84 probabilities, 84 characters, for every time in our mini-batch ... we have rank 3 tensor. So the Negative Log Likelihood loss function  will spit out an error

We will write a custom loss function for sequences

Will receive input and target and call 

We will flatten our input, and our targets. 

The first two access will need to be trasnposed. 

PyTorch, the sequence length is the first access

Second access is batch size

3rd access is the hidden state



In [None]:
# we need a negative log likelihood loss function for sequendces

def nll_loss_seq(inp, targ):
      # get size
      # PyTorch, the sequence length is the first axis
      # Second axis batch size
      #3rd axis is the hidden state
    
    sl,bs,nh = inp.size()
       # targ is 512 x 8  ... so will need to transpose to be 8 x 512 as before 
       # if you see that error this tensor not contiguous... not continguous ... then use contiguous
       # -1 as long as it needs to be 
    # calls PyTorch F.nll_loss
    # flatten inputs and targets 
    targ = targ.transpose(0,1).contiguous().view(-1)
    return F.nll_loss(inp.view(-1,nh), targ)

In [None]:
# pass the loss function to fit
# fit implements the training loop
# md model data object wraps the model, data test and validation set
fit(m, md, 4, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       2.59241  2.40251]                                
[ 1.       2.28474  2.19859]                                
[ 2.       2.13883  2.08836]                                
[ 3.       2.04892  2.01564]                                



In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 1, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       1.99819  2.00106]                               



### Identity init!

In [None]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [None]:
m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))


    1     0     0  ...      0     0     0
    0     1     0  ...      0     0     0
    0     0     1  ...      0     0     0
       ...          ⋱          ...       
    0     0     0  ...      1     0     0
    0     0     0  ...      0     1     0
    0     0     0  ...      0     0     1
[torch.cuda.FloatTensor of size 256x256 (GPU 0)]

In [None]:
fit(m, md, 4, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       2.39428  2.21111]                                
[ 1.       2.10381  2.03275]                                
[ 2.       1.99451  1.96393]                               
[ 3.       1.93492  1.91763]                                



In [None]:
set_lrs(opt, 1e-3)

In [None]:
fit(m, md, 4, opt, nll_loss_seq)

A Jupyter Widget

[ 0.       1.84035  1.85742]                                
[ 1.       1.82896  1.84887]                                
[ 2.       1.81879  1.84281]                               
[ 3.       1.81337  1.83801]                                



## Stateful model

### Review previous model

**Unrolled Model**  
- Review the basic unrolled RNN and code 

**Rolled Model**   
- Refactord with a loop (overlapping and re computing many times)

**Multi-Output Model**  
- Efficiency with Multi-output model - now do not have overlaping sections

### Stateful Model
 
- everytime we call hidden state in the loop we were resetting h to zero
- instead don't through it away. Store it after each loop.


When using an existing API which expects data to be certain format, you can either change your data to fit that format or you can write your own dataset sub-class to handle the format that your data is already in. Either is fine, but in this case, we will put our data in the format TorchText already support. Fast.ai wrapper around TorchText already has something where you can have a training path and validation path, and one or more text files in each path containing bunch of text that are concatenated together for your language model.

### Setup

In [None]:
# Use torchtext for convenience.
# When using APIs like torch text or fastai have methods that expect 
#   data in a given format.
# You can modify the data to fit the dataset classes or write your
#  own dataset subclass
# Usually want to put data into format of Torch Text or FastAi
#
# In this case, made a copy of Nietszhe data and put into format of torch text
#   copied into training (delete last 20% of rows) 
#   and another into validation (delete all but the last 20%)
# In practice don't necesarily want to shuffle, use last 20% 
#
# You need to create the training and validation
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

# Note: The student needs to practice her shell skills and prepare her own dataset before proceeding:
# - trn/trn.txt (first 80% of nietzsche.txt)
# - val/val.txt (last 20% of nietzsche.txt)

%ls {PATH}

[0m[01;34mmodels[0m/  nietzsche.txt  [01;34mtrn[0m/  [01;34mval[0m/


In [None]:
%ls {PATH}trn

trn.txt


In [None]:
# same as before
# Torch Text create a field
# a description of how to pre-process the text
#    lower case it, though upper mixed would also work
# characther model, so every character is a word 
#   list in python does list("abc") = "a" "b" "c"
# each minibatch will contain a list of characters 

TEXT = data.Field(lower=True, tokenize=list)

# same number bs, 
# number of characters is bptt (now that we know what it means not cs)
# n_hidden activations

bs=64; bptt=8; n_fac=42; n_hidden=256

# dictionary with train validation and test
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)

# model data 
# probably no characther less than frequency 3

# after this line, TEXT has
#   TEXT contains extra attribute calle VOCAB
#   list of unique itesm in the vocab.itos
#   vocab.stoi and reverse mapping 
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

# can't shuffle like with images 
# bptt length will vary to put some variance in for training
#   constant per mini-batch must remain constant, because h has to line up
#   length is how many mini batches
#   tokens are items in the vocabulary ... md.nt
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(963, 56, 1, 493747)

### RNN Stateful Model

A few "wrinkles"

#### Wrinkle No. 1: BPTT repackage h

- see code comments in "CharSeqStatefulRnn2"

#### Wrinkle 2 Mini-batches

How do we setup the data for an RNN. We talked about this in the original RNN lecture. 
```

Take entire long document all imdb reviews, works of Nietzshe
split into 64 equal size chunks (not chuncks of 64)

Suppose corpus is 64 million, then each 64 chunk of size 1 million
  
 -----------------------------------------------------
|        |        |        |        |        |        |
 -----------------------------------------------------
     
     
   Then, take chucks of 1 million and put on top of each other 
     to create 64 chuncks
   Then, each mini batch a vertical slice of size bptt
    
              predict over the bptt
   mini         bptt
   batch        <---><---><---><---> ...
         ^      -----+-----+--------------------------
         |     |        |         |        |         | 
         |      --------------------------------------
         |      -------------------------------------
     bs  |     |        |         |        |         | 
         |      -------------------------------------     
         |      --------------------------------------
         |     |        |         |         |        | 
         v      --------------------------------------  
         ^      -------------------------------------
         |     |        |         |        |         | 
         |      -------------------------------------
         |      -------------------------------------
     bs  |     |        |         |        |         | 
         |      ----------------- --------------------     
         |       ------------------------------------
         |      |        |         |        |        | 
         v      ------------------------------------- 
         
         ...
         
     after each mini-batch go to next batch
     
     look at these in parallel, each chunck far enough away 
       from each other
     the start of the 1 million-th character is in the middle
     of a sentence, but rare, so that is okay
    

```




### Wrinkle No 2 (Hiromi notes - Lesson 7, Pt1

* How to create mini-batches. We do not want to process one section at a time, but a bunch in parallel at a time.
* When we started looking at TorchText for the first time, we talked about how it creates these mini-batches.
* Jeremy said we take a whole long document consisting of the entire works of Nietzsche or all of the IMDB reviews concatenated together, we split this into 64 equal sized chunks (NOT chunks of size 64).
* For a document that is 64 million characters long, each “chunk” will be 1 million characters. We stack them together and now split them by bptt — 1 mini-bach consists of 64 by bptt matrix.
* The first character of the second chunk(1,000,001th character) is likely be in the middle of a sentence. But it is okay since it only happens once every million characters.



<img src="./RNNWrinkle2.png" style="width:400px;height:300px;">

### Question: How do we choose the size of bptt? [21:36]
There are a couple things to think about:
* the first is that mini-batch matrix has a size of bs (# of chunks) by bptt so your GPU RAM must be able to fit that by your embedding matrix. So if you get CUDA out of memory error, you need reduce one of these.
* If your training is unstable (e.g. your loss is shooting off to NaN suddenly), then you could try decreasing your bptt because you have less layers to gradient explode through.
* If it is too slow [22:44], try decreasing your bptt because it will do one of those steps at a time. for loop cannot be parallelized (for the current version). There is a recent thing called QRNN (Quasi-Recurrent Neural Network) which does parallelize it and we hope to cover in part 2.
* So pick the highest number that satisfies all these.

### Wrinkle 3   
last mini-batch is likely smaller last time

### Last wrinkle - PyTorch loss function

Loss functions not happy of receiving rank 3 tensor. Expects to be Rank 2 or Rank 4. 

```
       ------
      |       |
      |       |
      |       |
       -------

```

### How to pick bptt (question)
As high as it can be 

**Memory**  
Matrix size for a minibatch is bptt x bs x embedding length, so GPU RAM must be able to fit this by embedding (length of embedding matrix). If CUDA out of memory need to reduce one of those. 

**Stability** .   
If training unstable, due to exploding gradients. So now less layers.

**Performance**  
Too slow, decreasing bptt will help


### Data Augmentation (question)
Someone in a recent Kaggle competiton won by randomly inserting different rows. 


## RNN

In [None]:
# Stateful RNN

# the last mini batch will 
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
          # one more line in constructor, init_hiddent
          # initializes sets self.h to zeros
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        # last mini-batch is likely smaller last time, shorter
        #   might be exactly the right size, but unlikely
        # Wrinkle No. 3
        #   check batch size ... 
        #     self.h heigh no. activations, width batch size
        #     if not the same re-initialize 
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
          # store h in self.h 
          # Wrinkle Number No. 1
          #   don't do self.h = h ... would be way too big
          #      suppose train on doc 1 million characters
          #      size of unrolled network 1 million circles
          #      with back propogation, how much error character 1 affects
          #      final answer ... 1 million layer fully connected net
          #      very memory intensive for chain rule back prop 1 mil matrix multiply
          #      so, what we do is ... from time to time forget history
          #      repackage variable, grab the tensor (no history) and create new variable
          # call repackage inside forward ... at the end of the cs = 8 through
          #  from time to time don't remember your history
          #  away history at the end of for loop. Keep state, but not history
          # So, in the forward, for loop, back prop through cs = 8
          #   but at the end of the 8 repackage through away history of how 
          #   it got there
          # BPTT - this approach is called BPTT ... after for loop through away
          #   history and start afresh. Keep, keep state. We saw this in 
          #   our oritinal RNN lesson with BPTT = 70. 
          # Another reason not to backprop through too many layers is to limit 
          #   exploiding gradients. Longer value for BPTT will capture more state.
        self.h = repackage_var(h)
          # 4th last wrinkle ... sucks about PyTorch
          #    Loss functions such as softmax not happy with rank3 tensor
          #    no good reason, expects rank 2 tensor. Oddly can handle rank2 or rank 4 
          #    Rank 2 Tensor for each time period, for each batch have predictions
          #    Also have actuals. We want to check if they are the same
          #    Need to Flatten them both out (Actuals and Predictions)
          #    Therefore flatten with ".view" ... rows to rows 
          #    No. columns the size of Vocab ... probability for each letter
          #    No of rows as many as necessary bs (-1) x bpttt
          #    For the Target, Torch Text automatically changes the target to be flat
          #  
          #  Be careful, softmax as of v 0.3, need to tell it what axis
          #   to do softmax over ... remember we devide (sums to 1), last axis
          #   This notebook PyTorch 0.3

        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

Running on a PC

Surface book has a GPX 1060 built in
3 times slower than 1080 TI, about same as AWS P2 instance

In [None]:
# create model and optimizer, with the models parameters, and fit it
m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81983  1.81247]                                 
[ 1.       1.63097  1.66228]                                 
[ 2.       1.54433  1.57824]                                 
[ 3.       1.48563  1.54505]                                 



In [None]:
set_lrs(opt, 1e-4)

fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.4187   1.50374]                                 
[ 1.       1.41492  1.49391]                                 
[ 2.       1.41001  1.49339]                                 
[ 3.       1.40756  1.486  ]                                 



### RNN loop

No one really uses the PyTorch RNN cell in practice because of gradient explosion. 

Instead replce RNNCell with GRU cell

In [None]:
# RNNCELL from PyTorch source
#  this is the definition of an RNN Cell in PyTorch
        # matrix multipy of weights x inputs + biases
        # interestingly they do not concatenate the input and hidden
        #  either method works
# also notice the use of the tanh 
# like a sigmoid function 2x height -1
# Forcing no smaller 
# since multiplying by weight matrix again and again and again
#   because it is unbounded may have more of a gradient explosiion
#   problem 
# RELU may have more of a gradient explosion problem ...
# You can ask PyTorch for RNN cell for different nonlinearity
# Tanh is typical as an RNN nonlinearity

def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

In [None]:
# One more optimization of RNN, the same thing, same
# same as before ... remove self.rnn
#  instead defined as RNNCell as in PyTorch
#   just a matrix multiplication of weights, 
#   they do not concatenate, they sum
class CharSeqStatefulRnn2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
          # remove self.rnn cell and set to nn.RNNCell from PyTorch
          #  for reference above is the definition on nn.RNNCell as in PyTorch
          #  you can now read PyTorch source 
        # matrix multipy of weights x inputs + biases
        # interestingly they do not concatenate the input and hidden
        #  either method works
        self.rnn = nn.RNNCell(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp = []
        o = self.h
        for c in cs: 
            # every time call linear function append result to list
            # the result is stacked together
            o = self.rnn(self.e(c), o)
            outp.append(o)
        outp = self.l_out(torch.stack(outp))
        self.h = repackage_var(o)
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [None]:
m = CharSeqStatefulRnn2(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81013  1.7969 ]                                 
[ 1.       1.62515  1.65346]                                 
[ 2.       1.53913  1.58065]                                 
[ 3.       1.48698  1.54217]                                 



## GRU - Gated Recurent Network

Once per mini batch

http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ 

<img src="./GRUCell2.png" style="width:300px;height:150px;">

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

<img src="./GRUCellandEquations.png" style="width:600px;height:200px;">

\*in these equations sigma denotes a sigmoid function

### GRU Cell and Equations 

Normally the input would be multiplied by the weight matrix h, and add to input. This is not what happens with GRU. 


* Normally, the input gets multiplied by a weight matrix to create new activations h and get added to the existing activations straight away. That is not wha happens here.
Input goes into h˜ and it doesn’t just get added to the previous activations, but the previous activation gets multiplied by r (reset gate) which has a value of 0 or 1.

* r is calculated as below — matrix multiplication of some weight matrix and the concatenation of our previous hidden state and new input. In other words, this is a little one hidden layer neural net. It gets put through the sigmoid function as well. This mini neural net learns to determine how much of the hidden states to remember (maybe forget it all when it sees a full-stop character — beginning of a new sentence).

* z gate (update gate) determines what degree to use h˜ (the new input version of hidden states) and what degree to leave the hidden state the same as before.

**reset gate** . 
Input goes into h~ and previous activations get multiplied by r, reset gate, goes between 0 and 1. r = weight matrix Wr x [ht-1,xt] (concatenation of previous weight and new input xt). We then put this in to sigmoid function (sigma in this case denotes sigmoid, not std dev). 

This is like a mini hidden layer, logistic regression, like a mini neural net. 

This mini nueral net decides how much of my hidden state to remember. In what cases do I forget the past?. This goes through to create tilda h, the new hidden state along with the input ...

**z (update gate)**   
How much of the new hidden state, tilda h, to use for updating the hidden state h. Not 1 or 0. Look at equations, it is a linear interpolation. Not a switch, can put it in any position.

```
linear interpolation. Not a switch. 
(1 - zt)*ht-1 + zt*h_tilda_t
```

**Tanh** 
* Question about tanh [44:06]: As we have seen last week, tanh is forcing the value to be between -1 and 1. Since we are multiplying by this weight matrix again and again, we would worry that relu (since it is unbounded) might have more gradient explosion problem. Having said that, you can specify RNNCell to use different nonlineality whose default is tanh and ask it to use relu if you wanted to.


In [None]:
# same as before but replaces RNN cell with GRU cell
# will get better results ... down to 1.37 vs ~1.5
class CharSeqStatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [None]:
# GRU From the pytorch source code - for reference

def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    gi = F.linear(input, w_ih, b_ih)
    gh = F.linear(hidden, w_hh, b_hh)
    i_r, i_i, i_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)

    resetgate = F.sigmoid(i_r + h_r)
    inputgate = F.sigmoid(i_i + h_i)
    newgate = F.tanh(i_n + resetgate * h_n)
    return newgate + inputgate * (hidden - newgate)

In [None]:
m = CharSeqStatefulGRU(md.nt, n_fac, 512).cuda()

opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
fit(m, md, 6, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.68409  1.67784]                                 
[ 1.       1.49813  1.52661]                                 
[ 2.       1.41674  1.46769]                                 
[ 3.       1.36359  1.43818]                                 
[ 4.       1.33223  1.41777]                                 
[ 5.       1.30217  1.40511]                                 



In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 3, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.22708  1.36926]                                 
[ 1.       1.21948  1.3696 ]                                 
[ 2.       1.22541  1.36969]                                 



### Putting it all together: LSTM

**Chris Colah's blog**  
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

- LSTM's have one more improvement over GRU
- one more piece of state called the cell state, not just hidden state
- have to return a tuple of matrices, all the same size as the hidden state

In [None]:
# doubled the size of hiddent layer
from fastai import sgdr
n_hidden=512

In [None]:
# same but replace GRU with LSTM

# have to return tupple of matrices same
# also put dropout inside RNN with PyTorch 
#   dropout after each time step
# doubled the size of hiddent layer, hope is that this
# will allow it to learn more

# code is otherwise identical
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)    
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

## SGDR without the learner class. Fastai Callbacks

- Demonstrates how to use Fastai using Callbacks
- As before, create the model, m, sandard PyTorch model
- opt = optmim.Adam(m.parameters(),learning_rate)
- use lo, fastai layer optimizer


In [None]:
m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()
# lo fastai -  Layer Optimizer class/object
#   takes optim.Adam class constructor from PyTorch
#   takes model
#   learning rate
#   optional weight decay
# lo is tiny
# The only reason lo exists is for differential learningn rates and differential rate decay
#   also, all the mechanics in fastai assumes you have one of these
#   so, if you want to use SGDR you need one of these
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

In [None]:
# behind the scenes, call grab lo.opt property which gives you the optimizer
lo.opt

In [None]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [None]:
# not can pass in the optimizer and call bcks
fit(m, md, 2, lo.opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.72032  1.64016]                                 
[ 1.       1.62891  1.58176]                                 



### Callbacks

There are lots of cool things that we can do with call backs  
- save our model automatically
- things at start of training, end of training, epoch, batch 
- SGDR cb callback
- new approach to weight decay
- draw graphs

In [None]:
# cosine annealing call back
#   requires lo object
# pass in optimizer to fit
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
# resets len(md.trn_dl) ... how often
# on_cycle_end call back saves the model

cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
# pass in layer optimizer 
# cosine annealing by changing the learning rate inside this object
# call back will update the learning rate in the optimizer
fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb)

A Jupyter Widget

[ 0.       1.47969  1.4472 ]                                 
[ 1.       1.51411  1.46612]                                 
[ 2.       1.412    1.39909]                                 
[ 3.       1.53689  1.48337]                                 
[ 4.       1.47375  1.43169]                                 
[ 5.       1.39828  1.37963]                                 
[ 6.       1.34546  1.35795]                                 
[ 7.       1.51999  1.47165]                                 
[ 8.       1.48992  1.46146]                                 
[ 9.       1.45492  1.42829]                                 
[ 10.        1.42027   1.39028]                              
[ 11.        1.3814    1.36539]                              
[ 12.        1.33895   1.34178]                              
[ 13.        1.30737   1.32871]                              
[ 14.        1.28244   1.31518]                              



In [None]:
# get down to 1.25
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)

A Jupyter Widget

[ 0.       1.46053  1.43462]                                 
[ 1.       1.51537  1.47747]                                 
[ 2.       1.39208  1.38293]                                 
[ 3.       1.53056  1.49371]                                 
[ 4.       1.46812  1.43389]                                 
[ 5.       1.37624  1.37523]                                 
[ 6.       1.3173   1.34022]                                 
[ 7.       1.51783  1.47554]                                 
[ 8.       1.4921   1.45785]                                 
[ 9.       1.44843  1.42215]                                 
[ 10.        1.40948   1.40858]                              
[ 11.        1.37098   1.36648]                              
[ 12.        1.32255   1.33842]                              
[ 13.        1.28243   1.31106]                              
[ 14.        1.25031   1.2918 ]                              
[ 15.        1.49236   1.45316]                              
[ 16.   

### Test

In [None]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [None]:
# pass in some text
get_next('for thos')

'e'

In [None]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [None]:
print(get_next_n('for thos', 400))

for those the skemps), or
imaginates, though they deceives. it should so each ourselvess and new
present, step absolutely for the
science." the contradity and
measuring, 
the whole!

293. perhaps, that every life a values of blood
of
intercourse when it senses there is unscrupulus, his very rights, and still impulse, love?
just after that thereby how made with the way anything, and set for harmless philos


## Don't be disheartened ... 
The difference between something that looks good is not that far from a loss perspective. So, if you're getting something that is beginning to look like "english" then don't be disheartened, you're very close. Maybe just a little bit better loss.