https://github.com/fastai/fastai/blob/master/courses/dl1/lesson6-rnn.ipynb

In Lesson 4 (http://course.fast.ai/lessons/lesson4.html) we did a word a time. 01:23:30

Now we do a character a time.

A Recurrent Neural Network is no different than what we saw before. The basic issue they solve is to keep state of the long term dependencies, because tokens at the end of a sentence refer to the beginning state often. Note, this can be done with a convolutional network also, but with a RNN it is much more straightforward.

Stateful Representation:

where are we now
long term dependencies
memory
variable length sequences

This is prep for lesson 11: 

TO CHECK: 
Question would be if one can use a character model to kind of build a hidden language that is a translation layer between a source and target language. This might be more visual than useful directly for the algorithm

In [6]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

## Setup


We're going to download the collected works of Nietzsche to use as our data for this class.

In [7]:
PATH='data/nietzsche/'

In [8]:
get_data("https://s3.amazonaws.com/text-datasets/nietzsche.txt", f'{PATH}nietzsche.txt')
text = open(f'{PATH}nietzsche.txt').read()
print('corpus length:', len(text))

nietzsche.txt: 606kB [00:02, 257kB/s]                             

corpus length: 600893





In [9]:
text[:400]

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman? Certainly she has never allowed herself '

In [10]:
chars = sorted(list(set(text))) ## gives the unique letters
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 85


Sometimes it's useful to have a zero value in the dataset, e.g. for padding

In [11]:
chars.insert(0, "\0") ## padding character

''.join(chars[1:-6]) ## this is how our char level vocab 'chars' looks like

'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxy'

Map from chars to indices and back again

In [12]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

idx will be the data we use from now own - it simply converts all the characters to their index (based on the mapping above)

Combining charater with word level can be quite useful e.g. for sequence 2 sequence translations. Instead of treating words as unknown or unusual when newly encountered, one can use a character level model.

'In between' BPE (byte pair encoding): https://arxiv.org/abs/1508.07909

In [13]:
idx = [char_indices[c] for c in text]

idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [14]:
''.join(indices_char[i] for i in idx[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## Three char model

### Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters

In [15]:
## list of all 0,1,2,3, characters
cs=3
c1_dat = [idx[i]   for i in range(0, len(idx)-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)]

Our inputs

In [16]:
## input is above 3 lists and we use stack to pluck them together
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)

Our output

In [17]:
y = np.stack(c4_dat)

The first 4 inputs and outputs

In [18]:
## we use 3 characters to predict the 4th
x1[:4], x2[:4], x3[:4]

(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [19]:
## so we use e.g. 40, 42, 29 to predict 29. 29 is the next char below then. 
## then we use 29,25,27 to predict 1 and so on
y[:4]

array([30, 29,  1, 40])

In [20]:
x1.shape, y.shape

((200297,), (200297,))

### Create and train model

Pick a size for our hidden state

In [21]:
## how many activations do need
n_hidden = 256

The number of latent factors to create (i.e. the size of the embedding matrix)

In [22]:
## size of our embeddings / around half the number of characters
n_fac = 42

In [23]:
## This Char Model is a standard fully connected Model ! 
## The one addition is that we add each of the inouts one at a time


## Note all coloured arrows use each one matrix, why?
## GREEN: For the characters e.g. why would a character semantically have a different meaning if 1st, 2nd or 3rd?
## ORANGE: for moving between characters, weights should be similar too.

## Both arrows saY : take a character and represent it as a set of features

class Char3Model(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        ## IN: vocabulary size / Out: factors in the embedding
        self.e = nn.Embedding(vocab_size, n_fac)

        # The 'green arrow' from our diagram - the layer operation from input to hidden
        self.l_in = nn.Linear(n_fac, n_hidden)

        # The 'orange arrow' from our diagram - the layer operation from hidden to hidden
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        
        # The 'blue arrow' from our diagram - the layer operation from hidden to output
        self.l_out = nn.Linear(n_hidden, vocab_size)
    ## we pass in 3 characters
   
## NOTE this code here is rather repetitive: we have 2x3 lines that almost identical. 
## The last one should be a loop
    def forward(self, c1, c2, c3):
        ## pass every character through an embedding / linear layer and relu
        
        ## dimensions:
            ## self.e: length 42 --> l_in make it of size n_hidden -->
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
        ## Here we always take the previous layer and h and add one of the results from the char
        h = V(torch.zeros(in1.size()).cuda()) ## hs is just a bunch of zeros
        ## dimensions:
            ## self.l_hidden --> also returns sth of size n_hidden
            ## in2 was n_hidden h is n_hidden --> join them together to 2 n_hidden
            ## self.l_hidden --> returns size n_hidden
        
        
## hyperbolic tanh: 
    ## - like a sigmoid offset
    ## common to use in this transition because it stops it from 'flying' too high or too low
        
        h = F.tanh(self.l_hidden(h+in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

In [24]:
## Columnard Model Data: whatever you put in where here x1,x2,x3 
## it goes to above forward def forward(self, c1, c2, c3)

md = ColumnarModelData.from_arrays('.', [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)

In [25]:
## create a standard pytorch model
m = Char3Model(vocab_size, n_fac).cuda()

RuntimeError: Cannot initialize CUDA without ATen_cuda library. PyTorch splits its backend into two shared libraries: a CPU library and a CUDA library; this error has occurred because you are trying to use some CUDA functionality, but the CUDA library has not been loaded by the dynamic linker for some reason.  The CUDA library MUST be loaded, EVEN IF you don't directly use any symbols from the CUDA library! One common culprit is a lack of -Wl,--no-as-needed in your link arguments; many dynamic linkers will delete dynamic library dependencies if you don't depend on any of their symbols.  You can check if this has occurred by using ldd on your binary to see if there is a dependency on *_cuda.so library.

In [None]:
## this allows peeking inside to see what's going on
it = iter(md.trn_dl) ## iteratr through the training set
*xs,yt = next(it) ## call next to grab a mini batch that returns all xs, yt tensors
t = m(*V(xs)) ## use a model as it was a function. pass variablised version of tensors 
## len xs = 3 from forward(self, c1, c2, c3):
## xs[0].size --> 512 i.e. the batch size
## not one hot encoded, we use an embedding

In [None]:
## pytorch optimiser
opt = optim.Adam(m.parameters(), 1e-2)

In [None]:
## fit the model
fit(m, md, 1, opt, F.nll_loss)

In [None]:
set_lrs(opt, 0.001)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

### Test model

In [None]:
## pass in 3 chars, e.g. 'y. '

def get_next(inp):
    ## turn input into a tensor of an array of the character index for each character in that list
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs)) ## turn those into variable, then pass that to our model
    i = np.argmax(to_np(p)) ## argmax to get the right character. to_np to convert into numpy format
    return chars[i]

In [None]:
get_next('y. ')

In [None]:
get_next('ppl')

In [None]:
get_next(' th')

In [None]:
get_next('and')

## Our first RNN!

This is the same as Char3Model. 
Only instead of 2 x the same step we make a loop. Also, the very 1st inout char can be part of the loop and we can start out with zero initialisation. 

Note, we have shared weights in RNN. In convolutional networks we share filters.

### Create inputs

This is the size of our unrolled RNN.

In [None]:
cs=8

For each of 0 through 7, create a list of every 8th character with that starting point. These will be the 8 inputs to our model.

In [None]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs)]

Then create a list of the next character in each of these series. This will be the labels for our model.

In [None]:
c_out_dat = [idx[j+cs] for j in range(len(idx)-cs)]

In [None]:
xs = np.stack(c_in_dat, axis=0)

In [None]:
xs.shape

In [None]:
y = np.stack(c_out_dat)

So each column below is one series of 8 characters from the text.

In [None]:
xs[:cs,:cs]

How to read this: 
- 0-8: 40, 42, 29, 30, 25, 27, 29,  1 --> next one from next line last one  so 1
- 1-9: 42, 29, 30, 25, 27, 29,  1,  1 --> next one from next line last one  so 1
- 2-10: 29, 30, 25, 27, 29,  1,  1,  1 --> next one from next line last one  so 43
- 3-11: 30, 25, 27, 29,  1,  1,  1, 43 --> next one from next line last one  ...


Note, we re-calculate the same thing all over again with overlappings. 


...and this is the next character after each sequence.

In [None]:
y[:cs] ## this is last column above, so always the prediction

### Create and train model

In [None]:
val_idx = get_cv_idxs(len(idx)-cs-1)

In [None]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512)

In [None]:
## Note this can be a deep network, eg. 8 layers.
## Therefore, careful with training (deeper is harder to train)
class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
## Same as before in Char3Model        
## Only, this loop is new generalising Char3Model      
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
## NOTE: adding h and inp hidden and input state h: encoding of characters so far input: encoding of characters
        ## adding them might make loose information, see below
            h = F.tanh(self.l_hidden(h+inp))
#######################
## BEFORE, now looped

        ##in1 = F.relu(self.l_in(self.e(c1)))
        ##in2 = F.relu(self.l_in(self.e(c2)))
        ##in3 = F.relu(self.l_in(self.e(c3)))
        
        
        ##h = V(torch.zeros(in1.size()).cuda()) ## hs is just a bunch of zeros
        
        ##h = F.tanh(self.l_hidden(h+in1))
        ##h = F.tanh(self.l_hidden(h+in2))
        ##h = F.tanh(self.l_hidden(h+in3))

#########################
        
        return F.log_softmax(self.l_out(h), dim=-1)

In [None]:
m = CharLoopModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

In [None]:
set_lrs(opt, 0.001)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

In [None]:
##Heuristic: to combine things of differen kind, concatenate do not add to not loose information

class CharLoopConcatModel(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
## we need to add n_fac + n_hidden --> BEFORE / ABOVE: self.l_in = nn.Linear(n_fac, n_hidden)
## this way the dimensions check out
        self.l_in = nn.Linear(n_fac+n_hidden, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        ## for loop needs a starting point to be able to work
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            ## size: n_fac + n_hidden
            inp = torch.cat((h, self.e(c)), 1)
            ## back to size n_hidden
            inp = F.relu(self.l_in(inp))
            ## same square matrix as before 
            h = F.tanh(self.l_hidden(inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

In [None]:
m = CharLoopConcatModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs))

In [None]:
fit(m, md, 1, opt, F.nll_loss)

In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

### Test model

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [None]:
get_next('for thos') ## we pass in 8 things

In [None]:
get_next('part of ')

In [None]:
get_next('queens a')

## RNN with pytorch

What can pytorch do?
- write loop
- write input layers

CharRnn is like CharLoopConcatModel, only written in pytorch

In [None]:
class CharRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        ## create an RNN
        self.rnn = nn.RNN(n_fac, n_hidden) ## wrapped up and made easier than in CharLoopConcatModel
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        ## starting point for the for loop
        ## the 1 here is about the backward RNN, 1 indicates additional axis on tensor to keep track
        h = V(torch.zeros(1, bs, n_hidden)) ## rank 3 tensor / above its a rank 2 tensor
        inp = self.e(torch.stack(cs))
        ## for loop in here
        ## h is the initial hidden state and the starting point
        ## hidden state has size 256 represents the shift from one char to another one
        outp,h = self.rnn(inp, h) ## getting back the hidden state is useful. 
        
        ## pytorch appends h to a tensor, here we just want the last oen: -1
        ## pass through outp to get correct vocab size
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)

In [None]:
m = CharRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xs,yt = next(it)

In [None]:
t = m.e(V(torch.stack(xs)))
t.size()

In [None]:
ht = V(torch.zeros(1, 512,n_hidden))
outp, hn = m.rnn(t, ht)
outp.size(), hn.size()

In [None]:
t = m(*V(xs)); t.size()

In [None]:
fit(m, md, 4, opt, F.nll_loss)

In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 2, opt, F.nll_loss)

### Test model

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [None]:
get_next('for thos')

In [None]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c ## result from last run, so we feed into each other
    return res

In [None]:
get_next_n('for thos', 40) ## give start sequence

## Multi-output model

Idea: we always put the outpit layer in the loop, i.e. there is no one output (next char) but rather for every input we give an output, i.e. for 3 inouts we have 3 outputs.

Technically, 'small step' but helps a lot with efficieny and performance. Above, we re-calculated a lot.

### Setup

Let's take non-overlapping sets of characters this time

In [None]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(0, len(idx)-cs-1, cs)]

Then create the exact same thing, offset by 1, as our labels

In [None]:
c_out_dat = [[idx[i+j] for i in range(cs)] for j in range(1, len(idx)-cs, cs)]

In [None]:
xs = np.stack(c_in_dat)
xs.shape

In [None]:
ys = np.stack(c_out_dat)
ys.shape

Non overlapping, after we see chars 0-7 we predict 1-8. This is the same as above, but much more efficient

lesson 7: http://course.fast.ai/lessons/lesson7.html minute 6.30

In [None]:
xs[:cs,:cs]

In [None]:
ys[:cs,:cs]

### Create and train model

In [None]:
val_idx = get_cv_idxs(len(xs)-cs-1)

In [None]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=512)

In [None]:
## same as above, only 1 change below
## optimisation ( next lesson): we should keep the hidden state between batches as sth was learnt
class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
 ## Note (see below)  : store in self and keep updating it so we dont forget between mini batches :CharSeqStatefulRnn
        ##  self.h = V(torch.zeros(1, bs, n_hidden))
    
    def forward(self, *cs):
        bs = cs[0].size(0)
## Note: this is a problem because with every mini batch we would forget h, what was learnt before. rest to zeros

        h = V(torch.zeros(1, bs, n_hidden))
        ## Optimisation: avoid gradient explosion with identity matrix
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
## note above we had -1 for last layer. Now we grab all
        return F.log_softmax(self.l_out(outp), dim=-1) 

In [None]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xst,yt = next(it)
## yt labels is 512 x 8

- Negative loss likelihood function expects to receive 2 rank 1(2) tensors
- In a RNN we have 8 timesteps with 84 for each and that for 512 items in mini batch -- rank 3 tensor
- Pytorch would crash 

In [None]:
# custom loss function 
def nll_loss_seq(inp, targ):
    ## transpose
    ## pytorch: 1 timesteps, 2 batch size, 3 hidden state
    sl,bs,nh = inp.size() ## 8, 512, 256
    ## yt.size 512  x 8 
    ## transpose the 1st 2 axis. 
    ## pytorch keeps internal metadata saying, sth like this should be transposed
    ## sometimes contiguous error --> just add contiguors
    ## view(-1) is numpy rehsape to flatten out
    targ = targ.transpose(0,1).contiguous().view(-1)
    ## predictions have length 84
    return F.nll_loss(inp.view(-1,nh), targ) ## call the pytorch version expecting 2

In [None]:
## lowest level fast.ai abstraction 
fit(m, md, 4, opt, nll_loss_seq) ## other than MD (test, training & validation set) this is all pytorch standard stuff

In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 1, opt, nll_loss_seq)

### Identity init!

This avoids gradient explosion, init with identity matrix: one multiplies matrixes 
and with identity matrix it neither shrinks or explodes

Geoffrey Hinton https://arxiv.org/abs/1504.00941

In [None]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [None]:
## this is the implementation of the paper
m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden)) ## eye is the identity matrix in pytorch

In [None]:
fit(m, md, 4, opt, nll_loss_seq)

In [None]:
set_lrs(opt, 1e-3)

In [None]:
fit(m, md, 4, opt, nll_loss_seq)

## Stateful model

### Setup

Minute 23:30 lesson 7 http://course.fast.ai/lessons/lesson7.html


In [None]:
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

# Note: The student needs to practice her shell skills and prepare her own dataset before proceeding:
# - trn/trn.txt (first 80% of nietzsche.txt)
# - val/val.txt (last 20% of nietzsche.txt)

%ls {PATH}

In [None]:
%ls {PATH}trn

In [None]:
## torch text Field: description of how to go about pre-processing the text
## lowercase not necessary, tokenize - list for a character model list('abc') -->['a','b','c']
TEXT = data.Field(lower=True, tokenize=list)
## Text also contains things like TEXT.vocab.itos with list of all unique stoi with reverse mapping
## batchsize, number chars bptt, n-fac= size embedding, size hidden state = size of 'orange circles' (see the lecture)
bs=64; bptt=8; n_fac=42; n_hidden=256

##
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH) ## data
 ## min_freq ireelevant probably, no char less than 3 probably

md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

## len(md.trn_dl) = length of dataloader = how many mini batches
## md.nt = number of tokens, i.e. how many unique things are in the vocabulary, e.g. 56
##
##
#len(md.trn_ds[0].text) 963 = 493747 / no. tokens / bptt
## --> note this no. does not match exactly, but we randomise bptt (instead of shffling data) 
## so 5% of time it will be 8/2, but this will be constant per mini batch, where we do a matrix multiplication

### RNN

In [5]:
??repackage_var

Object `repackage_var` not found.


In [None]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        ## new line
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        ## the last minibatch likely is of different size than the others
        ## recreate hidden state if batch size changes
        if self.h.size(1) != bs: self.init_hidden(bs)
        ## RNN takes in self.h / spits out the new hidden state
        outp,h = self.rnn(self.e(cs), self.h)
        ## the last hidden state must be stored away, so its passed ove rto new mini batch
        ## Back prop through time BPTT:
            ## why not self.h = h ? 
            ## if the document trained on is 1 million, then the unrolled RNN would be a million long
            ## 1 million layer fully connect layer: the chain rule would need to propagate back through this large chain
            ## Therefore: from time to time forget the HISTORY, but keep the STATE
            ## function repackage variable: grab tensor out of hidden state 
            ## (tensor has no history) and make new variable. 
            ## So after 8 layers it throws away history 
            ## This is all there is to Backprop through time
            ## Note, if there are instabilities then more layers lead to more instability
            ## the longer BPTT the more you can reach back to have a kind of state
            
        self.h = repackage_var(h)
        ## pytorch loss function CANNOT take a rank 3 tensor (pytorch issue, no specific reason)
        ## rank 2 and 4 are fine
        ## .view : flatten things out / torch text autmatically changed the target to flatten that out
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    ## ste self.h to zeros
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

- pytorch requires to tell over which axis to make the softmax over- 
- F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
- We want last axis, contains prbability per later of the alphabet


Mini Batching / Parallelism

- Issue: Efficiency:  memory / performance / stability
- Goal: Look through chunks of data in parallel
-  Example: 
    - make 64 equally sized chunks of the text corpus. with 64 million we had 64 x 1 million 
    - then we split each mini batch like this: from each 1 million part we take a chunk of BPTT
    --> the size of the minit batch should be equal to the Backprop through time value BPTT
- How to pick BPTT size? batch size * batch size (?)
- we can use torchtext for these kinds of chunking


In [None]:
## create our model
m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()
## optimiser with that models parameters
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
## fit this
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81983  1.81247]                                 
[ 1.       1.63097  1.66228]                                 
[ 2.       1.54433  1.57824]                                 
[ 3.       1.48563  1.54505]                                 



In [None]:
set_lrs(opt, 1e-4)

fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.4187   1.50374]                                 
[ 1.       1.41492  1.49391]                                 
[ 2.       1.41001  1.49339]                                 
[ 3.       1.40756  1.486  ]                                 



### RNN loop

In [None]:
# From the pytorch source, for reference

def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    ## note they do not concatenate, but add
    ## tanh:  like sigmoid double the height -1
        ## between +1 and -1
        ## a relu might have a gradient explosion
        ## note you can choose to use relu as non linearity
    return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

In [None]:
## same as before, 
    ## only removed outp,h = self.rnn(self.e(cs), self.h)
    ## instead: self.rnn = nn.RNNCell(n_fac, n_hidden) --> see above
class CharSeqStatefulRnn2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNNCell(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp = []
        o = self.h
        ## we need the for loop back
        for c in cs: 
            o = self.rnn(self.e(c), o)
            ## append, so its all stacked up 2gether
            outp.append(o)
        outp = self.l_out(torch.stack(outp))
        self.h = repackage_var(o)
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [None]:
m = CharSeqStatefulRnn2(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
fit(m, md, 4, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.81013  1.7969 ]                                 
[ 1.       1.62515  1.65346]                                 
[ 2.       1.53913  1.58065]                                 
[ 3.       1.48698  1.54217]                                 



### GRU: Gated Recurrent Network

Above RNN is rarely used in practice, because gradient explosions are an issue (low learning, small bptt)
Instead often a GRU cell is used.

http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/

A GRU uses an internal neural net, a reset gate to decide when to forget an internal state e.g. when a . comes for sentence throw old state away. There is another neural network to decide how much use from the previous hidden state vs from the new input. Mathematically this is an interpolation.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

In [None]:
class CharSeqStatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [None]:
# From the pytorch source code - for reference
## some optimisations 
## this can relace the RNN cell with GRU cell
def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    gi = F.linear(input, w_ih, b_ih)
    gh = F.linear(hidden, w_hh, b_hh)
    i_r, i_i, i_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)

    resetgate = F.sigmoid(i_r + h_r)
    inputgate = F.sigmoid(i_i + h_i)
    newgate = F.tanh(i_n + resetgate * h_n)
    return newgate + inputgate * (hidden - newgate)

In [None]:
m = CharSeqStatefulGRU(md.nt, n_fac, 512).cuda()

opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
fit(m, md, 6, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.68409  1.67784]                                 
[ 1.       1.49813  1.52661]                                 
[ 2.       1.41674  1.46769]                                 
[ 3.       1.36359  1.43818]                                 
[ 4.       1.33223  1.41777]                                 
[ 5.       1.30217  1.40511]                                 



In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 3, opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.22708  1.36926]                                 
[ 1.       1.21948  1.3696 ]                                 
[ 2.       1.22541  1.36969]                                 



### Putting it all together: LSTM

LSTM is like a GRU, only we have an additional state: cell state. 

Cell State: return a tuple of hidden states

In [None]:
from fastai import sgdr

n_hidden=512 ## doubled size hidden layer, see also droput below

In [None]:
## code as above, only LSTM / see comments
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5) ## added dropout inside RNN, after each timestep
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        ## pass in self.h
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h) ## bptt as before
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        ## return a tuple to account for also the cell state
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

Fast.ai specific

SGDR with callbacks

In [None]:
m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda() ##standard pytorch model
## instead of using now an optimiser like ADAM from pytorch, we use fast.ai layeroptimiser class
## Optimiser: optim.Adam
## Model: m 
## learning rate: 1e-2
## weight decay : 1e-5
## Goal : differential learning rates & weight decay
## you need to use this class also for callbacks and SGDR

lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

## try in cell: lo.opt to give you the optimiser

In [None]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [None]:
fit(m, md, 2, lo.opt, F.nll_loss)

A Jupyter Widget

[ 0.       1.72032  1.64016]                                 
[ 1.       1.62891  1.58176]                                 



In [None]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
## len(md.trn_dl) --> length of an epoch / length of that dataloader
##
## on_cycle_end=on_end --> callback on cycle end, here it saves the model
## SGDR callback
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
## lo.opt is the optimiser
## callback: Cosin Annealing callback, which requires a layer optimised object. 
## This changes the learning rate insie lo
fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb) ## improvements

A Jupyter Widget

[ 0.       1.47969  1.4472 ]                                 
[ 1.       1.51411  1.46612]                                 
[ 2.       1.412    1.39909]                                 
[ 3.       1.53689  1.48337]                                 
[ 4.       1.47375  1.43169]                                 
[ 5.       1.39828  1.37963]                                 
[ 6.       1.34546  1.35795]                                 
[ 7.       1.51999  1.47165]                                 
[ 8.       1.48992  1.46146]                                 
[ 9.       1.45492  1.42829]                                 
[ 10.        1.42027   1.39028]                              
[ 11.        1.3814    1.36539]                              
[ 12.        1.33895   1.34178]                              
[ 13.        1.30737   1.32871]                              
[ 14.        1.28244   1.31518]                              



In [None]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)

A Jupyter Widget

[ 0.       1.46053  1.43462]                                 
[ 1.       1.51537  1.47747]                                 
[ 2.       1.39208  1.38293]                                 
[ 3.       1.53056  1.49371]                                 
[ 4.       1.46812  1.43389]                                 
[ 5.       1.37624  1.37523]                                 
[ 6.       1.3173   1.34022]                                 
[ 7.       1.51783  1.47554]                                 
[ 8.       1.4921   1.45785]                                 
[ 9.       1.44843  1.42215]                                 
[ 10.        1.40948   1.40858]                              
[ 11.        1.37098   1.36648]                              
[ 12.        1.32255   1.33842]                              
[ 13.        1.28243   1.31106]                              
[ 14.        1.25031   1.2918 ]                              
[ 15.        1.49236   1.45316]                              
[ 16.   

### Test

In [None]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [None]:
get_next('for thos')

'e'

In [None]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [None]:
print(get_next_n('for thos', 400))

for those the skemps), or
imaginates, though they deceives. it should so each ourselvess and new
present, step absolutely for the
science." the contradity and
measuring, 
the whole!

293. perhaps, that every life a values of blood
of
intercourse when it senses there is unscrupulus, his very rights, and still impulse, love?
just after that thereby how made with the way anything, and set for harmless philos
