# Human numbers

We're going to try to create a language model that can predict the next number written out in English.

In [2]:
#uncomment next line, if using colab
!curl -s https://course.fast.ai/setup/colab | bash
from fastai.text import *

Updating fastai...
Done.


In [0]:
bs=64

## Data

In [4]:
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

[PosixPath('/root/.fastai/data/human_numbers/train.txt'),
 PosixPath('/root/.fastai/data/human_numbers/valid.txt')]

 This is a dataset just contains all the numbers from 1 to 9,999 written out in English.

In [0]:
def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]

In [6]:
train_txt = readnums('train.txt'); train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [7]:
valid_txt = readnums('valid.txt'); valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

In [0]:
train = TextList(train_txt, path=path)
valid = TextList(valid_txt, path=path)

src = ItemLists(path=path, train=train, valid=valid).label_for_lm()
data = src.databunch(bs=bs)

In this case, the validation set is the numbers from 8,000 onwards, and the training set is 1 to 8,000. We can combine them together, turn that into a data bunch.

In [9]:
train[0].text[:80]

'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'

Here are the first 80 characters. It starts with a special token xxbos. Anything starting with xx is a special fast.ai token, bos is the beginning of stream token. 

In [10]:
len(data.valid_ds[0][0].data)

13017

The validation set contains 13,000 tokens. So 13,000 words or punctuation marks because everything between spaces is a separate token.

In [11]:
data.bptt, len(data.valid_dl)

(70, 3)

In [12]:
13017/70/bs

2.905580357142857

The batch size that we asked for was 64, and then by default it uses something called bptt of 70. bptt stands for "back prop through time". That's the sequence length. For each of our 64 document segments, we split it up into lists of 70 words that we look at at one time. So what we do for the validation set is we grab this entire string of 13,000 tokens, and then we split it into 64 roughly equal sized sections. They're 64 roughly equally sized segments. So we take the first 1/64 of the document - piece 1. The second 1/64 - piece 2.

Then for each of those 1/64 of the document, we then split those into pieces of length 70.Those 13,000 tokens, divide by batch size and divide by 70, so there's going to be 3 batches.

In [0]:
it = iter(data.valid_dl)
x1,y1 = next(it)
x2,y2 = next(it)
x3,y3 = next(it)
it.close()

In [14]:
x1.numel()+x2.numel()+x3.numel()

13440

In [15]:
x1.shape,y1.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [16]:
x2.shape,y2.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

As you can see, it's 95 by 64. That's because our data loader for language models slightly randomizes bptt just to give you a bit more shuffling - get bit more randomization - it helps the model.

In [17]:
x1[0]

tensor([ 2, 19, 11, 12,  9, 19, 11, 13,  9, 19, 11, 14,  9, 19, 11, 15,  9, 19,
        11, 16,  9, 19, 11, 17,  9, 19, 11, 18,  9, 19, 11, 19,  9, 19, 11, 20,
         9, 19, 11, 29,  9, 19, 11, 30,  9, 19, 11, 31,  9, 19, 11, 32,  9, 19,
        11, 33,  9, 19, 11, 34,  9, 19, 11, 35,  9, 19, 11, 36,  9, 19],
       device='cuda:0')

In [18]:
y1[0]

tensor([19, 11, 12,  9, 19, 11, 13,  9, 19, 11, 14,  9, 19, 11, 15,  9, 19, 11,
        16,  9, 19, 11, 17,  9, 19, 11, 18,  9, 19, 11, 19,  9, 19, 11, 20,  9,
        19, 11, 29,  9, 19, 11, 30,  9, 19, 11, 31,  9, 19, 11, 32,  9, 19, 11,
        33,  9, 19, 11, 34,  9, 19, 11, 35,  9, 19, 11, 36,  9, 19, 11],
       device='cuda:0')

So here, you can see the first batch of X (remember, we've numeric alized all these) and here's the first batch of Y. And you'll see here x1 is [ 2, 19, 11, 12,  9, ...], y1 is [ 2, 19, 11, 12,  9, ...]. So y1 is offset by 1 from x1. Because that's what you want to do with a language model. We want to predict the next word. So after 2, should come 19, and after 19, should come 11.

In [0]:
v = data.valid_ds.vocab

In [20]:
v.textify(x1[0])

'xxbos eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight'

In [21]:
v.textify(y1[0])

'eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight thousand'

You can grab the vocab for this dataset, and a vocab has a textify so if we look at exactly the same thing but with textify, that will just look it up in the vocab. So here you can see xxbos eight thousand one where else in the y, there's no xxbos, it's just eight thousand one. So after xxbos is eight, after eight is thousand, after thousand is one.

In [22]:
v.textify(x2[0])

'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'

In [23]:
v.textify(x3[0])

'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'

Then after we get 8023, comes x2, and look at this, we're always looking at column 0, so this is the first batch (the first mini batch) comes 8024 and then x3, all the way up to 8,040 .

In [24]:
v.textify(x1[1])

', eight thousand forty six , eight thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine ,'

In [25]:
v.textify(x2[1])

'eight thousand sixty , eight thousand sixty one , eight thousand sixty two , eight thousand sixty three , eight thousand sixty four , eight thousand sixty five , eight thousand sixty six , eight thousand sixty seven , eight thousand sixty eight , eight thousand sixty nine , eight thousand seventy , eight thousand seventy one , eight thousand seventy two , eight thousand seventy three , eight thousand'

In [26]:
v.textify(x3[1])

'seventy four , eight thousand seventy five , eight thousand seventy six , eight thousand seventy seven , eight thousand seventy eight , eight thousand seventy nine , eight thousand eighty , eight thousand eighty one , eight thousand eighty two , eight thousand eighty three , eight thousand eighty four , eight thousand eighty five , eight thousand eighty six , eight thousand eighty seven , eight thousand eighty'

In [27]:
v.textify(x3[-1])

'ninety , nine thousand nine hundred ninety one , nine thousand nine hundred ninety two , nine thousand nine hundred ninety three , nine thousand nine hundred ninety four , nine thousand nine hundred ninety five , nine thousand nine hundred ninety six , nine thousand nine hundred ninety seven , nine thousand nine hundred ninety eight , nine thousand nine hundred ninety nine xxbos eight thousand one , eight'

Then we can go right back to the start, but look at batch index 1 which is batch number 2. Now we can continue. A slight skip from 8,040 to 8,046, that's because the last mini batch wasn't quite complete. What this means is that every mini batch joins up with a previous mini batch. So you can go straight from x1[0] to x2[0] - it continues 8,023, 8,024. If you took the same thing for :,1, you'll also see they join up. So all the mini batches join up.

In [28]:
data.show_batch(ds_type=DatasetType.Valid)

idx,text
0,"thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine , eight thousand sixty , eight thousand sixty"
1,"eight , eight thousand eighty nine , eight thousand ninety , eight thousand ninety one , eight thousand ninety two , eight thousand ninety three , eight thousand ninety four , eight thousand ninety five , eight thousand ninety six , eight thousand ninety seven , eight thousand ninety eight , eight thousand ninety nine , eight thousand one hundred , eight thousand one hundred one , eight thousand one"
2,"thousand one hundred twenty four , eight thousand one hundred twenty five , eight thousand one hundred twenty six , eight thousand one hundred twenty seven , eight thousand one hundred twenty eight , eight thousand one hundred twenty nine , eight thousand one hundred thirty , eight thousand one hundred thirty one , eight thousand one hundred thirty two , eight thousand one hundred thirty three , eight thousand"
3,"three , eight thousand one hundred fifty four , eight thousand one hundred fifty five , eight thousand one hundred fifty six , eight thousand one hundred fifty seven , eight thousand one hundred fifty eight , eight thousand one hundred fifty nine , eight thousand one hundred sixty , eight thousand one hundred sixty one , eight thousand one hundred sixty two , eight thousand one hundred sixty three"
4,"thousand one hundred eighty three , eight thousand one hundred eighty four , eight thousand one hundred eighty five , eight thousand one hundred eighty six , eight thousand one hundred eighty seven , eight thousand one hundred eighty eight , eight thousand one hundred eighty nine , eight thousand one hundred ninety , eight thousand one hundred ninety one , eight thousand one hundred ninety two , eight thousand"


## Single fully connected model

In [0]:
data = src.databunch(bs=bs, bptt=3)

In [30]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 3]), torch.Size([64, 3]))

In [31]:
nv = len(v.itos); nv

40

In [0]:
nh=64

In [0]:
def loss4(input,target): return F.cross_entropy(input, target[:,-1])
def acc4 (input,target): return accuracy(input, target[:,-1])

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/49.png)

In [0]:
class Model0(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = self.bn(F.relu(self.h_h(self.i_h(x[:,0]))))
        if x.shape[1]>1:
            h = h + self.i_h(x[:,1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1]>2:
            h = h + self.i_h(x[:,2])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

It content contains 1 embedding (i.e. the green arrow), one hidden to hidden - brown arrow layer, and one hidden to output. So each colored arrow has a single matrix. Then in the forward pass, we take our first input x[0] and put it through input to hidden (the green arrow), create our first set of activations which we call h. Assuming that there is a second word, because sometimes we might be at the end of a batch where there isn't a second word. Assume there is a second word then we would add to h the result of x[1] put through the green arrow (that's i_h). Then we would say, okay our new h is the result of those two added together, put through our hidden to hidden (orange arrow), and then ReLU then batch norm. Then for the second word, do exactly the same thing. Then finally blue arrow - put it through h_o.

In [0]:
learn = Learner(data, Model0(), loss_func=loss4, metrics=acc4)

In [36]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.632127,3.603174,0.053539,00:01
1,3.113236,3.075947,0.392463,00:01
2,2.532481,2.57296,0.453585,00:01
3,2.201106,2.303107,0.452895,00:01
4,2.06736,2.207222,0.453125,00:01
5,2.039479,2.19383,0.453355,00:01


## Excercise: Refactor the same model with a loop 

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/50.png)


An RNN is just a refactoring of the moddel above.

In [0]:
class Model1(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        # loop over every input
        return self.h_o(h)

In [0]:
learn = Learner(data, Model1(), loss_func=loss4, metrics=acc4)

In [39]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.700336,3.711624,0.0,00:00
1,3.67998,3.689831,0.025735,00:00
2,3.655822,3.667598,0.025735,00:00
3,3.638037,3.652325,0.025735,00:00
4,3.629193,3.645549,0.025735,00:00
5,3.627209,3.644487,0.025735,00:00


## Multi fully connected model

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/52.png)

Previously we were comparing the result of our model to just the last word of the sequence. It is very wasteful, because there's a lot of words in the sequence. So let's compare every word in x to every word and y. To do that, we need to change the diagram so it's not just one triangle at the end of the loop, but the triangle is inside the loop. In other words, after every loop, predict, loop, predict, loop, predict.



In [0]:
data = src.databunch(bs=bs, bptt=20)

In [41]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 20]), torch.Size([64, 20]))

##Excercise:
Created an array, and every time you go through the loop, I append h_o(h) to the array. Now, for n inputs,it creates n outputs. So it is predicting after every word.

In [0]:
class Model2(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        res = []
        #loop over every input and append h_o(h) to the array
        return torch.stack(res, dim=1)

In [0]:
learn = Learner(data, Model2(), metrics=accuracy)

In [47]:
learn.fit_one_cycle(10, 1e-4, pct_start=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,3.675185,3.78383,0.028267,00:00
1,3.575442,3.668788,0.100781,00:00
2,3.452242,3.553083,0.138707,00:00
3,3.324661,3.44734,0.234659,00:00
4,3.205369,3.356991,0.261435,00:00
5,3.10294,3.286192,0.267472,00:00
6,3.023824,3.240035,0.272017,00:00
7,2.969144,3.215157,0.27358,00:00
8,2.935789,3.205647,0.274858,00:00
9,2.918441,3.20424,0.275071,00:00


Previously you had 46%, now I have 27%. Why is it worse? It's worse because now when you are trying to predict the second word, you only have one word of state to use. When you're looking at the third word, you only have two words of state to use. So it's a much harder problem for it to solve. The key problem is here:

***h = torch.zeros(...)***

You reset the state to zero every time you start another BPTT sequence.

## Excercise: Maintain state

Let's keep h. And we can, because remember, each batch connects to the previous batch. It's not shuffled like happens in image classification. So let's take this exact model and replicate it again, but let's move the creation of h into the constructor.

In [0]:
class Model3(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        self.h = torch.zeros(bs, nh).cuda()
        
    def forward(self, x):
        #replicate the previous model without reseting h
        res = self.h_o(res)
        return res

In [0]:
learn = Learner(data, Model3(), metrics=accuracy)

In [50]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.591737,3.532741,0.199006,00:00
1,3.253751,2.977511,0.354545,00:00
2,2.592783,2.036293,0.466761,00:00
3,2.003195,1.889399,0.400213,00:00
4,1.690694,1.791617,0.467685,00:00
5,1.481871,1.679199,0.4875,00:00
6,1.303349,1.644444,0.522301,00:00
7,1.138757,1.554977,0.529048,00:00
8,0.987126,1.665315,0.547301,00:00
9,0.861744,1.719088,0.575,00:00


## Deep RNN using Pytorch nn.RNN module
What you could do though is at the end of your every loop, you could not just spit out an output, but you could spit it out into another RNN. So you have an RNN going into an RNN. That's nice because we now got more layers of computation, you would expect that to work better.

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/54.png)

Let's take this code (Model3) and replace it with the equivalent built in [PyTorch code](https://pytorch.org/docs/stable/nn.html#recurrent-layers):

In [0]:
class Model4(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.RNN(nh, nh, 2, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(2, bs, nh).cuda()
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [0]:
learn = Learner(data, Model4(), metrics=accuracy)

In [65]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.406644,3.009712,0.37294,00:00
1,2.86588,2.28924,0.46321,00:00
2,2.266347,1.856459,0.467472,00:00
3,1.827456,1.759345,0.454048,00:00
4,1.490581,1.506842,0.487571,00:00
5,1.15747,1.357541,0.542117,00:00
6,0.862163,1.110684,0.652841,00:00
7,0.615088,0.962589,0.695739,00:00
8,0.444303,0.941189,0.719673,00:00
9,0.329872,0.901171,0.740909,00:00


## Excercise: Create the same model with a [GRU](https://pytorch.org/docs/stable/nn.html#gru) with 2 layers


In [0]:
class Model5(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):

        return self.h_o(self.bn(res))

In [0]:
learn = Learner(data, Model5(), metrics=accuracy)

In [68]:
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.046449,2.489613,0.441548,00:00
1,1.882761,1.533837,0.629475,00:00
2,0.962722,1.186184,0.774006,00:00
3,0.46554,1.24177,0.801136,00:00
4,0.231859,1.404719,0.803054,00:00
5,0.121732,1.296224,0.823366,00:00
6,0.067599,1.344958,0.821875,00:00
7,0.040112,1.36173,0.822514,00:00
8,0.025851,1.389524,0.821591,00:00
9,0.018318,1.374482,0.822017,00:00


## fin