# Continuation of Makemore - WaveNet
We YouTube series by Andrej Karpathy.

We have created the MLP frrom [Bengio, et all](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) to make a simple character level MLP. This time will will be adding in a tree like structure that will work us towards a CNN similar to [DeepMind WaveNet 2016](https://arxiv.org/abs/1609.03499) . Note this is only looking at the basic architecture and does not implement the residual gates at this time.

I will be building out a similar to [this tool](https://github.com/karpathy/makemore/tree/master) from scratch. Note that I will be following the tutorial doing it step by step not looking at the final repo. We want to get more characters into a the NN and we want to fuse the layers on the way instead of squashing everything into a single hidden layer.

Over all I will work through these papers:
- Bigram (one character predicts the next one with a lookup table of counts)
- MLP, following [Bengio et al. 2003](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
- CNN, following [DeepMind WaveNet 2016](https://arxiv.org/abs/1609.03499) 
- RNN, following [Mikolov et al. 2010](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf)
- LSTM, following [Graves et al. 2014](https://arxiv.org/abs/1308.0850)
- GRU, following [Kyunghyun Cho et al. 2014](https://arxiv.org/abs/1409.1259)
- Transformer, following [Vaswani et al. 2017](https://arxiv.org/abs/1706.03762

In [2]:
import torch
import math
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn.functional as F
%matplotlib inline

In [3]:
words = open("data/names.txt", "r").read().splitlines()
print(f'Total number of words: {len(words)}')
smallest = min(len(w) for w in words)
largest = max(len(w) for w in words)
print(f'Smallest Word is {smallest} char while the largest is {largest} char')

Total number of words: 32033
Smallest Word is 2 char while the largest is 15 char


In [4]:
import random
random.seed(42)
random.shuffle(words)

In [5]:
# build the covabulary of chars and mappings 
chars = sorted(list(set(''.join(words))))
str_to_ind = {s:i + 1 for i,s in enumerate(chars)}
str_to_ind['.'] = 0
ind_to_str = {i:s for s,i in str_to_ind.items()}
print(ind_to_str)
vocab_size = len(ind_to_str)
print(vocab_size)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}
27


In [46]:
def build_dataset(words, block_size=3):
    X, Y = [], []
    for word in words:
        context = [0] * block_size # padded
        for ch in word + '.':
            ind = str_to_ind[ch]
            X.append(context)
            Y.append(ind)
            context = context[1:] + [ind] # crop and append

    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print(X.shape, Y.shape)
    return X, Y


n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

block_size=8
X_train, Y_train = build_dataset(words[:n1], block_size=block_size)
X_dev, Y_dev = build_dataset(words[n1:n2], block_size=block_size)
X_test, Y_test = build_dataset(words[n2:], block_size=block_size)

torch.Size([182625, 8]) torch.Size([182625])
torch.Size([22655, 8]) torch.Size([22655])
torch.Size([22866, 8]) torch.Size([22866])


In [47]:
for x,y in zip(X_train[:20], Y_train[:20]):
    print(''.join(ind_to_str[ix.item()] for ix in x), '-->', ind_to_str[y.item()])

........ --> y
.......y --> u
......yu --> h
.....yuh --> e
....yuhe --> n
...yuhen --> g
..yuheng --> .
........ --> d
.......d --> i
......di --> o
.....dio --> n
....dion --> d
...diond --> r
..diondr --> e
.diondre --> .
........ --> x
.......x --> a
......xa --> v
.....xav --> i
....xavi --> e


In [68]:
class Linear:
    def __init__(self, fan_in, fan_out, bias=True):
        # kaiming normal
        self.weight = torch.randn((fan_in, fan_out)) / fan_in**0.5 
        self.bias = torch.zeros(fan_out) if bias else None
        
    def __call__(self, x):
        self.out = x @ self.weight
        if self.bias is not None:
            self.out += self.bias
        return self.out
    
    def parameters(self):
        return [self.weight] + ([] if self.bias is None else [self.bias])
    
class BatchNorm1D:
    # this departs from the pytorchs' where they assume (N, C, L) we assume (N, L, C)
    
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.momentum = momentum
        self.training = True
        
        # training with backprop
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)
        
        
        # buffers 
        self.running_mean = torch.zeros(dim)
        self.running_var = torch.ones(dim)
        
    def __call__(self, x):
        # forward pass
        if self.training:
            if x.ndim == 2:
                dim = 0
            elif x.ndim == 3:
                dim = (0,1)
            x_mean = x.mean(dim, keepdim=True)
            x_var = x.var(dim, keepdim=True, unbiased=True) 
        else:
            x_mean = self.running_mean
            x_var = self.running_var
        
        x_hat = (x - x_mean) / torch.sqrt(x_var + self.eps)
        self.out = self.gamma * x_hat + self.beta
        
        # update buffers
        if self.training:
            with torch.no_grad():
                hinderance = 1 - self.momentum
                self.running_mean = hinderance * self.running_mean + self.momentum * x_mean
                self.running_var = hinderance * self.running_var + self.momentum * x_var
        return self.out
                
    def parameters(self):
        return [self.gamma, self.beta]
    
class Tanh:
    def __call__(self, x):
        self.out = torch.tanh(x)
        return self.out
    
    def parameters(self):
        return []

class Embedding:
    def __init__(self, num_embeddings, embedding_dim):
        self.weight = torch.randn((num_embeddings, embedding_dim))
        
    def __call__(self, IX):
        self.out = self.weight[IX]
        return self.out
    
    def parameters(self):
        return [self.weight]
    
class FlattenConsecutive:
    def __init__(self, n):
        self.n = n
    
    def __call__(self, x):
        B, T, C, = x.shape
        x = x.view(B, T//self.n, C*self.n)
        
        if x.shape[1] == 1:
            x = x.squeeze(dim=1)
        
        self.out = x
        return self.out
    
    def parameters(self):
        return []
    
class Sequential:
    def __init__(self, layers):
        self.layers = layers
        
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        self.out = x
        return self.out
    
    def parameters(self):
        # get parameters from all layers into one list
        return [p for layer in self.layers for p in layer.parameters()]

In [69]:
torch.manual_seed(42) # see rng for reproduction

<torch._C.Generator at 0x108690990>

In [77]:
n_emb_dim = 24
n_hidden = 128
dim_groups = 2

model = Sequential([
    Embedding(vocab_size, n_emb_dim),
    FlattenConsecutive(dim_groups), Linear(n_emb_dim * dim_groups, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),
    FlattenConsecutive(dim_groups), Linear(n_hidden * dim_groups, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),
    FlattenConsecutive(dim_groups), Linear(n_hidden * dim_groups, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),
    Linear(n_hidden  , vocab_size)
])

with torch.no_grad():
    # last layer: make less confident
    model.layers[-1].weight *= 0.1 
            
parameters = model.parameters() 
print(sum(p.nelement() for p in parameters)) # number of params

# require grad
for p in parameters:
    p.requires_grad = True

22397


In [78]:
total_epochs = 0
ud = []
lossi = []

In [79]:
epoch = 200000
learning_rate = 0.1
decay = 0.1
batch_size = 32
decay_threshold = 150000

for layer in model.layers:
    layer.training = True

In [None]:
for i in range(epoch):
    
    # minibatch contstruct, faster although not exact
    ix = torch.randint(0, X_train.shape[0], (batch_size, ))
    Xb, Yb = X_train[ix], Y_train[ix]
    
    logits = model(Xb)                 
    loss = F.cross_entropy(logits, Yb)

    # backward pass
    for p in parameters:
        p.grad = None # zero_grade
    loss.backward()
    
    # update and learning rate decay
    for p in parameters:
        p.data += -learning_rate * p.grad
        
     
    # track stats
    if i % 10000 == 0 or i == epoch-1:
        print(f"epoch: {total_epochs:6} batch: {i:7}/{epoch:7} loss: {loss.item():5.3f}")

    lossi.append(loss.log10().item())
    with torch.no_grad():
        ud.append([((learning_rate*p.grad).std() / p.data.std()).log10().item() for p in parameters])
        
    # decay learning rate
    if total_epochs > 0 and total_epochs%decay_threshold == 0:
        learning_rate *= decay
        
    total_epochs += 1

epoch:      0 batch:       0/ 200000 loss: 3.298
epoch:  10000 batch:   10000/ 200000 loss: 2.208
epoch:  20000 batch:   20000/ 200000 loss: 2.228
epoch:  30000 batch:   30000/ 200000 loss: 1.694
epoch:  40000 batch:   40000/ 200000 loss: 2.202
epoch:  50000 batch:   50000/ 200000 loss: 2.582
epoch:  60000 batch:   60000/ 200000 loss: 2.017
epoch:  70000 batch:   70000/ 200000 loss: 2.276
epoch:  80000 batch:   80000/ 200000 loss: 1.788
epoch:  90000 batch:   90000/ 200000 loss: 1.785
epoch: 100000 batch:  100000/ 200000 loss: 1.730
epoch: 110000 batch:  110000/ 200000 loss: 2.063


In [None]:
print(len(lossi))
r = len(lossi)%1000
plt.plot(torch.tensor(lossi[r:]).view(-1, 1000).mean(1))

In [None]:
@torch.no_grad()
def split_loss(split):
    x,y = {
        'train': (X_train, Y_train),
        'valid': (X_dev, Y_dev),
        'test': (X_test, Y_test),
    }[split]
    
    # Evaluate parameters
    logits = model(x)
    loss = F.cross_entropy(logits, y)
    print(split, loss.item())
    
for layer in model.layers:
    layer.training = False
split_loss('train')
split_loss('valid')

# Playground

In [62]:
e = torch.rand(4, 8, 10)
explicit = torch.cat([e[:, ::2, :], e[:, 1::2, :]], dim=2)
explicit.shape

torch.Size([4, 4, 20])

In [64]:
(e.view(4, 4, 20) == explicit).all().item()

True

In [75]:
ix = torch.randint(0, X_train.shape[0], (4,))
Xb, Yb = X_train[ix], Y_train[ix]
logits = model(Xb)
print(Xb.shape)
Xb

torch.Size([4, 8])


tensor([[ 0,  0,  0,  0,  0,  0, 12,  5],
        [ 0,  0,  0,  0,  0,  2, 18,  5],
        [ 0,  0,  0,  0,  5, 13, 13,  5],
        [ 0,  0,  0, 20,  8,  1, 14,  9]])

In [76]:
for layer in model.layers:
    print(layer.__class__.__name__, ":", tuple(layer.out.shape))

Embedding : (4, 8, 10)
FlattenConsecutive : (4, 4, 20)
Linear : (4, 4, 200)
BatchNorm1D : (4, 4, 200)
Tanh : (4, 4, 200)
FlattenConsecutive : (4, 2, 400)
Linear : (4, 2, 200)
BatchNorm1D : (4, 2, 200)
Tanh : (4, 2, 200)
FlattenConsecutive : (4, 400)
Linear : (4, 200)
BatchNorm1D : (4, 200)
Tanh : (4, 200)
Linear : (4, 27)


In [None]:
model.layers[3].running_mean.shape

# Sampling

In [None]:
for _ in range(20):
    out = []
    context = [0] * block_size
    while True:
        logits = model(torch.tensor([context]))
        probs = F.softmax(logits, dim=1) 
        
        # sample
        index = torch.multinomial(probs, num_samples=1).item()
        
        # shift context window
        context = context[1:] + [index]
        if index == 0:
            break
            
        out.append(index)
    
    print(''.join(ind_to_str[i] for i in out))

--------------

# loss log

1. Default start with 

      - train 2.016745090484619
      - valid 2.3248491287231445

    ```
    block_size = 3
    n_emb_dim = 10
    n_hidden = 200
    input_size = n_emb_dim * block_size

    C = torch.randn((vocab_size, n_emb_dim))

    layers = [
        Linear(input_size, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),
        Linear(n_hidden  , vocab_size)
    ]
    total_epochs = 0
    ud = []
    lossi = []
    epoch = 200000
    learning_rate = 0.1
    decay = 0.1
    batch_size = 32
    decay_threshold = 100000
    ```

 
2. Add in model and more layers. I'm concerned here as my numbers are looking more divergent than his despite using the same set up and seeds.  

      - train 1.9769409894943237
      - valid 2.3193278312683105
    
     During step 3 I found the bug I needed to restart and rerun my notebook to see it. 
      - train 2.0587270259857178
      - valid 2.1071510314941406
     


    ```
    decay_threshold = 150000
    model = Sequential([
        Embedding(vocab_size, n_emb_dim),
        Flatten(),
        Linear(input_size, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),
        Linear(n_hidden  , vocab_size)
    ])
    ```
   
   
3. Up the context

    - train 1.9163451194763184
    - valid 2.034252166748047
    
     ``` 
     block_size = 8
     
     ```
     
4. Go to wavenet style
    Changed Flatten -> FlattenConsecutive for 3 dimensions, 8 char. n_hidden changed to match the 22k params
    

    ```
    n_hidden = 68
    model = Sequential([
    Embedding(vocab_size, n_emb_dim),
    FlattenConsecutive(dim_groups), Linear(n_emb_dim * dim_groups, n_hidden, bias=False), 
    BatchNorm1D(n_hidden), Tanh(),
    FlattenConsecutive(dim_groups), Linear(n_hidden * dim_groups, n_hidden, bias=False), 
    BatchNorm1D(n_hidden), Tanh(),
    FlattenConsecutive(dim_groups), Linear(n_hidden * dim_groups, n_hidden, bias=False), 
    BatchNorm1D(n_hidden), Tanh(),
    Linear(n_hidden  , vocab_size)])
    ```

5. Fix batch norm
    
    This broadcast correctly but didn't do what we wanted it to do. (you can see this by going step by step or looking at the model.layers[3].running_mean.shape. It's [1, 4, 68] instead of the desired 1 dim [1, 1, 68]
    
    didn't run the model as it takes a while, but tested on a smaller. We expect a tiny increase 
    
    
6. Increase model size 



--------------