# 🏄‍♂️ You will feel surfing!
Here, we will pickup where we left the things off in the last notebook.

**There:**
- We started out with the original modulized code
- Then **cleaned** up the code futher by introducing the `Embedding`, `Flatten` and finally the `Sequential` classes, that made our code more "pytorchized".
- Now, we will use them in here, to **build** the wavenet. 

But before that, we will need to **understand** the structure of the wavenet, to finally be able to surf over it! 🏄‍♂️ <br>
Excited enough? Let's go.

# 🖼 Visualizing the difference
I know, it may be *(very)* confusing from the paper... and also from the Andrej's lecture and you probably thinking... **okay... but how does it look like? What is going on inside?**.

<img src="./images/inside.gif" wdith=200px height=200px>

I have tried to brainstorm over it, and let's have a look at this thing.

# 🛷 Till now

<img src="./images/simple-net.png">

# 🧠 And, this will be the story

<img src="./images/wavenet.png">

# Woo...
Yeah, just a dimension added and a bunch of axis concatenation, except that everything is just the same.

- We will be calculating the stuff in the same way, just the matrix multiplication
- That means, **instead of combining embeddings** of **all** characters at once in the old method, **now** we will **combine the embeddings of only even-odd** charaters. Which will be always `embedding_size * 2` in all cases.
- Passing them seperate will automatically enable the network to learn the "wavenet" relationship.
- Of course, you can give it some more time to come up with some "philosophy" but this is how the architecture looks like in a nutshell.

# 👨‍💻 Code

> **NOTE**: The code is kind of a *boilerplate*, so it is just copied from the previous notebook with updated `embedding` and related clasees 😄

# 1️⃣ Loading & creating the dataset

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F

# loading the dataset
with open("./names.txt", "r") as file:
    names = file.read().splitlines()

# total unique characters
characters = sorted(list(set(''.join(names))))

# Builind index-to-char and char-to-index
number_to_chr = {k:v for k, v in enumerate(["."] + characters)}
chr_to_number = {v:k for k, v in enumerate(["."] + characters)}

👉 Dataset creation

In [2]:
# This function will build the dataset and return the X, Y
# Used when we have multiple splits :)
block_size = 8
def build_dataset(shuffled_names):
    sot = chr_to_number["."]

    X = []
    y = []

    for name in shuffled_names: #FOR ALL NAMES
        window_chars = [sot] * block_size
        name = name + "."

        for ch in name:
            _3chars = ''.join(
                list(
                    map(lambda x:number_to_chr[x], window_chars)
                )) 
            ch_index = chr_to_number[ch]

            X.append(window_chars)
            y.append(ch_index)
            window_chars = window_chars[1:] + [ch_index]

    X = torch.tensor(X)
    y = torch.tensor(y)
    return X, y

In [3]:
import random
random.seed(42)
random.shuffle(names) # In-place shuffling. No longer first word will be "emma"


train_idx = int(0.8 * len(names)) # 80%
val_idx = int(0.9 * len(names)) # 90% - 80% = 10%

Xtrain, ytrain = build_dataset(names[:train_idx])
Xval, yval = build_dataset(names[train_idx:val_idx])
Xtest, ytest = build_dataset(names[val_idx:])

print(f"* {Xtrain.shape = }\n* {Xval.shape = }\n* {Xtest.shape = }")

* Xtrain.shape = torch.Size([182625, 8])
* Xval.shape = torch.Size([22655, 8])
* Xtest.shape = torch.Size([22866, 8])


👉 What is in the training?

In [4]:
for x, y in zip(Xtrain[:20], ytrain[:20]):
    print(''.join(number_to_chr[ix.item()] for ix in x), "→", number_to_chr[y.item()])
    if y.item() == 0: print()

........ → y
.......y → u
......yu → h
.....yuh → e
....yuhe → n
...yuhen → g
..yuheng → .

........ → d
.......d → i
......di → o
.....dio → n
....dion → d
...diond → r
..diondr → e
.diondre → .

........ → x
.......x → a
......xa → v
.....xav → i
....xavi → e


# 2️⃣ We "pytorchified" the code, from the last book.
*Including the `Flatten`, `Embedding` and `Sequential`.*

## Creating a `Linear` class 

In [5]:
class Linear:
    """
    This will be used to create a Linear Layer of `n_ins` and `n_outs`
    and also performs the matrix multiplication
    
    - Possible to enable/disable the bias
    - Automatically set the weights and initialize them with Kaiming
    """
    
    def __init__(self, n_ins, n_outs, bias=True):
        self.weight = torch.randn(n_ins, n_outs) / n_ins**0.5
        self.bias = torch.zeros(n_outs) if bias else None
        
    def __call__(self, x):
        self.out = x @ self.weight
        if self.bias is not None:
            self.out += self.bias
        return self.out
    
    def parameters(self):
        return [self.weight] + ([] if self.bias is None else [self.bias])

## Creating a `BatchNorm` class 
*This is the buggy implementation, with the mean being calculated on the `0` th axis, we will later fix this once we perform the basic test*.

In [6]:
class BatchNorm1d:
    """
    This will implement the whole batchnorm stuff that can later be added 
    with the linear layer.
    
    - Perform normalization
    - Keep track of the statistics of the batch "while training" and "while evaluation".
    - Distinction between training and evaluation/inference.
    """
    
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        """
        `eps`: Adds a small number in the denomenator while standardizing to
            avoid division by zero error
            
        `momentum`: Used in the calculation of the statistics while training
            to set the effect of how much of the std and mean to keep from the
            current batch. High momentum means learn more and visaversa.
        """
        
        self.dim = dim
        self.eps = eps
        self.momentum = momentum
        self.training = True # Will be explained later in a bit below.
        
        ### For scaling & shifting
        # Sacler will be called `gamma`
        # Shifter will be called `beta`
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)
        
        ### Keep track of running mean and variance for the inference!
        self.running_mean = torch.zeros(dim)
        self.running_var = torch.ones(dim)
        
        
    def __call__(self, x):   
        ### If `training` then calculate the mean and var 
        if self.training:
            xmean = x.mean(0, keepdims=True)
            xvar = x.var(0, keepdims=True)
        ### If `not training` then use the running mean and var
        else:
            xmean = self.running_mean
            xvar = self.running_var
            
        ### Normalize!
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
        self.out = self.gamma * xhat + self.beta
        
        ### Calculate the running mean and variance
        if self.training:
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
        return self.out
    
    def parameters(self):
        return [self.gamma, self.beta]

## Creating a `Tanh` class 

In [7]:
class Tanh:
    """
    Just to calculate the `tanh`
    """
    
    def __call__(self, x):
        self.out = torch.tanh(x)
        return self.out
    
    def parameters(self):
        return []

## Creating a `Embedding` class

In [8]:
class Embedding:
    """
    1. It will initialize the weights
    2. It is able to call the weights based on their index
    """
    def __init__(self, vocab_size, n_embd):
        self.weights = torch.randn(vocab_size, n_embd)
        
    def __call__(self, IX):
        self.out = self.weights[IX]
        return self.out
    
    def parameters(self):
        return [self.weights]

## Creating a `Flatten` class

> As Andrej demonstrated [in this clip](https://youtube.com/clip/UgkxwVvcaO-5voBsSDQhVH0qvo0wqJeUjEEc) how the `view` operation gives exactly the same result as the "explicit" even odd concatenation that we demonstrated just above, thus we will use that 😉

In [9]:
class Flatten:
    """
    This will be the concatenator, BUT with the updated Wavenet
    based settings.

    So instead of flattening things out, of all embeddings together,
    here we will do the `.view()` operation.

    ### NOTE ###
    Here I am taking the constant `2` which is the only number
    to let us continue with the EVEN / ODD example explained above.

    What Andrej has done is taking `n` as an initial input which would
    replace the constant `2`. But we will continue our WaveNet example
    with this constant `2` without making things complicated.

    For now, just understand that `2` is the number which will get the 
    math correct in the `.view()` operation and result in the perfect
    concatenation operation.
    
    """
    def __call__(self, x):
        n_samples, block_size, emb_size = x.shape
        self.out = x.view(n_samples, block_size//2, emb_size*2)
        if self.out.shape[1] == 1:
            self.out = torch.squeeze(self.out, dim=1)
        return self.out
    
    def parameters(self):
        return []

## Creating a `Sequential` class

In [10]:
class Sequential:
    """
    We will simply replace the explicit LIST keeping
    and FOR LOOPING for the forward pass, in this
    single class.
    
    This is very very neat thing to be done.
    """
    
    def __init__(self, layers):
        self.layers = layers
        
    def __call__(self, x):
        for layer in layers:
            x = layer(x)
        self.out = x
        return self.out
    
    def parameters(self):
        parameters = []
        for layer in self.layers:
            for p in layer.parameters():
                parameters.append(p)
        return parameters

## 🧠 Model

In [11]:
torch.manual_seed(42);

In [12]:
n_embd = 10
n_neurons = 200
vocab_size = len(number_to_chr) # 27

layers = [
    Embedding(vocab_size, n_embd), 
    Flatten(), Linear(n_embd * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Flatten(), Linear(n_neurons * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Flatten(), Linear(n_neurons * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Linear(n_neurons, vocab_size), 
]

model = Sequential(layers) ### WE WILL CALL `MODEL` ✨

In [13]:
with torch.no_grad():
    model.layers[-1].weight *= 0.1

parameters = model.parameters()

print(sum(p.nelement() for p in parameters))
for p in parameters:
    p.requires_grad = True

171497


In [14]:
epochs = 10_000 # we will break it don't worry
batch_size = 32
losses = []

for i in range(epochs):
    sample_idx = torch.randint(0, Xtrain.shape[0], (batch_size,))
    Xb, Yb = Xtrain[sample_idx], ytrain[sample_idx]
    
    logits = model(Xb)
    loss = F.cross_entropy(logits, Yb) 
    
    # 2️⃣ Backward
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # 3️⃣ Update - with decay
    learning_rate = 0.1 if i < 10_000 else 0.01
    for p in parameters:
        p.data += -learning_rate * p.grad
        
    if i % 10000 == 0:
        print(f'{i:7d}/{epochs:7d}: {loss.item():.4f}')
    losses.append(loss.log10().item()) # for better visualization 
    break

      0/  10000: 3.3014


In [15]:
x = Xb
for idx, layer in enumerate(model.layers):
    x = layer(x)
    layer_name = layer.__class__.__name__
    print(layer_name, ":",layer.out.shape, end="\n" if layer_name != "Tanh" else "\n\n")

Embedding : torch.Size([32, 8, 10])
Flatten : torch.Size([32, 4, 20])
Linear : torch.Size([32, 4, 200])
BatchNorm1d : torch.Size([32, 4, 200])
Tanh : torch.Size([32, 4, 200])

Flatten : torch.Size([32, 2, 400])
Linear : torch.Size([32, 2, 200])
BatchNorm1d : torch.Size([32, 2, 200])
Tanh : torch.Size([32, 2, 200])

Flatten : torch.Size([32, 400])
Linear : torch.Size([32, 200])
BatchNorm1d : torch.Size([32, 200])
Tanh : torch.Size([32, 200])

Linear : torch.Size([32, 27])


## 🤔 Which is currently...

<img src="./images/wavenet-wiz.png">

Everything looks nice! 🎇

In [15]:
n_embd = 10
n_neurons = 48 ### instead of 68, as Andrej used, I will use 48 to match my previous model=12297
vocab_size = len(number_to_chr)

layers = [
    Embedding(vocab_size, n_embd), 
    Flatten(), Linear(n_embd * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Flatten(), Linear(n_neurons * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Flatten(), Linear(n_neurons * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Linear(n_neurons, vocab_size), 
]

model = Sequential(layers) ### WE WILL CALL `MODEL` ✨

In [16]:
with torch.no_grad():
    model.layers[-1].weight *= 0.1

parameters = model.parameters()

print(sum(p.nelement() for p in parameters))
for p in parameters:
    p.requires_grad = True

12201


In [17]:
epochs = 1_00_000
batch_size = 32
losses = []

for i in range(epochs):
    sample_idx = torch.randint(0, Xtrain.shape[0], (batch_size,))
    Xb, Yb = Xtrain[sample_idx], ytrain[sample_idx]
    
    # 1️⃣ Forward
    logits = model(Xb)
    loss = F.cross_entropy(logits, Yb) 
    
    # 2️⃣ Backward
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # 3️⃣ Update - with decay
    learning_rate = 0.1 if i < 10_000 else 0.01
    for p in parameters:
        p.data += -learning_rate * p.grad
        
    if i % 10000 == 0:
        print(f'{i:7d}/{epochs:7d}: {loss.item():.4f}')
    losses.append(loss.log10().item()) # for better visualization 

      0/ 100000: 3.3048
  10000/ 100000: 2.1995
  20000/ 100000: 2.0995
  30000/ 100000: 2.0591
  40000/ 100000: 2.5509
  50000/ 100000: 2.1282
  60000/ 100000: 1.6855
  70000/ 100000: 2.5461
  80000/ 100000: 2.3684
  90000/ 100000: 2.7769


In [18]:
@torch.no_grad() # NEW - Will disable the gradient tracking temproarily - for performance sake
def split_loss(split: str):
    x, y = {
        'train': (Xtrain, ytrain),
        'test': (Xtest, ytest),
        'val': (Xval, yval)
    }[split]
    
    logits = model(x)
    final_loss = F.cross_entropy(logits, y)
    print(split.title(), ":\t", round(final_loss.item(), 5))

In [19]:
for layer in model.layers:
    layer.training = False

In [40]:
split_loss('train')
split_loss('val')
split_loss('test')

Train :	 2.0576
Val :	 2.09363
Test :	 2.08772


⌛ Before, in **Simple ANN**: Test Loss = `2.15053` <br>
⌚ Now, in **WaveNet *(with BatchNorm Bug)***: Test Loss = `2.0865`

## 🐞 That BatchNorm Bug

## Creating a *bug free* `BatchNorm` class 

<img src="./images/batch_norm_mean_bug.png">

### So, the updated `BatchNorm` code is written below 👇

In [21]:
class BatchNorm1d:
    """
    The updated BatchNorm class, where the running mean and variance will 
    take care of the 3D input.
    """
    
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.dim = dim
        self.eps = eps
        self.momentum = momentum
        self.training = True
        
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)
        
        self.running_mean = torch.zeros(dim)
        self.running_var = torch.ones(dim)
        
        
    def __call__(self, x):   
        ## UPDATED CODE ##
        if self.training:
            if x.ndim == 2:
                dim = 0
            if x.ndim == 3:
                dim = (0, 1)
            xmean = x.mean(dim, keepdims=True)
            xvar = x.var(dim, keepdims=True)
        else:
            xmean = self.running_mean
            xvar = self.running_var
            
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
        self.out = self.gamma * xhat + self.beta
        
        if self.training:
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
        return self.out
    
    def parameters(self):
        return [self.gamma, self.beta]

👉 Creating a model *(again)*

In [22]:
n_embd = 10
n_neurons = 48 ### instead of 68, as Andrej used, I will use 48 to match my previous model=12297
vocab_size = len(number_to_chr)

layers = [
    Embedding(vocab_size, n_embd), 
    Flatten(), Linear(n_embd * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Flatten(), Linear(n_neurons * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Flatten(), Linear(n_neurons * 2, n_neurons), BatchNorm1d(n_neurons), Tanh(),
    Linear(n_neurons, vocab_size), 
]

model = Sequential(layers) ### WE WILL CALL `MODEL` ✨

In [23]:
with torch.no_grad():
    model.layers[-1].weight *= 0.1

parameters = model.parameters()

print(sum(p.nelement() for p in parameters))
for p in parameters:
    p.requires_grad = True

12201


In [24]:
epochs = 1_00_000
batch_size = 32
losses = []

for i in range(epochs):
    sample_idx = torch.randint(0, Xtrain.shape[0], (batch_size,))
    Xb, Yb = Xtrain[sample_idx], ytrain[sample_idx]

    # 1️⃣ Forward
    logits = model(Xb)
    loss = F.cross_entropy(logits, Yb) 
    
    # 2️⃣ Backward
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # 3️⃣ Update - with decay
    learning_rate = 0.1 if i < 10_000 else 0.01
    for p in parameters:
        p.data += -learning_rate * p.grad
        
    if i % 10000 == 0:
        print(f'{i:7d}/{epochs:7d}: {loss.item():.4f}')
    losses.append(loss.log10().item()) # for better visualization 

      0/ 100000: 3.3010
  10000/ 100000: 2.2331
  20000/ 100000: 2.2508
  30000/ 100000: 2.0198
  40000/ 100000: 1.9547
  50000/ 100000: 1.7769
  60000/ 100000: 2.0611
  70000/ 100000: 1.7319
  80000/ 100000: 2.4099
  90000/ 100000: 2.1044


In [25]:
for layer in model.layers:
    layer.training = False

In [73]:
split_loss('train')
split_loss('val')
split_loss('test')

Train :	 2.05237
Val :	 2.09428
Test :	 2.08826


⌛ Before's Before, in **Simple ANN**: Test Loss = `2.15053` <br>
⏲ Before, in **WaveNet *(with BatchNorm Bug)***: Test Loss = `2.0865` <br>
⌚ Now, in **WaveNet *(with-out BatchNorm Bug)***: Test Loss = `2.08826`

### Inference!! 🎉

In [107]:
for _ in range(20):
    out = []
    context = [0] * block_size
    
    while True:
        x = torch.tensor([context])
        logits = model(x)
        
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1).item()
        context = context[1:] + [ix]
        out.append(ix)
        
        if ix == 0:
            break
            
    print(''.join(number_to_chr[i] for i in out))

thama.
nezianna.
darrett.
vihahorree.
brocllyn.
ejanie.
yaden.
yabeolade.
kalys.
yannett.
azeana.
jahariof.
tynna.
alygin.
caadian.
aiv.
sheri.
joha.
myael.
kiadsle.


# 🎁 Bonus 
Let's try, **if the model can complete your name** or not 😆

In [134]:
name = "aayu" # Please half name
initials = list(map(lambda c: chr_to_number[c], name))
initials

[1, 1, 25, 21]

In [135]:
# Appending "dots" to fulfull the block size
context = [0] * (block_size - len(initials)) + initials
context

[0, 0, 0, 0, 1, 1, 25, 21]

In [151]:
for trial in range(10):
    out = initials.copy()
    while True:
        x = torch.tensor([context])
        logits = model(x)
        
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1).item()
        context = context[1:] + [ix]
        out.append(ix)
        
        if ix == 0:
            break
            
    print(''.join(number_to_chr[i] for i in out))

aayua.
aayue.
aayus.
aayun.
aayufei.
aayu.
aayu.
aayu.
aayuawsen.
aayua.


> Oops, it didn't complete my name for a single time 😅

## 😒 But,
There seems to be some confusion, the model:
1. Seems to allow "minimum" number of layers, which is dependent on the `block_size`.
    - Which means, in this case, we have the **blocksize=8** for which we have to implement the **minimum** number of hidden layers `3`
    - We can't lower the number of hidden layers otherwise the **second dimension** won't end up being `1`.
2. Following to the **first** point... if we have block size, say `12`, or even `3`, it won't work.
    - Because the WaveNet seems to be created only when you have the block size `2`, `4`, `8`, `16`, `32` and so on.
    - If you choose other block size, the numbers won't match up and will give the error.
  
> But anyways... we have learnt something new here 🤘

# With that said,
Let's meet in the master lecture next, where we will build the GPT 😈