# Makemore 3 - Exercises

Exercises from the [makemore #3 video](https://www.youtube.com/watch?v=P6sfmUTpUmc).<br>
The video description holds the exercises, which are also listed below.

1. Watch the [makemore #3 video](https://www.youtube.com/watch?v=P6sfmUTpUmc) on YouTube
2. Come back and complete the exercises to level up :)

In [1]:
import torch
import random
import torch.nn.functional as F
import matplotlib.pyplot as plt

%matplotlib inline

## Exercise 1 - Dead or Alive?

**Objective:** We did not get around to seeing what happens when you initialize all weights and biases to zero.<br>
Try this and train the neural net. You might think either that:
1. the network trains just fine or
2. the network doesn't train at all,
3. the network trains but only partially, and achieves a pretty bad final performance.

Inspect the gradients and activations to figure out what is happening and why the network (spoiler) is only partially training, and what part is being trained exactly (and why).

In [None]:
words = open('../names.txt', 'r').read().splitlines() # read in all the words
print(words[:5])           # Show the first eight words
print(len(words), 'words') # Total amount of words in our dataset

In [64]:
# Build a vocabulary of characters map them to integers (these will be the index tokens)
chars = sorted(list(set(''.join(words))))  # set(): Throwing out letter duplicates
stoi = {s:i+1 for i,s in enumerate(chars)} # Make tupels of type (char, counter)
stoi['.'] = 0                              # Add this special symbol's entry explicitly
itos = {i:s for s,i in stoi.items()}       # Switch order of (char, counter) to (counter, char)
vocab_size = len(itos)

In [None]:
# Build the dataset
block_size = 3 # Context length: We look at this many characters to predict the next one

def build_dataset(words):
  X, Y = [], []

  for w in words:
    context = [0] * block_size
    for ch in w + '.':
      ix = stoi[ch]
      X.append(context)
      Y.append(ix)
      context = context[1:] + [ix] # Crop and append

  X = torch.tensor(X)
  Y = torch.tensor(Y)

  print(X.shape, Y.shape)
  return X, Y

# Randomize the dataset (with reproducibility)
random.seed(42)
random.shuffle(words)

# These are the "markers" we will use to divide the dataset
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

# Dividing the dataset into train, dev and test splits
Xtr, Ytr = build_dataset(words[:n1])     # 80%
Xdev, Ydev = build_dataset(words[n1:n2]) # 10%
Xte, Yte = build_dataset(words[n2:])     # 10%

In [1]:
# TODO: Modify the model implementation to intialize all weights and biases to zero

# Linear Layer Definition (mimicing torch.nn.Linear's structure)
class Linear:

  def __init__(self, fan_in, fan_out, bias=True):
    self.weight = torch.randn((fan_in, fan_out), generator=g) / fan_in ** 0.5
    self.bias = torch.zeros(fan_out) if bias else None # Biases are optional here

  def __call__(self, x):
    self.out = x @ self.weight # W*x
    if self.bias is not None:  # Add biases if so desired
      self.out += self.bias
    return self.out

  def parameters(self):
    return [self.weight] + ([] if self.bias is None else [self.bias]) # return layer's tensors
  

class BatchNorm1d:

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.momentum = momentum
    self.training = True
    # Initialize Parameters (trained with backprop)
    # (bngain -> gamma, bnbias -> beta)
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)
    # Initialize Buffers
    # (Trained with a running 'momentum update')
    self.running_mean = torch.zeros(dim)
    self.running_var = torch.ones(dim)


  def __call__(self, x):
    # Forward-Pass
    if self.training:
      xmean = x.mean(0, keepdim=True) # Batch mean
      xvar = x.var(0, keepdim=True)   # Batch variance
    else:
      xmean = self.running_mean # Using the running mean as basis
      xvar = self.running_var   # Using the running variance as basis

    # Normalize to unit variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
    self.out = self.gamma * xhat + self.beta  # Apply batch gain and bias

    # Update the running buffers
    if self.training:
      with torch.no_grad():
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar

    return self.out


  def parameters(self):
    return [self.gamma, self.beta] # return layer's tensors

# Similar to torch.tanh(), but Class-structure to make later steps easier
class Tanh:
  def __call__(self, x):
    self.out = torch.tanh(x)
    return self.out
  def parameters(self):
    return []

In [None]:
n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 100 # the number of neurons in the hidden layer of the MLP
g = torch.Generator().manual_seed(2147483647) # for reproducibility

C = torch.randn((vocab_size, n_embd), generator=g)

layers = [Linear(n_embd * block_size, n_hidden), BatchNorm1d(n_hidden), Tanh(),
          Linear(n_hidden, n_hidden), BatchNorm1d(n_hidden), Tanh(),
          Linear(n_hidden, n_hidden), BatchNorm1d(n_hidden), Tanh(),
          Linear(n_hidden, n_hidden), BatchNorm1d(n_hidden), Tanh(),
          Linear(n_hidden, n_hidden), BatchNorm1d(n_hidden), Tanh(),
          Linear(n_hidden, vocab_size), BatchNorm1d(vocab_size)]

with torch.no_grad():
  # Last layer: make less confident
  layers[-1].gamma *= 0.1 # As last layer is a Batch-Normalization
  # All other layers: apply gain
  for layer in layers[:-1]:
    if isinstance(layer, Linear):
      layer.weight *= 1.0

# Embedding matrix + all parameters in all layers = total involved parameters
parameters = [C] + [p for layer in layers for p in layer.parameters()]
print(f'Params: {sum(p.nelement() for p in parameters)}') # number of parameters in total

# These parameters will be affected by backpropagation
for p in parameters:
  p.requires_grad = True

In [None]:
# Same optimization as was built in the video
max_steps = 200000
batch_size = 32
lossi = [] # Keeping track of loss
ud = []    # Keeping track of Update-to-Data ratio

for i in range(max_steps):
  # Minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
  Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

  # Forward pass
  emb = C[Xb] # embed the characters into vectors
  x = emb.view(emb.shape[0], -1) # concatenate the vectors
  for layer in layers:
    x = layer(x)
  loss = F.cross_entropy(x, Yb) # loss function

  # Backward pass
  for layer in layers:
    layer.out.retain_grad() # AFTER_DEBUG: would take out retain_graph
  for p in parameters:
    p.grad = None
  loss.backward()

  # Update
  lr = 0.1 if i < 150000 else 0.01 # step learning rate decay
  for p in parameters:
    p.data += -lr * p.grad

  # Tracking the stats
  if i % 10000 == 0: # Print every once in a while
    print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
  lossi.append(loss.log10().item())
  with torch.no_grad():
    ud.append([((lr*p.grad).std() / p.data.std()).log10().item() for p in parameters])

In [2]:
# TODO: Describe what kind of model behavior you observe (and why it adheres to behavior option 3 from the task)

## Exercise 2 - Folding BatchNorm

**Objective:** BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the BatchNorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time.<br>
- Set up a small $3$-layer MLP with BatchNorms,
- Train the network, then
- "fold" the BatchNorm gamma/beta into the preceeding `Linear` layer's $W,\ b$ by creating a new $W2,\ b2$ and erasing the BatchNorm.
- Verify that this gives the same forward pass during inference.

We will see that the BatchNorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

In [None]:
# TODO: Make sure you reset the zero-based initialization from last exercise to be random again here

In [None]:
# TODO: Cut out a 3-layer MLP from the above model code (don't fold anything yet, this is just supposed to be the baseline)

We apply the exact same training routine as before:

In [None]:
# Same optimization as last time
max_steps = 200000
batch_size = 32
lossi = [] # Keeping track of loss
ud = []    # Keeping track of Update-to-Data ratio

for i in range(max_steps):
  # Minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
  Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

  # Forward pass
  emb = C[Xb] # embed the characters into vectors
  x = emb.view(emb.shape[0], -1) # concatenate the vectors
  for layer in layers:
    x = layer(x)
  loss = F.cross_entropy(x, Yb) # loss function

  # Backward pass
  for layer in layers:
    layer.out.retain_grad() # AFTER_DEBUG: would take out retain_graph
  for p in parameters:
    p.grad = None
  loss.backward()

  # Update
  lr = 0.1 if i < 150000 else 0.01 # step learning rate decay
  for p in parameters:
    p.data += -lr * p.grad

  # Tracking the stats
  if i % 10000 == 0: # Print every once in a while
    print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
  lossi.append(loss.log10().item())
  with torch.no_grad():
    ud.append([((lr*p.grad).std() / p.data.std()).log10().item() for p in parameters])

As a reminder, this is how BatchNorm is formulated:<br>
![](./img/batch_norm_recipe.PNG)

In [108]:
# TODO: Fold BatchNorm1d layers into their preceeding Linear layers

In [None]:
# TODO: Verify the folding correctness by comparing the outputs of the original and the folded model on some dummy input