# Pytorchifying Batch normalization 

In this nb we will use ready made classes from pytorch rather than writing custon NN layers with BN and activation. This is how it will be deployed in production. Lets get through some prerquisites though. 

1. [nn.linear](https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)
- To initialize a linear layer (Wx + b) with in_features, out_features, bais as parameters. In case applying BN, bias = Flase can be set. 
- Also see _how_ the weights and biases are initialized when nn.Linear() is called:<br>
Values are sampled from $U(-\sqrt{k},\sqrt{k})$, where $k = \frac{1}{in\_features}$ (uniformly) - note that this is similar to `kaiming init` without the gain factor $\frac{5}{3}$ for tanh! 


2. [nn.BatchNorm1d](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html#torch.nn.BatchNorm1d1)
- Mathematically: $ y = \frac{x - \mathbb{E}[x]}{\sqrt{\text{Var}[x] + \varepsilon}} \cdot \gamma + \beta $, where $\gamma = $ bngain, $\beta = $ bnbias

Lets break down its arguments: <br>
- $\epsilon$ prevents blowing up around $0$
- `momentum` is the update rate for the `bnmeani` to update the running_mean ($\alpha$ in the previous files)
- `affine` must be true to ensure _bngain_ and _bnbias_ are learnable 
- `track_running_stats` allows computing overall mean, std etc while training itself. 



In [1]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
words = open('names.txt', 'r').read().splitlines()

In [10]:
allchars = sorted(set(''.join(words)))

stoi = {s:i+1 for i,s in enumerate(allchars)}
stoi['.'] = 0

itos = {i:s for s,i in stoi.items()}

vocab_size = len(itos)

In [12]:
# build the dataset
block_size = 3 # context length: how many characters do we take to predict the next one?

def build_dataset(words):  
  X, Y = [], []
  
  for w in words:
    context = [0] * block_size
    for ch in w + '.':
      ix = stoi[ch]
      X.append(context)
      Y.append(ix)
      context = context[1:] + [ix] # crop and append

  X = torch.tensor(X)
  Y = torch.tensor(Y)
  print(X.shape, Y.shape)
  return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

Xtr,  Ytr  = build_dataset(words[:n1])     # 80%
Xdev, Ydev = build_dataset(words[n1:n2])   # 10%
Xte,  Yte  = build_dataset(words[n2:])     # 10%

torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])


Lets make our network deeper and layers more generalizable unlike explicit definition for each layer. 

 The classes we create here are the same API as nn.Module in PyTorch

In [13]:
g = torch.Generator().manual_seed(200989800)

In [16]:
class Linear:

    def __init__(self, fan_in, fan_out, bias = True):
        self.weight = torch.randn((fan_in, fan_out), generator=g) * 1/fan_in**0.5
        self.bais = torch.zeros(fan_out) if bias else None
    
    def __call__(self, x):
        self.out = x @ self.weight
        if self.bias is not None:
            self.out += self.bias
        return self.out
    
    def parameters(self):
        params = [self.weight] + ([] if self.bais is None else [self.bias])
        return params
    

class BatchNorm1d:

    def __init__(self, dim, eps = 1e-5, momentum = 0.1):
        self.eps = eps
        self.momentum = momentum
        self.training = True
        # parameters (trained with backprop)
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)
        # buffers (trained with a running 'momentum update')
        self.running_mean = torch.zeros(dim)
        self.running_var = torch.ones(dim)
    
    def __call__(self, x):
        if self.training:
            xmean = x.mean(dim = 0, keepdim = True)
            xvar = x.var(dim = 0, keepdim = True)

        else: 
            xmean = self.running_mean
            xvar = self.running_var
        # apply to data
        xhat = (x-xmean)/torch.sqrt(xvar + self.eps)

        self.out = self.gamma * xhat + self.beta

        if self.training:
            with torch.no_grad:
                self.running_mean = self.momentum * xmean + (1 - self.momentum) * self.running_mean
                self.running_var = self.momentum * xvar + (1 - self.momentum) * self.running_var

        return self.out

class Tanh:

    def __call__(self, x):
        out = torch.tanh(x)
        return self.out
    def parameters(self):
        return []