Q1) What are the parameters of batchnorm? What information do you need to train the parameters?  

A1) $ \frac{X - \mu}{\sigma} * \gamma + \beta $, in which $\gamma$ and $\beta$ are optimized using training data (i.e. parameter) and $\mu$ and $\sigma$ attained from data.

Q2) Why do we use 'exponentially weighted moving average' (chain of linear interpolation) of training data in inference time? In other words, why can't we use one batch of training set's mean and variance in inference time?  
A2) When we get a totally different type of image at inference time, we can not fairly access/evaluate the parameters since attained mean/variance of one training data could be irrevant to validation/test data.

In [1]:
!git clone https://github.com/fastai/course-v3/

Cloning into 'course-v3'...
remote: Enumerating objects: 5893, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 5893 (delta 0), reused 2 (delta 0), pack-reused 5890[K
Receiving objects: 100% (5893/5893), 263.03 MiB | 33.42 MiB/s, done.
Resolving deltas: 100% (3249/3249), done.


In [2]:
%cd /content/course-v3/nbs/dl2/
%load_ext autoreload
%autoreload 2
%matplotlib inline
from exp.nb_06 import *

/content/course-v3/nbs/dl2


In [3]:
def get_data():
    # path = datasets.download_data(MNIST_URL, ext='.gz')
    path = '/content/mnist.pkl.gz'
    with gzip.open(path, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train,y_train,x_valid,y_valid))

- get data and transform those to regularized size
- dataset class, which enables slicing and length information
- databunch instance, which gives 1) dataloader 2) number of class
    1. dataloader
        - generate training data in a random order (when shuffle=True)
        - in case tensors don't have same length, pad first or last (e.g. pytorch [pad_packed_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html))
- callback functions
    - recorder which records learning rate and loss
    - avgstats which records total loss and batch count
    - cuda : convert to gpu before batch
    - unflatten image

- define convolution channels 


In [5]:
??AvgStats
??AvgStatsCallback

In [6]:
x_train, y_train, x_valid, y_valid = get_data()
x_train, x_valid = normalize_to(x_train, x_valid)

train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)
nh, bs = 50, 512
c = y_train.max().item()+1
loss_func = F.cross_entropy

data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)
mnist_view = view_tfm(1, 28, 28)
cbfs = [Recorder,
        partial(AvgStatsCallback, accuracy),
        # CudaCallback,
        partial(BatchTransformXCallback, mnist_view)]
nfs = [8, 16, 32, 64, 64]
learn, run = get_learn_run(nfs, data, 0.4, conv_layer, cbs=cbfs)

%time run.fit(2, learn)

train: [1.39714296875, tensor(0.5276)]
valid: [0.2345330322265625, tensor(0.9307)]
train: [0.1934852734375, tensor(0.9416)]
valid: [0.1346829345703125, tensor(0.9600)]
CPU times: user 22.9 s, sys: 600 ms, total: 23.5 s
Wall time: 12.2 s


🎮 Q3: Implement customized batch norm, and plot activations’ mean and std.
- Caution: in train uses batches mean and variance while inference time it uses exponentially weighted moving average
- Caution: batch norm broadcasts tensor thorough batch data. so that dimension size of normalizing parameter differs depending on types of normalization.

In [9]:
xb, yb = next(iter(data.train_dl))
xb = xb.view(-1, 1, 28, 28)

In [11]:
xb.mean(dim=(0, 2, 3), keepdim=True).shape

torch.Size([1, 1, 1, 1])

In [None]:
xb.lerp()

In [12]:
# customized batchnorm
class BatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        self.mom, self.eps = mom, eps

        self.mult = nn.Parameter(torch.ones(nf, 1, 1))
        self.add  = nn.Parameter(torch.zeros(nf, 1, 1))
        self.register_buffer('vars', torch.ones(1, nf, 1, 1))
        self.register_buffer('means', torch.zeros(1, nf, 1, 1))

    def update_stats(self, x):
        m = x.mean(dim=(0, 2, 3), keepdim=1)
        v = x.var(dim=(0, 2, 3), keepdim=1)
        self.means.lerp_(m, self.mom)
        self.vars.lerp_(v, self.mom)
        return m, v
    def forward(self, x):
        if self.in_train:
            with torch.no_grad(): # TODO: where and where not we need this?
                m, v = self.updates_stats(x)
        else:
            m, v = self.means, self.vars
        x = (x - m) / (v+self.eps).sqrt()
        return x * self.mult + self.add

In [13]:
def conv_layer(ni, nf, ks=3, stride=2, bn=True, **kwargs):
    layers = [nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias = not bn), GeneralRelu(**kwargs)]
    if bn: layers.append(BatchNorm(nf))
    return nn.Sequential(*layers)

In [15]:
def init_cnn(m, uniform=False):
    f = init.kaiming_uniform_ if uniform else init.kaiming_normal_
    init_cnn_(m, f)
def init_cnn_(m, f):
    if isinstance(m, nn.Conv2d):
        # actualy initialization happens here
        f(m.weight, a=0.1)
        # when bias exist (i.e. no batchnorm) make sure it's zero
        if getattr(m, 'bias', None) is not None: m.bias.data.zero_()
    for layer in m.children(): init_cnn_(layer, f)
def get_learn_run(nfs, data, lr, layer, cbs=None, opt_func = None, uniform = False, **kwargs):
    model = get_cnn_model(data, nfs, layer, **kwargs)
    init_cnn(model, uniform = uniform)
    return get_runner(model, data, lr=lr, cbs=cbs, opt_func=opt_func)

In [17]:
learn, run = get_learn_run(nfs, data, 0.9, conv_layer, cbs=cbfs)

- plot activations



🎮 Q4: Use built-in batchnorm of pytorch

🎮 Q5: Add scheduler and train

📝 Q6: Explain difference between batchnorm and layernorm. Implement Layernorm class.

📝 🎮 Q7: Implement InstanceNorm class. Why do you think the model trained on instance norm can not be a classification model?

🎮 Q8: GroupNorm: initialize activation with N=20, channel = 6, height, width = 10 and 1) separate 6 channels into 3 groups 2) separate 6 channels into 6 groups (instancenorm) 3) put all 6 channels into as signle group (layernorm)

📝 🎮 Q9: Fastai introduces RunningBatchNorm class to be used in small batch cases. Implement it and write your opinion for what reason small batch size makes problem.