- 01_matmul
- 02_fully_connected
- 02a_why_sqrt5
- 02b_initialization
- 03_minibatch_training

Table of contents
1. Initial Setup
2. Basic training Loop
3. Using parameters and optim
4. Dataset and DataLoader
5. Validation

# 1. Initial Setup

## Data

### Fastai for Colab env

In [1]:
!git clone http://github.com/fastai/course-v3.git

Cloning into 'course-v3'...
remote: Enumerating objects: 5498, done.[K
remote: Total 5498 (delta 0), reused 0 (delta 0), pack-reused 5498[K
Receiving objects: 100% (5498/5498), 258.00 MiB | 35.59 MiB/s, done.
Resolving deltas: 100% (2992/2992), done.


In [2]:
%cd course-v3/nbs/dl2

/content/course-v3/nbs/dl2


In [3]:
from exp.nb_02 import *

In [4]:
!cat exp/nb_02.py


#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: dev_nb/02_fully_connected.ipynb

from exp.nb_01 import *

def get_data():
    path = datasets.download_data(MNIST_URL, ext='.gz')
    with gzip.open(path, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train,y_train,x_valid,y_valid))

def normalize(x, m, s): return (x-m)/s

def test_near_zero(a,tol=1e-3): assert a.abs()<tol, f"Near zero: {a}"

from torch.nn import init

def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()

from torch import nn

### import data

In [5]:
train_x, train_y, valid_x, valid_y = get_data()

Downloading http://deeplearning.net/data/mnist/mnist.pkl.gz


In [6]:
train_x.shape, train_x.type()

(torch.Size([50000, 784]), 'torch.FloatTensor')

In [7]:
n_in, nh, n_out = train_x.shape[1], 32, train_y.max().item()+1; n_in, nh, n_out

(784, 32, 10)

[^1]

In [8]:
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in, nh), nn.ReLU(), nn.Linear(nh, n_out)]
    def __call__(self, x):
        for l in self.layers: x = l(x)
        return x
    # def forward(self, x, y):
    #     for l in self.layers: x = l(x)
    #     return(self.loss(x, y))

In [9]:
m = Model(n_in, nh, n_out)

In [10]:
y_hat = m(train_x)

## Loss Function

### Softmax

* formula of softmax : $f(x) = \frac{exp(x)}{\sum exp(x_i)}$
* We need softmax at last layer since we have to map it to the probabilistic space
* code of softmax is below

In [11]:
def softmax(x): return x.exp() / x.exp().sum(-1, keepdim=True)
# def softmax(x): return x.exp() / x.exp().sum()

[^2]

In [12]:
p = softmax(y_hat); p[0].sum()

tensor(1., grad_fn=<SumBackward0>)

- I need *log* of softmax because we will use cross-entropy which has form of $H(X, q) = H(X) + D(p|q) = - \sum_{x} p(x)\  log\ q(x)$ [^3]

In [13]:
def log_softmax(x): return x.exp() / x.exp().sum(-1, keepdim=True)

In [14]:
a = log_softmax(y_hat)

### Cross Entropy

- see the footnote 3 regarding further information of cross entropy
- One possible reason is because an information theory is based on descrete and determinate value

### Negative log likelihood function

In [15]:
def nll(x, y): return -x[range(x.shape[0]), y].mean() #x: prediction, y:target 

In [16]:
nll(y_hat, train_y)

tensor(0.0017, grad_fn=<NegBackward>)

- Negative value is related to entropy, roughly variable which is rare is meaningful [^4]

### LogSumExp

- Further tasks (after reading sum refs) [^5]

In [17]:
def log_softmax(x): return x - x.logsumexp(-1, keepdim=True)

In [18]:
test_near(log_softmax(y_hat), a)

AssertionError: ignored

# 2. Basic Training Loop

### 1.  

First, get the prediction (i.e. output) from the model<br/>
Second, get a loss value with prediction and target<br/>
Third, get a gradients<br/>
Fourth, update parameter using gradients, hyperparams<br/>

### 2.

In [43]:
def accuracy(out, trg): return (torch.argmax(out, dim=-1) == trg).float().mean()

### 3

In [26]:
bs = 64

In [27]:
xb, yb = train_x[:bs], train_y[:bs]

In [28]:
xb.shape, yb.shape

(torch.Size([64, 784]), torch.Size([64]))

In [34]:
m = Model(n_in, nh, n_out); pred = m(xb)

In [44]:
accuracy(pred, yb)

tensor(0.0781)

In [46]:
del m, pred, xb, yb
import gc
gc.collect()

600

### 4

In [47]:
n_in, nh, n_out

(784, 32, 10)

In [57]:
epochs = 1
lr = 0.5

In [48]:
n = train_x.shape[0]

In [49]:
import torch.nn.functional as F
loss_fun = F.cross_entropy

In [58]:
def fit():
    # init model to keep memory, gradients,...etc
    model = Model(n_in, nh, n_out)
    for epoch in range(epochs):
        for i in range((n-1)//bs +1): # [^6]
            srt, end = bs*i, bs*(i+1)
            xb, yb = train_x[srt:end], train_y[srt:end]
            loss = loss_fun(model(xb), yb)
            loss.backward()
            with torch.no_grad(): # [^8]
                for l in model.layers:
                    if hasattr(l, 'weight'): # [^7]
                        l.weight -= l.weight.grad * lr
                        l.bias -= l.bias.grad * lr
                        l.weight.grad.zero_() # [^8]
                        l.bias.grad.zero_()
    return model

In [59]:
m=fit()

In [61]:
xb, yb = train_x[:64], train_y[:64]

In [62]:
loss_fun(m(xb), yb), accuracy(m(xb), yb)

(tensor(0.1746, grad_fn=<NllLossBackward>), tensor(0.9375))

# 3. Using parameters and optimal

## Parameters

### 1. Re-define model using nn.Module

In [67]:
# [^10]

class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.l1 = nn.Linear(n_in, nh)
        self.l2 = nn.Linear(nh, n_out)
    def __call__(self, inp):
        return self.l2(F.relu(self.l1(inp)))

In [68]:
model = Model(n_in, nh, n_out)

In [69]:
model(train_x)

tensor([[-0.1136, -0.1931,  0.1854,  ..., -0.0405, -0.1048, -0.1394],
        [-0.0593, -0.1301,  0.2580,  ..., -0.0151, -0.0813, -0.0631],
        [-0.0358,  0.0146,  0.0793,  ..., -0.0828, -0.0170, -0.2184],
        ...,
        [-0.0436,  0.0086,  0.1209,  ..., -0.0345,  0.0280, -0.2004],
        [-0.0250,  0.0863,  0.0958,  ..., -0.0683, -0.0426, -0.2373],
        [-0.0649,  0.0403,  0.0983,  ..., -0.1354, -0.0169, -0.1508]],
       grad_fn=<AddmmBackward>)

### 2. nn.module,name_children

In [70]:
[f"{name}: {layer}" for name, layer in model.named_children()]

['l1: Linear(in_features=784, out_features=32, bias=True)',
 'l2: Linear(in_features=32, out_features=10, bias=True)']

In [71]:
model.l1

Linear(in_features=784, out_features=32, bias=True)

In [73]:
dir(model.l1)

['__call__',
 '__class__',
 '__constants__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_apply',
 '_backward_hooks',
 '_buffers',
 '_forward_hooks',
 '_forward_pre_hooks',
 '_get_name',
 '_load_from_state_dict',
 '_load_state_dict_pre_hooks',
 '_modules',
 '_named_members',
 '_parameters',
 '_register_load_state_dict_pre_hook',
 '_register_state_dict_hook',
 '_replicate_for_data_parallel',
 '_save_to_state_dict',
 '_slow_forward',
 '_state_dict_hooks',
 '_version',
 'add_module',
 'apply',
 'bfloat16',
 'bias',
 'buffers',
 'children',
 'cpu',
 'cuda',
 'double',
 'dump_patches',
 'eval',
 'extra_repr',
 'float',
 'forward',
 'half',
 'in_featu

In [76]:
help(model.l1.__setattr__)

Help on method __setattr__ in module torch.nn.modules.module:

__setattr__(name, value) method of torch.nn.modules.linear.Linear instance
    Implement setattr(self, name, value).



### 2. fit function second

In [80]:
def fit():
    model = Model(n_in, nh, n_out)
    for epoch in range(epochs):
        for i in range((n-1)//bs +1):
            srt, end = bs*i, bs*(i+1)
            xb, yb = train_x[srt:end], train_y[srt:end]
            loss = loss_fun(model(xb), yb)
            
            loss.backward()
            with torch.no_grad():
                for param in model.parameters(): param-= param.grad * lr # [^9]
                model.zero_grad()
    return model

- What changed?/ why shorter?

1. this time we make zero grad to model, not each layer
2. don't need the code which checked if they have weight(i.e. parameter), since we already selected parameters in model

In [81]:
m = fit()
loss_fun(m(xb), yb), accuracy(m(xb), yb)

(tensor(0.1144, grad_fn=<NllLossBackward>), tensor(0.9688))

### DummyModule class to simulate pytorch's `__setattr__`

In [82]:
## Registering modules
## nn.ModuleList
## nn.Sequential
## Optim

# Notes

[^1]: I don't have to define parameters no more, but obviously at first I tried to iniailized Model with `w1, b1, ....`, meaning I didn't practice enough the previous part, part of the torch nn, where I get parameters

[^2]: be cautious since `train_x.sum(-1)` will squeeze the rank, so that you need arg `keepdim:bool`, false as default

[^3]: see *eq (2.46)*, Foundations of Natural Language Processing, Christopher D. Manning and Hinrich Schütze

[^4]: Should check for `cross-entropy $\approx$ softmax of negative likelihood`

[^5]: check later

[^6]: Be careful to take 1 off from size of dataset, in case batch size is divisible number</br>
[^7]: study of hasattr() more<br/>
[^8]: 1) careful, that is in-place function / 2) should learn more of why I need this

[^9]: `parameters` is method, layers is attr. this time we make zero grad to model, not each layer

[^10]: things that happen when you don't specify inheritence