# Part 2.3. Weight Initialization

In [0]:
import torch
import torchvision.datasets as dsets
import torchvision.transforms as transforms
import random

In [0]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# for reproducibility
random.seed(777)
torch.manual_seed(777)
if device == 'cuda':
    torch.cuda.manual_seed_all(777)

## 1. Why Good Initialization

![10-1.png](./img/10-1.png)

weight를 어떻게 초기화하느냐에 따라 모델의 성능이 달라지는 것을 볼 수 있다.

## 2. RBM and DBM

![10-2.png](./img/10-2.png)

RBM(Restricted Boltzmann Machine)란 위에 처럼 layer 안의 노드 간에는 연결이 없고 서로 이웃한 layer의 노드들끼리는 fully-connected 되어있는 형태를 말한다.

![10-3.png](./img/10-3.png)

RBM은 한 layer 씩 쌓고 RBM으로 학습하고 그 하위 layer의 weight값들을 고정시키고 그 위에 다시 layer를 쌓는 과정을 반복하여 pre-training과정을 진행한다. 그 후에 실제 training(=fine tuning)을 진행하여 실제 weight값을 구하게 된다.

하지만 해당 방법은 현재 잘 쓰이지 않고 있다.

## 3. Xavier and He Initialization
layer의 특성에 따라 weight값을 초기화한다. $n_{in}$은 layer의 input의 개수, $n_{out}$은 layer의 output 개수를 말한다.



### 3.1. Xavier Initialization
#### Xavier Normal Initialization
정규 분포 형태의 초기값을 갖는 것.

$$W \sim N(0, Var(W))$$
$$Var(W) = \sqrt{\frac{2}{n_{in}+n_{out}}}$$

#### Xavier Uniform Initialization
연속 균등 분포 형태로 초기값을 갖는 것.

$$W \sim U(-\frac{6}{n_{in}+n_{out}}, +\frac{6}{n_{in}+n_{out}})$$


### 3.2. He Initialization
He초기값은 Xavier초기값과 달리 output의 개수를 사용하지 않는다. 그 외에는 Xavier랑 동일.

#### He Normal Initialization
$$W \sim N(0, Var(W))$$
$$Var(W) = \sqrt{\frac{2}{n_{out}}}$$

#### He Uniform Initialization

$$W \sim U(-\frac{6}{n_{in}}, +\frac{6}{n_{in}})$$

## 4. MNIST Example

In [0]:
# MNIST dataset
mnist_train = dsets.MNIST(root='MNIST_data/',
                          train=True,
                          transform=transforms.ToTensor(),
                          download=True)

mnist_test = dsets.MNIST(root='MNIST_data/',
                         train=False,
                         transform=transforms.ToTensor(),
                         download=True)

In [0]:
# parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100

In [0]:
# dataset loader
data_loader = torch.utils.data.DataLoader(dataset=mnist_train,
                                          batch_size=batch_size,
                                          shuffle=True,
                                          drop_last=True)

In [0]:
# nn layers
linear1 = torch.nn.Linear(784, 256, bias=True)
linear2 = torch.nn.Linear(256, 256, bias=True)
linear3 = torch.nn.Linear(256, 10, bias=True)
relu = torch.nn.ReLU()

In [22]:
# xavier uniform initialization
torch.nn.init.xavier_uniform_(linear1.weight)
torch.nn.init.xavier_uniform_(linear2.weight)
torch.nn.init.xavier_uniform_(linear3.weight)

Parameter containing:
tensor([[-0.0215, -0.0894,  0.0598,  ...,  0.0200,  0.0203,  0.1212],
        [ 0.0078,  0.1378,  0.0920,  ...,  0.0975,  0.1458, -0.0302],
        [ 0.1270, -0.1296,  0.1049,  ...,  0.0124,  0.1173, -0.0901],
        ...,
        [ 0.0661, -0.1025,  0.1437,  ...,  0.0784,  0.0977, -0.0396],
        [ 0.0430, -0.1274, -0.0134,  ..., -0.0582,  0.1201,  0.1479],
        [-0.1433,  0.0200, -0.0568,  ...,  0.0787,  0.0428, -0.0036]],
       requires_grad=True)

In [0]:
# model
model = torch.nn.Sequential(linear1, relu, linear2, relu, linear3).to(device)

In [0]:
# define cost/loss & optimizer
criterion = torch.nn.CrossEntropyLoss().to(device)    # Softmax is internally computed.
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [25]:
# train
total_batch = len(data_loader)
for epoch in range(training_epochs):
    avg_cost = 0

    for X, Y in data_loader:
        # reshape input image into [batch_size by 784]
        # label is not one-hot encoded
        X = X.view(-1, 28 * 28).to(device)
        Y = Y.to(device)

        optimizer.zero_grad()
        hypothesis = model(X)
        cost = criterion(hypothesis, Y)
        cost.backward()
        optimizer.step()

        avg_cost += cost / total_batch

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))

print('Learning finished')

Epoch: 0001 cost = 0.249894276
Epoch: 0002 cost = 0.094109461
Epoch: 0003 cost = 0.060697529
Epoch: 0004 cost = 0.043816913
Epoch: 0005 cost = 0.031662427
Epoch: 0006 cost = 0.027006302
Epoch: 0007 cost = 0.021771355
Epoch: 0008 cost = 0.017671170
Epoch: 0009 cost = 0.016151655
Epoch: 0010 cost = 0.014032764
Epoch: 0011 cost = 0.013825929
Epoch: 0012 cost = 0.012349785
Epoch: 0013 cost = 0.012296054
Epoch: 0014 cost = 0.009266987
Epoch: 0015 cost = 0.009194419
Learning finished


In [26]:
# test
with torch.no_grad():
    X_test = mnist_test.test_data.view(-1, 28 * 28).float().to(device)
    Y_test = mnist_test.test_labels.to(device)

    prediction = model(X_test)
    correct_prediction = torch.argmax(prediction, 1) == Y_test
    accuracy = correct_prediction.float().mean()
    print('Accuracy:', accuracy.item())

Accuracy: 0.9821000099182129


