# Pytorch optimizer and loss function

一旦我們建立完網路之後，我們要知道網路怎麼更新參數，所以我們必須設定一個 criterion 用來衡量模型當前的表現，也就是 loss function

有了目標函數之後，我們可以用梯度下降的演算法 (optimizer) 去更新我們的網路參數


## Pytorch Loss function

這裡最常用的 pytorch built-in loss function，除此之外，大部分都需要自定義

- ```nn.L1Loss:``` L1 正則損失 (絕對值差)
- ```nn.MSELoss:``` L2 正則損失 (平方差)
- ```nn.CrossEntropyLoss:``` 交叉謪，用於分類任務
- ```nn.NLLLoss:``` 負 log likelihood，用於分類任務
- ```nn.BCELoss:``` 二元交叉謪，用於二元分類
- ```nn.BCEWithLogitsLoss:``` Sigmoid 加上二元交叉謪

這是其他好用套件的 loss function

- ```LabelSmoothingLoss:``` [標籤平滑](https://github.com/PistonY/torch-toolbox)

In [31]:
import torch.nn as nn
import torch
from torch.autograd import Variable

x = Variable(torch.FloatTensor([[1, 2], [3, 4]]), requires_grad=True)
y = Variable(torch.FloatTensor([[5, 6], [7, 8]]), requires_grad=True)

criterion = nn.MSELoss()
loss = criterion(x, y)
loss.backward()

print('gradient of x:\n', x.grad)
print('gradient of y:\n', y.grad)

gradient of x:
 tensor([[-2., -2.],
        [-2., -2.]])
gradient of y:
 tensor([[2., 2.],
        [2., 2.]])


### Customize Loss function

自定義 loss function 可以有很多種方法，但是你要確保梯度可以反向傳播，其中一種我比較推薦的方法是寫個 Module 的子類

以下我們寫個自定義的 MSELoss function

In [33]:
x.grad.zero_()
y.grad.zero_()

class MSELoss(nn.Module):

    def __init__(self):
        super().__init__()

        pass

    def forward(self, x, y):

        return torch.mean((x - y) ** 2) 

criterion = MSELoss()
loss = criterion(x, y)
loss.backward()

print('gradient of x:\n', x.grad)
print('gradient of y:\n', y.grad)

gradient of x:
 tensor([[-2., -2.],
        [-2., -2.]])
gradient of y:
 tensor([[2., 2.],
        [2., 2.]])


在 pytorch 中梯度是會累加的，所以我們要執行

```python
x.grad.zero_()
y.grad.zero_()
```

來將之前保存的梯度清零



## [Optimizer](https://pytorch.org/docs/stable/optim.html)

這裡最常用的 pytorch built-in optimizer，除此之外，大部分都需要自定義

- ```Adagrad:``` (Class Adagrad)
- ```Adam:``` (Class Adam)
- ```AdamW:``` (Class AdamW)
- ```LBFGS:``` (Class LBFGS)
- ```RMSprop:``` (Class RMSprop)
- ```SGD:``` (Class SGD)

當我們完全梯度計算並且反向傳播之後，我們可以 optimizer 類底下的 step 方法來更新參數

注意: 每次更新之前都需要把梯度清零



In [38]:
from torch.optim import Adam

class MyModule(nn.Module):

    def __init__(self, in_features=512, out_features=64, depth=5):
        super().__init__()

        self.layers = nn.Sequential()
        for i in range(depth):
            self.layers.add_module(f'linear{i+1}', nn.Linear(in_features, in_features // 2))
            self.layers.add_module(f'relu{i+1}', nn.ReLU(inplace=True))
            in_features = in_features // 2
        self.layers.add_module(f'linear{depth+1}', nn.Linear(in_features, out_features))

        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = self.layers(x)
        x = self.activation(x)
        return x

model = MyModule()

optimizer = Adam(model.parameters(), lr=0.01)

if __name__ == '__main__':
    print(optimizer)
    for i in range(5):
        optimizer.zero_grad()
        optimizer.step()

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.01
    maximize: False
    weight_decay: 0
)


### Learning Rate Schedule

optimizer 的輸入參數有兩個，要更新的參數和學習率，learning rate scheduler 是一個可以動態調整 optimizer 學習率的方法，常用的有以下幾種

#### ReduceLROnPlateau

如果 loss 值一直沒有進步，或者梯度值趨近於零 (Plateau) 

**Parameter:**

- patience (int) - 過多少個 epoch 沒有進步就降低 lr
- factor (float) - 下降比例 

In [48]:
LEARNING_RATE = 0.01
model = MyModule()

optimizer = Adam(model.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.4, patience=2)

for i in range(1, 15):
    optimizer.zero_grad()
    optimizer.step()
    print(f"Epoch [{i}]  lr: {optimizer.param_groups[0]['lr']}")
    loss = 1
    scheduler.step(loss) # you need to pass the value of loss for scheduler to verify if it is reducing 

Epoch [1]  lr: 0.01
Epoch [2]  lr: 0.01
Epoch [3]  lr: 0.01
Epoch [4]  lr: 0.01
Epoch [5]  lr: 0.004
Epoch [6]  lr: 0.004
Epoch [7]  lr: 0.004
Epoch [8]  lr: 0.0016
Epoch [9]  lr: 0.0016
Epoch [10]  lr: 0.0016
Epoch [11]  lr: 0.00064
Epoch [12]  lr: 0.00064
Epoch [13]  lr: 0.00064
Epoch [14]  lr: 0.00025600000000000004


#### StepLR

每過一個指定的週期後更新學習率到指定比例 

**Parameter:**

- step_size (int) - 學習率週期
- gamma (float) - 每次週期下降比例 

In [53]:
optimizer = Adam(model.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma = .1)

for i in range(1, 15):
    optimizer.zero_grad()
    optimizer.step()
    print(f"Epoch [{i}]  lr: {optimizer.param_groups[0]['lr']}")
    scheduler.step()

Epoch [1]  lr: 0.01
Epoch [2]  lr: 0.01
Epoch [3]  lr: 0.01
Epoch [4]  lr: 0.01
Epoch [5]  lr: 0.01
Epoch [6]  lr: 0.01
Epoch [7]  lr: 0.01
Epoch [8]  lr: 0.01
Epoch [9]  lr: 0.01
Epoch [10]  lr: 0.01
Epoch [11]  lr: 0.001
Epoch [12]  lr: 0.001
Epoch [13]  lr: 0.001
Epoch [14]  lr: 0.001


#### MultiStepLR

每過一個指定的 step 次數後更新學習率到指定比例 

**Parameter:**

- milestones (list) - 指定 step 次數
- gamma (float) - 每次到達指定 step 次數後下降比例 

In [54]:
optimizer = Adam(model.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[5, 12], gamma = .1)

for i in range(1, 15):
    optimizer.zero_grad()
    optimizer.step()
    print(f"Epoch [{i}]  lr: {optimizer.param_groups[0]['lr']}")
    scheduler.step()

Epoch [1]  lr: 0.01
Epoch [2]  lr: 0.01
Epoch [3]  lr: 0.01
Epoch [4]  lr: 0.01
Epoch [5]  lr: 0.01
Epoch [6]  lr: 0.001
Epoch [7]  lr: 0.001
Epoch [8]  lr: 0.001
Epoch [9]  lr: 0.001
Epoch [10]  lr: 0.001
Epoch [11]  lr: 0.001
Epoch [12]  lr: 0.001
Epoch [13]  lr: 0.0001
Epoch [14]  lr: 0.0001


#### ExponentialLR

每次 step 下降指定學習率比例

**Parameter:**

- gamma (float) - 每次 step 下降比例 

In [55]:
optimizer = Adam(model.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma = .1)

for i in range(1, 15):
    optimizer.zero_grad()
    optimizer.step()
    print(f"Epoch [{i}]  lr: {optimizer.param_groups[0]['lr']}")
    scheduler.step()

Epoch [1]  lr: 0.01
Epoch [2]  lr: 0.001
Epoch [3]  lr: 0.0001
Epoch [4]  lr: 1e-05
Epoch [5]  lr: 1.0000000000000002e-06
Epoch [6]  lr: 1.0000000000000002e-07
Epoch [7]  lr: 1.0000000000000004e-08
Epoch [8]  lr: 1.0000000000000005e-09
Epoch [9]  lr: 1.0000000000000006e-10
Epoch [10]  lr: 1.0000000000000006e-11
Epoch [11]  lr: 1.0000000000000006e-12
Epoch [12]  lr: 1.0000000000000007e-13
Epoch [13]  lr: 1.0000000000000008e-14
Epoch [14]  lr: 1.0000000000000009e-15


#### MultiplicativeLR

每過一個指定的 step 次數後透過自定義函數更新學習率

**Parameter:**

- lr_lambda  (fcn or list) - 自定義函數更新學習率

In [59]:
def lr_lambda(epoch):
    return 0.2 if epoch < 5 else 0.9

optimizer = Adam(model.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda = lr_lambda)

for i in range(1, 15):
    optimizer.zero_grad()
    optimizer.step()
    print(f"Epoch [{i}]  lr: {optimizer.param_groups[0]['lr']}")
    scheduler.step()

Epoch [1]  lr: 0.01
Epoch [2]  lr: 0.002
Epoch [3]  lr: 0.0004
Epoch [4]  lr: 8e-05
Epoch [5]  lr: 1.6000000000000003e-05
Epoch [6]  lr: 1.4400000000000003e-05
Epoch [7]  lr: 1.2960000000000003e-05
Epoch [8]  lr: 1.1664000000000002e-05
Epoch [9]  lr: 1.0497600000000003e-05
Epoch [10]  lr: 9.447840000000002e-06
Epoch [11]  lr: 8.503056000000003e-06
Epoch [12]  lr: 7.652750400000004e-06
Epoch [13]  lr: 6.8874753600000035e-06
Epoch [14]  lr: 6.198727824000003e-06


#### LambdaLR

每過一個指定的 step 次數後透過自定義函數更新學習率

**Parameter:**

- lr_lambda (fcn or list) - 自定義函數更新學習率

In [61]:
def lr_lambda(epoch):
    return 0.2 if epoch < 5 else 0.9

optimizer = Adam(model.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda = lr_lambda)

for i in range(1, 15):
    optimizer.zero_grad()
    optimizer.step()
    print(f"Epoch [{i}]  lr: {optimizer.param_groups[0]['lr']}")
    scheduler.step()

Epoch [1]  lr: 0.002
Epoch [2]  lr: 0.002
Epoch [3]  lr: 0.002
Epoch [4]  lr: 0.002
Epoch [5]  lr: 0.002
Epoch [6]  lr: 0.009000000000000001
Epoch [7]  lr: 0.009000000000000001
Epoch [8]  lr: 0.009000000000000001
Epoch [9]  lr: 0.009000000000000001
Epoch [10]  lr: 0.009000000000000001
Epoch [11]  lr: 0.009000000000000001
Epoch [12]  lr: 0.009000000000000001
Epoch [13]  lr: 0.009000000000000001
Epoch [14]  lr: 0.009000000000000001
