## CB02-4 Part Five: Typical Training Loop

### 01 combining everything so far

In [1]:
import torch

class mlp(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        
        self.layers = torch.nn.Sequential(
            
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 16),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(16, 16),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(16, num_outputs)
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits

In [2]:
x_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])
y_train = torch.tensor([0, 0, 0, 1, 1])

x_test = torch.tensor([
[-0.8, 2.8],
[2.6, -1.6],
])
y_test = torch.tensor([0, 1])

from torch.utils.data import Dataset, DataLoader
class ToyDataset(Dataset):
    def __init__(self, x, y):
        self.features = x
        self.labels = y

    def __len__(self):
        return self.labels.shape[0] # self.features.shape[0]
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

train_dataset = ToyDataset(x_train, y_train)
test_dataset = ToyDataset(x_test, y_test)

from torch.utils.data import DataLoader

torch.manual_seed(0)
train_loader = DataLoader(
    dataset=train_dataset, 
    batch_size=2, 
    shuffle=True,
    num_workers= 0,
    drop_last=True
)

test_loader = DataLoader(
    dataset=test_dataset, 
    batch_size=2, 
    shuffle=False,
    num_workers= 0,
)

### 02 typical training loop

In [3]:
torch.manual_seed(0)
model = mlp(2, 2)
optimizer = torch.optim.SGD(model.parameters(), lr=1)

num_epochs = 3

for epoch in range(num_epochs):

    model.train() # set the model to training mode: redundant for this example

    for idx, (features, labels) in enumerate(train_loader):
        # clear the gradients for every batch
        optimizer.zero_grad()

        # forward pass
        logits = model(features)

        # compute the loss
        loss = torch.nn.functional.cross_entropy(logits, labels)
        
        # backward pass
        loss.backward()

        # update weights & biases through SGD
        optimizer.step()

        print(f'Epoch: {epoch}, Batch: {idx}, Loss: {loss:.2f}')
    
    model.eval() # set the model to evaluation mode: redundant for this example

Epoch: 0, Batch: 0, Loss: 0.80
Epoch: 0, Batch: 1, Loss: 0.82
Epoch: 1, Batch: 0, Loss: 0.30
Epoch: 1, Batch: 1, Loss: 0.22
Epoch: 2, Batch: 0, Loss: 0.00
Epoch: 2, Batch: 1, Loss: 0.00


In [22]:
model

mlp(
  (layers): Sequential(
    (0): Linear(in_features=2, out_features=30, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=20, out_features=2, bias=True)
  )
)

Note: 

1. In practicce, usually use validation dataset to find the optimal hyperparameter settings. 
A validation dataset is similar to a test set. However, 
while we only want to use a test set precisely once to avoid biasing the evaluation, 
we usually use the validation set multiple times to tweak the model settings.

2. PREVENTING UNDESIRED GRADIENT ACCUMULATION It is important to include an
optimizer.zero_grad() call in each update round to reset the gradients to zero. Otherwise, the
gradients will accumulate, which may be undesired

### 03 Evaluation

In [4]:
model.eval()
with torch.no_grad():
    output = model(x_train)
output


tensor([[ 8.0681, -7.6774],
        [ 7.3428, -6.9849],
        [ 6.2796, -5.9794],
        [-3.8422,  4.2457],
        [-4.7082,  5.1806]])

In [5]:
prob = torch.softmax(output, dim=1)

In [6]:
predictions = torch.argmax(prob, dim=1)
torch.sum(predictions == y_train)

tensor(5)

In [7]:

model.eval()
with torch.no_grad():
    output = model(x_test)
prob = torch.softmax(output, dim=1)
predictions = torch.argmax(prob, dim=1)
num_correct = torch.sum(predictions == y_test)

num_correct.item() / len(y_test)

1.0

### Note: Generated by Kimi

Softmax 和 Cross-Entropy 是机器学习和深度学习中常用的概念，它们通常在分类问题中一起使用。

### Softmax 函数
Softmax 函数是一种将一个向量或一组实数转换成概率分布的函数。给定一个向量 \( z \)，其中 \( z_i \) 是向量中的第 \( i \) 个元素，Softmax 函数定义为：

$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$

其中，分母是对 \( z \) 中所有元素应用指数函数后求和的结果。Softmax 函数的输出是一个概率分布，每个元素的值都在 0 到 1 之间，并且所有元素的和为 1。

### Cross-Entropy 损失
Cross-Entropy 损失函数用于衡量两个概率分布之间的差异，通常用于分类问题中衡量模型预测的概率分布与真实标签的概率分布之间的差异。对于二分类问题，Cross-Entropy 损失可以表示为：

$L = -\left( y \log(p) + (1 - y) \log(1 - p) \right)$

其中，\( y \) 是真实标签（0 或 1），\( p \) 是模型预测样本为类别 1 的概率。

对于多分类问题，Cross-Entropy 损失可以表示为：

$L = -\sum_{c=1}^{M} y_{o,c} \log(p_{o,c})$

其中，\( M \) 是类别的数量，\( y_{o,c} \) 是一个二进制指示器（0 或 1），如果类别 \( c \) 是样本 \( o \) 的正确分类，则为 1，否则为 0；\( p_{o,c} \) 是模型预测样本 \( o \) 属于类别 \( c \) 的概率。

### Softmax 和 Cross-Entropy 的结合
在多分类问题中，Softmax 函数通常用于模型的输出层，将模型的输出（通常是未归一化的预测值，称为 logits）转换成概率分布。然后，这个概率分布可以直接用于计算 Cross-Entropy 损失，以此来训练模型。通过最小化 Cross-Entropy 损失，模型学习调整其参数，以便预测的概率分布尽可能接近真实标签的概率分布。


### 04 Model Save & load

In [8]:
torch.save(model.state_dict(), 'model_v1.pth') # .pth or .pt are the most common extensions

In [9]:
new_model = mlp(2, 2)
new_model.load_state_dict(torch.load('model_v1.pth'))

<All keys matched successfully>

In [10]:
new_model.eval()
with torch.no_grad():
    output = new_model(x_train)
prob = torch.softmax(output, dim=1)
predictions = torch.argmax(prob, dim=1)
num_correct = torch.sum(predictions == y_train)

num_correct.item() / len(y_train)

1.0