# **模型訓練（分類問題）**
此份程式碼會講解針對分類型任務在模型訓練上需要注意的細節。

## 本章節內容大綱
* ### 二元分類問題
    * ### [創建資料集／載入資料集（Dataset Creating/ Loading）](#DatasetCreating/Loading)
    * ### [資料前處理（Data Preprocessing）](#DataPreprocessing)
    * ### [模型建置（Model Building）](#ModelBuilding)
    * ### [模型訓練（Model Training）](#ModelTraining)
    * ### [模型評估（Model Evaluation）](#ModelEvaluation)
* ### 多元分類問題
---

## 匯入套件

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

# PyTorch 相關套件
import torch
import torch.nn as nn

<a name="DatasetCreating/Loading"></a>
## 創建資料集／載入資料集（Dataset Creating / Loading）

In [None]:
# 上傳資料
!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/DL/Data_part2.zip
!unzip -q Data_part2.zip

In [None]:
train_df = pd.read_csv('./Data/FilmComment_train.csv')
test_df = pd.read_csv('./Data/FilmComment_test.csv')

In [None]:
train_df.head()

* #### 電影評論資料集
訓練集，測試集分別為 6250，2500 筆，9997 種常用字詞，若在同一則評論中出現該字詞為 1，若否則為 0，y_label 標記評價正面與否。

In [None]:
X_df = train_df.iloc[:, :-1].values
y_df = train_df.y_label.values

In [None]:
X_test = test_df.iloc[:, :-1].values
y_test = test_df.y_label.values

<a name="DataPreprocessing"></a>
## 資料前處理（Data Preprocessing）

* ### 資料正規化（Data Normalization）
由於此資料集的數值範圍都介於 0-1，並且皆是以相同意義轉換特徵值，因此也可以使用原始的數值作為訓練資料。

* ### 資料切分（Data Splitting）

In [None]:
# train, valid/test dataset split
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_df, y_df, test_size=0.2, random_state=5566)

In [None]:
print(f'X_train shape: {X_train.shape}')
print(f'X_valid shape: {X_valid.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_valid shape: {y_valid.shape}')

In [None]:
# Build torch dataset and dataloader
from torch.utils.data import TensorDataset, DataLoader

BATCH_SIZE = 512

train_dataset = TensorDataset(torch.from_numpy(X_train).float(),
                              torch.from_numpy(y_train).unsqueeze(1).float())
valid_dataset = TensorDataset(torch.from_numpy(X_valid).float(),
                              torch.from_numpy(y_valid).unsqueeze(1).float())

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

In [None]:
for x, y in train_loader:
    print(x.shape, y.shape, y[:10])
    break

<a name="ModelBuilding"></a>
## 模型建置（Model Building）

In [None]:
torch.manual_seed(5566)

model = nn.Sequential(
    nn.Linear(X_train.shape[1], 16),
    nn.ReLU(),
    nn.Linear(16, 16),
    nn.ReLU(),
    nn.Linear(16, 1),
    nn.Sigmoid(),
)

print(model)

## 模型訓練（Model training）

* ### 設定模型訓練時，所需的優化器 (optimizer)、損失函數 (loss function)、評估指標 (metrics)

In [None]:
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')
model = model.to(device)

In [None]:
def train_epoch(model, optimizer, loss_fn, train_dataloader, val_dataloader):
    # 訓練一輪
    model.train()
    total_train_loss = 0
    total_train_correct = 0
    for x, y in tqdm(train_dataloader, leave=False):
        x, y = x.to(device), y.to(device) # 將資料移至GPU
        y_pred = model(x) # 計算預測值
        loss = loss_fn(y_pred, y) # 計算誤差
        optimizer.zero_grad() # 梯度歸零
        loss.backward() # 反向傳播計算梯度
        optimizer.step() # 更新模型參數

        total_train_loss += loss.item()
        total_train_correct += ((y_pred > 0.5) == (y > 0.5)).sum().item()

    # 驗證一輪
    model.eval()
    total_val_loss = 0
    total_val_correct = 0
    # 關閉梯度計算以加速
    with torch.no_grad():
        for x, y in val_dataloader:
            x, y = x.to(device), y.to(device)
            y_pred = model(x)
            loss = loss_fn(y_pred, y)
            total_val_loss += loss.item()
            total_val_correct += ((y_pred > 0.5) == (y > 0.5)).sum().item()

    avg_train_loss = total_train_loss / len(train_dataloader)
    avg_train_acc = total_train_correct / len(train_dataloader.dataset)
    avg_val_loss = total_val_loss / len(val_dataloader)
    avg_val_acc = total_val_correct / len(val_dataloader.dataset)
    return avg_train_loss, avg_val_loss, avg_train_acc, avg_val_acc

In [None]:
train_loss_log = []
val_loss_log = []
train_acc_log = []
val_acc_log = []
for epoch in tqdm(range(20)):
    avg_train_loss, avg_val_loss, avg_train_acc, avg_val_acc = train_epoch(model, optimizer, loss_fn, train_loader, valid_loader)
    train_loss_log.append(avg_train_loss)
    val_loss_log.append(avg_val_loss)
    train_acc_log.append(avg_train_acc)
    val_acc_log.append(avg_val_acc)
    print(f'Epoch: {epoch}, Train Loss: {avg_train_loss:.3f}, Val Loss: {avg_val_loss:.3f} | Train Acc: {avg_train_acc:.3f}, Val Acc: {avg_val_acc:.3f}')

<a name="ModelEvaluation"></a>
## 模型評估（Model Evaluation）

* ### 視覺化訓練過程的評估指標 （Visualization）

In [None]:
plt.figure(figsize=(15, 4))
plt.subplot(1, 2, 1)
plt.plot(range(len(train_loss_log)), train_loss_log, label='train_loss')
plt.plot(range(len(val_loss_log)), val_loss_log, label='valid_loss')
plt.xlabel('Epochs')
plt.ylabel('Binary crossentropy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(range(len(train_acc_log)), train_acc_log, label='train_acc')
plt.plot(range(len(val_acc_log)), val_acc_log, label='valid_acc')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

* ### 模型預測（Model predictions）

In [None]:
# predict all test data
model.eval()

y_pred = []
with torch.no_grad():
    for x, _ in tqdm(valid_loader):
        x = x.to(device)
        y_pred.append(model(x))

y_pred = torch.cat(y_pred).cpu()
y_pred_class = (y_pred > 0.5).squeeze(1).int()

print(y_pred_class, y_pred_class.shape)

* ### 視覺化結果

In [None]:
plt.figure(figsize=(15, 4))
plt.scatter(range(y_pred.shape[0]), y_pred)
plt.hlines(0.5, 0, y_pred.shape[0], colors='red', label='y=0.5')
plt.legend()
plt.show()

----------------
此範例是二元分類，y 的表示方式可用一維陣列，分別以 0, 1 表示兩個類別（正面，負面評價）
![](https://hackmd.io/_uploads/SyTA5tU-p.png)

**若是多元分類又該如何表示？** pytorch中以**整數值**代表類別(0, 1, ... n)

**對訓練有何影響？**
跟 y 最直接相關的就是 Loss function，使用torch.nn.CrossEntropyLoss()


*   預測值(y_pred)：每筆資料n類別預測值
*   解答(y): 0~n-1 的整數值代表解答類別





----------------------------

## 多元分類（Multi-class classification）

### 創建資料集／載入資料集（Dataset Creating / Loading）

In [None]:
train_df = pd.read_csv('./Data/FilmComment_train.csv')
test_df = pd.read_csv('./Data/FilmComment_test.csv')

In [None]:
X_df = train_df.iloc[:, :-1].values
y_df = train_df.y_label.values

In [None]:
X_test = test_df.iloc[:, :-1].values
y_test = test_df.y_label.values

<a name="DataPreprocessing"></a>
## 資料前處理（Data Preprocessing）

In [None]:
# train, valid/test dataset split
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_df, y_df,
                                                      test_size=0.2,
                                                      random_state=5566)

In [None]:
print(f'X_train shape: {X_train.shape}')
print(f'X_valid shape: {X_valid.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_valid shape: {y_valid.shape}')

In [None]:
# Build torch dataset and dataloader
from torch.utils.data import TensorDataset, DataLoader

BATCH_SIZE = 512

train_dataset = TensorDataset(torch.from_numpy(X_train).float(),
                              torch.from_numpy(y_train).long())
valid_dataset = TensorDataset(torch.from_numpy(X_valid).float(),
                              torch.from_numpy(y_valid).long())

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

for x, y in train_loader:
    print(x.shape, y.shape, y[:10])
    break

### 模型建置（Model Building）

In [None]:
torch.manual_seed(5566)

# 不需要sigmoid, 分別輸出類別0, 1的值
model = nn.Sequential(
    nn.Linear(X_train.shape[1], 16),
    nn.ReLU(),
    nn.Linear(16, 16),
    nn.ReLU(),
    nn.Linear(16, 2),
)

print(model)

### 模型訓練（Model training）

* #### 設定模型訓練時，所需的優化器 (optimizer)、損失函數 (loss function)

In [None]:
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss() # 多元分類損失函數

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')
model = model.to(device)

In [None]:
def train_epoch(model, optimizer, loss_fn, train_dataloader, val_dataloader):
    # 訓練一輪
    model.train()
    total_train_loss = 0
    total_train_correct = 0
    for x, y in tqdm(train_dataloader, leave=False):
        x, y = x.to(device), y.to(device) # 將資料移至GPU
        y_pred = model(x) # 計算預測值
        loss = loss_fn(y_pred, y) # 計算誤差
        optimizer.zero_grad() # 梯度歸零
        loss.backward() # 反向傳播計算梯度
        optimizer.step() # 更新模型參數

        total_train_loss += loss.item()
        # 利用argmax計算最大值是第n個類別，與解答比對是否相同
        total_train_correct += ((y_pred.argmax(dim=1) == y).sum().item())

    # 驗證一輪
    model.eval()
    total_val_loss = 0
    total_val_correct = 0
    # 關閉梯度計算以加速
    with torch.no_grad():
        for x, y in val_dataloader:
            x, y = x.to(device), y.to(device)
            y_pred = model(x)
            loss = loss_fn(y_pred, y)
            total_val_loss += loss.item()
            # 利用argmax計算最大值是第n個類別，與解答比對是否相同
            total_val_correct += ((y_pred.argmax(dim=1) == y).sum().item())

    avg_train_loss = total_train_loss / len(train_dataloader)
    avg_train_acc = total_train_correct / len(train_dataloader.dataset)
    avg_val_loss = total_val_loss / len(val_dataloader)
    avg_val_acc = total_val_correct / len(val_dataloader.dataset)

    return avg_train_loss, avg_val_loss, avg_train_acc, avg_val_acc

In [None]:
train_loss_log = []
val_loss_log = []
train_acc_log = []
val_acc_log = []
for epoch in tqdm(range(20)):
    avg_train_loss, avg_val_loss, avg_train_acc, avg_val_acc = train_epoch(model, optimizer, loss_fn, train_loader, valid_loader)
    train_loss_log.append(avg_train_loss)
    val_loss_log.append(avg_val_loss)
    train_acc_log.append(avg_train_acc)
    val_acc_log.append(avg_val_acc)
    print(f'Epoch: {epoch}, Train Loss: {avg_train_loss:.3f}, Val Loss: {avg_val_loss:.3f} | Train Acc: {avg_train_acc:.3f}, Val Acc: {avg_val_acc:.3f}')

### 模型評估（Model evalutation）

* #### 視覺化訓練過程的評估指標 （Visualization）

In [None]:
plt.figure(figsize=(15, 4))
plt.subplot(1, 2, 1)
plt.plot(range(len(train_loss_log)), train_loss_log, label='train_loss')
plt.plot(range(len(val_loss_log)), val_loss_log, label='valid_loss')
plt.xlabel('Epochs')
plt.ylabel('Binary crossentropy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(range(len(train_acc_log)), train_acc_log, label='train_acc')
plt.plot(range(len(val_acc_log)), val_acc_log, label='valid_acc')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

* ### 模型預測（Model predictions）

In [None]:
# predict all test data
model.eval()

y_pred = []
with torch.no_grad():
    for x, _ in tqdm(valid_loader):
        x = x.to(device)
        y_pred.append(model(x))

y_pred = torch.cat(y_pred).cpu()
y_pred_class = y_pred.argmax(dim=1)

print(y_pred_class, y_pred_class.shape)

### Quiz
請試著利用 Data/pkgo_train.csv 做多元分類問題，預測五個種類的 pokemon，並調整模型（網路層數、神經元數目）得到更高的準確度。

pkgo_train 為 Pokemon go 中 pokemon 出沒狀態描述的資料集，欄位說明如下：
* latitude, longitude: 位置（經緯度）
* local.xx: 時間（擷取格式 mm-dd'T'hh-mm-ss.ms'Z'）
* appearedTimeOfDay: night, evening, afternoon, morning 四種時段
* appearedHour/Minute: 當地小時／分鐘
* appearedDayOfWeek: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
* appearedDay/Month: 當地日期／月份
* terrainType: 地形種類
* closeToWater: 是否接近水源（100 公尺內）
* city: 城市
* continent: 洲別
* weather: 天氣種類（Foggy Clear, PartlyCloudy, MostlyCloudy, Overcast, Rain, BreezyandOvercast, LightRain, Drizzle, BreezyandPartlyCloudy, HeavyRain, BreezyandMostlyCloudy, Breezy, Windy, WindyandFoggy, Humid, Dry, WindyandPartlyCloudy, DryandMostlyCloudy, DryandPartlyCloudy, DrizzleandBreezy, LightRainandBreezy, HumidandPartlyCloudy, HumidandOvercast, RainandWindy）
* temperature: 攝氏溫度
* windSpeed: 風速（km/h）
* windBearing: 風向
* pressure: 氣壓
* sunrise/sunsetXX: 日出日落相關訊息
* population_density: 人口密集度
* urban/suburban/midurban/rural: 出沒過的地點城市程度（人口密集度小於 200 為 rural, 大於等於 200 且小於 400 為 midUrban, 大於等於400 且小於 800 為 subUrban, 大於 800 為 urban）
* gymDistanceKm: 最近道館的距離
* gymInxx: 道館是否在指定距離內
* cooc1-cooc151: 是否有其他 pokemon 在 24 小時內，出現在周圍 100 公尺之內
* category: 種類