### **載入資料**
資料來源:[TIMIT](https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3)

從google drive下載資料並解壓縮，應該要得到timit_11資料夾並包括以下檔案:
*  train_11.npy     (訓練資料)
*  train_label_11.npy  (訓練標籤)
*  test_11.npy     (測試資料)



In [3]:
# !gdown --id '1HPkcmQmFGu-3OknddKIa5dNDsR05lIQR' --output data.zip
# !unzip data.zip
# !ls

Downloading...
From: https://drive.google.com/uc?id=1HPkcmQmFGu-3OknddKIa5dNDsR05lIQR
To: /content/data.zip
100% 372M/372M [00:04<00:00, 78.9MB/s]
Archive:  data.zip
   creating: timit_11/
  inflating: timit_11/train_11.npy   
  inflating: timit_11/test_11.npy    
  inflating: timit_11/train_label_11.npy  
data.zip  drive  sample_data  timit_11


### **準備資料**
從.npy檔載入訓練跟測試資料

In [55]:
import numpy as np

print('Loading data ...')

# data_root = './timit_11/'
train = np.load('train_11.npy')
train_label = np.load('train_label_11.npy')
test = np.load('test_11.npy')

print('Size of training data: {}'.format(train.shape))
print('Size of testing data: {}'.format(test.shape))

Loading data ...
Size of training data: (1229932, 429)
Size of testing data: (451552, 429)


### **創建資料集**

In [56]:
import torch
from torch.utils.data import Dataset

class TIMITDataset(Dataset):
    def __init__(self, X, y=None):
        self.data = torch.from_numpy(X).float()      # 將資料轉成浮點數並創建張量tensor
        if y is not None:
            y = y.astype(int)            # 將label轉成int
            self.label = torch.LongTensor(y)      # 將label轉成tensor儲存
        else:
            self.label = None

    def __getitem__(self, idx):
        if self.label is not None:
            return self.data[idx], self.label[idx]   # 一次取一筆訓練資料
        else:
            return self.data[idx]            # 一次取一筆測試資料

    def __len__(self):
        return len(self.data)

### **訓練資料分割**
將訓練資料分成訓練資料集跟驗證資料集，並用VAL_RATIO控制兩個資料集的比例

In [57]:
VAL_RATIO = 0.1     # 控制兩個資料集的比例

percent = int(train.shape[0] * (1 - VAL_RATIO))     # 計算訓練資料集的長度
train_x, train_y, val_x, val_y = train[:percent], train_label[:percent], train[percent:], train_label[percent:]    # 根據長度對資料集進行分割
print('Size of training set: {}'.format(train_x.shape))
print('Size of validation set: {}'.format(val_x.shape))

Size of training set: (1106938, 429)
Size of validation set: (122994, 429)


### **創建DataLoader**
用Dataset創建DataLoader，並在這邊調整batch size


In [58]:
from torch.utils.data import DataLoader

BATCH_SIZE = 128    # 調整batch size大小

train_set = TIMITDataset(train_x, train_y)
val_set = TIMITDataset(val_x, val_y)
# 只在training開啟洗牌功能(shuffle)
train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_set, batch_size=BATCH_SIZE, shuffle=False)

### **清理不需要的變數**
這次的資料集大小相當巨大，因此要要注意Colab記憶體的狀況

透過清理不需要的變數來節省記憶體的空間，如果還需要在這個block被清理掉的變數，則將這個block刪掉或是晚點再清理變數

參考資料:https://docs.python.org/zh-tw/3/library/gc.html

In [59]:
import gc
# 用 del 清除以下變數(資料已存進dataset)
del train, train_label, train_x, train_y, val_x, val_y
gc.collect()     # 清除記憶體中不可訪問(未引用)的資料

2506

### **創建 Model**
定義model的結構，並嘗試拿到好的分數

In [61]:
import torch.nn as nn

class Classifier(nn.Module):
    def __init__(self):
        super(Classifier, self).__init__()

        # self.layer1 = nn.Linear(429, 1024)
        # self.layer2 = nn.Linear(1024, 512)
        # self.layer3 = nn.Linear(512, 128)
        # self.out = nn.Linear(128, 39)

        # self.act_fn = nn.Sigmoid()
        self.net = nn.Sequential(
            nn.Linear(429, 2048),
            nn.ReLU(),
            nn.BatchNorm1d(2048),
            nn.Dropout(p=0.5),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.BatchNorm1d(1024),
            nn.Dropout(p=0.4),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.Dropout(p=0.3),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Dropout(p=0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Dropout(p=0.1),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.BatchNorm1d(64),
            nn.Linear(64, 39),
        )

    def forward(self, x):
        # x = self.layer1(x)
        # x = self.act_fn(x)

        # x = self.layer2(x)
        # x = self.act_fn(x)

        # x = self.layer3(x)
        # x = self.act_fn(x)

        # x = self.out(x)

        # return x
        return self.net(x).squeeze(1)

### **Training**

In [62]:
# 確認設備
def get_device():
  return 'cuda' if torch.cuda.is_available() else 'cpu'

固定random seeds以確保在不更改參數的情況下，model的輸出不會改變

參考資料:https://zhuanlan.zhihu.com/p/161575780

In [64]:
def same_seeds(seed):
    torch.manual_seed(seed)     # 為CPU設定random seed
    if torch.cuda.is_available():
        # torch.cuda.manual_seed(seed)    # 為特定GPU設定random seed
        torch.cuda.manual_seed_all(seed)  # 為所有GPU設定random seed
    np.random.seed(seed)
    torch.backends.cudnn.benchmark = False    # 最佳化卷積層
    torch.backends.cudnn.deterministic = True   # 讓GPU的輸出一致

調整訓練參數

In [66]:
# 設定random seed
same_seeds(0)

# 抓取設備
device = get_device()
print(f'DEVICE: {device}')

# 訓練參數
num_epoch = 60
learning_rate = 0.0001

# 檢查點的儲存路徑
model_path = './model.ckpt'

# 創建模型，定義loss function與最佳化方法
model = Classifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.0005)

DEVICE: cuda


開始訓練

In [67]:
best_acc = 0.0
for epoch in range(num_epoch):
    train_acc = 0.0
    train_loss = 0.0
    val_acc = 0.0
    val_loss = 0.0

    # training
    model.train()   # 將model設成training model
    for i, data in enumerate(train_loader):
        inputs, labels = data                    # 從train_loader載入訓練input跟label
        inputs, labels = inputs.to(device), labels.to(device)   # 將input跟label複製到device中進行運算
        optimizer.zero_grad()                     # 將gradient設為0(避免梯度累積)
        outputs = model(inputs)                   # 計算輸出(forward pass)
        batch_loss = criterion(outputs, labels)           # 計算cross-entropy loss
        _, train_pred = torch.max(outputs, 1)           # 找出outputs中最大值的位置(索引值)(0為列，1為行)
        batch_loss.backward()                     # 計算gradient(backpropagation)
        optimizer.step()                        # 更新參數

        # 比較每筆training data的train_pred跟labels的值，相同回傳1不同回傳0
        # 將train_pred.cpu() == labels.cpu()的值加總(越大越好)，再用item()讀取tensor值並回傳float
        train_acc += (train_pred.cpu() == labels.cpu()).sum().item()
        # 計算1個batch的loss
        train_loss += batch_loss.item()

    # validation
    if len(val_set) > 0:
        model.eval()                          # 將model設成evaluation model
        with torch.no_grad():                     # 關掉梯度計算
            for i, data in enumerate(val_loader):
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                batch_loss = criterion(outputs, labels)
                _, val_pred = torch.max(outputs, 1)     # 找出機率最高的class的索引值

                val_acc += (val_pred.cpu() == labels.cpu()).sum().item()
                val_loss += batch_loss.item()

            # len(train_set)是指training data的數量，len(train_loader)是指batch的數量(len(train_set)/batch_size)
            print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f} | Val Acc: {:3.6f} loss: {:3.6f}'.format(
                epoch + 1, num_epoch, train_acc/len(train_set), train_loss/len(train_loader), val_acc/len(val_set), val_loss/len(val_loader)
            ))

            # 如果model在驗證時的效果比較好，儲存該model的checkpoint
            # model.state_dict()是用來查看model的參數(通常用於儲存model)
            if val_acc > best_acc:
                best_acc = val_acc
                torch.save(model.state_dict(), model_path)
                print('saving model with acc {:.3f}'.format(best_acc/len(val_set)))
    else:
        print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f}'.format(
            epoch + 1, num_epoch, train_acc/len(train_set), train_loss/len(train_loader)
        ))

[001/060] Train Acc: 0.547156 Loss: 1.552408 | Val Acc: 0.665878 loss: 1.070760
saving model with acc 0.666
[002/060] Train Acc: 0.626883 Loss: 1.204728 | Val Acc: 0.691139 loss: 0.970993
saving model with acc 0.691
[003/060] Train Acc: 0.648854 Loss: 1.121065 | Val Acc: 0.707221 loss: 0.913175
saving model with acc 0.707
[004/060] Train Acc: 0.662427 Loss: 1.069816 | Val Acc: 0.713734 loss: 0.884824
saving model with acc 0.714
[005/060] Train Acc: 0.671913 Loss: 1.034243 | Val Acc: 0.722035 loss: 0.857114
saving model with acc 0.722
[006/060] Train Acc: 0.679157 Loss: 1.006031 | Val Acc: 0.726361 loss: 0.838246
saving model with acc 0.726
[007/060] Train Acc: 0.686898 Loss: 0.980815 | Val Acc: 0.728377 loss: 0.828933
saving model with acc 0.728
[008/060] Train Acc: 0.691589 Loss: 0.962700 | Val Acc: 0.733678 loss: 0.810986
saving model with acc 0.734
[009/060] Train Acc: 0.696736 Loss: 0.944851 | Val Acc: 0.735459 loss: 0.804998
saving model with acc 0.735
[010/060] Train Acc: 0.70003

### **Testing**
創建測試資料集並從checkpoint載入訓練好的model

In [68]:
# 創建test_set與test_loader
test_set = TIMITDataset(test, None)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, shuffle=False)

# 創建模型並從checkpoint載入模型參數
model = Classifier().to(device)
model.load_state_dict(torch.load(model_path))

<All keys matched successfully>

開始進行testing

In [69]:
predict = []                       # 儲存預測結果
model.eval()                        # 將model設成evaluation model
with torch.no_grad():
    for i, data in enumerate(test_loader):
        inputs = data
        inputs = inputs.to(device)
        outputs = model(inputs)
        _, test_pred = torch.max(outputs, 1)  # 找出機率最高的class的索引值

        # 將預測結果儲存起來
        for y in test_pred.cpu().numpy():
            predict.append(y)

將預測結果存成.csv檔，並將prediction.csv下載下來，之後放上kaggle進行評分

In [70]:
with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(predict):
        f.write('{},{}\n'.format(i, y))