# **Homework 2-1 Phoneme Classification**

* Slides: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/hw/HW02/HW02.pdf
* Video (Chinese): https://youtu.be/PdjXnQbu2zo
* Video (English): https://youtu.be/ESRr-VCykBs


**reference note for training**
https://blog.csdn.net/qq_42994201/article/details/121324301?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522166376957116800184125803%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=166376957116800184125803&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-1-121324301-null-null.142^v49^control,201^v3^add_ask&utm_term=%E6%9D%8E%E5%AE%8F%E6%AF%852021%E4%BD%9C%E4%B8%9A2&spm=1018.2226.3001.4187


## The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT)
The TIMIT corpus of reading speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.

This homework is a multiclass classification task, 
we are going to train a deep neural network classifier to predict the phonemes for each frame from the speech corpus TIMIT.

link: https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3

## Download Data
Download data from google drive, then unzip it.

You should have `timit_11/train_11.npy`, `timit_11/train_label_11.npy`, and `timit_11/test_11.npy` after running this block.<br><br>
`timit_11/`
- `train_11.npy`: training data<br>
- `train_label_11.npy`: training label<br>
- `test_11.npy`:  testing data<br><br>

**notes: if the google drive link is dead, you can download the data directly from Kaggle and upload it to the workspace**




參考這篇來過strong baseline!!!

https://github.com/1am9trash/Hung_Yi_Lee_ML_2021/blob/main/hw/hw2/hw2_code.ipynb


In [None]:
!gdown --id '1HPkcmQmFGu-3OknddKIa5dNDsR05lIQR' --output data.zip
!unzip data.zip
!ls 

Downloading...
From: https://drive.google.com/uc?id=1HPkcmQmFGu-3OknddKIa5dNDsR05lIQR
To: /content/data.zip
100% 372M/372M [00:01<00:00, 239MB/s]
Archive:  data.zip
   creating: timit_11/
  inflating: timit_11/train_11.npy   
  inflating: timit_11/test_11.npy    
  inflating: timit_11/train_label_11.npy  
data.zip  sample_data  timit_11


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Import 一些package
# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

# For data preprocess
import numpy as np
import csv
import os

# Utility
import gc

my_seed = 0
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(my_seed)
torch.manual_seed(my_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(my_seed)

## Preparing Data
Load the training and testing data from the `.npy` file (NumPy array).

In [None]:
import numpy as np

print('Loading data ...')

data_root='./timit_11/'
train = np.load(data_root + 'train_11.npy')
train_label = np.load(data_root + 'train_label_11.npy')
test = np.load(data_root + 'test_11.npy')

print('Size of training data: {}'.format(train.shape))
print('Size of testing data: {}'.format(test.shape))

Loading data ...
Size of training data: (1229932, 429)
Size of testing data: (451552, 429)


## **（ADD）做一些資料分析，因為發現Data分佈量很不均勻**

In [None]:
print ("Total number {:d}".format(train_label.shape[0])) #label shape:即類別數量
train_cnt = np.zeros((39), dtype=int)
for i in range(39): #每個不同label取出來，計算個別數量
    train_cnt[i] = np.sum(train_label == str(i))

sum = np.sum(train_cnt)
print ("\n   class   count    rate")
for i in range(39):
    print ("{:8d}".format(i), end='')
    print ("{:8d}".format(train_cnt[i]), end='')
    print ("  {:.4f}".format(train_cnt[i] / sum))


Total number 1229932

   class   count    rate
       0   62708  0.0510
       1   83746  0.0681
       2   35048  0.0285
       3   59031  0.0480
       4   38930  0.0317
       5   26380  0.0214
       6    4038  0.0033
       7   73827  0.0600
       8   28797  0.0234
       9   34289  0.0279
      10   11028  0.0090
      11   11711  0.0095
      12   26790  0.0218
      13   43410  0.0353
      14   39583  0.0322
      15   11342  0.0092
      16   20922  0.0170
      17   51533  0.0419
      18   24938  0.0203
      19   47059  0.0383
      20    8508  0.0069
      21    7083  0.0058
      22    7050  0.0057
      23   10663  0.0087
      24    3883  0.0032
      25    8219  0.0067
      26    7825  0.0064
      27    6059  0.0049
      28   11492  0.0093
      29   21012  0.0171
      30   25094  0.0204
      31   31618  0.0257
      32   12003  0.0098
      33   22907  0.0186
      34    6920  0.0056
      35   84521  0.0687
      36   27088  0.0220
      37   14164  0.0115
   

## Create Dataset

In [None]:
import torch
from torch.utils.data import Dataset

class TIMITDataset(Dataset):
    def __init__(self, X, y=None):
        self.data = torch.from_numpy(X).float()
        if y is not None:
            y = y.astype(np.int)
            self.label = torch.LongTensor(y)
        else:
            self.label = None

    def __getitem__(self, idx):
        if self.label is not None:
            return self.data[idx], self.label[idx]
        else:
            return self.data[idx]

    def __len__(self):
        return len(self.data)


Split the labeled data into a training set and a validation set, you can modify the variable `VAL_RATIO` to change the ratio of validation data.

In [None]:
VAL_RATIO = 0.01
percent = int(train.shape[0] * (1 - VAL_RATIO))
train_x, train_y, val_x, val_y = train[:percent], train_label[:percent], train[percent:], train_label[percent:]
#改為隨機分割資料成train, val.保證資料分布相同
#train_indices, valid_indices = train_test_split([i for i in range(train.shape[0])], test_size=VAL_RATIO, random_state=1)
#train_x, train_y, val_x, val_y = train[train_indices,:], train_label[train_indices], train[valid_indices,:], train_label[valid_indices]
print('Size of training set: {}'.format(train_x.shape))
print('Size of validation set: {}'.format(val_x.shape))

Size of training set: (1217632, 429)
Size of validation set: (12300, 429)


Create a data loader from the dataset, feel free to tweak the variable `BATCH_SIZE` here.

# 把每個類別的id記錄下來
後面重新sample data的時候會使用到

In [None]:
# 紀錄每個類的id，方便sample
train_class = []
id = np.arange(train_x.shape[0])
for i in range(39):
    train_class.append(id[train_y == str(i)]) 

# 擴充每個class的資料數量
根據前面的資料視覺化，每個class的數量差距過大，會影響training
需要把它補齊

In [None]:
number = 10000  #each class擴充到十萬筆
BATCH_SIZE = 2048 #這裡從64改成2048,這樣在做training時資料量大、速度快、較能代表分佈

print ("Sample data:")
print ("\n   class   count")
for i in range(len(train_class)):
    if (train_class[i].shape[0] < number): #不足十萬筆的training class data
        print ("{:8d}".format(i), end='')
        print ("{:8d}".format(number - train_class[i].shape[0]))

        id = np.random.choice(train_class[i], size=number-train_class[i].shape[0])#從train_class[i]抽取num-train_class[i]數量的一列數字，往下補上去補齊到10000
        train_x = np.vstack((train_x, train_x[id]))
        label = np.empty((id.shape[0]), dtype=int)
        train_y = np.append(train_y, label)#補齊到10000
        train_y[-id.shape[0]:] = int(i)

print ("\n", train_x.shape, train_y.shape)
train_dataset = TIMITDataset(train_x, train_y)
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)

Sample data:

   class   count
       6    5999
      20    1517


In [None]:

from torch.utils.data import DataLoader

#train_set = TIMITDataset(train_x, train_y)
val_set = TIMITDataset(val_x, val_y)
#train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True) #only shuffle the training data
val_loader = DataLoader(val_set, batch_size=BATCH_SIZE, shuffle=False)

Cleanup the unneeded variables to save memory.<br>

**notes: if you need to use these variables later, then you may remove this block or clean up unneeded variables later<br>the data size is quite huge, so be aware of memory usage in colab**

In [None]:
import gc

del train, train_label, train_x, train_y, val_x, val_y
gc.collect()

## Create Model

Define model architecture, you are encouraged to change and experiment with the model architecture.

＊＊紀錄＊＊
1. 原本的activation function 改成ReLU()
2.加入Batch Normalization讓error surface比較平緩，train起來會更smooth也比較不容易卡在local或奇怪的saddle point

3. drop out: randomly inactivate 一些neuron


In [None]:
import torch
import torch.nn as nn

class Classifier(nn.Module):
    def __init__(self):
        super(Classifier, self).__init__()
        self.layer1 = nn.Linear(429, 2048) #layer1 output 1024 -> 2048
        self.layer2 = nn.Linear(2048, 2048)
        self.layer3 = nn.Linear(2048, 2048)
        self.layer4 = nn.Linear(2048, 1024)
        self.layer5 = nn.Linear(1024, 512)
        self.layer6 = nn.Linear(512, 128)
        self.out = nn.Linear(128, 39) 

        #we define batch normalization layer for each layer(1~5)

        self.bn1 = nn.BatchNorm1d(2048)#this is placed at the output of layer1
        self.bn2 = nn.BatchNorm1d(2048)
        self.bn3 = nn.BatchNorm1d(2048)
        self.bn4 = nn.BatchNorm1d(1024)
        self.bn5 = nn.BatchNorm1d(512)
        self.bn6 = nn.BatchNorm1d(128)

        #drop out

        self.drop = nn.Dropout(0.5)

        self.act_fn = nn.ReLU() #linear functions passing sigmoid is called "Logistic Regression!"

    def forward(self, x): #改成先過ReLU再做batch normalization
        x = self.layer1(x)
        x = self.act_fn(x)
        x = self.bn1(x)
        x = self.drop(x)

        x = self.layer2(x)
        x = self.act_fn(x)
        x = self.bn2(x)
        x = self.drop(x)

        x = self.layer3(x)
        x = self.act_fn(x)
        x = self.bn3(x)
        x = self.drop(x)

        x = self.layer4(x)
        x = self.act_fn(x)
        x = self.bn4(x)
        x = self.drop(x)

        x = self.layer5(x)
        x = self.act_fn(x)
        x = self.bn5(x)
        x = self.drop(x)

        x = self.layer6(x)
        x = self.act_fn(x)
        x = self.bn6(x)
        x = self.drop(x)

        x = self.out(x)
        
        return x

## Training

In [None]:
#check device
def get_device():
  return 'cuda' if torch.cuda.is_available() else 'cpu'

Fix random seeds for reproducibility.

In [None]:
# fix random seed
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  
    np.random.seed(seed)  
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

Feel free to change the training parameters here.

**更改筆記**
1. 把epoch number從20 -> 100
2. 加入weight decay防止model overfitting

In [None]:
# fix random seed for reproducibility
same_seeds(0)

# get device 
device = get_device()
print(f'DEVICE: {device}')

# training parameters
num_epoch = 100             # number of training epoch
learning_rate = 0.0001       # learning rate
weight_decay_l1 = 0.0 
weight_decay_l2 = 0.001

# the path where checkpoint saved
model_path = './model.ckpt'

# create model, define a loss function, and optimizer
#model = Classifier().to(device)
#criterion = nn.CrossEntropyLoss() 
#optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

**筆記：**
我這裡有自己把L2正則加進去，看看會不會效果比較好？
（9/22試試看）

In [None]:
# start training
def cal_regularization(model, weight_decay_l1, weight_decay_l2):
    l1 = 0
    l2 = 0
    for i in model.parameters():
        l1 += torch.sum(abs(i)) #L1正則化
        l2 += torch.sum(torch.pow(i, 2)) #L2正則化加進Loss
        #這裡的weight_decay當作lambda乘在前面
    return weight_decay_l1 * l1 + weight_decay_l2 * l2


def train_model(num_epoch, learning_rate, weight_decay_l1, weight_decay_l2,
                train_dataset, train_dataloader,
                valid_dataset, valid_dataloader):
    model = Classifier().to(device)
    criterion = nn.CrossEntropyLoss()

    best_acc = 0.0
    for epoch in range(num_epoch):
        # 前面使用adam，收斂快，後面使用SGDM，穩定且偏差小
        if epoch == 0:
          optimizer = torch.optim.Adam(model.parameters(),lr = learning_rate)
        elif epoch == 35:
            optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

        train_acc = 0.0
        train_loss = 0.0
        val_acc = 0.0
        val_loss = 0.0

        # training
        model.train() # set the model to training mode
        for i, data in enumerate(train_dataloader):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)
    
            optimizer.zero_grad() 
            outputs = model(inputs) 
    
            batch_loss = criterion(outputs, labels)
            _, train_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
            (batch_loss + cal_regularization(model, weight_decay_l1, weight_decay_l2)).backward() #加入L1.L2正則
            #前面超參數weight decay L1設成0,所以用的是L2正則，要改成L1往前調參數就好

            optimizer.step() 
    
            train_acc += (train_pred.cpu() == labels.cpu()).sum().item()
            train_loss += batch_loss.item()
    
        # validation
        if len(valid_dataset) > 0:
            model.eval() # set the model to evaluation mode
            with torch.no_grad():
                for i, data in enumerate(valid_dataloader):
                    inputs, labels = data
                    inputs, labels = inputs.to(device), labels.to(device)
                    outputs = model(inputs)
                    batch_loss = criterion(outputs, labels) 
                    _, val_pred = torch.max(outputs, 1) 
                
                    val_acc += (val_pred.cpu() == labels.cpu()).sum().item() # get the index of the class with the highest probability
                    val_loss += batch_loss.item()
    
                print ("[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f} | Val Acc: {:3.6f} loss: {:3.6f}".format(
                    epoch + 1, num_epoch, train_acc / len(train_dataset), train_loss / len(train_dataloader), val_acc / len(valid_dataset), val_loss / len(valid_dataloader)
                ))
    
                # if the model improves, save a checkpoint at this epoch
                if val_acc > best_acc:
                    best_acc = val_acc
                    torch.save(model.state_dict(), model_path)
                    print ("saving model with acc {:.3f}".format(best_acc / len(valid_dataset)))
        else:
            print("[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f}".format(
                epoch + 1, num_epoch, train_acc / len(train_dataset), train_loss / len(train_dataloaders)
            ))

    # if not validating, save the last epoch
    if len(valid_dataset) == 0:
        torch.save(model.state_dict(), model_path)
        print("saving model at last epoch")


In [None]:
train_model(num_epoch, learning_rate, weight_decay_l1, weight_decay_l2, train_dataset, train_dataloader, valid_dataset, valid_dataloader)

del train_x, train_y, train_dataset, train_dataloader, valid_dataset, valid_dataloader
gc.collect()

## Testing

Create a testing dataset, and load model from the saved checkpoint.

In [None]:
# create testing dataset
test_set = TIMITDataset(test, None)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, shuffle=False)

# create model and load weights from checkpoint
model = Classifier().to(device)
model.load_state_dict(torch.load(model_path))

Make prediction.

In [None]:
predict = []
model.eval() # set the model to evaluation mode
with torch.no_grad():
    for i, data in enumerate(test_loader):
        inputs = data
        inputs = inputs.to(device)
        outputs = model(inputs)
        _, test_pred = torch.max(outputs, 1) # get the index of the class with the highest probability

        for y in test_pred.cpu().numpy():
            predict.append(y)

Write prediction to a CSV file.

After finish running this block, download the file `prediction.csv` from the files section on the left-hand side and submit it to Kaggle.

In [None]:
with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(predict):
        f.write('{},{}\n'.format(i, y))