<a href="https://colab.research.google.com/github/ga642381/ML2021-Spring/blob/main/HW02/HW02-1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework 2-1 Phoneme Classification**

* Slides: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/hw/HW02/HW02.pdf
* Video (Chinese): https://youtu.be/PdjXnQbu2zo
* Video (English): https://youtu.be/ESRr-VCykBs


## The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT)
The TIMIT corpus of reading speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.

This homework is a multiclass classification task, 
we are going to train a deep neural network classifier to predict the phonemes for each frame from the speech corpus TIMIT.

link: https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3

## Download Data
Download data from google drive, then unzip it.

You should have `timit_11/train_11.npy`, `timit_11/train_label_11.npy`, and `timit_11/test_11.npy` after running this block.<br><br>
`timit_11/`
- `train_11.npy`: training data<br>
- `train_label_11.npy`: training label<br>
- `test_11.npy`:  testing data<br><br>

**notes: if the google drive link is dead, you can download the data directly from Kaggle and upload it to the workspace**




In [1]:
# !gdown --id '1HPkcmQmFGu-3OknddKIa5dNDsR05lIQR' --output data.zip
# !unzip data.zip
!ls 

HW02-1.ipynb  HW02.pdf		    model.ckpt	    sampleSubmission.csv
HW02-2.ipynb  ml2021spring-hw2.zip  prediction.csv  timit_11


## Preparing Data
Load the training and testing data from the `.npy` file (NumPy array).

In [2]:
import numpy as np

print('Loading data ...')

data_root='./timit_11/timit_11/'
train = np.load(data_root + 'train_11.npy')
train_label = np.load(data_root + 'train_label_11.npy')
test = np.load(data_root + 'test_11.npy')

print('Size of training data: {}'.format(train.shape))
print('Size of testing data: {}'.format(test.shape))

Loading data ...
Size of training data: (1229932, 429)
Size of testing data: (451552, 429)


## Create Dataset

In [3]:
import torch
from torch.utils.data import Dataset

class TIMITDataset(Dataset):
    def __init__(self, X, y=None):
        self.data = torch.from_numpy(X).float()
        if y is not None:
            y = y.astype(np.int)
            self.label = torch.LongTensor(y)
        else:
            self.label = None

    def __getitem__(self, idx):
        if self.label is not None:
            return self.data[idx], self.label[idx]
        else:
            return self.data[idx]

    def __len__(self):
        return len(self.data)


In [4]:
# fix random seed
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  
    np.random.seed(seed)  
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

In [5]:
# fix random seed for reproducibility
same_seeds(42)

Split the labeled data into a training set and a validation set, you can modify the variable `VAL_RATIO` to change the ratio of validation data.

In [6]:
# 统计每类的样本数，对于样本少的类别数据要重复使用
train_class = []
id = np.arange(train.shape[0])
for i in range(39):
    train_class.append(id[train_label == str(i)])

In [7]:
number = 10000       #每类至少20000笔数据
print ("Sample data:")
print ("\n   class   count")
for i in range(len(train_class)):
    print ("{:8d}".format(i), end='')
    print ("{:8d}".format(train_class[i].shape[0]))
    if (train_class[i].shape[0] < number):

        id = np.random.choice(train_class[i], size=number-train_class[i].shape[0])
        train = np.vstack((train, train[id]))
        label = np.empty((id.shape[0]), dtype=int)
        train_label = np.append(train_label, label)
        train_label[-id.shape[0]:] = int(i)

Sample data:

   class   count
       0   62708
       1   83746
       2   35048
       3   59031
       4   38930
       5   26380
       6    4038
       7   73827
       8   28797
       9   34289
      10   11028
      11   11711
      12   26790
      13   43410
      14   39583
      15   11342
      16   20922
      17   51533
      18   24938
      19   47059
      20    8508
      21    7083
      22    7050
      23   10663
      24    3883
      25    8219
      26    7825
      27    6059
      28   11492
      29   21012
      30   25094
      31   31618
      32   12003
      33   22907
      34    6920
      35   84521
      36   27088
      37   14164
      38  178713


In [8]:
from sklearn.model_selection import train_test_split
# 随机打乱
VAL_RATIO = 0.2

train_indices, valid_indices = train_test_split([i for i in range(train.shape[0])], test_size=VAL_RATIO, random_state=42)
train_x = train[train_indices, :]
train_y = train_label[train_indices]
val_x = train[valid_indices, :]
val_y = train_label[valid_indices]

print('Size of training set: {}'.format(train_x.shape))
print('Size of validation set: {}'.format(val_x.shape))

Size of training set: (1008277, 429)
Size of validation set: (252070, 429)


Create a data loader from the dataset, feel free to tweak the variable `BATCH_SIZE` here.

In [9]:
BATCH_SIZE = 2048

from torch.utils.data import DataLoader

train_set = TIMITDataset(train_x, train_y)
val_set = TIMITDataset(val_x, val_y)
train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True) #only shuffle the training data
val_loader = DataLoader(val_set, batch_size=BATCH_SIZE, shuffle=False)

Cleanup the unneeded variables to save memory.<br>

**notes: if you need to use these variables later, then you may remove this block or clean up unneeded variables later<br>the data size is quite huge, so be aware of memory usage in colab**

In [10]:
import gc

del train, train_label, train_x, train_y, val_x, val_y, train_class, train_indices, valid_indices
gc.collect()

40

## Create Model

Define model architecture, you are encouraged to change and experiment with the model architecture.

In [11]:
import torch
import torch.nn as nn

class Classifier(nn.Module):
    def __init__(self):
        super(Classifier, self).__init__()
        self.net = nn.Sequential(
            nn.BatchNorm1d(429),
            nn.Linear(429, 1024),
            nn.ReLU(),
            nn.BatchNorm1d(1024),
            nn.Dropout(p = 0.2),
            
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
#             nn.Dropout(p = 0.1),
            
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
#             nn.Dropout(p = 0.1),
            
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
#             nn.Dropout(p = 0.1),
            
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.BatchNorm1d(64),
#             nn.Dropout(p = 0.1),
            
            nn.Linear(64, 39),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)

## Training

In [12]:
#check device
def get_device():
  return 'cuda' if torch.cuda.is_available() else 'cpu'

Fix random seeds for reproducibility.

Feel free to change the training parameters here.

In [13]:
# get device 
device = get_device()
print(f'DEVICE: {device}')

# training parameters
num_epoch = 100               # number of training epoch
learning_rate = 0.0001       # learning rate
weight_decay_l1 = 0.0
weight_decay_l2 = 0.0

# the path where checkpoint saved
model_path = './model.ckpt'

# create model, define a loss function, and optimizer
model = Classifier().to(device)
criterion = nn.CrossEntropyLoss() 
# optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

DEVICE: cuda


In [14]:
def cal_regularization(model, weight_decay_l1, weight_decay_l2):
    l1 = 0
    l2 = 0
    for i in model.parameters():
        l1 += torch.sum(abs(i))
        l2 += torch.sum(torch.pow(i, 2))
    return weight_decay_l1 * l1 + weight_decay_l2 * l2

In [15]:
# start training

best_acc = 0.0
for epoch in range(num_epoch):
    train_acc = 0.0
    train_loss = 0.0
    val_acc = 0.0
    val_loss = 0.0
    if epoch == 0:
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=0.002)
    elif epoch == 25:
        optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=0.01)
    # training
    model.train() # set the model to training mode
    for i, data in enumerate(train_loader):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad() 
        outputs = model(inputs) 
        batch_loss = criterion(outputs, labels)
        _, train_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
        (batch_loss + cal_regularization(model, weight_decay_l1, weight_decay_l2)).backward()
        optimizer.step() 

        train_acc += (train_pred.cpu() == labels.cpu()).sum().item()
        train_loss += batch_loss.item()

    # validation
    if len(val_set) > 0:
        model.eval() # set the model to evaluation mode
        with torch.no_grad():
            for i, data in enumerate(val_loader):
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                batch_loss = criterion(outputs, labels) 
                _, val_pred = torch.max(outputs, 1) 
            
                val_acc += (val_pred.cpu() == labels.cpu()).sum().item() # get the index of the class with the highest probability
                val_loss += batch_loss.item()

            print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f} | Val Acc: {:3.6f} loss: {:3.6f}'.format(
                epoch + 1, num_epoch, train_acc/len(train_set), train_loss/len(train_loader), val_acc/len(val_set), val_loss/len(val_loader)
            ))

            # if the model improves, save a checkpoint at this epoch
            if val_acc > best_acc:
                best_acc = val_acc
                torch.save(model.state_dict(), model_path)
                print('saving model with acc {:.3f}'.format(best_acc/len(val_set)))
    else:
        print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f}'.format(
            epoch + 1, num_epoch, train_acc/len(train_set), train_loss/len(train_loader)
        ))

# if not validating, save the last epoch
if len(val_set) == 0:
    torch.save(model.state_dict(), model_path)
    print('saving model at last epoch')


[001/100] Train Acc: 0.485024 Loss: 3.304951 | Val Acc: 0.551755 loss: 3.219815
saving model with acc 0.552
[002/100] Train Acc: 0.548813 Loss: 3.180503 | Val Acc: 0.550867 loss: 3.142329
[003/100] Train Acc: 0.543141 Loss: 3.099175 | Val Acc: 0.548915 loss: 3.056462
[004/100] Train Acc: 0.559803 Loss: 3.017860 | Val Acc: 0.585389 loss: 2.983200
saving model with acc 0.585
[005/100] Train Acc: 0.590321 Loss: 2.958682 | Val Acc: 0.609462 loss: 2.936193
saving model with acc 0.609
[006/100] Train Acc: 0.618624 Loss: 2.924558 | Val Acc: 0.636383 loss: 2.911903
saving model with acc 0.636
[007/100] Train Acc: 0.639625 Loss: 2.904064 | Val Acc: 0.654394 loss: 2.895743
saving model with acc 0.654
[008/100] Train Acc: 0.654365 Loss: 2.890891 | Val Acc: 0.665847 loss: 2.885902
saving model with acc 0.666
[009/100] Train Acc: 0.665669 Loss: 2.881559 | Val Acc: 0.671401 loss: 2.878563
saving model with acc 0.671
[010/100] Train Acc: 0.674911 Loss: 2.874346 | Val Acc: 0.684449 loss: 2.871754
savi

[087/100] Train Acc: 0.777322 Loss: 2.883220 | Val Acc: 0.760856 loss: 2.899426
[088/100] Train Acc: 0.778617 Loss: 2.884254 | Val Acc: 0.761134 loss: 2.900215
saving model with acc 0.761
[089/100] Train Acc: 0.778229 Loss: 2.885338 | Val Acc: 0.760229 loss: 2.901222
[090/100] Train Acc: 0.778235 Loss: 2.886339 | Val Acc: 0.760999 loss: 2.902770
[091/100] Train Acc: 0.778814 Loss: 2.887333 | Val Acc: 0.760094 loss: 2.903592
[092/100] Train Acc: 0.778743 Loss: 2.888431 | Val Acc: 0.761098 loss: 2.904687
[093/100] Train Acc: 0.778813 Loss: 2.889554 | Val Acc: 0.760725 loss: 2.905674
[094/100] Train Acc: 0.778713 Loss: 2.890544 | Val Acc: 0.761348 loss: 2.906827
saving model with acc 0.761
[095/100] Train Acc: 0.779567 Loss: 2.891590 | Val Acc: 0.762074 loss: 2.907552
saving model with acc 0.762
[096/100] Train Acc: 0.779023 Loss: 2.892600 | Val Acc: 0.761332 loss: 2.909085
[097/100] Train Acc: 0.779704 Loss: 2.893586 | Val Acc: 0.761431 loss: 2.909833
[098/100] Train Acc: 0.779609 Loss: 

## Testing

Create a testing dataset, and load model from the saved checkpoint.

In [16]:
# create testing dataset
test_set = TIMITDataset(test, None)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, shuffle=False)

# create model and load weights from checkpoint
model = Classifier().to(device)
model.load_state_dict(torch.load(model_path))

<All keys matched successfully>

Make prediction.

In [17]:
predict = []
model.eval() # set the model to evaluation mode
with torch.no_grad():
    for i, data in enumerate(test_loader):
        inputs = data
        inputs = inputs.to(device)
        outputs = model(inputs)
        _, test_pred = torch.max(outputs, 1) # get the index of the class with the highest probability

        for y in test_pred.cpu().numpy():
            predict.append(y)

Write prediction to a CSV file.

After finish running this block, download the file `prediction.csv` from the files section on the left-hand side and submit it to Kaggle.

In [18]:
with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(predict):
        f.write('{},{}\n'.format(i, y))