# HW1: Frame-Level Speech Recognition

In this homework, you will be working with MFCC data consisting of 28 features at each time step/frame. Your model should be able to recognize the phoneme occured in that frame.

To run this notebook, follow the cells chronological order. Also, note that this notebook was ran on Kaggle notebook, you will have to add competition dataset. The notebook is mainly made up of the following sections:
- Install and Import libraries
- Install Kaggle API
- Datasets and Dataloaders
- Parameter configuration
- Network architecture
- Criterion and optimizer
- Training and validation functions
- Training and logging checkpoint and hyperparameters on WandB
- Testing and Kaggle submission
#### Dataset and dataloaders
To improve memory efficiency, arrays were created with shapes equal to the final concatenated MFCCs + context and transcripts. These arrays were pre-filled with zeros. Instead of appending each frame to a list and then concatenating, which consumes almost double the memory, I directly loaded each frame and inserted it into the previously created array using indexing and slicing.
#### Architectures
Here are few of many architectures that I have tried. I have used a fixed learning rate of 1e-3.
- Diamond architecture: 
- - Activation: ReLU 
- - Optimizer: ADAM
- - Context: 20
- - 3 hidden layers with width 512, 1024, 512
- - Results: 76%
- Diamond architecture: 
- - Activation: ReLU 
- - Optimizer: ADAM
- - Context: 20
- - 5 hidden layers with width 512, 1024, 2048, 1024, 512
- - Results: 79.5%
- Diamond architecture: 
- - Activation: ReLU 
- - Optimizer: ADAM
- - Batch normalization after every layer
- - Dropout of 0.5 after every layer
- - Context: 30
- - 5 hidden layers with width 512, 1024, 2048, 1024, 512
- - Results: 81% acuracy on validation but terrible on Kaggle test set
- Multi-stage Diamond architecture: 
- - Activation: ReLU  
- - Optimizer: ADAM 
- - Batchnorm after every alternate layer
- - Context: 30
- - 7 hidden layers of width 1024, 2048, 2048, 1024, 2024, 512, 512
- - Results: Overfit
- Cylinder architecture of width 2048: 
- - Activation: SiLu 
- - Optimizer: ADAMW
- - weight_decay: 1e-4
- - 5 hidden layers of width 2048
- - dropout of 0.2 after every layer
- - Context: 30
- - Batchnorm after every alternate layer
- -  Results: stack at 84.5% accuracy.
- Cylinder architecture: 
- - Activation: SiLu 
- - Optimizer: ADAMW
- - weight_decay: 1e-4
- - 4 hidden layers of width 3072
- - dropout of 0.2 after every layer
- - Batchnorm after every alternate layer
- -  Results: stack around 82% accuracy.
- Cylinder architecture of width 2048: 
- - Activation: SiLu 
- - Optimizer: ADAMW
- - weight_decay: 1e-4
- - batch size: 8000
- - Context: 30
- - 6 hidden layers of width 2048
- - Dropout of 0.2 after every layer
- - Batchnorm after every alternate layer
- -  Results: reached around 83% accuracy then start overfitting
##### The following model is the one that gave me the best results

- Cylinder architecture: 
- - Activation: ReLu 
- - Optimizer: ADAMW
- - Batch size: 8192
- - Epochs: 80
- - Context: 30
- - weight_decay: 1e-4
- - 6 hidden layers of width 2048
- - Dropout of 0.3 for the first 2 layers then 0.2 for the rest
- - Batchnorm after every alternate layer
- - 24,580,138 parameters
- -  Results: reached around 86.2% validation accuracy.\

When training these models, I saved the checkpoints lowest lost model, best accuracy model and current model. The best accuracy model was used for test prediction.


# Libraries

In [1]:
!pip install torchsummaryX wandb --quiet

In [None]:
import torch
from torch.cuda.amp import autocast, GradScaler
import numpy as np
from torchsummaryX import summary
import sklearn
import gc
import zipfile
import pandas as pd
from tqdm.auto import tqdm
import os
import datetime
import wandb
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

In [3]:
# ### If you are using colab, you can import google drive to save model checkpoints in a folder
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
### PHONEME LIST
PHONEMES = [
            '[SIL]',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',
            'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
            'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
            'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
            'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
            'V',     'W',     'Y',     'Z',     'ZH',    '[SOS]', '[EOS]']

# Kaggle

This section contains code that helps you install kaggle's API, creating kaggle.json with you username and API key details. Make sure to input those in the given code to ensure you can download data from the competition successfully.

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"gbwiraye","key":"<Kaggle API key>"}')
    # Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

In [6]:
# # commands to download data from kaggle

# !kaggle competitions download -c 11785-hw1p2-f23
# !mkdir '/kaggle/working/data'

# !unzip -qo /kaggle/working/11785-hw1p2-f23.zip -d '/kaggle/working/data'

# Dataset

This section covers the dataset/dataloader class for speech data. You will have to spend time writing code to create this class successfully. We have given you a lot of comments guiding you on what code to write at each stage, from top to bottom of the class. Please try and take your time figuring this out, as it will immensely help in creating dataset/dataloader classes for future homeworks.

Before running the following cells, please take some time to analyse the structure of data. Try loading a single MFCC and its transcipt, print out the shapes and print out the values. Do the transcripts look like phonemes?

In [7]:
class AudioDataset(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, context =0, partition = "train-clean-100"):

        self.context = context
        self.phonemes = phonemes

        # MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = os.path.join(root, partition,'mfcc')
        # Transcripts directory - use partition to acces train/dev directories from kaggle data using root
        self.transcript_dir = os.path.join(root, partition,'transcript')

        #  List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))
        # List files in self.transcript_dir using os.listdir in sorted order
        transcript_names    = sorted(os.listdir(self.transcript_dir))

        # Making sure that we have the same no. of mfcc and transcripts
        assert len(mfcc_names) == len(transcript_names)

        length = len(mfcc_names)
        T = 0

        for i in range(length):
            #   Load a single mfcc
            mfcc  = np.load(os.path.join(self.mfcc_dir, mfcc_names[i]))
            # Extract the length of the mfcc
            T += mfcc.shape[0]

        self.mfccs = np.zeros((2 * self.context + T, 28),dtype = np.float32)
        self.transcripts = np.zeros((T,), dtype = np.uint8)
        # Encoding phonemes
        PHONEMES_endoded = {p:i for i, p in enumerate(self.phonemes)}

        cy, cx = 0,self.context

        for i in range(length):
            #   Load a single mfcc
            mfcc        = np.load(os.path.join(self.mfcc_dir, mfcc_names[i]))
            #   Do Cepstral Normalization of mfcc (explained in writeup)
            mfcc = (mfcc-np.mean(mfcc, axis=0)) / np.std(mfcc, axis=0) #Added axis =0, as advised by the TA on Piazza
            #Watch out for an issue with the datatype, needs to be tensors not numpy arrays
            #   Load the corresponding transcript and Remove [SOS] and [EOS] from the transcript
            transcript  = np.load(os.path.join(self.transcript_dir, transcript_names[i]))[1:-1]
            #  Convert transcript to a sequence of integers based on self.phonemes
            y = [PHONEMES_endoded[phoneme] for phoneme in transcript]

            self.mfccs[cx:cx + mfcc.shape[0]] = mfcc
            self.transcripts[cy:cy + mfcc.shape[0]] = y

            cx += mfcc.shape[0]
            cy += mfcc.shape[0]

        self.length = T

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind : ind + 2 * self.context + 1]
        # After slicing, you get an array of shape 2*context+1 x 28. But our MLP needs 1d data and not 2d.
        frames = frames.flatten() # Flatten to get 1d data

        frames      = torch.FloatTensor(frames) # Convert to tensors
        phonemes    = torch.tensor(self.transcripts[ind])

        return frames, phonemes

In [8]:
class AudioTestDataset(torch.utils.data.Dataset):

    # TODO: Create a test dataset class similar to the previous class but you dont have transcripts for this
    # Imp: Read the mfccs in sorted order, do NOT shuffle the data here or in your dataloader.
    def __init__(self, root, context=0, partition= "test-clean"):

        self.context = context

        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = os.path.join(root, partition,'mfcc')

        # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))

        length = len(mfcc_names)
        T = 0

        for i in range(length):
            #   Load a single mfcc
            mfcc  = np.load(os.path.join(self.mfcc_dir, mfcc_names[i]))
            # Extract the length of the mfcc
            T += mfcc.shape[0]

        self.mfccs = np.zeros((2 * self.context + T, 28),dtype = np.float32)

        cx = self.context

        for i in range(length):
            #   Load a single mfcc
            mfcc        = np.load(os.path.join(self.mfcc_dir, mfcc_names[i]))
            #   Do Cepstral Normalization of mfcc (explained in writeup)
            mfcc = (mfcc-np.mean(mfcc, axis=0)) / np.std(mfcc, axis=0) #Added axis =0, as advised by the TA on Piazza
            #Watch out for an issue with the datatype, needs to be tensors not numpy arrays

            self.mfccs[cx:cx + mfcc.shape[0]] = mfcc
            cx += mfcc.shape[0]

        self.length = T

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind : ind + 2 * self.context + 1]
        # After slicing, you get an array of shape 2*context+1 x 28. But our MLP needs 1d data and not 2d.
        frames = frames.flatten() # TODO: Flatten to get 1d data

        frames      = torch.FloatTensor(frames) # Convert to tensors

        return frames



# Parameters Configuration

Storing your parameters and hyperparameters in a single configuration dictionary makes it easier to keep track of them during each experiment. It can also be used with weights and biases to log your parameters for each experiment and keep track of them across multiple experiments.

In [10]:
config = {
    'epochs': 80,
    'batch_size': 8192,
    'context': 30,
    'init_lr': 1e-3,
    'architecture': 'high-cutoff',
    'weight_decay': 1e-4,  # L2 regularization strength
}


# Create Datasets

In [11]:
#TODO: Create a dataset object using the AudioDataset class for the training data 
train_data = AudioDataset(root='/kaggle/input/11785-hw1p2-f23/11-785-f23-hw1p2/', context=config['context'])

# TODO: Create a dataset object using the AudioDataset class for the validation data
val_data = AudioDataset(root='/kaggle/input/11785-hw1p2-f23/11-785-f23-hw1p2/', partition='dev-clean', context=config['context'])
# TODO: Create a dataset object using the AudioTestDataset class for the test data
test_data = AudioTestDataset(root='/kaggle/input/11785-hw1p2-f23/11-785-f23-hw1p2/', partition='test-clean', context=config['context'])

In [None]:
# Define dataloaders for train, val and test datasets
# Dataloaders will yield a batch of frames and phonemes of given batch_size at every iteration
# We shuffle train dataloader but not val & test dataloader. Why?

train_loader = torch.utils.data.DataLoader(
    dataset     = train_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = True
)

val_loader = torch.utils.data.DataLoader(
    dataset     = val_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False
)

test_loader = torch.utils.data.DataLoader(
    dataset     = test_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False
)


print("Batch size     : ", config['batch_size'])
print("Context        : ", config['context'])
print("Input size     : ", (2*config['context']+1)*28)
print("Output symbols : ", len(PHONEMES))

print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Validation dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

In [None]:
# Testing code to check if your data loaders are working
for i, data in enumerate(train_loader):
    frames, phoneme = data
    print(frames.shape, phoneme.shape)
    break

# Network Architecture


This section defines your network architecture for the homework. We have given you a sample architecture that can easily clear the very low cutoff for the early submission deadline.

In [15]:
class Network(torch.nn.Module):

    def __init__(self, input_size, output_size):

        super(Network, self).__init__()

        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_size, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.GELU(),
            torch.nn.Dropout(0.3),

            torch.nn.Linear(2048, 2048),
            torch.nn.GELU(),
            torch.nn.Dropout(0.3),

            torch.nn.Linear(2048, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.GELU(),
            torch.nn.Dropout(0.2),

            torch.nn.Linear(2048, 2048),
            torch.nn.GELU(),
            torch.nn.Dropout(0.2),

            torch.nn.Linear(2048, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.GELU(),
            torch.nn.Dropout(0.2),

            torch.nn.Linear(2048, 2048),
            torch.nn.GELU(),
            torch.nn.Dropout(0.2),

            torch.nn.Linear(2048, output_size)
        )

    def forward(self, x):
        out = self.model(x)
        return out

# Define Model, Loss Function and Optimizer

Here we define the model, loss function, optimizer and optionally a learning rate scheduler.

In [16]:
INPUT_SIZE  = (2*config['context'] + 1) * 28 # Why is this the case?
model       = Network(INPUT_SIZE, len(train_data.phonemes)).to(device)
# summary(model, frames.to(device))
# Check number of parameters of your network
# Remember, you are limited to 20 million parameters for HW1 (including ensembles)

In [17]:
def count_parameters(model=model):
    params = [p.numel() for p in model.parameters() if p.requires_grad]
    print(f'Model total parameters: {sum(params):>7}')

In [None]:
count_parameters()

In [19]:
criterion = torch.nn.CrossEntropyLoss() # Defining Loss function.
# We use CE because the task is multi-class classification

# optimizer = torch.optim.Adam(model.parameters(), lr= config['init_lr']) #Defining Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=config['init_lr'], weight_decay=config['weight_decay'])

# Recommended : Define Scheduler for Learning Rate,
# including but not limited to StepLR, MultiStepLR, CosineAnnealingLR, ReduceLROnPlateau, etc.
# You can refer to Pytorch documentation for more information on how to use them.

# Is your training time very high?
# Look into mixed precision training if your GPU (Tesla T4, V100, etc) can make use of it
# Refer - https://pytorch.org/docs/stable/notes/amp_examples.html

# Training and Validation Functions

This section covers the training, and validation functions for each epoch of running your experiment with a given model architecture. The code has been provided to you, but we recommend going through the comments to understand the workflow to enable you to write these loops for future HWs.

In [None]:
torch.cuda.empty_cache()
gc.collect()

In [21]:
def train(model, dataloader, optimizer, criterion):

    model.train()
    tloss, tacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        ### Forward Propagation
        logits  = model(frames)

        ### Loss Calculation
        loss    = criterion(logits, phonemes)

        ### Backward Propagation
        loss.backward()

        ### Gradient Descent
        optimizer.step()

        tloss   += loss.item()
        tacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        batch_bar.set_postfix(loss="{:.04f}".format(float(tloss / (i + 1))),
                              acc="{:.04f}%".format(float(tacc*100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    tloss   /= len(train_loader)
    tacc    /= len(train_loader)

    return tloss, tacc

In [22]:
def eval(model, dataloader):

    model.eval() # set model in evaluation mode
    vloss, vacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(val_loader), dynamic_ncols=True, position=0, leave=False, desc='Val')

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Move data to device (ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        # makes sure that there are no gradients computed as we are not training the model now
        with torch.inference_mode():
            ### Forward Propagation
            logits  = model(frames)
            ### Loss Calculation
            loss    = criterion(logits, phonemes)

        vloss   += loss.item()
        vacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        # Do you think we need loss.backward() and optimizer.step() here?

        batch_bar.set_postfix(loss="{:.04f}".format(float(vloss / (i + 1))),
                              acc="{:.04f}%".format(float(vacc*100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    vloss   /= len(val_loader)
    vacc    /= len(val_loader)

    return vloss, vacc

# Weights and Biases Setup

In [None]:
wandb.login(key="<wandb API key>") #API Key is in your wandb account, under settings (wandb.ai/settings)

In [None]:
# Create your wandb run
run = wandb.init(
    name    = "second-run-Kaggle_5", ### Wandb creates random run names if you skip this field, we recommend you give useful names
#     reinit  = True, ### Allows reinitalizing runs when you re-run this cell
    id     = "jkjzosux", ### Insert specific run id here if you want to resume a previous run
    resume = "must", ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw1p2", ### Project should be created in your wandb account
    config  = config ### Wandb Config for your run
)

In [None]:
# Save your model architecture as a string with str(model)
model_arch  = str(model)

### Save it in a txt file
arch_file   = open("model_arch.txt", "w")
file_write  = arch_file.write(model_arch)
arch_file.close()

### log it in your wandb run with wandb.save()
wandb.save('model_arch.txt')

# Experiment

Now, it is time to finally run your ablations! Have fun!

In [None]:
%%time
# Iterate over number of epochs to train and evaluate your model
torch.cuda.empty_cache()
gc.collect()
# wandb.watch(model, log="all")
best_val_loss = float('inf')
best_val_acc = 0
for epoch in range(config['epochs']):

    print("\nEpoch {}/{}".format(epoch+1, config['epochs']))

    curr_lr                 = float(optimizer.param_groups[0]['lr'])
    train_loss, train_acc   = train(model, train_loader, optimizer, criterion)
    val_loss, val_acc       = eval(model, val_loader)
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_loss_HW1P2_2_5.pth')  # Save the best model
        wandb.save("best_loss_HW1P2_2_5.pth")
        
    if val_acc > best_val_acc:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_acc_HW1P2_2_5.pth')  # Save the best model
        wandb.save("best_acc_HW1P2_2_5.pth")

    torch.save(model.state_dict(), 'HW1P2_2_5.pth')  # Save the current model
    wandb.save("HW1P2_2_5.pth")

    print("\tTrain Acc {:.04f}%\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_acc*100, train_loss, curr_lr))
    print("\tVal Acc {:.04f}%\tVal Loss {:.04f}".format(val_acc*100, val_loss))

    ### Log metrics at each epoch in your run
    # Optionally, you can log at each batch inside train/eval functions
#     (explore wandb documentation/wandb recitation)
    wandb.log({'train_acc': train_acc*100, 'train_loss': train_loss,
               'val_acc': val_acc*100, 'valid_loss': val_loss, 'lr': curr_lr})

    ### Highly Recommended: Save checkpoint in drive and/or wandb if accuracy is better than your current best

### Finish your wandb run
# run.finish()

In [None]:
model.load_state_dict(torch.load('best_acc_HW1P2_2_5.pth', map_location=torch.device(device)))

# Testing and submission to Kaggle

Before we get to the following code, make sure to see the format of submission given in *sample_submission.csv*. Once you have done so, it is time to fill the following function to complete your inference on test data. Refer the eval function from previous cells to get an idea of how to go about completing this function.

In [36]:
def test(model, test_loader):
    ### What you call for model to perform inference?
    model.eval # TODO train or eval?

    ### List to store predicted phonemes of test data
    test_predictions = []

    ### Which mode do you need to avoid gradients?
    with torch.no_grad(): # TODO

        for i, mfccs in enumerate(tqdm(test_loader)):

            mfccs   = mfccs.to(device)

            logits  = model(mfccs)

            ### Get most likely predicted phoneme with argmax
            predicted_phonemes = torch.argmax(logits, dim=1).cpu().numpy().flatten()

            ### How do you store predicted_phonemes with test_predictions? Hint, look at eval
            test_predictions.extend(predicted_phonemes)# TODO

    return test_predictions

In [None]:
predictions = test(model, test_loader)

In [38]:
PHONEMES_decoder = {i:p for i, p in enumerate(PHONEMES)}

In [39]:
predictions=[PHONEMES_decoder[target_value] for target_value in predictions]

In [40]:
### Create CSV file with predictions
with open("./submission.csv", "w+") as f:
    f.write("id,label\n")
    for i in range(len(predictions)):
        f.write("{},{}\n".format(i, predictions[i]))

In [None]:
# ### Submit to kaggle competition using kaggle API (Uncomment below to use)
!kaggle competitions submit -c 11785-hw1p2-f23 -f ./submission.csv -m "Test Submission"

# ### However, its always safer to download the csv file and then upload to kaggle