# HW3P2: Utterance to Phoneme Mapping


For this homework, you would ideally have learned:

• To solve a sequence-to-sequence problem using Sequence models.
    – How to set up GRU/LSTM based models on pytorch
    – How to utilize CNNs as feature extractors
    – How to handle sequential data
    – How to pad / pack baches of variable length data
    – How to train the model using CTC Loss
    – How to optimize the model
    – How to implement and utilize decoders such as greedy and beam decoders

• To explore architectures and hyperparameters for the optimal solution
– To identify and tabulate all the various design/architecture choices, parameters
and hyperparameters that affect your solution
– To devise strategies to search through this space of options to find the best solution

• The process of staging the exploration
– To initially set up a simple solution that is easily implemented and optimized
– To stage your data to efficiently search through the space of solutions
– To subset promising configurations/settings and tune them to obtain higher performance

• To engineer the solution using your tools
– To use objects from the PyTorch framework to build a GRU/LSTM based model
– To deal with issues of data loading, memory usage, arithmetic precision etc. to maximize the time efficiency of your training and inference

# README

## Instructions to run code: All cells need to be run! 
(Note: Training cell was run twice to reach high cutoff - at the time of submission I had re-run the cell (instead of seperately running the cell) due to a lack of time, so the output logs of the first training run are not visible. Moreover, the additional training was only run for 10 epochs)

Ablations Strategies: 

1) Architectures considered:
- Simplified 1D Convolutional layer + BatchNorm of LSTM  (Easy implementation - Low cutoff reached)
- 2 1D-Conv layers with RELU (Allowed me to reach Medium Cutoff sucesfully, however, improvements thereafter were quite tedious and required a reconsideration of architecture)
- 2 1D-Conv layers with GELU   (Best Performance - highly improved Levenshtein Distance accuracy achieved within 15 epochs)
- Additionally, after a group discussiion and suggestion from TA's, introducing GELU with dropout in the classification layer seemed to really bolster performance, and convergence was quite easily achieved

2) Epochs: 
- Trained for 35 epochs in total. Training for longer epochs improved performance, however, beyond a certain number of epochs (50), the tradeoff in performance and resource consumption was not beneficial.

3) Hyperparameters: 
* Learning Rate tuning was not required, 0.002 LR achieved the high cutoff requirement
* Batch Size was experimented at different values from 16-128. Finally a batch_size of 32 was selected due its acceptable performance for compute units consumed. Although lower batch sizes notably improved performance.
* LockedDropout was implemented as recommended

4) Data loading scheme:
* No transforms were required to reach cutoff.

# Installs

## wandb

You will need to fetch your api key from wandb.ai

In [1]:
!pip install wandb -q

[K     |████████████████████████████████| 1.9 MB 4.7 MB/s 
[K     |████████████████████████████████| 168 kB 91.1 MB/s 
[K     |████████████████████████████████| 182 kB 67.5 MB/s 
[K     |████████████████████████████████| 63 kB 2.2 MB/s 
[K     |████████████████████████████████| 166 kB 85.5 MB/s 
[K     |████████████████████████████████| 166 kB 84.1 MB/s 
[K     |████████████████████████████████| 162 kB 79.5 MB/s 
[K     |████████████████████████████████| 162 kB 66.8 MB/s 
[K     |████████████████████████████████| 158 kB 73.4 MB/s 
[K     |████████████████████████████████| 157 kB 77.0 MB/s 
[K     |████████████████████████████████| 157 kB 82.4 MB/s 
[K     |████████████████████████████████| 157 kB 64.0 MB/s 
[K     |████████████████████████████████| 157 kB 99.0 MB/s 
[K     |████████████████████████████████| 157 kB 66.0 MB/s 
[K     |████████████████████████████████| 157 kB 88.5 MB/s 
[K     |████████████████████████████████| 157 kB 85.7 MB/s 
[K     |██████████████████

In [2]:
import wandb
wandb.login(key="2178c9f0d96e90016c3d36bcccb07de5e0c51edc")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
run = wandb.init(
    name = "early-submission", ## Wandb creates random run names if you skip this field
    reinit = True, ### Allows reinitalizing runs when you re-run this cell
    # run_id = ### Insert specific run id here if you want to resume a previous run
    # resume = "must" ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw3p2-ablations", ### Project should be created in your wandb account 
    # config = config ### Wandb Config for your run
)

## Levenshtein

This may take a while

In [4]:
!pip install python-Levenshtein
!git clone --recursive https://github.com/parlance/ctcdecode.git
!pip install wget
%cd ctcdecode
!pip install .
%cd ..

!pip install torchsummaryX

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.20.8-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.20.8
  Downloading Levenshtein-0.20.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (175 kB)
[K     |████████████████████████████████| 175 kB 5.1 MB/s 
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 68.7 MB/s 
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.20.8 python-Levenshtein-0.20.8 rapidfuzz-2.13.2
Cloning into 'ctcdecode'...
remote: Enumerating objects: 1102, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 1102 (delta 16), reused 32 (delta 14), pack-reused 1063[K
Receiving objects: 100% (1102/1102), 782.27 KiB | 6.74 MiB/s, done.
Resolving deltas: 100% (529/529), 

## imports

In [5]:
import torch
import random
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

import torchaudio.transforms as tat

from sklearn.metrics import accuracy_score
import gc

import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import ctcdecode
import Levenshtein
from ctcdecode import CTCBeamDecoder

import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


# Kaggle Setup

In [6]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"ripcurl11","key":"a924e45910075179ad325ad28d952008"}') # TODO: Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

Collecting kaggle==1.5.8
  Downloading kaggle-1.5.8.tar.gz (59 kB)
[?25l[K     |█████▌                          | 10 kB 25.3 MB/s eta 0:00:01[K     |███████████                     | 20 kB 11.7 MB/s eta 0:00:01[K     |████████████████▋               | 30 kB 9.7 MB/s eta 0:00:01[K     |██████████████████████▏         | 40 kB 4.4 MB/s eta 0:00:01[K     |███████████████████████████▊    | 51 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 59 kB 3.2 MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.8-py3-none-any.whl size=73275 sha256=535f0676eae547f7409c6775e330fff99197661fb2a35814fc82ccbfc5342aaf
  Stored in directory: /root/.cache/pip/wheels/de/f7/d8/c3902cacb7e62cb611b1ad343d7cc07f42f7eb76ae3a52f3d1
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
   

In [7]:
!kaggle competitions download -c 11-785-f22-hw3p2

Downloading 11-785-f22-hw3p2.zip to /content
100% 8.88G/8.88G [00:53<00:00, 109MB/s]
100% 8.88G/8.88G [00:53<00:00, 178MB/s]


In [8]:
'''
This will take a couple minutes, but you should see at least the following:
11-785-f22-hw3p2.zip  ctcdecode  hw3p2
'''
!unzip -q 11-785-f22-hw3p2.zip
!ls

11-785-f22-hw3p2.zip  ctcdecode  hw3p2	sample_data  wandb


# Google Drive

In [9]:
# from google.colab import drive # Link your drive if you are a colab user
# drive.mount('/content/drive') # Models in this HW take a long time to get trained and make sure to save it her

import os.path as path 
if not path.exists("/content/drive"):
    !sudo add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
    !sudo apt-get update -qq 2>&1 > /dev/null
    !sudo apt -y install -qq google-drive-ocamlfuse 2>&1 > /dev/null
    !google-drive-ocamlfuse

    !sudo apt-get install -qq w3m # to act as web browser 
    !xdg-settings set default-web-browser w3m.desktop # to set default browser
    %cd /content
    !mkdir drive
    %cd drive
#     !mkdir MyDrive
    %cd ..
    %cd ..
    !google-drive-ocamlfuse /content/drive/MyDrive

W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease' is no longer signed.
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease' is no longer signed.


debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf

# Dataset and Dataloader

In [10]:
# ARPABET PHONEME MAPPING
# DO NOT CHANGE
# This overwrites the phonetics.py file.

CMUdict_ARPAbet = {
    "" : " ",
    "[SIL]": "-", "NG": "G", "F" : "f", "M" : "m", "AE": "@", 
    "R"    : "r", "UW": "u", "N" : "n", "IY": "i", "AW": "W", 
    "V"    : "v", "UH": "U", "OW": "o", "AA": "a", "ER": "R", 
    "HH"   : "h", "Z" : "z", "K" : "k", "CH": "C", "W" : "w", 
    "EY"   : "e", "ZH": "Z", "T" : "t", "EH": "E", "Y" : "y", 
    "AH"   : "A", "B" : "b", "P" : "p", "TH": "T", "DH": "D", 
    "AO"   : "c", "G" : "g", "L" : "l", "JH": "j", "OY": "O", 
    "SH"   : "S", "D" : "d", "AY": "Y", "S" : "s", "IH": "I",
    "[SOS]": "[SOS]", "[EOS]": "[EOS]"}

    

CMUdict = list(CMUdict_ARPAbet.keys())
ARPAbet = list(CMUdict_ARPAbet.values())


PHONEMES = CMUdict
mapping = CMUdict_ARPAbet
LABELS = ARPAbet[:-2]

In [11]:
# You might want to play around with the mapping as a sanity check here

### Train Data

In [15]:
class AudioDataset(torch.utils.data.Dataset):

    # For this homework, we give you full flexibility to design your data set class.
    # Hint: The data from HW1 is very similar to this HW

    #TODO
    def __init__(self,data_path): 
        '''
        Initializes the dataset.
        INPUTS: What inputs do you need here?
        '''

        self.data_path = data_path
        self.mfcc_dir = self.data_path + '/mfcc/'
        self.transcript_dir = self.data_path + 'transcript/raw/'

        mfcc_names = sorted(os.listdir(self.mfcc_dir))
        transcript_names = sorted(os.listdir(self.transcript_dir))

        assert len(mfcc_names) == len(transcript_names)

        self.mfcc, self.transcript = [], []

        # num_examples = int(len(self.mfcc)*percent_data/100) . # Toy Dataset creation

        self.PHONEMES = PHONEMES

        # TODO:
        # Iterate through mfccs and transcripts
        for i in range(0, len(mfcc_names)):
            mfcc = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle=True)
            # print(mfcc[i])
        #   Optionally do Cepstral Normalization of mfcc
            mfcc = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
        #   Load the corresponding transcript
            transcript = np.load(self.transcript_dir + transcript_names[i],allow_pickle=True)[1:-1] # Remove [SOS] and [EOS] from the transcript (Is there an efficient way to do this
            # without traversing through the transcript)
        #   Append each mfcc to self.mfcc, transcript to self.transcript
            self.mfcc.append(mfcc)
            self.transcript.append(transcript)

        #TODO
        # WHAT SHOULD THE LENGTH OF THE DATASET BE?
        self.length = len(self.transcript)
        
        #TODO
        # HOW CAN WE REPRESENT PHONEMES? CAN WE CREATE A MAPPING FOR THEM?
        # HINT: TENSORS CANNOT STORE NON-NUMERICAL VALUES OR STRINGS

        #TODO
        # CREATE AN ARRAY OF ALL FEATUERS AND LABELS
        # WHAT NORMALIZATION TECHNIQUE DID YOU USE IN HW1? CAN WE USE IT HERE?
        '''
        You may decide to do this in __getitem__ if you wish.
        However, doing this here will make the __init__ function take the load of
        loading the data, and shift it away from training.
        '''
       
    def __len__(self):
        
        '''
        TODO: What do we return here?
        '''
        return self.length

    def __getitem__(self, ind):
        '''
        TODO: RETURN THE MFCC COEFFICIENTS AND ITS CORRESPONDING LABELS

        If you didn't do the loading and processing of the data in __init__,
        do that here.

        Once done, return a tuple of features and labels.
        '''
        
        mfcc = torch.FloatTensor(self.mfcc[ind]) # Convert to Tensors
        # transcript = torch.tensor([self.PHONEMES.index(i) for i in self.transcript[ind]], dtype=torch.long)
        transcript = torch.LongTensor([self.PHONEMES.index(i) for i in self.transcript[ind]])
        
        return mfcc, transcript

    def collate_fn(self,batch):
        '''
        TODO:
        1.  Extract the features and labels from 'batch'
        2.  We will additionally need to pad both features and labels,
            look at pytorch's docs for pad_sequence
        3.  This is a good place to perform transforms, if you so wish. 
            Performing them on batches will speed the process up a bit.
        4.  Return batch of features, labels, lenghts of features, 
            and lengths of labels.
        '''
        # batch of input mfcc coefficients
        batch_mfcc = [i for i,j in batch]
        # batch of outputututututut phonemes
        batch_transcript = [j for i,j in batch]

        # HINT: CHECK OUT -> pad_sequence (imported above)
        # Also be sure to check the input format (batch_first)
        batch_mfcc_pad = pad_sequence(batch_mfcc, batch_first = True)
        # lengths_mfcc = [len(m) for m in batch_mfcc] 
        lengths_mfcc = [m.shape[0] for m in batch_mfcc]

        batch_transcript_pad = pad_sequence(batch_transcript, batch_first = True)
        #lengths_transcript = [len(t) for t in batch_transcript] 
        lengths_transcript = [t.shape[0] for t in batch_transcript]

        # You may apply some transformation, Time and Frequency masking, here in the collate function;
        # Food for thought -> Why are we applying the transformation here and not in the __getitem__?
        #                  -> Would we apply transformation on the validation set as well?
        #                  -> Is the order of axes / dimensions as expected for the transform functions?
        
        # Return the following values: padded features, padded labels, actual length of features, actual length of the labels
        return batch_mfcc_pad, batch_transcript_pad, torch.tensor(lengths_mfcc), torch.tensor(lengths_transcript)

       

### Test Data

In [16]:
# Test Dataloader
#TODO
class AudioDatasetTest(torch.utils.data.Dataset):
    
  # Load the directory and all files in them
    def __init__(self,data_path):

        self.data_path = data_path
        self.mfcc_dir = self.data_path + '/mfcc/' 

        mfcc_names = sorted(os.listdir(self.mfcc_dir))

        self.mfcc = [] 

        self.PHONEMES = PHONEMES


        # TODO:
        # Iterate through mfccs and transcripts
        for i in range(0, len(mfcc_names)):
          mfcc = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle = True)
          # print(mfcc[i])
        # Optionally do Cepstral Normalization of mfcc
          mfcc = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
        # Append each mfcc to self.mfcc
          self.mfcc.append(mfcc)

        self.length = len(self.mfcc)

    def __len__(self):

        return self.length

    def __getitem__(self, ind):

        mfcc = torch.FloatTensor(self.mfcc[ind]) # Convert to Tensors

        return mfcc

    def collate_fn(self,batch):

        batch_mfcc = [i for i in batch] 
        batch_mfcc_pad = pad_sequence(batch_mfcc, batch_first = True) 
        # lengths_mfcc = [len(b) for b in batch_mfcc]
        lengths_mfcc = [b.shape[0] for b in batch_mfcc]

        return batch_mfcc_pad, torch.tensor(lengths_mfcc)

### Data - Hyperparameters

In [17]:
BATCH_SIZE = 32 # Increase if your device can handle it

transforms = [] # set of tranformations
# You may pass this as a parameter to the dataset class above
# This will help modularize your implementation

# root = '/content/hw3p2' 

### Data loaders

In [18]:
# get me RAMMM!!!! 
import gc
gc.collect()

153

In [19]:
# Create objects for the dataset class
# train_data = AudioDataset('/content/hw3p2/train-clean-100/') # Low Cut-off
train_data = AudioDataset('/content/hw3p2/train-clean-360/')
val_data = AudioDataset('/content/hw3p2/dev-clean/')
test_data = AudioDatasetTest('/content/hw3p2/test-clean/')

# Do NOT forget to pass in the collate function as parameter while creating the dataloader
train_loader = torch.utils.data.DataLoader(train_data, collate_fn=train_data.collate_fn,
                                           batch_size=BATCH_SIZE, pin_memory= True,
                                           shuffle= True, num_workers= 4) 
val_loader = torch.utils.data.DataLoader(val_data, collate_fn=val_data.collate_fn,
                                           batch_size=BATCH_SIZE, pin_memory= True,
                                           shuffle= False, num_workers= 2)
test_loader = torch.utils.data.DataLoader(test_data, collate_fn=test_data.collate_fn,
                                           batch_size=BATCH_SIZE, pin_memory= True,
                                           shuffle= False, num_workers= 2)

print("Batch size: ", BATCH_SIZE)
print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Val dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

Batch size:  32
Train dataset samples = 104014, batches = 3251
Val dataset samples = 2703, batches = 85
Test dataset samples = 2620, batches = 82


In [20]:
# sanity check
for data in train_loader:
    x, y, lx, ly = data
    print(x.shape, y.shape, lx.shape, ly.shape)
    break 

torch.Size([32, 1640, 15]) torch.Size([32, 204]) torch.Size([32]) torch.Size([32])


# Model Config

In [21]:
OUT_SIZE = len(LABELS)
OUT_SIZE

41

## Basic

In [23]:
import torch.nn as nn
class LockedDropout(nn.Module):
    """ LockedDropout applies the same dropout mask to every time step.

    **Thank you** to Sales Force for their initial implementation of :class:`WeightDrop`. Here is
    their `License
    <https://github.com/salesforce/awd-lstm-lm/blob/master/LICENSE>`__.

    Args:
        p (float): Probability of an element in the dropout mask to be zeroed.
    """

    def __init__(self, p=0.5):
        self.p = p
        super().__init__()

    def forward(self, x):
        """
        Args:
            x (:class:`torch.FloatTensor` [sequence length, batch size, rnn hidden size]): Input to
                apply dropout too.
        """
        if not self.training or not self.p:
            return x
        x = x.clone()
        mask = x.new_empty(1, x.size(1), x.size(2), requires_grad=False).bernoulli_(1 - self.p)
        mask = mask.div_(1 - self.p)
        mask = mask.expand_as(x)
        return x * mask



In [24]:
# torch.cuda.empty_cache()
# import torch.nn.functional as F

# class Network(nn.Module):

#     def __init__(self, input_size, embed_dim, hidden_dim, out_size, dropout_rate):

#         super(Network, self).__init__()

#         # Adding some sort of embedding layer or feature extractor might help performance.
#         self.embedding = nn.Sequential(nn.Conv1d(in_channels = input_size, out_channels = embed_dim, bias = False, kernel_size = 3, padding = 1, stride = 1),
#                                        nn.BatchNorm1d(embed_dim),          
#                                                   )

#         # TODO : look up the documentation. You might need to pass some additional parameters.
#         self.lstm = nn.LSTM(input_size = embed_dim, hidden_size = hidden_dim, num_layers = 3, bidirectional = True) 
       
#         #droupout1d !!
#         self.lstm_dropout = LockedDropout(p = dropout_rate)

#         self.classification = nn.Sequential(
#             nn.Dropout(p=dropout_rate),
#             nn.Linear(hidden_dim*2, out_size),
#             #TODO: Linear layer with in_features from the lstm module above and out_features = OUT_SIZE
#         )
        
#         # self.classification.appply(self.init_weights)
#         # self.lstm.apply(self.init_weights)

#         self.logSoftmax = nn.LogSoftmax(dim = 2) #TODO: Apply a log softmax here. Which dimension would apply it on ?

#     def init_weights(self, m):
#       if isinstance(m,torch.nn.Linear):
#         torch.nn.init.xavier_uniform_(m.weight)
#       if isinstance(m,torch.nn.Conv1d):
#         torch.nn.init.xavier_uniform_(m.weight)

#     def forward(self, x, lx):
#         #TODO
#         # The forward function takes 2 parameter inputs here. Why?
#         # Refer to the handout for hints
        
#         out = x.permute((0,2,1))
#         out = self.embedding(out)
#         out = out.permute((0,2,1))

#         out = self.lstm_dropout(out)
        
#         packed_input = pack_padded_sequence(out, lx, enforce_sorted=False, batch_first = True)

#         lstm_out, hidden_dims = self.lstm(packed_input)
        

#         lstm_pad_pack, lx  = pad_packed_sequence(lstm_out, batch_first = True)

#         out = self.classification(lstm_pad_pack)
#         out = self.logSoftmax(out)
            
#         out = out.permute((1,0,2))

#         return out, lx

In [25]:
torch.cuda.empty_cache()
import torch.nn.functional as F

class Network(nn.Module):

    def __init__(self, input_size, embed_dim, hidden_dim, out_size, dropout_rate):

        super(Network, self).__init__()

        self.embedding = nn.Sequential(nn.Conv1d(in_channels = input_size, out_channels = embed_dim, bias = False, kernel_size = 1, padding = 0, stride = 1),
                                       nn.BatchNorm1d(embed_dim),
                                       nn.GELU(),
                                       nn.Conv1d(in_channels = embed_dim, out_channels = embed_dim, bias = False, kernel_size = 3, padding = 1, stride = 1, groups = embed_dim),
                                       nn.BatchNorm1d(embed_dim),
                                       nn.GELU(),
                                      #  nn.Conv1d(in_channels = embed_dim, out_channels = embed_dim, bias = False, kernel_size = 1, padding = 0, stride = 1),
                                      #  nn.BatchNorm1d(embed_dim),
                                      #  nn.GELU(),
                                       nn.Dropout(0.2)
                                                  )

        # TODO : look up the documentation. You might need to pass some additional parameters.
        self.lstm = nn.LSTM(input_size = embed_dim, hidden_size = hidden_dim, num_layers = 2, bidirectional = True) 
       
        #droupout1d !!

        self.classification = nn.Sequential(
            nn.Linear((hidden_dim * 2), 2048),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(2048, 41)
            #TODO: Linear layer with in_features from the lstm module above and out_features = OUT_SIZE
        )
        
        self.logSoftmax = nn.LogSoftmax(dim = 2) #TODO: Apply a log softmax here. Which dimension would apply it on ?

    def forward(self, x, lx):
        #TODO
        # The forward function takes 2 parameter inputs here. Why?
        # Refer to the handout for hints
        
        out = torch.permute(x, (0,2,1))
        out = self.embedding(out)
        out = torch.permute(out, (0,2,1))
        
        packed_input = pack_padded_sequence(out, lx, enforce_sorted=False, batch_first = True)

        lstm_out, hidden_dims = self.lstm(packed_input)
        

        lstm_pad_pack, lx  = pad_packed_sequence(lstm_out, batch_first = True)

        out = self.classification(lstm_pad_pack)
        out = self.logSoftmax(out)
            
        out = torch.permute(out, (1,0,2))

        return out, lx

## INIT

In [26]:
torch.cuda.empty_cache()

model = Network(input_size=15, embed_dim=128,hidden_dim=256,out_size=41,dropout_rate=0.2).to(device)
summary(model, x.to(device), lx) # x and lx come from the sanity check above :)

                             Kernel Shape      Output Shape     Params  \
Layer                                                                    
0_embedding.Conv1d_0         [15, 128, 1]   [32, 128, 1640]      1.92k   
1_embedding.BatchNorm1d_1           [128]   [32, 128, 1640]      256.0   
2_embedding.GELU_2                      -   [32, 128, 1640]          -   
3_embedding.Conv1d_3          [1, 128, 3]   [32, 128, 1640]      384.0   
4_embedding.BatchNorm1d_4           [128]   [32, 128, 1640]      256.0   
5_embedding.GELU_5                      -   [32, 128, 1640]          -   
6_embedding.Dropout_6                   -   [32, 128, 1640]          -   
7_lstm                                  -      [38939, 512]  2.367488M   
8_classification.Linear_0     [512, 2048]  [32, 1640, 2048]  1.050624M   
9_classification.GELU_1                 -  [32, 1640, 2048]          -   
10_classification.Dropout_2             -  [32, 1640, 2048]          -   
11_classification.Linear_3     [2048, 

Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_embedding.Conv1d_0,"[15, 128, 1]","[32, 128, 1640]",1920.0,3148800.0
1_embedding.BatchNorm1d_1,[128],"[32, 128, 1640]",256.0,128.0
2_embedding.GELU_2,-,"[32, 128, 1640]",,
3_embedding.Conv1d_3,"[1, 128, 3]","[32, 128, 1640]",384.0,629760.0
4_embedding.BatchNorm1d_4,[128],"[32, 128, 1640]",256.0,128.0
5_embedding.GELU_5,-,"[32, 128, 1640]",,
6_embedding.Dropout_6,-,"[32, 128, 1640]",,
7_lstm,-,"[38939, 512]",2367488.0,2359296.0
8_classification.Linear_0,"[512, 2048]","[32, 1640, 2048]",1050624.0,1048576.0
9_classification.GELU_1,-,"[32, 1640, 2048]",,


In [27]:
torch.cuda.empty_cache()
gc.collect()


106

# Training Config

In [28]:
train_config = {
    "beam_width" : 2,
    "lr" : 2e-3,
    "epochs" : 25
    } # Feel free to add more items here

In [29]:
#TODO

criterion = torch.nn.CTCLoss()# Define CTC loss as the criterion. How would the losses be reduced?
# CTC Loss: https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html
# Refer to the handout for hints

optimizer =  torch.optim.AdamW(model.parameters(),lr=train_config['lr'],weight_decay=5e-5) # What goes in here?

# Declare the decoder. Use the CTC Beam Decoder to decode phonemes
# CTC Beam Decoder Doc: https://github.com/parlance/ctcdecode
decoder = CTCBeamDecoder(labels=LABELS,beam_width=train_config['beam_width'],num_processes=4,log_probs_input=True)#TODO 

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,mode='min', factor=0.5, patience=1,verbose=True)#TODO

# Mixed Precision, if you need it
scaler = torch.cuda.amp.GradScaler()




### Levenshtein

In [30]:
# Use debug = True to see debug outputs
def calculate_levenshtein(h, y, lh, ly, decoder, labels, debug = False):

    if debug:
        pass
        print(f"\n----- IN LEVENSHTEIN -----\n")
        # Add any other debug statements as you may need
        # you may want to use debug in several places in this function
        
    # TODO: look at docs for CTC.decoder and find out what is returned here
    # h = h.permute(1,0,2)
    h = torch.permute(h,(1,0,2))
    beam_results, beam_scores, timesteps, out_seq_len = decoder.decode(h, seq_lens = lh)

    batch_size = len(beam_results) # TODO
    distance = 0 # Initialize the distance to be 0 initially
    
    for i in range(len(beam_results)):
        # TODO: Loop through each element in the batch
        if out_seq_len[i][0] != 0:
            decoded_slice = "".join([labels[p] for p in beam_results[i,0,:out_seq_len[i,0]]])
            target = "".join([labels[int(q)] for q in y[i,0:ly[i]]])
            d = Levenshtein.distance(decoded_slice,target)
            distance += d
            # print(distance)
            # pass

    distance /= len(beam_results) # TODO: Uncomment this, but think about why we are doing this

    return distance

In [31]:
# ANOTEHR SANITY CHECK

with torch.no_grad():
  for i, (out,y,out_lengths,ly) in enumerate(train_loader):
      
      #TODO: 
      # Follow the following steps, and 
      # Add some print statements here for sanity checking
      out_tmp, y, out_lengths_tmp, ly = out.to(device), y.to(device), out_lengths, ly
      out, out_lengths = model(out, out_lengths)
      # print(out.shape,out_lengths.shape)

      #1. What values are you returning from the collate function
      #2. Move the features and target to <DEVICE>
      #3. Print the shapes of each to get a fair understanding 
      #4. Pass the inputs to the model
            # Think of the following before you implement:
            # 4.1 What will be the input to your model?
            # 4.2 What would the model output?
            # 4.3 Print the shapes of the output to get a fair understanding 

      # Calculate loss: https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html
      # Calculating the loss is not straightforward. Check the input format of each parameter
      
      loss = criterion(out,y,out_lengths,ly) # What goes in here?
      print(f"loss: {loss}")

      distance = calculate_levenshtein(out, y, out_lengths, ly, decoder, LABELS, debug = False)
      print(f"lev-distance: {distance}")

      break # one iteration is enough

loss: 30.61703872680664
lev-distance: 470.875


# Training

### Eval function
Writing a function to do one round of evaluations will help make your code more modular, you can, however, choose to skip this if you'd like it.

In [32]:
torch.cuda.empty_cache()
def evaluate(data_loader, model):
    model.eval()
    dist = 0
    loss = 0
    batch_bar = tqdm(total=len(data_loader), dynamic_ncols=True, leave=False, position=0, desc='Val') 
    # TODO Fill this function out, if you're using it.
    for i, (h,y,lh,ly) in enumerate(data_loader):
        h,y,lh,ly = h.to(device),y.to(device),lh,ly
        
        with torch.inference_mode():
            out, out_length = model(h,lh)
            l = criterion(out,y,out_length,ly)
            d = calculate_levenshtein(out,y,out_length,ly,decoder,LABELS,debug=False)
        
        batch_bar.set_postfix(loss = f"{loss/ (i+1):.4f}", Distance = f"{dist/(i+1)}")
        
        loss+=l
        dist+=d
        
        batch_bar.update()
        
    batch_bar.close()
    del out, out_length, lh, ly, h, y

    loss /=len(data_loader)
    dist /=len(data_loader)
    
    
    print(f"\n Validation Loss: {loss:.4f}")
    print(f"\n Distance: {dist:.4f}")
    
    return loss, dist

### Training Setup

In [33]:
# This is for checkpointing, if you're doing it over multiple sessions

last_epoch_completed = 0
start = last_epoch_completed
end = train_config['epochs']
best_val_dist = float("inf") # if you're restarting from some checkpoint, use what you saw there.
dist_freq = 1

Again, writing a train step might help you code be more modular. You may choose to skip this and write the whole thing out in the training loop below if you so wish.

In [34]:
def train_step(train_loader, model, optimizer, criterion, scheduler, scaler):
    
    batch_bar = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train') 
    train_loss = 0
    model.train()


    for i, data in enumerate(train_loader):
        h, y, lh, ly = data
        optimizer.zero_grad()
        h, y = h.to(device), y.to(device)

        # TODO: Fill this with the help of your sanity check

        with torch.cuda.amp.autocast():
            out,out_length = model(h,lh)
            loss = criterion(out,y,out_length,ly)

        # HINT: Are you using mixed precision? 

        batch_bar.set_postfix(
            loss = f"{train_loss/ (i+1):.4f}",
            lr = f"{optimizer.param_groups[0]['lr']}"
        )
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        train_loss += loss
        batch_bar.update()
    
    batch_bar.close()
    del out, out_length, lh, ly, h, y

    train_loss /= len(train_loader) # TODO
    print(f"\n Training Loss: {loss:.4f}")

    return train_loss # And anything else you may wish to get out of this function

### Train Loop

In [None]:
torch.cuda.empty_cache()
gc.collect()

#TODO: Please complete the training loop

for epoch in range(train_config["epochs"]):

    # one training step
    # one validation step (if you want)

    print("\nEpoch {}/{}".format(epoch+1,train_config["epochs"]))

    # HINT: Calculating levenshtein distance takes a long time. Do you need to do it every epoch?
    # Does the training step even need it? 

    train_loss = train_step(train_loader, model,optimizer, criterion, scheduler, scaler)
    val_loss, val_dist = evaluate(val_loader, model)

    # Where you have your scheduler.step depends on the scheduler you use.
    scheduler.step(val_dist)
    
    
    # Use the below code to save models
    if val_dist < best_val_dist:
      #path = os.path.join(root_path, model_directory, 'checkpoint' + '.pth')
      # print("Saving model")
      # torch.save({'model_state_dict':model.state_dict(),
      #             'optimizer_state_dict':optimizer.state_dict(),
      #             'train_loss': train_loss,
      #             'val_dist': val_dist, 
      #             'epoch': epoch}, 
      #             '/content/gdrive/MyDrive/hw3p2-checkpoint-ak.pth')
      best_val_dist = val_dist
#       wandb.save('checkpoint.pth')
    

    # You may want to log some hyperparameters and results on wandb
#     wandb.log()

run.finish()

# Extra Training
High Cutoff

In [40]:
torch.cuda.empty_cache()
gc.collect()

#TODO: Please complete the training loop

for epoch in range(10):#train_config["epochs"]): # Note additional training is for 10 epochs

    # one training step
    # one validation step (if you want)
    print("\nEpoch {}/{}".format(epoch+1,train_config["epochs"]))  # forgot to replace config["epochs"] with 10 -> printed as 10/25 in logs
    # HINT: Calculating levenshtein distance takes a long time. Do you need to do it every epoch?
    # Does the training step even need it? 
    train_loss = train_step(train_loader,model,optimizer,criterion,scheduler,scaler)
    val_loss, val_dist = evaluate(val_loader,model)
    
    scheduler.step(val_dist)
    # Where you have your scheduler.step depends on the scheduler you use.
    
    
    # Use the below code to save models
    if val_dist < best_val_dist:
      #path = os.path.join(root_path, model_directory, 'checkpoint' + '.pth')
      # print("Saving model")
      # torch.save({'model_state_dict':model.state_dict(),
      #             'optimizer_state_dict':optimizer.state_dict(),
      #             'train_loss': train_loss,
      #             'val_dist': val_dist, 
      #             'epoch': epoch}, 
      #             '/content/gdrive/MyDrive/hw3p2-checkpoint-ak.pth')
      best_val_dist = val_dist
#       wandb.save('checkpoint.pth')
    

    # You may want to log some hyperparameters and results on wandb
#     wandb.log()

run.finish()


Epoch 1/25





 Training Loss: 0.1243





 Validation Loss: 0.2433

 Distance: 4.8894
Epoch    26: reducing learning rate of group 0 to 2.5000e-04.

Epoch 2/25





 Training Loss: 0.1211





 Validation Loss: 0.2372

 Distance: 4.7538

Epoch 3/25





 Training Loss: 0.1780





 Validation Loss: 0.2379

 Distance: 4.7695

Epoch 4/25





 Training Loss: 0.1150





 Validation Loss: 0.2381

 Distance: 4.7460

Epoch 5/25





 Training Loss: 0.1463





 Validation Loss: 0.2382

 Distance: 4.7257

Epoch 6/25





 Training Loss: 0.1419





 Validation Loss: 0.2384

 Distance: 4.7584

Epoch 7/25





 Training Loss: 0.1958





 Validation Loss: 0.2376

 Distance: 4.7400
Epoch    32: reducing learning rate of group 0 to 1.2500e-04.

Epoch 8/25





 Training Loss: 0.1667





 Validation Loss: 0.2377

 Distance: 4.7094

Epoch 9/25





 Training Loss: 0.1272





 Validation Loss: 0.2380

 Distance: 4.6906

Epoch 10/25





 Training Loss: 0.1105


                                                                                             


 Validation Loss: 0.2374

 Distance: 4.6531




# Generate Predictions and Submit to Kaggle

In [41]:
#TODO: Make predictions

# Follow the steps below:
# 1. Create a new object for CTCBeamDecoder with larger (why?) number of beams
# 2. Get prediction string by decoding the results of the beam decoder

decoder_test = CTCBeamDecoder(labels = LABELS, beam_width = 10, num_processes = 4, log_probs_input = True)

def make_output(h, lh, decoder, LABELS):

    h = torch.permute(h,(1,0,2))
    # print(h.shape)
    beam_results, beam_scores, timesteps, out_seq_len = decoder_test.decode(h, seq_lens=lh) #TODO: What parameters would the decode function take in?
    batch_size = len(beam_results) #What is the batch size

    dist = 0
    preds = []
    for i in range(batch_size): # Loop through each element in the batch
        if out_seq_len[i,0] != 0:
          # h_sliced = #TODO: Obtain the beam results
          h_string = "".join([LABELS[b] for b in beam_results[i,0,:out_seq_len[i,0]]])
          preds.append(h_string)
    
    return preds

In [42]:
def predict(data_loader,model,decoder,debug=False):
  model.eval()
  pred=[]
  for i, data in enumerate(data_loader):
    x,lx = data
    x = x.to(device)
    output, l = model(x, lx)
    
    predictions = make_output(output, lx, decoder, LABELS)
    for j in predictions:
      pred.append(j)
  return pred

In [43]:
for data in test_loader:
    x, lx = data
    print(x.shape, lx.shape)
    break 

torch.Size([32, 825, 15]) torch.Size([32])


In [44]:
#TODO:
# Write a function (predict) to generate predictions and submit the file to Kaggle

torch.cuda.empty_cache()
predictions = predict(test_loader, model, decoder_test)
import pandas as pd

# with open("submission.csv", "w+") as f:
#     f.write("index,label\n")
#     for i in range(len(predictions)):
#         f.write("{},{}\n".format(i, predictions[i]))

df = pd.read_csv('/content/hw3p2/test-clean/transcript/random_submission.csv')
df.label = predictions

df.to_csv('submission.csv', index = False)
!kaggle competitions submit -c 11-785-f22-hw3p2 -f submission.csv -m "I made it!"

100% 208k/208k [00:01<00:00, 172kB/s]
Successfully submitted to Automatic Speech Recognition (ASR)