<center>
    <h1>Utterance to Phoneme Mapping</h1>
    <h2>A Speech Recognition Task</h2>
    <h3>Meriem Fouad</h3>
</center>

# Introduction

This project is in the field of speech recognition. The goal is to predict the phonemes of a recording. The picture below illustrates the task at hand, where an audio recording of a person saying the word “yes” has been passed though a neural network, which outputs the phonetic transcription, /Y/ /EH/ /S/. My task is to build a recurrent neural network that performs this task.

<p align="center">
    <img src="yes_example.png" width="250">
</p>

## Data and Alignment Challenge:

The data consists of speech recordings that are parametrized as a sequence of feature vectors (mel spectral vectors, each representing a “frame” of speech), which arrive at a rate of 100 frames per second. 

The challenge here is that the output and input are not time-synchronous. It is unknown a priori which phonemes occur in the output, the length of the output sequence, or even when to output the phonemes the recording. The image below illustrates this issue: the input to the model is a sequence of feature vectors (shown by the red boxes) from a speech recording. The network must analyze the input sequence and generate the output phoneme sequence (e.g. “/Y/ /EH/ /S/”) that best matches the audio. The output is not time-synchronous with the input. 

<p align="center">
    <img src="alignment_issue.png" width="250">
</p>

## Connectionist Temporal Classification

To tackle this alignment we issue, we use a **Connectionist Temporal Classification** (CTC) where we decompose the inference into a 2-step process:

1. **Step 1** A neural network generates outputs at every time step. This eliminates the uncertainty of knowing when to generate the (sporadic) outputs
2. **Step 2** We perform a **dynamic programming (DP)** search-like operation on the complete set of outputs generated by the network, to generate the actual final outputs

Here's an illustration of the CTC framework:

<p align="center">
    <img src="ctc.png" width="250">
</p>


## Network :

My Automatic Speech Recognition network consists of the following architecture:

1. Encoder:
  a. Embedding with 4 CNNs and Relu activation,
  b. 1 Bi-directional Long Short-Term Memory (1 Bi-LSTM)
  c. 2 pBlstms (pyramidal Bi-LSTMs)

2. Decoder: MLP 

3. Inference: Beam Search

## Hyper Parameter Experimentation

1. **Data Augmentation**: Experimented with Frequency and Time masking

2. **Additional Params**: 

    I experimented with a variety of parameters including: beam width, batch size, adam weight decay rate, locked dropout, embed size, scheduler type and initialization linear rate.

    I logged my experiments on wandb. I'm adding below how a variety of my models performed.

<p align="center">
    <img src="wandb.png" width="700">
</p>

To see the wandb log of experiments, go to [this link](https://wandb.ai/mfouad-cmu/hw3p2/table)


## Evaluation Metric:

We used the Levenshtein Distance as the main evaluation metrics. Levenshtein Distance takes as input two strings, and the computed distance is the number of character differences between the strings.



#### Note: 
This project was part of a Deep Learning course (11-785: Introduction to Deep Learning) I took at Carnegie Mellon University during Fall 2024.


# Installs

In [None]:
%pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchtext==0.14.1 torchaudio==0.13.1 torchdata==0.5.1 --extra-index-url https://download.pytorch.org/whl/cu117 -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 GB[0m [31m889.9 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.3/24.3 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m84.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install torchsummaryX==1.3.0
!pip install pandas==1.5.2
!pip install wandb --quiet
!pip install python-Levenshtein -q

Collecting torchsummaryX==1.3.0
  Downloading torchsummaryX-1.3.0-py3-none-any.whl.metadata (325 bytes)
Downloading torchsummaryX-1.3.0-py3-none-any.whl (3.6 kB)
Installing collected packages: torchsummaryX
Successfully installed torchsummaryX-1.3.0
Collecting pandas==1.5.2
  Downloading pandas-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading pandas-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.24.0 requires pandas>

In [None]:
!git clone --recursive https://github.com/parlance/ctcdecode.git
!pip install wget -q

Cloning into 'ctcdecode'...
remote: Enumerating objects: 1102, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 1102 (delta 16), reused 32 (delta 14), pack-reused 1063 (from 1)[K
Receiving objects: 100% (1102/1102), 782.27 KiB | 2.20 MiB/s, done.
Resolving deltas: 100% (529/529), done.
Submodule 'third_party/ThreadPool' (https://github.com/progschj/ThreadPool.git) registered for path 'third_party/ThreadPool'
Submodule 'third_party/kenlm' (https://github.com/kpu/kenlm.git) registered for path 'third_party/kenlm'
Cloning into '/content/ctcdecode/third_party/ThreadPool'...
remote: Enumerating objects: 82, done.        
remote: Counting objects: 100% (26/26), done.        
remote: Compressing objects: 100% (9/9), done.        
Receiving objects: 100% (82/82), 13.34 KiB | 297.00 KiB/s, done.
Resolving deltas: 100% (36/36), done.
remote: Total 82 (delta 19), reused 17 (delta 17), pack-reused 56 (from 1)        
Cloni

In [None]:
%cd ctcdecode
!pip install . -q
%cd ..

/content/ctcdecode
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for ctcdecode (setup.py) ... [?25l[?25hdone
/content


# Imports

In [1]:
import os
import torch
import random
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

import torchaudio.transforms as tat

from sklearn.metrics import accuracy_score
import gc

import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import ctcdecode
import Levenshtein
from ctcdecode import CTCBeamDecoder

import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


# Kaggle Competition and Data Setup

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8 -q
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":".....","key":"......"}')

!chmod 600 /root/.kaggle/kaggle.json

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone


In [None]:
!kaggle competitions download -c 11-785-hw3p2-f24

Downloading 11-785-hw3p2-f24.zip to /content
100% 3.96G/3.97G [00:49<00:00, 67.3MB/s]
100% 3.97G/3.97G [00:49<00:00, 85.9MB/s]


In [None]:
'''
This will take a couple minutes, but you should see at least the following:
11-785-f24-hw3p2  ctcdecode  hw3p2asr-f24.zip  sample_data
'''
!unzip -q 11-785-hw3p2-f24.zip
!ls

11785-f24-hw3p2  11-785-hw3p2-f24.zip  ctcdecode  sample_data


# Dataset and Dataloader

In [2]:
# ARPABET PHONEME MAPPING

CMUdict_ARPAbet = {
    "" : " ",
    "[SIL]": "-", "NG": "G", "F" : "f", "M" : "m", "AE": "@",
    "R"    : "r", "UW": "u", "N" : "n", "IY": "i", "AW": "W",
    "V"    : "v", "UH": "U", "OW": "o", "AA": "a", "ER": "R",
    "HH"   : "h", "Z" : "z", "K" : "k", "CH": "C", "W" : "w",
    "EY"   : "e", "ZH": "Z", "T" : "t", "EH": "E", "Y" : "y",
    "AH"   : "A", "B" : "b", "P" : "p", "TH": "T", "DH": "D",
    "AO"   : "c", "G" : "g", "L" : "l", "JH": "j", "OY": "O",
    "SH"   : "S", "D" : "d", "AY": "Y", "S" : "s", "IH": "I",
    "[SOS]": "[SOS]", "[EOS]": "[EOS]"
}


CMUdict = list(CMUdict_ARPAbet.keys())
ARPAbet = list(CMUdict_ARPAbet.values())


PHONEMES = CMUdict[:-2]
LABELS = ARPAbet[:-2]



### Train Data

In [3]:
class AudioDataset(torch.utils.data.Dataset):


    def __init__(self, root, transformations, phonemes = PHONEMES, partition = "train-clean-100",
                 val_data = False):

        # Load the directory and all files in them

        self.mfcc_dir = f"{root}/{partition}/mfcc"
        self.transcript_dir = f"{root}/{partition}/transcript"

        # Important: Load the files in sorted order of their names in the directory.
        mfcc_names = sorted(os.listdir(self.mfcc_dir))
        transcript_names = sorted(os.listdir(self.transcript_dir))

        self.phonemes = phonemes
        self.val_data = val_data
        self.transformations = transformations

        # Making sure that we have the same no. of mfcc and transcripts
        assert len(mfcc_names) == len(transcript_names)


        self.mfccs, self.transcripts = [], []

        for i in range(len(mfcc_names)):
        # Load a single mfcc
            mfcc        = np.load(f"{self.mfcc_dir}/{mfcc_names[i]}")
            #  Do Cepstral Normalization of mfcc 
            mfcc = (mfcc - np.mean(mfcc, axis=0)) / np.std(mfcc, axis=0)
            # Remove [SOS] and [EOS] from the transcript
            transcript  = np.load(f"{self.transcript_dir}/{transcript_names[i]}")[1:-1]
            #  Append each mfcc to self.mfcc, transcript to self.transcript
            self.mfccs.append(mfcc)

            # Map these phonemes to integers
            # Create a dictionary from the list where they keys are the phenome and the values are the indices,
            transcript = np.vectorize(dict(zip(self.phonemes, range(len(self.phonemes)))).__getitem__)(transcript)

            self.transcripts.append(transcript)

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)


    def __len__(self):

        return self.length

    def __getitem__(self, ind):
        '''
        RETURNs THE MFCC COEFFICIENTS AND ITS CORRESPONDING LABELS
        '''

        mfcc = torch.FloatTensor(self.mfccs[ind])
        transcript = torch.tensor(self.transcripts[ind])
        return mfcc, transcript


    def collate_fn(self,batch):
        """
        
        Collate function for batching sequences.
        
        This function extracts features and labels from the input batch, pads them to the same length, 
        and applies transformations for efficiency.


        Args:
            batch (list): A list of tuples containing (features, labels).

        Returns:
            tuple: A tuple containing:
                - Padded features (Tensor)
                - Padded labels (Tensor)
                - Lengths of features (Tensor)
                - Lengths of labels (Tensor)
        """

        # batch of input mfcc coefficients
        batch_mfcc = [item[0] for item in batch]
        # batch of output phonemes
        batch_transcript = [item[1] for item in batch]


        batch_mfcc_pad = pad_sequence(batch_mfcc, batch_first = True)
        lengths_mfcc = [item.shape[0] for item in batch_mfcc]

        batch_transcript_pad = pad_sequence(batch_transcript, batch_first = True)
        lengths_transcript = [item.shape[0] for item in batch_transcript]


        # Only apply transformations to train data (Not val):
        if self.val_data == False:
            for transform in self.transformations:
                batch_mfcc_pad = transform(batch_mfcc_pad)

        return batch_mfcc_pad, batch_transcript_pad, torch.tensor(lengths_mfcc), torch.tensor(lengths_transcript)



### Test Data

In [4]:
# Test Dataloader

class AudioDatasetTest(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, partition= "test-clean"):

        self.phonemes   = phonemes

        self.mfcc_dir       = f"{root}/{partition}/mfcc"

        # List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))

        self.mfccs = []

        for i in range(len(mfcc_names)):
            mfcc        = np.load(f"{self.mfcc_dir}/{mfcc_names[i]}")
        #   Do Cepstral Normalization of mfcc
            mfcc = (mfcc - np.mean(mfcc, axis=0)) / np.std(mfcc, axis=0)

            self.mfccs.append(mfcc)

        # Length of the dataset is now the length of concatenated mfccs
        self.length = len(self.mfccs)


    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        mfcc = torch.FloatTensor(self.mfccs[ind])

        return mfcc

    def collate_fn(self, batch):

        batch_mfcc_pad = pad_sequence(batch, batch_first = True)
        lengths_mfcc = [item.shape[0] for item in batch]

        return batch_mfcc_pad, torch.tensor(lengths_mfcc)

### Config - Hyperparameters

In [5]:
root = '/content/11785-f24-hw3p2'
# root = '/kaggle/working/11785-f24-hw3p2'

# Update this based on desired experimentations
config = {
    "beam_width" : 10,
    "init_lr"         : 1e-3,
    "epochs"     : 25,
    "batch_size" : 32, # Increase if device can handle it
    'adamw_decay': 7.5e-3,
}


freq_masking = tat.FrequencyMasking(freq_mask_param=5, iid_masks = True)
time_masking = tat.TimeMasking(time_mask_param=5, iid_masks = True)

transforms = [freq_masking, time_masking] # set of tranformations

### Data loaders

In [6]:
# get me RAMMM!!!!
import gc
gc.collect()

0

In [7]:
# Create objects for the dataset class
train_data = AudioDataset(root = root, transformations = transforms)
val_data = AudioDataset(root = root, transformations = transforms, partition= "dev-clean", val_data=True)
test_data = AudioDatasetTest(root = root, partition="test-clean")

# Do NOT forget to pass in the collate function as parameter while creating the dataloader
train_loader = torch.utils.data.DataLoader(
    dataset     = train_data,
    num_workers = 4,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = True,
    collate_fn = train_data.collate_fn
)

val_loader = torch.utils.data.DataLoader(
    dataset     = val_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False,
    collate_fn = val_data.collate_fn
)

test_loader = torch.utils.data.DataLoader(
    dataset     = test_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False,
    collate_fn = test_data.collate_fn
)

print("Batch size: ", config['batch_size'])
print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Val dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

Batch size:  32
Train dataset samples = 28539, batches = 892
Val dataset samples = 2703, batches = 85
Test dataset samples = 2620, batches = 82


In [8]:
# sanity check
for data in train_loader:
    x, y, lx, ly = data
    print(x.shape, y.shape, lx.shape, ly.shape)
    break

torch.Size([32, 1642, 28]) torch.Size([32, 219]) torch.Size([32]) torch.Size([32])


In [9]:
# sanity check
for data in test_loader:
    x, lx = data
    print(x.shape, lx.shape)
    break

torch.Size([32, 1675, 28]) torch.Size([32])


# NETWORK

In [8]:
# Locked DROPOUT (increases performance!)

class LockedDropout(nn.Module):
    def __init__(self, dropout = 0.5):
        super(LockedDropout, self).__init__()
        self.dropout = dropout

    def forward(self, x):
        if not self.training or self.dropout == 0:
            return x
        x, x_lens = pad_packed_sequence(x, batch_first=True)
        m = x.new_empty(1, x.size(1), x.size(2), requires_grad=False).bernoulli_(1 - self.dropout)
        mask = m / (1 - self.dropout)
        mask = mask.expand_as(x)
        x = mask * x

        x = pack_padded_sequence(x, x_lens, batch_first=True, enforce_sorted=False)
        return x

# Reference: 
# Locked Dropout Code Inspired From:
# https://github.com/salesforce/awd-lstm-lm/blob/dfd3cb0235d2caf2847a4d53e1cbd495b781b5d2/locked_dropout.py#L5


## ASR Network

### Pyramid Bi-LSTM (pBLSTM)

In [9]:
# Utils for network
torch.cuda.empty_cache()

class PermuteBlock(torch.nn.Module):
    def forward(self, x):
        return x.transpose(1, 2)

In [11]:
from re import template
class pBLSTM(torch.nn.Module):

    '''
    Pyramidal BiLSTM

    At each step,
    1. Pad your input if it is packed (Unpack it)
    2. Reduce the input length dimension by concatenating feature dimension
    3. Pack input
    4. Pass it into LSTM layer

    To make our implementation modular, we pass 1 layer at a time.
    '''

    def __init__(self, input_size, hidden_size):
        super(pBLSTM, self).__init__()

        # Initialize a single layer bidirectional LSTM with the given input_size and hidden_size
        self.blstm = nn.LSTM(input_size = 2*input_size, hidden_size = hidden_size, num_layers = 1,
                             batch_first = True, bidirectional = True)


    def forward(self, x_packed): # x_packed is a PackedSequence

        # Pad Packed Sequence
        x_unpacked, x_lens = pad_packed_sequence(x_packed, batch_first=True)

        # Call self.trunc_reshape() which downsamples the time steps of x and increases the feature dimensions
        # self.trunc_reshape will return 2 outputs
        x, x_lens = self.trunc_reshape(x_unpacked, x_lens)

        # Pack Padded Sequence
        x_packed = pack_padded_sequence(x, x_lens, batch_first = True, enforce_sorted=False)

        # Pass the sequence through bLSTM
        out, _ = self.blstm(x_packed)

        return out

    def trunc_reshape(self, x, x_lens):
        # Handle cases with odd number of timesteps
        if x.shape[1] % 2 != 0:
          x = x[:, :-1, :] # trim to an even length by deleting the final vector


        # Reshape x. When reshaping x, you have to reduce number of timesteps by a downsampling factor
        # while increasing number of features by the same factor
        x_downsampled = torch.reshape(input = x, shape = (x.shape[0], x.shape[1]//2, x.shape[2]*2))

        # Reduce lengths by the same downsampling factor
        x_lens = x_lens // 2
        return x_downsampled, x_lens

### Encoder

In [12]:
class Encoder(torch.nn.Module):
    '''
    The Encoder takes utterances as inputs and returns latent feature representations
    '''
    def __init__(self, input_size, encoder_hidden_size):
        super(Encoder, self).__init__()

        self.embedding = torch.nn.Sequential(

            PermuteBlock(),

            nn.Conv1d(in_channels = input_size, out_channels = 64, kernel_size = 5, padding = 2, stride = 1),
            torch.nn.ReLU(),
            torch.nn.BatchNorm1d(64),

            nn.Conv1d(in_channels = 64, out_channels = 128, kernel_size = 5, padding = 2, stride = 1),
            torch.nn.ReLU(),
            torch.nn.BatchNorm1d(128),
            nn.Dropout(0.3),

            nn.Conv1d(in_channels = 128, out_channels = 256, kernel_size = 5, padding = 2, stride = 1),
            torch.nn.ReLU(),
            torch.nn.BatchNorm1d(256),
            nn.Dropout(0.2),

            torch.nn.Conv1d(in_channels = 256, out_channels = input_size, kernel_size = 5,
                            stride = 1, padding = 2),
            torch.nn.ReLU(),
            torch.nn.BatchNorm1d(input_size),
            nn.Dropout(0.1),

            PermuteBlock()

        )

        self.BLSTM = nn.LSTM(input_size = input_size, hidden_size = encoder_hidden_size,
                             num_layers = 4, batch_first=True, bidirectional=True)


        self.pBLSTMs = torch.nn.Sequential( #
            pBLSTM(input_size = 2*encoder_hidden_size, hidden_size = encoder_hidden_size),
            LockedDropout(dropout=0.3),
            pBLSTM(input_size = encoder_hidden_size*2, hidden_size = encoder_hidden_size),
            LockedDropout(dropout=0.3)


        )

    def forward(self, x, x_lens):
        # Call the embedding layer
        x = self.embedding(x)
        # Pack Padded Sequence
        x_packed = pack_padded_sequence(x, x_lens, batch_first = True, enforce_sorted = False)
        # Pass through BLSTM layer
        out, _ = self.BLSTM(x_packed)
        # Pass Sequence through the pyramidal Bi-LSTM layer
        out = self.pBLSTMs(out)
        # Pad Packed Sequence
        encoder_outputs, encoder_lens = pad_packed_sequence(out, batch_first = True)

        return encoder_outputs, encoder_lens

### Decoder

In [13]:
class Decoder(torch.nn.Module):

    def __init__(self, embed_size, output_size= 41):
        super().__init__()

        self.mlp = torch.nn.Sequential(
            nn.Linear(2*embed_size, 2*embed_size),
            nn.ReLU(),
            PermuteBlock(),
            torch.nn.BatchNorm1d(2*embed_size),
            PermuteBlock(),
            nn.Linear(2*embed_size, output_size),
        )

        self.softmax = torch.nn.LogSoftmax(dim=2)

    def forward(self, encoder_out):
        # call MLP
        out = self.mlp(encoder_out)
        out = self.softmax(out)

        return out

In [14]:
class ASRModel(torch.nn.Module):

    def __init__(self, input_size, embed_size= 192, output_size= len(PHONEMES)):
        super().__init__()

        self.encoder        = Encoder(input_size=input_size, encoder_hidden_size = embed_size)
        self.decoder        = Decoder(embed_size = embed_size, output_size = output_size)



    def forward(self, x, lengths_x):

        encoder_out, encoder_lens   = self.encoder(x, lengths_x)
        decoder_out                 = self.decoder(encoder_out)

        return decoder_out, encoder_lens

## Initialize ASR Network

In [15]:
model = ASRModel(
    input_size  = 28,
    embed_size  = 256,
    output_size = len(PHONEMES)
).to(device)
print(model)
#summary(model, x.to(device), lx)

ASRModel(
  (encoder): Encoder(
    (embedding): Sequential(
      (0): PermuteBlock()
      (1): Conv1d(28, 64, kernel_size=(5,), stride=(1,), padding=(2,))
      (2): ReLU()
      (3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (4): Conv1d(64, 128, kernel_size=(5,), stride=(1,), padding=(2,))
      (5): ReLU()
      (6): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (7): Dropout(p=0.3, inplace=False)
      (8): Conv1d(128, 256, kernel_size=(5,), stride=(1,), padding=(2,))
      (9): ReLU()
      (10): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): Dropout(p=0.2, inplace=False)
      (12): Conv1d(256, 28, kernel_size=(5,), stride=(1,), padding=(2,))
      (13): ReLU()
      (14): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (15): Dropout(p=0.1, inplace=False)
      (16): PermuteBlock()
    )
    (BLSTM): LSTM(28, 256

# Training Config
Initialize Loss Criterion, Optimizer, CTC Beam Decoder, Scheduler, Scaler (Mixed-Precision), etc.

In [16]:
# Define CTC loss as the criterion
# More on CTC Loss: https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html
criterion = nn.CTCLoss(blank = 0, reduction = 'mean') ## We want expected loss so I'm using the mean

optimizer = torch.optim.AdamW(model.parameters(), lr=config['init_lr'],
                             weight_decay = config['adamw_decay'])

# Declare the decoder. Use the CTC Beam Decoder to decode phonemes
# More on CTC Beam Decoder Doc: https://github.com/parlance/ctcdecode
decoder =  CTCBeamDecoder(LABELS, beam_width=config['beam_width'], log_probs_input = True) # reminder to use a bigger beam width for testing

# scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma = 0.9)
# Next try a cosine annealing w/ 1e-5 lr
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=5, verbose=True)

# Mixed Precision, if you need it
scaler = torch.cuda.amp.GradScaler()

# Decode Prediction

In [17]:
def decode_prediction(output, output_lens, decoder, PHONEME_MAP= LABELS):

    beam_results, beam_scores, timesteps, out_lens = decoder.decode(output,
                                                                    seq_lens= output_lens)

    pred_strings                    = []

    # print("this is the len of pred strings", output_lens.shape[0])
    for i in range(output_lens.shape[0]):

        #Create the prediction from the output of decoder.decode. Map it using PHONEME_MAP.
        top_beam = beam_results[i][0][:out_lens[i][0]]

        top_beam = "".join([PHONEME_MAP[i] for i in top_beam])

        pred_strings.append(top_beam)

    return pred_strings

def calculate_levenshtein(output, label, output_lens, label_lens, decoder,
                          PHONEME_MAP= LABELS): # y - sequence of integers

    dist            = 0
    batch_size      = label.shape[0]

    pred_strings    = decode_prediction(output, output_lens, decoder, PHONEME_MAP)

    for i in range(batch_size):
        # Get predicted string and label string for each element in the batch
        pred_string = pred_strings[i][:label_lens[i]]
        label_string = "".join(PHONEME_MAP[j] for j in label[i])[:label_lens[i]]

        dist += Levenshtein.distance(pred_string, label_string)

    dist /= batch_size
    return dist

# Test Implementation

In [18]:
# test code to check shapes

model.eval()
for i, data in enumerate(val_loader, 0):
    x, y, lx, ly = data
    x, y = x.to(device), y.to(device)
    h, lh = model(x, lx)
    print(h.shape)
    h = torch.permute(h, (1, 0, 2))
    print(h.shape, y.shape)
    loss = criterion(h, y, lh, ly)
    print(loss)

    print(calculate_levenshtein(torch.permute(h, (1, 0, 2)), y, lx, ly, decoder, LABELS))

    break

torch.Size([32, 734, 41])
torch.Size([734, 32, 41]) torch.Size([32, 265])
tensor(7.5534, device='cuda:0', grad_fn=<MeanBackward0>)
69.25


# WandB

You will need to fetch your api key from wandb.ai

In [18]:
import wandb
# enter your wandb key here
wandb.login(key=".....")

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mmfouad[0m ([33mmfouad-cmu[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [19]:
model_run_name = "highcutoff_model+highembed"
run = wandb.init(
    #name = model_run_name, ## Wandb creates random run names if you skip this field
    #reinit = True, ### Allows reinitalizing runs when you re-run this cell
    id = 'uq6xbtt2',### Insert specific run id here if you want to resume a previous run
    resume = "must", ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw3p2", ### Projects should be created in your wandb account
    config = config ### Wandb Config for your run
)

# Train Functions

In [20]:
from tqdm import tqdm

def train_model(model, train_loader, criterion, optimizer):

    model.train()
    batch_bar = tqdm(total=len(train_loader), dynamic_ncols=True,
                     leave=False, position=0, desc='Train')

    total_loss = 0
    for i, data in enumerate(train_loader):
        optimizer.zero_grad()
        x, y, lx, ly = data
        x, y = x.to(device), y.to(device)

        with torch.cuda.amp.autocast():
            h, lh = model(x, lx)
            h = torch.permute(h, (1, 0, 2))
            loss = criterion(h, y, lh, ly)

        total_loss += loss.item()

        batch_bar.set_postfix(
            loss="{:.04f}".format(float(total_loss / (i + 1))),
            lr="{:.06f}".format(float(optimizer.param_groups[0]['lr'])))

        batch_bar.update() # Update tqdm bar

        # Another couple things needed for FP16.
        scaler.scale(loss).backward() # This is a replacement for loss.backward()
        scaler.step(optimizer) # This is a replacement for optimizer.step()
        scaler.update() # This is something added just for FP16

        del x, y, lx, ly, h, lh, loss
        torch.cuda.empty_cache()

    batch_bar.close() # You need this to close the tqdm bar

    return total_loss / len(train_loader)


def validate_model(model, val_loader, decoder, phoneme_map= LABELS):

    model.eval()
    batch_bar = tqdm(total=len(val_loader), dynamic_ncols=True, position=0, leave=False, desc='Val')

    total_loss = 0
    vdist = 0

    for i, data in enumerate(val_loader):

        x, y, lx, ly = data
        x, y = x.to(device), y.to(device)

        with torch.inference_mode():
            h, lh = model(x, lx)
            h = torch.permute(h, (1, 0, 2))
            loss = criterion(h, y, lh, ly)

        total_loss += float(loss)
        vdist += calculate_levenshtein(torch.permute(h, (1, 0, 2)), y, lh, ly, decoder, phoneme_map)

        batch_bar.set_postfix(loss="{:.04f}".format(float(total_loss / (i + 1))), dist="{:.04f}".format(float(vdist / (i + 1))))

        batch_bar.update()

        del x, y, lx, ly, h, lh, loss
        torch.cuda.empty_cache()

    batch_bar.close()
    total_loss = total_loss/len(val_loader)
    val_dist = vdist/len(val_loader)
    return total_loss, val_dist

## Training Setup

In [21]:
def save_model(model, optimizer, scheduler, metric, epoch, path):
    torch.save(
        {'model_state_dict'         : model.state_dict(),
         'optimizer_state_dict'     : optimizer.state_dict(),
         'scheduler_state_dict'     : scheduler.state_dict(),
         metric[0]                  : metric[1],
         'epoch'                    : epoch},
         path
    )

def load_model(path, model, metric= 'valid_acc', optimizer= None, scheduler= None):

    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizer != None:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler != None:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

    epoch   = checkpoint['epoch']
    metric  = checkpoint[metric]

    return [model, optimizer, scheduler, epoch, metric]

In [22]:
# This is for checkpointing, if doing it over multiple sessions

last_epoch_completed = 25
start = last_epoch_completed
end = config["epochs"]
best_lev_dist = float("5.6141") # if restarting from some checkpoint, use what you saw there.
epoch_model_path = 'epoch_model.pth'# set the model path
best_model_path = 'best_model.pth' # set best model path

In [23]:
# Retrieve past model

artifact = wandb.restore('best_model.pth', run_path='mfouad-cmu/hw3p2/uq6xbtt2').name
checkpoint = torch.load(artifact, weights_only=False)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

try:
    print(f"Resuming from epoch {checkpoint['epoch']}")
except:
    print(f"Resuming")

Resuming from epoch 22


In [24]:
torch.cuda.empty_cache()
gc.collect()

# Training Loop
config['epochs'] = 40

for epoch in range(23, config['epochs']):

    print("\nEpoch: {}/{}".format(epoch+1, config['epochs']))

    curr_lr = float(optimizer.param_groups[0]['lr'])

    train_loss              = train_model(model, train_loader, criterion, optimizer)
    valid_loss, valid_dist  = validate_model(model, val_loader, decoder, phoneme_map= LABELS)
    scheduler.step(valid_dist)

    print("\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_loss, curr_lr))
    print("\tVal Dist {:.04f}%\t Val Loss {:.04f}".format(valid_dist, valid_loss))

    wandb.log({
        'train_loss': train_loss,
        'valid_dist': valid_dist,
        'valid_loss': valid_loss,
        'lr'        : curr_lr
    })

    save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, epoch_model_path)
    wandb.save(epoch_model_path)
    print("Saved epoch model")

    if valid_dist <= best_lev_dist:
        best_lev_dist = valid_dist
        save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, best_model_path)
        wandb.save(best_model_path)
        print("Saved best model")

run.finish()


Epoch: 24/40




	Train Loss 0.1755	 Learning Rate 0.0010000
	Val Dist 5.5226%	 Val Loss 0.2674
Saved epoch model
Saved best model

Epoch: 25/40




	Train Loss 0.1646	 Learning Rate 0.0010000
	Val Dist 5.3398%	 Val Loss 0.2630
Saved epoch model
Saved best model

Epoch: 26/40




	Train Loss 0.1591	 Learning Rate 0.0010000
	Val Dist 5.2346%	 Val Loss 0.2616
Saved epoch model
Saved best model

Epoch: 27/40




	Train Loss 0.1541	 Learning Rate 0.0010000
	Val Dist 5.3506%	 Val Loss 0.2667
Saved epoch model

Epoch: 28/40




	Train Loss 0.1545	 Learning Rate 0.0010000
	Val Dist 5.4596%	 Val Loss 0.2715
Saved epoch model

Epoch: 29/40




	Train Loss 0.1484	 Learning Rate 0.0010000
	Val Dist 5.4998%	 Val Loss 0.2801
Saved epoch model

Epoch: 30/40




	Train Loss 0.1468	 Learning Rate 0.0010000
	Val Dist 5.4542%	 Val Loss 0.2725
Saved epoch model

Epoch: 31/40




	Train Loss 0.1405	 Learning Rate 0.0010000
	Val Dist 5.3615%	 Val Loss 0.2766
Saved epoch model

Epoch: 32/40




	Train Loss 0.1399	 Learning Rate 0.0010000
	Val Dist 5.2130%	 Val Loss 0.2695
Saved epoch model
Saved best model

Epoch: 33/40




	Train Loss 0.1332	 Learning Rate 0.0010000
	Val Dist 5.1273%	 Val Loss 0.2672
Saved epoch model
Saved best model

Epoch: 34/40




	Train Loss 0.1311	 Learning Rate 0.0010000
	Val Dist 5.2376%	 Val Loss 0.2719
Saved epoch model

Epoch: 35/40




	Train Loss 0.1339	 Learning Rate 0.0010000
	Val Dist 5.1772%	 Val Loss 0.2667
Saved epoch model

Epoch: 36/40




	Train Loss 0.1262	 Learning Rate 0.0010000
	Val Dist 5.1790%	 Val Loss 0.2646
Saved epoch model

Epoch: 37/40




	Train Loss 0.1264	 Learning Rate 0.0010000
	Val Dist 5.1754%	 Val Loss 0.2775
Saved epoch model

Epoch: 38/40




	Train Loss 0.1241	 Learning Rate 0.0010000
	Val Dist 5.1783%	 Val Loss 0.2683
Saved epoch model

Epoch: 39/40




Epoch 00039: reducing learning rate of group 0 to 5.0000e-04.
	Train Loss 0.1170	 Learning Rate 0.0010000
	Val Dist 5.2178%	 Val Loss 0.2702
Saved epoch model

Epoch: 40/40




	Train Loss 0.0931	 Learning Rate 0.0005000
	Val Dist 4.8452%	 Val Loss 0.2652
Saved epoch model
Saved best model


VBox(children=(Label(value='212.623 MB of 254.317 MB uploaded\r'), FloatProgress(value=0.836054148737092, max=…

0,1
lr,████████████████▁
train_loss,█▇▇▆▆▆▆▅▅▄▄▄▄▄▄▃▁
valid_dist,█▆▅▆▇█▇▆▅▄▅▄▄▄▄▅▁
valid_loss,▃▂▁▃▅█▅▇▄▃▅▃▂▇▄▄▂

0,1
lr,0.0005
train_loss,0.09311
valid_dist,4.84522
valid_loss,0.26523


# Generate Predictions and Submit to Kaggle

In [25]:
# Make predictions

# 1. Create a new object for CTCBeamDecoder with larger number of beams
# 2. Get prediction string by decoding the results of the beam decoder

torch.backends.cudnn.enabled = False
TEST_BEAM_WIDTH = 10

test_decoder    = CTCBeamDecoder(LABELS, beam_width=TEST_BEAM_WIDTH, log_probs_input= True)
results = []

model.eval()
print("Testing")
for data in tqdm(test_loader):

    x, lx   = data
    x       = x.to(device)

    with torch.no_grad():
        h, lh = model(x, lx)

    prediction_string = decode_prediction(output = h, output_lens=lx, decoder = test_decoder)
    #save the output in results array.
    results.append(prediction_string)

    del x, lx, h, lh
    torch.cuda.empty_cache()

Testing


100%|██████████| 82/82 [01:41<00:00,  1.23s/it]


In [26]:
results_unbatched = []
for i in range(len(results)):
  for j in range(len(results[i])):
    results_unbatched.append(results[i][j])

In [27]:
data_dir = "/content/random_submission.csv"
df = pd.read_csv(data_dir)
df.label = results_unbatched
df.to_csv('submission_7_test10.csv', index = False)

In [None]:
!kaggle competitions submit -c hw3p2-785-f24 -f early_submission_2.csv -m "I made it!"
