# 11785 HW3P2: Automatic Speech Recognition

**Welcome to HW3P2. In this homework, you will be using the same data from HW1 but will be incorporating sequence models. We recommend you get familaried with sequential data and the working of RNNs, LSTMs and GRUs to have a smooth learning in this part of the homework.**

Disclaimer: This starter notebook will not be as elaborate as that of HW1P2 or HW2P2. You will need to do most of the implementation in this notebook because, it is expected after 2 HWs, you will be in a position to write a notebook from scratch. You are welcomed to reuse the code from the previous starter notebooks but may also need to make appropriate changes for this homework. <br>
We have also given you 3 log files for the Very Low Cutoff (Levenshtein Distance = 30) so that you can observe how loss decreases.

Common errors which you may face


*   Shape errors: Half of the errors from this homework will account to this category. Try printing the shapes between intermediate steps to debug
*   CUDA out of Memory: When your architecture has a lot of parameters, this can happen. Golden keys for this is, (1) Reducing batch_size (2) Call *torch.cuda.empty_cache* often, even inside your training loop, (3) Call *gc.collect* if it helps and (4) Restart run time if nothing works







# Prelimilaries

You will need to install packages for decoding and calculating the Levenshtein distance

In [1]:
!pip install python-Levenshtein
!git clone --recursive https://github.com/parlance/ctcdecode.git
!pip install wget
%cd ctcdecode
!pip install .
%cd ..

!pip install torchsummaryX # We also install a summary package to check our model's forward before training

Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[?25l[K     |██████▌                         | 10 kB 30.5 MB/s eta 0:00:01[K     |█████████████                   | 20 kB 38.5 MB/s eta 0:00:01[K     |███████████████████▌            | 30 kB 21.7 MB/s eta 0:00:01[K     |██████████████████████████      | 40 kB 9.2 MB/s eta 0:00:01[K     |████████████████████████████████| 50 kB 4.6 MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149857 sha256=20c08bc93327c04719051254cdbd24a1dd43d0a78d444afc8c063b24da071046
  Stored in directory: /root/.cache/pip/wheels/05/5f/ca/7c4367734892581bb5ff896f15027a932c551080b2abd3e00d
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2
Clon

**bold text**# Libraries

In [2]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.autograd import Variable
from torch.utils import data as tud

from sklearn.metrics import accuracy_score
import gc
import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import ctcdecode
import Levenshtein
from ctcdecode import CTCBeamDecoder

import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)
import csv

Device:  cuda


# Kaggle (TODO)

You need to set up your Kaggle and download the data

In [3]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"mangalamsahai","key":"521f66540469b3a12f7b11566d8b1c14"}') # Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

Collecting kaggle==1.5.8
  Downloading kaggle-1.5.8.tar.gz (59 kB)
[?25l[K     |█████▌                          | 10 kB 34.9 MB/s eta 0:00:01[K     |███████████                     | 20 kB 22.2 MB/s eta 0:00:01[K     |████████████████▋               | 30 kB 16.5 MB/s eta 0:00:01[K     |██████████████████████▏         | 40 kB 15.4 MB/s eta 0:00:01[K     |███████████████████████████▊    | 51 kB 8.9 MB/s eta 0:00:01[K     |████████████████████████████████| 59 kB 4.2 MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.8-py3-none-any.whl size=73275 sha256=ce07ee8070d2daf3ce334a88ab3934c636bcc39f86a0c09614530021078c8113
  Stored in directory: /root/.cache/pip/wheels/de/f7/d8/c3902cacb7e62cb611b1ad343d7cc07f42f7eb76ae3a52f3d1
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
 

In [4]:
!kaggle competitions download -c 11-785-s22-hw3p2

!unzip -q 11-785-s22-hw3p2.zip

!ls

Downloading 11-785-s22-hw3p2.zip to /content
 99% 1.83G/1.84G [00:30<00:00, 100MB/s] 
100% 1.84G/1.84G [00:30<00:00, 64.3MB/s]
11-785-s22-hw3p2.zip  ctcdecode  hw3p2_student_data  phonemes.py  sample_data


# Dataset and dataloading (TODO)

In [5]:
# PHONEME_MAP is the list that maps the phoneme to a single character. 
# The dataset contains a list of phonemes but you need to map them to their corresponding characters to calculate the Levenshtein Distance
# You final submission should not have the phonemes but the mapped string
# No TODOs in this cell

PHONEME_MAP = [
    " ",
    ".", #SIL
    "a", #AA
    "A", #AE
    "h", #AH
    "o", #AO
    "w", #AW
    "y", #AY
    "b", #B
    "c", #CH
    "d", #D
    "D", #DH
    "e", #EH
    "r", #ER
    "E", #EY
    "f", #F
    "g", #G
    "H", #H
    "i", #IH 
    "I", #IY
    "j", #JH
    "k", #K
    "l", #L
    "m", #M
    "n", #N
    "N", #NG
    "O", #OW
    "Y", #OY
    "p", #P 
    "R", #R
    "s", #S
    "S", #SH
    "t", #T
    "T", #TH
    "u", #UH
    "U", #UW
    "v", #V
    "W", #W
    "?", #Y
    "z", #Z
    "Z" #ZH
]

In [89]:
import pdb
from phonemes import PHONEMES
# This cell is where your actual TODOs start
# You will need to implement the Dataset class by your own. You may also implement it similar to HW1P2 (dont require context)
# The steps for implementation given below are how we have implemented it.
# However, you are welcomed to do it your own way if it is more comfortable or efficient. 

class LibriSamples(torch.utils.data.Dataset):

    def __init__(self, data_path, partition= "train"): # You can use partition to specify train or dev

        if partition=='train':
           self.X_dir = os.path.join(data_path,"mfcc") # data_path = /content/hw3p2_student_data/hw3p2_student_data
           self.Y_dir = os.path.join(data_path,"transcript") # data_path = /content/hw3p2_student_data/hw3p2_student_data
        elif partition=='dev':
           self.X_dir = os.path.join(data_path,"mfcc") # data_path = /content/hw3p2_student_data/hw3p2_student_data
           self.Y_dir = os.path.join(data_path,"transcript") # data_path = /content/hw3p2_student_data/hw3p2_student_data
           

        
        self.X_files = os.listdir(self.X_dir)
        self.Y_files = os.listdir(self.Y_dir)
        
        #print(self.X_files.shape)
        #self.X=np.zeros(len(self.X_files))
        #self.Y=np.zeros(len(self.Y_files))
        #for i in range(len(self.X_files)):
         #   self.X[i],self.Y[i] =self._getitem_(i)
             


        #self.Y_files = self.Y1_files[1:-1]
        # remove <eos> & <sos> from Y
        # TODO: store PHONEMES from phonemes.py inside the class. phonemes.py will be downloaded from kaggle.
        # You may wish to store PHONEMES as a class attribute or a global variable as well.
        self.PHONEMES = PHONEMES
        print("X_files",len(self.X_files))
        print("Y_files",len(self.Y_files))
        assert(len(self.X_files) == len(self.Y_files))

        
        pass

    def __len__(self):
        return len(self.X_files)

    def __getitem__(self, ind):
    
        X = np.load(os.path.join(self.X_dir,self.X_files[ind]))    # TODO: Load the mfcc npy file at the specified index ind in the directory
        #pdb.set_trace()
        Y1 = np.load(os.path.join(self.Y_dir,self.Y_files[ind]))    # TODO: Load the corresponding transcripts
        
        Y2 =np.array(Y1[1:-1])
        #print(Y2)
        Y = np.zeros(Y2.shape[0])
        
        for i in range(Y2.shape[0]):
            #pdb.set_trace()
            Y[i] = self.PHONEMES.index(Y2[i])
            

        # Remember, the transcripts are a sequence of phonemes. Eg. np.array(['<sos>', 'B', 'IH', 'K', 'SH', 'AA', '<eos>'])
        # You need to convert these into a sequence of Long tensors

        # Tip: You may need to use self.PHONEMES
        # Remember, PHONEMES or PHONEME_MAP do not have '<sos>' or '<eos>' but the transcripts have them. 
        # You need to remove '<sos>' and '<eos>' from the trancripts. 
        # Inefficient way is to use a for loop for this. Efficient way is to think that '<sos>' occurs at the start and '<eos>' occurs at the end.
        
        # Amend the self.Phonemes list
        
        Yy = torch.LongTensor(Y)      # TODO: Convert sequence of  phonemes into sequence of Long tensors
        #pdb.set_trace()
        return torch.tensor(X), Yy
    
    def collate_fn(batch):

        #batch_x=[]
        #batch_y=[]
        #batch_x_pad=[]
        #batch_y_pad=[]
        #lengths_x=[]
        #lengths_y=[]
        batch_x = [x for x,y in batch]
        #pdb.set_trace()
        batch_y = [y for x,y in batch]
        #for i in range(len(batch)):
        #    batch_x.append(torch.from_numpy(np.array([x for x,y in batch])[i]))
        #    batch_y.append(np.array([y for x,y in batch])[i])
        
                         # TODO: pad the sequence with pad_sequence (already imported)
        
        #pdb.set_trace()
        batch_x_pad = pad_sequence(batch_x)
        #pdb.set_trace()
        lengths_x = [x.shape[0] for x,y in batch]                   # TODO: Get original lengths of the sequence before padding
        batch_y_pad = pad_sequence(batch_y,batch_first=True)        # TODO: pad the sequence with pad_sequence (already imported)
        lengths_y = [y.shape[0] for x,y in batch]                   # TODO: Get original lengths of the sequence before padding
        #pdb.set_trace()
        return batch_x_pad, batch_y_pad, torch.tensor(lengths_x), torch.tensor(lengths_y)


# You can either try to combine test data in the previous class or write a new Dataset class for test data
class LibriSamplesTest(torch.utils.data.Dataset):

     def __init__(self, data_path, test_order): # test_order is the csv similar to what you used in hw1
         self.data_path=data_path
         
         test_order_list = list(pd.read_csv(data_path + test_order).file)
         self.X = [np.load(data_path + "/mfcc/"+v) for v in test_order_list]
#          # self.X_dir = os.path.join(data_path,"mfcc") # data_path = /content/hw3p2_student_data/hw3p2_student_data
#            #self.Y_dir = os.path.join(data_path,"transcript") # data_path = /content/hw3p2_student_data/hw3p2_student_data
        
#         # elif partition=='dev':
#         #    self.X_dir = os.path.join(data_path,"mfcc") # data_path = /content/hw3p2_student_data/hw3p2_student_data
#         #    self.Y_dir = os.path.join(data_path,"transcript") # data_path = /content/hw3p2_student_data/hw3p2_student_data
#         # self.X_files = os.listdir(self.X_dir)
          # if test_order:
#         #    with open(test_order) as f:  # Need for csv file here.
#         #         subset = csv.reader(f) # read csv file
#         #         test_order_list = list(subset) # TODO: open test_order.csv as a list
#         #    for file in test_order_list:  
#         #        self.X.append(np.load(os.path.join("/content/hw3p2_student_data/hw3p2_student_data/test",file)))     

#         # Load .npy files.
                
#         else:
       
#         # You can load the files here or save the paths here and load inside __getitem__ like the previous class
    
     def __len__(self):
         return len(self.X)
    
     def __getitem__(self, ind):
         # TODOs: Need to return only X because this is the test dataset

         return torch.tensor(self.X[ind])
    
     def collate_fn(batch):
        # batch_x=[]
        # batch_x_pad=[]
        # lengths_x=[]
        # for i in range(len(batch)):
        #     batch_x.append(torch.from_numpy(np.array([x for x in batch])[i]))
            
        
        #                  # TODO: pad the sequence with pad_sequence (already imported)
        
        # batch_x_pad.append(pad_sequence(batch_x))
        # lengths_x.append(len(batch_x))                   # TODO: Get original lengths of the sequence before padding
        #         # TODO: pad the sequence with pad_sequence (already imported)
         
         
         
        #      # TODO: pad the sequence with pad_sequence (already imported)
        #       # TODO: Get original lengths of the sequence before padding
        batch_x = [x for x in batch]
        #pdb.set_trace()
        #batch_y = [y for x,y in batch]
        #for i in range(len(batch)):
        #    batch_x.append(torch.from_numpy(np.array([x for x,y in batch])[i]))
        #    batch_y.append(np.array([y for x,y in batch])[i])
        
                         # TODO: pad the sequence with pad_sequence (already imported)
        
        #pdb.set_trace()
        batch_x_pad = pad_sequence(batch_x)
        lengths_x = [x.shape[0] for x in batch]                   # TODO: Get original lengths of the sequence before padding
        
        return batch_x_pad, torch.tensor(lengths_x)

In [90]:
batch_size = 64

root = "/content/hw3p2_student_data/hw3p2_student_data" # TODO: Where your hw3p2_student_data folder is

train_data = LibriSamples(os.path.join(root,"train"),partition='train')
val_data = LibriSamples(os.path.join(root,'dev'),partition='dev')
test_data = LibriSamplesTest(os.path.join(root,'test'),'/test_order.csv')


train_loader = DataLoader(train_data,batch_size=batch_size,shuffle=True,collate_fn=LibriSamples.collate_fn) # TODO: Define the train loader. Remember to pass in a parameter (function) for the collate_fn argument 
val_loader = DataLoader(val_data,batch_size=batch_size,shuffle=True,collate_fn=LibriSamples.collate_fn) # TODO: Define the val loader. Remember to pass in a parameter (function) for the collate_fn argument 
test_loader = DataLoader(test_data,batch_size=batch_size,shuffle=False,collate_fn=LibriSamplesTest.collate_fn) # TODO: Define the test loader. Remember to pass in a parameter (function) for the collate_fn argument 


print("Batch size: ", batch_size)
print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Val dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

X_files 28539
Y_files 28539
X_files 2703
Y_files 2703
Batch size:  64
Train dataset samples = 28539, batches = 446
Val dataset samples = 2703, batches = 43
Test dataset samples = 2620, batches = 41


In [75]:
x,y,lx,ly = next(iter(val_loader))

In [91]:
# Optional
# Test code for checking shapes and return arguments of the train and val loaders
for data in val_loader:
    x, y, lx, ly = data # if you face an error saying "Cannot unpack", then you are not passing the collate_fn argument
    print(x.shape, y.shape, lx.shape, ly.shape)
    break

torch.Size([2436, 64, 13]) torch.Size([64, 288]) torch.Size([64]) torch.Size([64])


In [None]:
print(lx.shape)

torch.Size([128])


# Model Configuration (TODO)

In [92]:
def init_weights(m):
    if isinstance(m,nn.Linear):
       nn.init.kaiming_normal(m.weight)
       nn.init.normal_(m.bias)

    elif isinstance(m,nn.Conv1d):
         nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

    elif isinstance(m,nn.BatchNorm1d):
         nn.init.constant_(m.weight,1)
         nn.init.constant_(m.bias,0)        

In [64]:
class Block(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Conv1d(in_channels=dim, out_channels=dim, kernel_size=7, padding=3, groups=dim),
            nn.BatchNorm1d(dim),
            nn.Conv1d(dim, dim * 4, kernel_size=1),
            nn.GELU(),
            nn.Conv1d(dim * 4, dim, kernel_size=1)
        )
        self.layer_scale = nn.Parameter(torch.ones(1, dim, 1) * 1e-2, requires_grad=True)
  
    def forward(self, x):
        out = x
        x = self.backbone(x)
        x = self.layer_scale * x
        x += out
        return x

In [93]:
import pdb
from torch.nn.modules import dropout
class Network(nn.Module):

    def __init__(self,input_size=13,hidden_size=256): # You can add any extra arguments as you wish

        super(Network, self).__init__()

        # Embedding layer converts the raw input into features which may (or may not) help the LSTM to learn better 
        # For the very low cut-off you dont require an embedding layer. You can pass the input directly to the  LSTM
        self.embedding = nn.Sequential(nn.Conv1d(in_channels=input_size,out_channels=hidden_size,kernel_size=3,stride=2,padding=1,bias=False),
                                       nn.BatchNorm1d(hidden_size),
                                       #nn.Conv1d(128,256,kernel_size=1,stride=1)
                                       nn.GELU(),
                                       Block(hidden_size),
                                       nn.Dropout(0.4)
                                       )

        # nn.Sequential(nn.Conv1d(13,64,kernel_size=3,stride=1,bias=False,padding=1),
        #                                nn.BatchNorm1d(64),
        #                                nn.LeakyReLU(),
        #                                #nn.MaxPool1d(kernel_size=2,stride=2),
        #                                #nn.Dropout(p=0.45),
        #                                nn.Conv1d(64,128,stride=1,kernel_size=1),
        #                                nn.BatchNorm1d(128),
        #                                nn.LeakyReLU(),
        #                                #nn.MaxPool1d(kernel_size=2,stride=2),
        #                                nn.Dropout(0.45),
        #                                #nn.Conv1d(256,512,stride=1,kernel_size=1),
        #                                #nn.BatchNorm1d(512),
        #                                #nn.LeakyReLU(),
        #                                #nn.MaxPool1d(kernel_size=2,stride=2),
        #                                #nn.Dropout(0.45),
        #                               )
            
        # Frequency Masking.
        # GPU With High RAM
        # for p in self.embedding:
        #     if isinstance(p,nn.Conv1d):
        #        nn.init.kaiming_normal_(p.weight,mode='fan_out',nonlinearity='relu')
        #     elif isinstance(p,nn.BatchNorm1d):
        #          nn.init.constant_(p.weight,1)
        #          nn.init.constant_(p.bias,0)
        
        self.lstm = nn.LSTM(input_size=hidden_size,hidden_size=hidden_size,num_layers=4,bidirectional=True,dropout=0.4)


        #self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=3)# TODO: # Create a single layer, uni-directional LSTM with hidden_size = 256
        # Use nn.LSTM() Make sure that you give in the proper arguments as given in https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
                
        self.classification = nn.Sequential(nn.Linear(hidden_size*2,hidden_size*4),
                                            nn.GELU(),
                                            #nn.Dropout(p=0.45),
                                            nn.Linear(hidden_size*4,41)
                                            )
        #nn.Linear(hidden_size,41)# TODO: Create a single classification layer using nn.Linear()

        self.apply(init_weights)

    def forward(self, x, l): # TODO: You need to pass atleast 1 more parameter apart from self and x

        # x is returned from the dataloader. So it is assumed to be padded with the help of the collate_fn
        #pdb.set_trace()
        
        out = torch.permute(x,(1,2,0))
        out = self.embedding(out)
        out = torch.permute(out,(2,0,1))
        #pdb.set_trace()
        #out = out.permute(2,0,1)
        #pdb.set_trace()
        #l = torch.div(l,8)
        #l = l.clamp(max=out.shape[2])
        l = ((l - 3 + 2) // 2) + 1
        packed_input = pack_padded_sequence(out,l,enforce_sorted=False) # TODO: Pack the input with pack_padded_sequence. Look at the parameters it requires
        #pdb.set_trace()
        out1, (out2, out3) = self.lstm(packed_input)# TODO: Pass packed input to self.lstm
        # As you may see from the LSTM docs, LSTM returns 3 vectors. Which one do you need to pass to the next function?
        out, lengths  = pad_packed_sequence(out1) # TODO: Need to 'unpack' the LSTM output using pad_packed_sequence
        #pdb.set_trace()
        out = self.classification(out)# TODO: Pass unpacked LSTM output to the classification layer
        out = out.log_softmax(2)# Optional: Do log softmax on the output. Which dimension?
        #pdb.set_trace()
        return out,lengths # TODO: Need to return 2 variables

model = Network().to(device)
print(model)
#summary(model, x.to(device), lx) # x and lx are from the previous cell

Network(
  (embedding): Sequential(
    (0): Conv1d(13, 256, kernel_size=(3,), stride=(2,), padding=(1,), bias=False)
    (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): GELU()
    (3): Block(
      (backbone): Sequential(
        (0): Conv1d(256, 256, kernel_size=(7,), stride=(1,), padding=(3,), groups=256)
        (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
        (3): GELU()
        (4): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
      )
    )
    (4): Dropout(p=0.4, inplace=False)
  )
  (lstm): LSTM(256, 256, num_layers=4, dropout=0.4, bidirectional=True)
  (classification): Sequential(
    (0): Linear(in_features=512, out_features=1024, bias=True)
    (1): GELU()
    (2): Linear(in_features=1024, out_features=41, bias=True)
  )
)


## CovNext + Residual

In [53]:
class StageLayers(nn.Module):
  
  def __init__(self,
                 in_channels,
                 out_channels,
                 stride,
               padding,
               flag,
                ):
        super().__init__() # Just have to do this for all nn.Module classes

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.stride = stride
        self.padding= padding
        self.flag= flag
        #self.drop_block = DropBlock2D(block_size=3, drop_prob=0.3)        
        self.Layer = nn.Sequential(
            nn.Conv1d(in_channels=in_channels,out_channels=out_channels,kernel_size=7,stride=self.stride,padding=self.padding,groups=in_channels),
            nn.BatchNorm1d(num_features=out_channels),
            nn.ReLU(),
            nn.Conv1d(in_channels=out_channels,out_channels=4*out_channels,kernel_size=1,stride=1,padding=0),
            nn.BatchNorm1d(num_features=4*out_channels),
            nn.ReLU(),
            nn.Conv1d(in_channels=4*out_channels,out_channels=out_channels,kernel_size=1,stride=1,padding=0),
            nn.BatchNorm1d(num_features=out_channels),
            #nn.Conv2d(in_channels=in_channel,out_channels=in_channel*4,kernel_size=1,stride=1,padding=0),
         )
        
  def forward(self, x):
        
        #y= torch.tensor(nn.Identity(x))
        size = (x.shape[2]+2*self.padding-7)//self.stride +1
        y= nn.functional.interpolate(x,size)
        out = self.Layer(x)

        if self.flag==1:
          #  print("out,x",x.shape)
          #  print("out,y",y.shape)
           return out
        
        #if self.in_channels==self.out_channels and x.shape[2]==out.shape[2]:
             #print(out.shape)
        #      #print(x.shape)
        # elif self.flag==2:
        #    return out
        else: 
           #pdb.set_trace()
          #  print("out+y,y.shape",y.shape)
          #  print("out+y,out.shape",out.shape)
          #  print("out+y,flag",self.flag)
           return out+y
      
    
        
class CovNext(nn.Module):
     def __init__(self, num_classes= 256):
         super().__init__() # Already features indented from previous class. 
         #self.drop_block = DropBlock2D(block_size=3, drop_prob=0.3)  
         self.num_classes = num_classes
         
         self.stem = nn.Sequential(
            nn.Conv1d(in_channels=13,out_channels=96,kernel_size=4,stride=4,padding=0),
            nn.BatchNorm1d(num_features=96),
            nn.ReLU(),
            )
         
        #  self.stage1_maxpool = nn.Sequential(
        #    nn.MaxPool2d(kernel_size=3,stride=2,padding=1), 
        #     )
        # # 56*56
         self.stage_cfgs = [
              # in_channel, #blocks
            [96, 96, 3],
            [96, 192, 3], 
            [192, 384, 9], 
            [384, 768, 3], 
            ]
                           
         layers = []
         #pdb.set_trace()
         i=0
         for curr_stage in self.stage_cfgs:
             in_channels, out_channels, num_blocks = curr_stage
             for block_idx in range(num_blocks):
                 print(i)
                 print(block_idx)
                 if block_idx==0: 
                       stride=1 if in_channels==13 else 2
                       padding=0 if in_channels==13 else 3
                      #  flag=2
                 else:
                    stride= 1
                    padding= 3
                    #flag=0      

                #  if block_idx==num_blocks-1:
                #     flag=1
                #  else:
                #      flag=0     

                 layers.append(StageLayers(
                 in_channels = in_channels,
                 out_channels = out_channels,
                 padding = padding,  
                 stride= stride,
                 flag= 1 if block_idx==0 else 0
                 ))
                 in_channels = out_channels
             
             i=i+1  
                 
                # in_channels=num_channels
                # In channels of the next block is the out_channels of the current one
                # in_channels = in_channels*4 
            
         self.layers = nn.Sequential(*layers) # Done, save them to the class

       
         self.cls_layer = nn.Sequential(
             #nn.Dropout(p=0.1),
             #pdb.set_trace(),
             nn.AdaptiveAvgPool1d(256),
            #  nn.Flatten(),
            #  nn.Linear(768,num_classes),
           )

      #   self._initialize_weights()

     #def _initialize_weights(self):
     #   """
     #   Usually, I like to use default pytorch initialization for stuff, but
     #   MobileNetV2 made a point of putting in some custom ones, so let's just
     #   use them.
     #   """
     #   for m in self.modules():
     #       if isinstance(m, nn.Conv2d):
     #          n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
     #           m.weight.data.normal_(0, math.sqrt(2. / n))
     #           if m.bias is not None:
     #              m.bias.data.zero_()
     #       elif isinstance(m, nn.BatchNorm2d):
     #           m.weight.data.fill_(1)
     #          m.bias.data.zero_()
     #       elif isinstance(m, nn.Linear):
     #           m.weight.data.normal_(0, 0.01)
     #          m.bias.data.zero_()

     def forward(self, x):
        out = self.stem(x)
        #out = self.drop_block(out)
        #out = self.stage1_maxpool(out)
        #pdb.set_trace()
        out = self.layers(out)
        #print(out.shape)
        #pdb.set_trace()
        out = self.cls_layer(out)
        
        

        return out

# New Model

In [None]:
class Block(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Conv1d(in_channels=dim, out_channels=dim, kernel_size=3, padding=1, groups=dim),
            nn.BatchNorm1d(dim),
            nn.Conv1d(dim, dim * 4, kernel_size=1),
            nn.GELU(),
            nn.Conv1d(dim * 4, dim, kernel_size=1)
        )
        self.layer_scale = nn.Parameter(torch.ones(1, dim, 1) * 1e-2, requires_grad=True)
  
    def forward(self, x):
        out = x
        x = self.backbone(x)
        x = self.layer_scale * x
        x += out
        return x


# Training Configuration (TODO)

In [94]:
epoches= 200
lr=2e-3
criterion = nn.CTCLoss() # TODO: What loss do you need for sequence to sequence models? 
# Do you need to transpose or permute the model output to find out the loss? Read its documentation

optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5) # TODO: Adam works well with LSTM (use lr = 2e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.8, patience=2, min_lr=1e-7)
decoder = CTCBeamDecoder(
    labels=PHONEMES,
    model_path=None,
    alpha=0,
    beta=0,
    beam_width=10,
    log_probs_input=True
)
# TODO: Intialize the CTC beam decoder
# Check out https://github.com/parlance/ctcdecode for the details on how to implement decoding
# Do you need to give log_probs_input = True or False?

In [95]:
# this function calculates the Levenshtein distance 

def calculate_levenshtein(h, y, lh, ly, decoder, PHONEME_MAP):

    # h - ouput from the model. Probability distributions at each time step 
    # y - target output sequence - sequence of Long tensors
    # lh, ly - Lengths of output and target
    # decoder - decoder object which was initialized in the previous cell
    # PHONEME_MAP - maps output to a character to find the Levenshtein distance
   
    # TODO: You may need to transpose or permute h based on how you passed it to the criterion
    beam_results, beam_scores, timesteps, out_seq_len =decoder.decode(torch.permute(h,(1,0,2)),seq_lens=lh)
    # Print out the shapes often to debug
     
    # TODO: call the decoder's decode method and get beam_results and out_len (Read the docs about the decode method's outputs)
    # Input to the decode method will be h and its lengths lh 
    # You need to pass lh for the 'seq_lens' parameter. This is not explicitly mentioned in the git repo of ctcdecode.

    batch_size = h.size()[1]


    dist = 0

    for i in range(batch_size): # Loop through each element in the batch

        h_sliced = beam_results[i,0,0:out_seq_len[i][0]]# TODO: Get the output as a sequence of numbers from beam_results
        # Remember that h is padded to the max sequence length and lh contains lengths of individual sequences
        # Same goes for beam_results and out_lens
        # You do not require the padded portion of beam_results - you need to slice it with out_lens 
        # If it is confusing, print out the shapes of all the variables and try to understand

        h_string = ''.join([PHONEME_MAP[hi]for hi in h_sliced])# TODO: MAP the sequence of numbers to its corresponding characters with PHONEME_MAP and merge everything as a single string

        y_sliced = y[i][:ly[i].item()]# TODO: Do the same for y - slice off the padding with ly
        y_string = ''.join([PHONEME_MAP[yi] for yi in y_sliced])# TODO: MAP the sequence of numbers to its corresponding characters with PHONEME_MAP and merge everything as a single string
        
        dist += Levenshtein.distance(h_string, y_string)

    dist/=batch_size

    return dist

In [None]:
m = nn.AdaptiveAvgPool1d(5)
input = torch.randn(1, 64, 8)
output = m(input)
print(output.shape)

In [96]:
# Optional but recommended

for i, data in enumerate(train_loader, 0):
    
    # Write a test code do perform a single forward pass and also compute the Levenshtein distance
    # Make sure that you are able to get this right before going on to the actual training
    # You may encounter a lot of shape errors
    # Printing out the shapes will help in debugging
    # Keep in mind that the Loss which you will use requires the input to be in a different format and the decoder expects it in a different format
    # Make sure to read the corresponding docs about it
    x,y,lx,ly = data
    x = x.to(device) #cuda()
    y = y.to(device) #cuda()
    # # lx = lx.cuda()
    #lx = lx.to(device)
    #pdb.set_trace()
    outputs,l = model(x,lx)

    break # one iteration is enough

In [97]:
torch.cuda.empty_cache() # Use this often

# TODO: Write the model evaluation function if you want to validate after every epoch
def model_eval():
  total_d = 0
  total_l = 0
  val_loss = 0
  for i,data in enumerate(val_loader):
      x,y,lx,ly = data
    
      x= x.cuda()
      y= y.cuda()

      with torch.no_grad():
           outputs, l =model(x,lx)
           loss = criterion(outputs,y,l,ly)
      val_loss += loss     
    #pdb.set_trace()
      d = calculate_levenshtein(outputs,y,l, ly,decoder,PHONEME_MAP) ## (BatchSize, Seqlen, Features=41)

      total_d +=float(d)
  model.train() 
  print(f"loss: {total_loss / len(val_loader)}, Average Distance: {total_d / len(val_loader)}")
    
  return  val_loss  

# You are free to write your own code for model evaluation or you can use the code from previous homeworks' starter notebooks
# However, you will have to make modifications because of the following.
# (1) The dataloader returns 4 items unlike 2 for hw2p2
# (2) The model forward returns 2 outputs
# (3) The loss may require transpose or permuting

# Note that when you give a higher beam width, decoding will take a longer time to get executed
# Therefore, it is recommended that you calculate only the val dataset's Levenshtein distance (train not recommended) with a small beam width
# When you are evaluating on your test set, you may have a higher beam width

In [98]:
from google.colab import drive

drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [None]:
save_checkpoint = torch.load("/content/drive/MyDrive/BiDirLSTM_LRScheduler_model_epoch_27.pth") # Epoch done till now 67

In [None]:
model.load_state_dict(save_checkpoint['model_state_dict'])

<All keys matched successfully>

In [None]:
optimizer.load_state_dict(save_checkpoint['optimizer_state_dict'])

In [None]:
epoch = save_checkpoint['epoch']

In [99]:
import pdb
torch.cuda.empty_cache()
#scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=(len(train_loader) * epoches)) # New Addition
#scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.8, patience=3, mode='min', verbose=True)
# TODO: Write the model training code 
scaler = torch.cuda.amp.GradScaler()
# You are free to write your own code for training or you can use the code from previous homeworks' starter notebooks
model.train()
# However, you will have to make modifications because of the following.
for epoch in range(epoches):
    batch_bar = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    total_dist = 0
    total_loss = 0
    Val_loss = 0

    for i, data in enumerate(train_loader):
        optimizer.zero_grad()
        x, y, lx, ly = data
        x = x.to(device)
        y = y.to(device)

        with torch.cuda.amp.autocast():
            outputs, l = model(x, lx)
            #print(i)
            #pdb.set_trace()
            loss = criterion(outputs, y, l, ly)      #.permute(1,0,2)

        #num_correct += int((torch.argmax(outputs, axis=1) == y).sum())
        total_loss += float(loss)

        batch_bar.set_postfix(
            loss="{:.04f}".format(float(total_loss / (i + 1))),
            lr="{:.04f}".format(float(optimizer.param_groups[0]['lr'])))
        #pdb.set_trace()
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

                 
        batch_bar.update()
    #Val_loss = model_eval()
    #scheduler.step(Val_loss)   # Validation Loss  
    batch_bar.close()
    
    print("Epoch {}/{}: Train Loss {:.04f}, Learning Rate {:.04f}".format(
        epoch + 1,
        epoches,
        float(total_loss / len(train_loader)),
        float(optimizer.param_groups[0]['lr'])),
        float(float(total_dist / len(train_loader))))
    
    # if epoch%3==0: 
    #    model_eval()
    Val_loss = model_eval()
    scheduler.step(Val_loss)

    checkpoint= {
               'epoch': epoch,
               'model_state_dict': model.state_dict(),
               'optimizer_state_dict': optimizer.state_dict(),
               'loss': criterion,
               }
    torch.save(checkpoint,"/content/drive/MyDrive/Mangalam_model_epoch_"+str(epoch)+".pth")
# (1) The dataloader returns 4 items unlike 2 for hw2p2
# (2) The model forward returns 2 outputs
# (3) The loss may require transpose or permuting

# Tip: Implement mixed precision training



Epoch 1/200: Train Loss 2.4286, Learning Rate 0.0020 0.0
loss: 25.18957822128784, Average Distance: 29.192296511627905




Epoch 2/200: Train Loss 0.7553, Learning Rate 0.0020 0.0
loss: 7.834122297375701, Average Distance: 20.38825096899225




Epoch 3/200: Train Loss 0.5558, Learning Rate 0.0020 0.0
loss: 5.764792332122492, Average Distance: 16.999006782945738




Epoch 4/200: Train Loss 0.4674, Learning Rate 0.0020 0.0
loss: 4.8480757488760835, Average Distance: 15.251501937984495




Epoch 5/200: Train Loss 0.4155, Learning Rate 0.0020 0.0
loss: 4.309796738070111, Average Distance: 15.622141472868217




Epoch 6/200: Train Loss 0.3780, Learning Rate 0.0020 0.0
loss: 3.9206904165966567, Average Distance: 14.709883720930232




Epoch 7/200: Train Loss 0.3511, Learning Rate 0.0020 0.0
loss: 3.6413888252058695, Average Distance: 13.221487403100776




Epoch 8/200: Train Loss 0.3346, Learning Rate 0.0020 0.0
loss: 3.4706220155538516, Average Distance: 12.86419573643411




Epoch 9/200: Train Loss 0.3173, Learning Rate 0.0020 0.0
loss: 3.291118153999018, Average Distance: 12.6765503875969




Epoch 10/200: Train Loss 0.3046, Learning Rate 0.0020 0.0
loss: 3.1595571609430535, Average Distance: 12.246027131782945


Train:   2%|▏         | 10/446 [00:10<06:58,  1.04it/s, loss=0.2794, lr=0.0020]

KeyboardInterrupt: ignored

# Submit to kaggle (TODO)

In [None]:
save_checkpoint = torch.load("/content/drive/MyDrive/model_epoch_29.pth")

In [None]:
model.load_state_dict(save_checkpoint['model_state_dict'])

<All keys matched successfully>

In [None]:
optimizer.load_state_dict(save_checkpoint['optimizer_state_dict'])

In [None]:
def model_eval():
  total_d=0
  for i,data in enumerate(val_loader):
    x,y,lx,ly = data
    x= x.cuda()
    y= y.cuda()

    with torch.no_grad():
      outputs, l =model(x,lx)

      d = calculate_levenshtein(outputs,y,l, ly,decoder,PHONEME_MAP)

      total_d +=float(d)

  print("Total Distance:{},Average Distance:{}",format(total_d),format(total_d/len(val_loader)))

In [None]:
model_eval()

KeyboardInterrupt: ignored

In [None]:
# TODO: Write your model evaluation code for the test dataset
# You can write your own code or use from the previous homewoks' stater notebooks
# You can't calculate loss here. Why?
model.eval()
pred = []
for i, data in enumerate(test_loader,0):
    x, lx = data
    x = x.cuda()

    with torch.no_grad():
        outputs, l = model(x, lx)

    beam_results, beam_scores, timesteps, out_lengths = decoder.decode(torch.permute(outputs, (1, 0, 2)), seq_lens=l)

    for i in range(outputs.size()[1]): # Loop through each element in the batch
        
        h_sliced = beam_results[i, 0, 0:out_lengths[i][0]]
        h_string = ''.join([PHONEME_MAP[hi] for hi in h_sliced])
        pred.append(h_string)

df = pd.DataFrame(pred, columns=['predictions'])
df.index.name = 'id'   

In [None]:
print("Test")
df.head(10)

Test


Unnamed: 0_level_0,predictions
id,Unnamed: 1_level_1
0,.HwtytOldDhtRUhTDhstrgOkRrst.
1,.RImEnyimpoRI.DIIvliNhzmOstlhvlI.
2,.Hwn?UmEbIWhniNaR?UtbIgiN?emhSrpIs.
3,..pHIbhlsofrhnDhmyt.ekses.brnz.
4,..DesWiTwvRekhniNinDhpEnzhvDhHaRt.hnsOtgOz.on.
5,.DhcyldHAdidnEdhv.gREs.WicdiznatinbARIhblI.kOg...
6,.olhbwtHimWhzhthmhldhvbRyt.hndbROkhnkelrzkAtrd...
7,.EfhtbIgRenjhn.frst.DhtDIt.ots.hvisrdhnkeRktr....
8,.DhpitIDetmImhstwn.ANdh.
9,.HIWhzshvtHaRdidinipechs.sedbeT.AndbIiNhnlhvHI...


In [None]:
print("Train")


In [None]:
# TODO: Generate the csv file
df.to_csv('/content/submission.csv')

In [None]:
!kaggle competitions submit -c 11-785-s22-hw3p2 -f submission.csv -m New_Submission

100% 204k/204k [00:02<00:00, 92.2kB/s]
Successfully submitted to Automatic Speech Recognition (ASR)