## Frame Level Speech Recognition with Neural Networks

### IDC410 Machine Learning Assesment

#### Submitted to Prof Sarab Anand

## Question for the Assesment
In this coursework you will take your knowledge of feedforward neural networks and apply it to the task of speech recognition.

You are provided a dataset of audio recordings (utterances) and their phoneme state (subphoneme) labels. The data comes from articles published in the Wall Street Journal (WSJ) that are read aloud and labelled using the original text. If you have not encountered speech data before or have not heard of phonemes or spectrograms, we will clarify these here:


### Phonems and Phoneme States

As letters are the atomic elements of written language, phonemes are the atomic elements of speech. It is crucial for us to have a means to distiguish different sounds in speech that may or may not represent the same letter or combinations of letters in the written alphabet. For example, the words "jet" and "ridge" both contain the same sound and we refer to this elemental sound as the phoneme "JH". For this challenge we will consider 46 phonemes in the english language.

["+BREATH+", "+COUGH+", "+NOISE+", "+SMACK+", "+UH+", "+UM+", "AA", "AE", "AH", "AO", "AW", "AY", "B", "CH", "D", "DH", "EH", "ER", "EY", "F", "G", "HH", "IH", "IY", "JH", "K", "L", "M", "N", "NG", "OW", "OY", "P", "R", "S", "SH", "SIL", "T", "TH", "UH", "UW", "V", "W", "Y", "Z", "ZH"]

A powerful technique in speech recognition is to model speech as a markov process with unobserved states. This model considers observed speech to be dependent on unobserved state transitions. We refer to these unobserved states as phoneme states or subphonemes. For each phoneme, there are 3 respective phoneme states. Therefore for our 46 phonemes, there exist 138 respective phoneme states.


Hidden Markov Models (HMMs) estimate the parameters of this unobserved markov process (transition and emission probabilities) that maximize the likelihood of the observed speech data. 

Your task is to instead take a model-free approach and classify mel spectrogram frames using a neural network that takes a frame (plus optional context) and outputs class probabilities for all 138 phoneme states. Performance on the task will be measured by classification accuracy on a held-out set of labelled mel spectrogram frames. Training/dev labels are provided as integers [0-137].


### Representing Speech

As a first step, the speech must be converted into a feature representation that can be fed into the network.

In our representation, utterances have been converted to "mel spectrograms", which are pictorial representations that characterize how the frequency content of the signal varies with time. The frequency-domain of the audio signal provides more useful features for distinguishing phonemes.

For a more intuitive understanding, consider attempting to determine which instruments are playing in an orchestra given an audio recording of a performance. By looking only at the amplitude of the signal of the orchestra over time, it is nearly impossible to distinguish one source from another. But if the signal is transformed into the frequency domain, we can use our knowledge that flutes produce higher frequency sounds and bassoons produce lower frequency sounds. In speech, a similar phenomenon is observed when the vocal tract produces sounds at varying frequencies.

To convert the speech to a mel spectrogram, it is segmented into little "frames", each 25ms wide, where the "stride" between adjacent frames is 10ms. Thus we get 100 such frames per second of speech.

From each frame, we compute a single "mel spectral" vector, where the components of the vector represent the (log) energy in the signal in different frequency bands. In the data we have given you, we have 40-dimensional mel-spectral vectors, i.e. we have computed energies in 40 frequency bands.

Thus, we get 100 40-dimensional mel spectral (row) vectors per second of speech in the recording. Each one of these vectors is referred to as a frame. The details of how mel spectrograms are computed from speech is explained in the attached blog.

Thus, for a T-second recording, the entire spectrogram is a 100T x 40 matrix, comprising 100T 40- dimensional vectors (at 100 vectors (frames) per second).

The Training Data Comprises :
<li> Speech Recordings
<li> Frame Level Phoneme State labels

The test data comprises
<li> Speech Recordings
<li> Phoneme state labels are not given


### Expected from Us

Your job is to identify the phoneme state label for each frame in the test data set. It is important to note that utterances are of variable length. We are providing you code to load and parse the raw files into the expected format. For now we are only providing dev data files as the training file is very large.


### Dataset
#### Feature File 

[train|dev|test].npy contain a numpy object array of shape [utterances]. Each utterance is a float32 ndarray of shape [time, frequency], where time is the length of the utterance. Frequency dimension is always 40 but time dimension is of variable length.

#### Label Files

[train|dev]_labels.npy contain a numpy object array of shape [utterances]. Each element in the array is an int32 array of shape [time] and provides the phoneme state label for each frame. There are 138 distinct labels [0-137], one for each subphoneme.



You can downlaoad the dataset from [here](https://www.kaggle.com/c/cmu-11785-deep-learning-hw1-p2/data)



### Implementation
The dataset files are of nearly 8GB size, We can't load them directly to google colab notebook, instead we make use of Google Drive.

Upload the files on Google Drive and make use of Drive feature of the google colaboratry, type the below code, It'll show you a link, Visit that link, give confirmation, copy the auth code and paste it in the dialog box that appears.
It will let you access the files in your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

If it shows Mounted at /content/drive 
Then your google drive has been synched with google colab.

### Use GPU
As the data file is too big, and the ML Model wil require more intense computations, it'll be better to run the whole process on GPU.

On the top Click on Runtime, and select Change Runtime , and select GPU.

Runtime>>Change Runtine>> GPU



Check the version of the GPU. It's better to run the whole program on Google Colab, since Google Colab let us use GPU and the Google backend to make our program train faster.

In [None]:
## check the version of CUDA and other deatils about the GPU
!nvidia-smi

We have the version 11.2 CUDA. Since for Neural Networks we need to work with either tensorflow or pytorch. I am using PyTorch here. 
To download the best suited version of PyTorch for your system. 
Visit [here](https://pytorch.org/) 

Select all the necessary requirements, and your systems specification, copy the text and insert the text below to install PyTorch.

!pip install 

In [None]:
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Since PyTorch is a big library, it'll take some time to install.
It'll show
Successfully installed torch followed by its version.

In [None]:
#importing the basic libraries
import numpy as np               # numpy for matrix operation
import sys                       # the sys module provides information about constants, functions and methods of the Python interpreter
import matplotlib.pyplot as plt  # matplotlib is used to plot the graphs and figures
import time                      # the time() function returns the number of seconds passed since epoch, so we import time library

# getting the ML libraries to make our neural network
import torch                     # getting the PyTorch module imported
import torch.nn as nn            # importing the neural network module from PyTorch(torch) by the name nn
import torch.nn.functional as F  # torch.nn.functional contains function like convulation layer, pooling function etc
import torch.optim as optim      # torch.optim is a package implementing various optimization algorithms.

# getting some functions from Pytorch libraries
from torch.utils import data        # importing the data func from the pytorch module
from torchvision import transforms  # Transforms are common image transformations. They can be chained together using Compose
from torch.optim.lr_scheduler import StepLR  # torch.optim is a package implementing various optimization algorithms  
from torch.optim.lr_scheduler import ReduceLROnPlateau # StepLR Decays the learning rate of each parameter group by gamma every step_size epochs.


The ML model we want to train has a training file of 6.1GB , and therefore will take a lot of computing power. The use of GPU is must. Google Colab offers the use of GPU backend and 12GB RAM.
PyTorch has a functionality of CUDA to check for GPU.

In [None]:
# check for GPU
cuda = torch.cuda.is_available() # this line checks whether GPU is available on the device and returns True if it is available
cuda



In [None]:
# Checking the PyTorch version
print(torch.__version__)



## Loading the Dataset

Since the data files are too large , and hence can not be directly uploaded to the Google Colab, so we sync the Google drive and upload the data files on Google Drive, adnd use the load function from the numpy library.

In [None]:
# loading the data file
train_labels = np.load('/content/drive/My Drive/train_labels.npy',allow_pickle=True)  # Allow loading pickled object arrays stored in npy files.
dev_train = np.load('/content/drive/My Drive/dev.npy',allow_pickle=True)              # Reasons for disallowing pickles include security
dev_labels = np.load('/content/drive/My Drive/dev_labels.npy',allow_pickle=True)      # As loading pickled data can execute arbitrary code. 
test =  np.load('/content/drive/My Drive/test.npy',allow_pickle=True)                 # If pickles are disallowed, loading object arrays will fail.

In [None]:
# loading the training dataset
train = np.load('/content/drive/My Drive/train.npy',allow_pickle=True)     # this file will take longer to load

##Dataloader 
In the dataloader, I have padded the feature vector and stacked both the features and labels as one large 2D array each in the init part. The concatenation of frames is done in the get item part. It might take a long time to load the train data into the train loader depending on the system.



In [None]:
# Making a class by the name of MyDataset, that takes on the data parameter

class MyDataset(data.Dataset):
    def __init__(self, X,Y,k):
       
        self.X = X               # intitalize all the parameters as self
        self.Y = Y
        self.k = k
        self.samples = []        # initialize empty arrays/lists which will be used further
        self.labels = []
        self.length = []
        self._init_dataset()
        self.ind = np.arange(self.length[-1])    
        km = [self.k*(2*i+1) for i in range(len(self.length))]
        
        b = 0
        for i in range(self.length[-1]):
            if i == self.length[b]:
                b = b+1
                self.ind[i] = self.ind[i] + km[b]
            else:
                self.ind[i] = self.ind[i] + km[b]
        
    # function to find the length of the dataset used after above 
    def __len__(self):
        print(len(self.samples),len(self.labels))
        return len(self.labels)
    # concatinate the different arrays, we'll get high dimension data, generally called Tensor-
    def __getitem__(self,index):
        X = np.concatenate((self.samples[self.ind[index]-self.k:self.ind[index]+ self.k+1]),axis=0)
        labels = self.labels[index]
        return torch.from_numpy(X).float(),torch.tensor(labels).long()
    
    def _init_dataset(self):
        s = 0
        for i in range(len(self.X)):
            p = np.pad(self.X[i], ((self.k, self.k), (0, 0)), 'constant', constant_values=0)
            s = s + len(self.X[i])
            self.length.append(s)
            self.samples = self.samples + list(p)
            self.labels = self.labels + list(self.Y[i]) 

         
        return np.array(self.samples), np.array(self.labels)

In [None]:
# making the class TestDataset, which have the functions used in the testing phase of our model
# its similar to the MyDataset class made above
class TestDataset(data.Dataset):
    def __init__(self, X,k):
       
        self.X = X
        self.k = k
        self.samples = []
        self.length = []
        self._init_dataset()
        self.ind = np.arange(self.length[-1])
        km = [self.k*(2*i+1) for i in range(len(self.length))]
        
        b = 0
        for i in range(self.length[-1]):
            if i == self.length[b]:
                b = b+1
                self.ind[i] = self.ind[i] + km[b]
            else:
                self.ind[i] = self.ind[i] + km[b]
        

    def __len__(self):
        print(len(self.samples),self.length[-1])
        return self.length[-1]

    def __getitem__(self,index):
        X = np.concatenate((self.samples[self.ind[index]-self.k:self.ind[index]+ self.k+1]),axis=0)
        return torch.from_numpy(X).float()
    
    def _init_dataset(self):
        s = 0
        for i in range(len(self.X)):
            p = np.pad(self.X[i], ((self.k, self.k), (0, 0)), 'constant', constant_values=0)
            s = s + len(self.X[i])
            self.length.append(s)
            self.samples = self.samples + list(p)
         
        return np.array(self.samples)

### Difference between num_workers and batch_size

Mostly people confuse between num_workers and batch_size. The num_workers is not related to batch_size. Say you set batch_size to 20 and the training size is 2000, then each epoch would contain 100 iterations, i.e. for each iteration, the data loader returns a batch of 20 instances. num_workers > 0 is used to preprocess batches of data so that the next batch is ready for use when the current batch has been finished. More num_workers would consume more memory usage but is helpful to speed up the I/O process.

Use of num_workers is advisable only if you're running your device on GPU 

In [None]:
# Having more workers will increase the memory usage and that’s the most serious overhead
num_workers = 8 if cuda else 0  # this function set num_workers to 8 if GPU is available, else 0

# preparing the model to train
train_dataset = MyDataset(train, train_labels,13) 

train_loader_args = dict(shuffle=True, batch_size=256, num_workers=num_workers, pin_memory=True)

train_loader = data.DataLoader(train_dataset, **train_loader_args)

##Model 
I have used k=13 (input_size=1080), batch size 0f 256, initialized the model with xavier initialization, used batch norm after activations and used GeLU as activation function (torch 1.4 required). I have used Adam Optimizer with default learning rate and reduced the learning rate by 0.5 after every 5 epochs and ran it for around 30 epochs (I do not exactly remember how many epochs I ran, 30 is a conservative estimate

In [None]:
# Validation
num_workers = 8 
val_dataset = MyDataset(dev, dev_labels,13)
val_loader_args = dict(shuffle=False, batch_size=256, num_workers=num_workers, pin_memory=True)
val_loader = data.DataLoader(val_dataset, **val_loader_args)

In [None]:

# Testing
test_dataset = TestDataset(test,13)
test_loader_args = dict(shuffle=False, batch_size=1, num_workers=num_workers, pin_memory=True)
test_loader = data.DataLoader(test_dataset, **test_loader_args)

## Xavier Initialization

Xavier initialization, originally proposed by Xavier Glorot and Yoshua Bengio in "Understanding the difficulty of training deep feedforward neural networks", is the weights initialization technique that tries to make the variance of the outputs of a layer to be equal to the variance of its inputs. This idea turned out to be very useful in practice. Naturally, this initialization depends on the layer activation function. And in their paper, Glorot and Bengio considered logistic sigmoid activation function, which was the default choice at that moment.

Later on, the sigmoid activation was surpassed by ReLu, because it allowed to solve vanishing / exploding gradients problem. Consequently, there appeared a new initialization technique, which applied the same idea (balancing of the variance of the activation) to this new activation function. It was proposed by Kaiming He at al in "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", and now it often referred to as He initialization.

In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer()

<li> The initialization works better for layers with ReLu activation.
<li>  Xavier initialization works better for layers with sigmoid activation

Read about it more [here](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) and [here](https://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)

In [None]:
# the xavier function

def init_xavier(m):
  if type(m) == nn.Linear:
    fan_in = m.weight.size()[1]
    fan_out = m.weight.size()[0]
    std = np.sqrt(1.0/(fan_in + fan_out))
    m.weight.data.normal_(0,std)

In [None]:
def init_hey(m):
  if type(m) == nn.Linear:
    fan_in = m.weight.size()[1]
    fan_out = m.weight.size()[0]
    std = np.sqrt(2.0/(fan_in + fan_out))
    m.weight.data.normal_(0,std)

## Functions Used in Making the Neural Network

### nn.Linear()
torch.nn.Linear(in_features, out_features, bias=True)\
Applies a linear transformation to the incoming data: $y = xA^T + b$

###### Parameters
<li> in_features – size of each input sample

<li> out_features – size of each output sample

<li> bias – If set to False, the layer will not learn an additive bias. Default: True


###### Shape
<li> Input: $(N, *, H_{in})$  where $*$ means any number of additional dimensions and $H_{in} = \text{in_features}$

<li> Output: $(N, *, H_{out})$
where all but the last dimension are the same shape as the input and $H_{out} = \text{out_features}$


### nn.GELU()

Applies the Gaussian Error Linear Units function:

$\text{GELU}(x) = x * \Phi(x)
$
where $\Phi(x)$is the Cumulative Distribution Function for Gaussian Distribution.


###### Shape
<li> Input: $(N, *)$where $*$ means, any number of additional dimensions

<li> Output: $(N, *)$ same shape as the input.


### BATCHNORM1D

Applies Batch Normalization over a 2D or 3D input (a mini-batch of 1D inputs with optional additional channel dimension). Read about it more [here](https://arxiv.org/abs/1502.03167)

$y= \frac{x - E[x]}{\sqrt(Var[x] + \epsilon)} * \gamma + \beta$


The mean and standard-deviation are calculated per-dimension over the mini-batches and $\gamma$ and $\beta$ are learnable parameter vectors of size C (where C is the input size). By default, the elements of $\gamma$ are set to 1 and the elements of $\beta$ are set to 0. The standard-deviation is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).

Also by default, during training this layer keeps running estimates of its computed mean and variance, which are then used for normalization during evaluation. The running estimates are kept with a default momentum of 0.1.




In [None]:
# SIMPLE MODEL DEFINITION
class Simple_MLP(nn.Module):
    def __init__(self, size_list):
        super(Simple_MLP, self).__init__()
        layers = []
        self.size_list = size_list
        for i in range(len(size_list) - 2):
            layers.append(nn.Linear(size_list[i],size_list[i+1]))  # the linear function
            #layers.append(nn.ReLU())
            layers.append(nn.GELU())
            layers.append(nn.BatchNorm1d(size_list[i+1]))
            #layers.append(nn.Dropout(0.04*i,True))
        layers.append(nn.Linear(size_list[-2], size_list[-1]))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

#### Cross Entropy Loss
This criterion combines LogSoftmax and NLLLoss in one single class.It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes.This is particularly useful when you have an unbalanced training set.


#### Adam Optimizer
Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.
                                             Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
                                            Adam is relatively easy to configure where the default configuration parameters do well on most problems.



#### StepLR 
Decays the learning rate of each parameter group by gamma every
step_size epochs. Notice that such decay can happen simultaneously with
other changes to the learning rate from outside this scheduler. When
last_epoch=-1, sets initial lr as lr.

In [None]:
model = Simple_MLP([1080, 2048, 2048,  1024, 1024, 1024, 512, 512, 256, 138])
model.apply(init_xavier)                     # applying the Xavier weight initialization
criterion = nn.CrossEntropyLoss()            # This criterion combines LogSoftmax and NLLLoss in one single class.
                                             # It is useful when training a classification problem with C classes. 
                                             # If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes.
                                             # This is particularly useful when you have an unbalanced training set.


optimizer = optim.Adam(model.parameters())  # Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.
                                            # Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
                                            # Adam is relatively easy to configure where the default configuration parameters do well on most problems.


scheduler = StepLR(optimizer, step_size=5, gamma=0.5)
#scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.1, patience=2, verbose=True)
device = torch.device("cuda" if cuda else "cpu") # Running the model on GPU is available
model.to(device)                            # Saving the model
print(model)                                # Printing the Model

In [None]:
# making the epoch function, that tells how many iteration our model runs

def train_epoch(model, train_loader, criterion, optimizer):
    model.train()

    running_loss = 0.0
    total_predictions = 0.0
    correct_predictions = 0.0
    model.to(device)
    
    start_time = time.time()   # saving the time when the model start running
    
    # Print Learning Rate
    
    for batch_idx, (data, target) in enumerate(train_loader):   
        optimizer.zero_grad()   # .backward() accumulates gradients
        data = data.to(device)
        target = target.to(device) # all data & model on same device

        outputs = model(data)
        _, predicted = torch.max(outputs.data, 1)
        
        total_predictions += target.size(0)
        correct_predictions += (predicted == target).sum().item()
        
        loss = criterion(outputs, target)                                        # the loss function
        running_loss += loss.item()

        loss.backward()
        optimizer.step()
    scheduler.step()
    end_time = time.time()
    
    running_loss /= len(train_loader)
    acc = (correct_predictions/total_predictions)*100.0
    print('Training Loss: ', running_loss, 'Time: ',end_time - start_time, 's')  # printing the Training Loss and the Time required to run
    print('Training Accuracy: ', acc, '%')                                       # printing the Training Accuracy
    return running_loss,acc

In [None]:
def val_model(model, val_loader, criterion):
    with torch.no_grad():
        model.eval()
        model.to(device)

        running_loss = 0.0
        total_predictions = 0.0
        correct_predictions = 0.0

        for batch_idx, (data, target) in enumerate(val_loader):   
            data = data.to(device)
            target = target.to(device)

            outputs = model(data)

            _, predicted = torch.max(outputs.data, 1)
            total_predictions += target.size(0)
            correct_predictions += (predicted == target).sum().item()

            loss = criterion(outputs, target).detach()
            running_loss += loss.item()


        running_loss /= len(val_loader)
        acc = (correct_predictions/total_predictions)*100.0
        print('Testing Loss: ', running_loss)
        print('Testing Accuracy: ', acc, '%')
        return running_loss, acc

In [None]:
def test_model(model, test_loader):
    with torch.no_grad():
        model.eval()
        pred = []

        for batch_idx, (data) in enumerate(test_loader):   
            data = data.to(device)
            outputs = model(data)

            _, predicted = torch.max(outputs.data, 1)
            pred.append(predicted.cpu().numpy()[0])

        return np.array(pred)

In [None]:
# Using the above mentioned functions here to actually train the model

n_epochs = 30     # number of epochs this model will run
Train_acc = []    # declared empty arrays in the beginning 
Train_loss = []   # they will be filled when the code runs
Val_loss = []
Val_acc = []

for i in range(n_epochs):
    print('Epoch: ',i+1)
    print('LR: ', scheduler.get_lr())
    train_loss,acc = train_epoch(model, train_loader, criterion, optimizer)
    test_loss, test_acc = val_model(model, val_loader, criterion)
    Train_loss.append(train_loss)
    Train_acc.append(acc)
    Val_loss.append(test_loss)
    Val_acc.append(test_acc)
    print('='*20)
    #scheduler.step(test_acc)
    torch.save(model.state_dict(), '/content/drive/My Drive/model1.pt')  # saving the model that we made on the google drive

In [None]:
# finding the prediction of the model

pred= test_model(model, test_loader)

### Generating the Output csv file

In [None]:
# this code will make a new csv file in the synced Google drive


with open('/content/drive/My Drive/monitsharma.csv', 'w') as w: # open a csv file by the name monitsharma in drive and have the permission of w
    w.write('id,label\n')                                       # write the columns id and label
    for i in range(len(pred)):                                  # for the range of length of the prediction module
            w.write(str(i)+','+str(pred[i])+'\n')               # append the value in the csv file created by the name monitsharma

### Plotting various functions

In [None]:
# ploting the graph between the epoch number and the varying Loss with it 
plt.title('Training Loss')
plt.xlabel('Epoch Number')
plt.ylabel('Loss')
plt.plot(Train_loss)

In [None]:
# Plotting the Value Loss graph
plt.title('Val Loss')
plt.xlabel('Epoch Number')
plt.ylabel('Loss')
plt.plot(Val_loss)

In [None]:
# Plotting the graph for Epoch and Accuray of our model name it Val Accuracy

plt.title('Val Accuracy')
plt.xlabel('Epoch Number')
plt.ylabel('Accuracy (%)')
plt.plot(Val_acc)