# Frame Level Speech Recognition with Neural Networks

## IDC410 Machine Learning Assignment

### Submitted to Prof. Sarab Anand

## Question for the Assesment
In this coursework you will take your knowledge of feedforward neural networks and apply it to the task of speech recognition.

You are provided a dataset of audio recordings (utterances) and their phoneme state (subphoneme) labels. The data comes from articles published in the Wall Street Journal (WSJ) that are read aloud and labelled using the original text. If you have not encountered speech data before or have not heard of phonemes or spectrograms, we will clarify these here:

## Phonems and Phoneme States

As letters are the atomic elements of written language, phonemes are the atomic elements of speech. It is crucial for us to have a means to distiguish different sounds in speech that may or may not represent the same letter or combinations of letters in the written alphabet. For example, the words "jet" and "ridge" both contain the same sound and we refer to this elemental sound as the phoneme "JH". For this challenge we will consider 46 phonemes in the english language.

["+BREATH+", "+COUGH+", "+NOISE+", "+SMACK+", "+UH+", "+UM+", "AA", "AE", "AH", "AO", "AW", "AY", "B", "CH", "D", "DH", "EH", "ER", "EY", "F", "G", "HH", "IH", "IY", "JH", "K", "L", "M", "N", "NG", "OW", "OY", "P", "R", "S", "SH", "SIL", "T", "TH", "UH", "UW", "V", "W", "Y", "Z", "ZH"]

A powerful technique in speech recognition is to model speech as a markov process with unobserved states. This model considers observed speech to be dependent on unobserved state transitions. We refer to these unobserved states as phoneme states or subphonemes. For each phoneme, there are 3 respective phoneme states. Therefore for our 46 phonemes, there exist 138 respective phoneme states.

Hidden Markov Models (HMMs) estimate the parameters of this unobserved markov process (transition and emission probabilities) that maximize the likelihood of the observed speech data.

Your task is to instead take a model-free approach and classify mel spectrogram frames using a neural network that takes a frame (plus optional context) and outputs class probabilities for all 138 phoneme states. Performance on the task will be measured by classification accuracy on a held-out set of labelled mel spectrogram frames. Training/dev labels are provided as integers [0-137].


## Representing Speech

As a first step, the speech must be converted into a feature representation that can be fed into the network.

In our representation, utterances have been converted to "mel spectrograms", which are pictorial representations that characterize how the frequency content of the signal varies with time. The frequency-domain of the audio signal provides more useful features for distinguishing phonemes.

For a more intuitive understanding, consider attempting to determine which instruments are playing in an orchestra given an audio recording of a performance. By looking only at the amplitude of the signal of the orchestra over time, it is nearly impossible to distinguish one source from another. But if the signal is transformed into the frequency domain, we can use our knowledge that flutes produce higher frequency sounds and bassoons produce lower frequency sounds. In speech, a similar phenomenon is observed when the vocal tract produces sounds at varying frequencies.

To convert the speech to a mel spectrogram, it is segmented into little "frames", each 25ms wide, where the "stride" between adjacent frames is 10ms. Thus we get 100 such frames per second of speech.

From each frame, we compute a single "mel spectral" vector, where the components of the vector represent the (log) energy in the signal in different frequency bands. In the data we have given you, we have 40-dimensional mel-spectral vectors, i.e. we have computed energies in 40 frequency bands.

Thus, we get 100 40-dimensional mel spectral (row) vectors per second of speech in the recording. Each one of these vectors is referred to as a frame. The details of how mel spectrograms are computed from speech is explained in the attached blog.

Thus, for a T-second recording, the entire spectrogram is a 100T x 40 matrix, comprising 100T 40- dimensional vectors (at 100 vectors (frames) per second).

The Training Data Comprises :

<li>Speech Recordings
<li>Frame Level Phoneme State labels
The test data comprises

<li>Speech Recordings
<li>Phoneme state labels are not given



### Expected from Us

Your job is to identify the phoneme state label for each frame in the test data set. It is important to note that utterances are of variable length. We are providing you code to load and parse the raw files into the expected format. For now we are only providing dev data files as the training file is very large

### Dataset

#### Feature File
[train|dev|test].npy contain a numpy object array of shape [utterances]. Each utterance is a float32 ndarray of shape [time, frequency], where time is the length of the utterance. Frequency dimension is always 40 but time dimension is of variable length


#### Label File

[train|dev]_labels.npy contain a numpy object array of shape [utterances]. Each element in the array is an int32 array of shape [time] and provides the phoneme state label for each frame. There are 138 distinct labels [0-137], one for each subphoneme.

You can downlaoad the dataset from [here](https://www.kaggle.com/c/cmu-11785-deep-learning-hw1-p2/data)






Importing all the basic and ML Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os

In [None]:
import torch
cuda = torch.cuda.is_available()
print(cuda)

## Implementation
The dataset files are of nearly 8GB size, We can't load them directly to google colab notebook, instead we make use of Google Drive.

Upload the files on Google Drive and make use of Drive feature of the google colaboratry, type the below code, It'll show you a link, Visit that link, give confirmation, copy the auth code and paste it in the dialog box that appears. It will let you access the files in your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:

train_labels = np.load('/content/drive/My Drive/train_labels.npy',allow_pickle=True)
dev_train = np.load('/content/drive/My Drive/dev.npy',allow_pickle=True)
dev_labels = np.load('/content/drive/My Drive/dev_labels.npy',allow_pickle=True)
test =  np.load('/content/drive/My Drive/test.npy',allow_pickle=True)

In [None]:
train = np.load('/content/drive/My Drive/train.npy',allow_pickle=True)

In [None]:
# Checking the Shape of the training Data
train.shape

### TORCH.UTILS.DATA

At the heart of PyTorch data loading utility is the torch.utils.data.DataLoader class. It represents a Python iterable over a dataset, with support for
<li> map-style and iterable-style datasets,
<li> customizing data loading order,
<li> automatic batching,
<li> single- and multi-process data loading,
<li> automatic memory pinning.


#### DataLoader
           Dataloader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)

The most important argument of DataLoader constructor is dataset, which indicates a dataset object to load data from.


In [None]:
from torch.utils.data import DataLoader, Dataset, TensorDataset

class TensorDataset(Dataset):
    def __init__(self, x, y):
        super().__init__()
        assert len(x) == len(y)
        self._x = x
        self._y = y
    
    def __len__(self):
        return len(self._x)
      
    def __getitem__(self, index):
        x_item = self._x[index]
        return torch.FloatTensor(x_item), torch.FloatTensor(self._y[index])

#### TensorDataset
A dataset of tensors.

Stores a single tensor internally, which is then indexed inside get().

In [None]:
train_dataset = TensorDataset(train, train_labels)

load_train = DataLoader(
    train_dataset,
    batch_size = 1,
    shuffle=False,
    pin_memory=True
)

dev_dataset = TensorDataset(dev_train, dev_labels)

load_valid = DataLoader(
    dev_dataset,
    batch_size = 1
)


In [None]:

import torch
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

To work with GPU in Pytorch you have to make sure the model and the tensors to be referred to the GPU or cuda else it will show an error. Also, colab provides only around 11GB. You can also use pin_memory while loading the dataset if your dataset preprocessing can directly be done on the GPU

In [None]:

DEVICE

In [None]:
embedding_dim = 40          # the dimensions used in the Neural Network
hidden_dim = 10             # the number of hidden layer dimensions
vocab_size = 138            # It goes from [0-137], hence 138
layers=4                    # the total number of layers used to make the neural network

def hidden_init():
    return (torch.rand(layers*2, 1, hidden_dim).to(DEVICE) ,
            torch.rand(layers*2, 1, hidden_dim).to(DEVICE))

hidden_init()

## LSTM Model

It is special kind of recurrent neural network that is capable of learning long term dependencies in data. This is achieved because the recurring module of the model has a combination of four layers interacting with each other.



In [None]:
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('LSTM.jpg')
imgplot = plt.imshow(img)
plt.show()

The picture above depicts four neural network layers in yellow boxes, point wise operators in green circles, input in yellow circles and cell state in blue circles. An LSTM module has a cell state and three gates which provides them with the power to selectively learn, unlearn or retain information from each of the units. The cell state in LSTM helps the information to flow through the units without being altered by allowing only a few linear interactions. Each unit has an input, output and a forget gate which can add or remove the information to the cell state. The forget gate decides which information from the previous cell state should be forgotten for which it uses a sigmoid function. The input gate controls the information flow to the current cell state using a point-wise multiplication operation of ‘sigmoid’ and ‘tanh’ respectively. Finally, the output gate decides which information should be passed on to the next hidden state

In [None]:
class LSTM_model(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(LSTM_model, self).__init__()
        self.vocab_size = 138                 # vocab_size
        self.embedding_dim = embedding_dim    # embedding dimension
        self.hidden_dim = hidden_dim          # hidden dimension
        
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim, num_layers=layers, dropout = 0.2, bidirectional = True).to(DEVICE)
        self.linear = torch.nn.Linear(hidden_dim*2, vocab_size)       # *2 applied if bidir = true
        self.softmax = torch.nn.functional.softmax
        
    def forward(self, encrypted):
        lstm_in = encrypted.transpose(0,1)

        lstm_out, lstm_hidden = self.lstm(lstm_in.float(), hidden_init())
        
        scores = self.linear(lstm_out)
        scores = scores.transpose(1, 2)

        return scores

model = LSTM_model(vocab_size, embedding_dim, hidden_dim)

When loading a model on a GPU that was trained and saved on GPU, simply convert the initialized model to a CUDA optimized model using model.to(torch.device('cuda')). Also, be sure to use the .to(torch.device('cuda')) function on all model inputs to prepare the data for the model. Note that calling my_tensor.to(device) returns a new copy of my_tensor on GPU. It does NOT overwrite my_tensor. Therefore, remember to manually overwrite tensors: my_tensor = my_tensor.to(torch.device('cuda')).


model.cuda() and model.to(device) are the same, but they actually gave different running time.
They do the same thing yes: send each param to the GPU one after the other.


In [None]:

model = model.to(DEVICE)

In [None]:
#  the trained Model
model.load_state_dict(torch.load('/content/drive/My Drive/model_5.sav'))

### Cross Entropy Loss
This criterion combines LogSoftmax and NLLLoss in one single class.It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes.This is particularly useful when you have an unbalanced training set.

### Adam Optimizer
Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. Adam is relatively easy to configure where the default configuration parameters do well on most problems.

In [None]:
# making the LSTM trainer
losses = []        # empty array for the losses to be appended 

class LSTM_Trainer():
    def __init__(self, model):
        self.model = model
        self.loss_fn = torch.nn.CrossEntropyLoss().to(DEVICE)
        self.optimizer = torch.optim.Adam(model.parameters(), lr=0.001) 

    def get_loss(self, encrypted, original) :
        encrypted = encrypted.to(DEVICE).long()
        original = original.to(DEVICE).long()

        scores = self.model.forward(encrypted)
        original = original.transpose(0,1)
        original = original.long()

        loss = self.loss_fn(scores, original)  # <- Training loss
        return loss

    def train(self, num_epochs):
        accuracies, max_accuracy = [], 0
        best_valid_loss = 10   # V.high initialization

        with open(os.path.join(PATH, 'history.csv'),'w') as writer:
            for N in range(num_epochs):
                print('Epoch: {}'.format(N))
                for i, (encrypted, original) in enumerate(load_train):  #dataset(num_examples):
                    self.optimizer.zero_grad()
  
                    loss = self.get_loss(encrypted, original)  # <- Training loss
                    loss.backward()

                    self.optimizer.step()

                # Validation
                    if i % validation_time == 0:

                        print('Validation:' + str(i))
                        validation_loss = []
                        for (val_encrypted, val_original) in load_valid:    #val dataset(num_examples):
                            val_loss = self.get_loss(val_encrypted, val_original) 
                      
                            validation_loss.append(val_loss.item())

                        avg_loss = sum(validation_loss) / len(validation_loss)
                        print('Training Loss: {:6.4f}'.format(loss.item()))
                        print('Validation Loss: {:6.4f}'.format(avg_loss))        
                        writer.write(str(N)+','+str(i)+','+str(loss.item())+','+str(avg_loss))
                        writer.write('\n')

                # Saving the model after an epoch
                model_saved = os.path.join(PATH, 'model_' + str(N+1) + '.sav')
                torch.save(self.model.state_dict(), model_saved)

                print('Train Loss at end of epoch: {:6.4f}'.format(loss.item()))

In [None]:
trainer = LSTM_Trainer(model)  # training the model using the LSTM trainer

In [None]:
n_epochs = 15                  # number of iterations the program will run

In [None]:

# Path to home
PATH = os.getcwd()
model.load_state_dict(torch.load('/content/drive/My Drive/model_5.sav'))   # getting the saved model

In [None]:
# Printing validation loss at regular intervals
validation_time = len(train) / 20
print(validation_time)

In [None]:
# Below call starts training models for required epochs

trainer.train(n_epochs)

In [None]:
from torch.utils.data import DataLoader, Dataset, TensorDataset

class TestDataset(Dataset):
    def __init__(self, x, y):
        super().__init__()
        assert len(x) == len(y)
        self._x = x
    
    def __len__(self):
        return len(self._x)
      
    def __getitem__(self, index):
        x_item = self._x[index]
        return torch.FloatTensor(x_item)

In [None]:
test_dataset = TestDataset(test)

load_test = DataLoader(
    train_dataset,
    batch_size = 1,
    shuffle=False,
    pin_memory=True
)

## Softmax Function
Softmax extends the idea of logistic regression into a multi-class world. That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.

Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.

#### Softmax Options
Consider the following variants of Softmax:

<li> Full Softmax is the Softmax we've been discussing; that is, Softmax calculates a probability for every possible class.

<li> Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we don't have to provide probabilities for every non-doggy example.

Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.


### One Label vs. Many Labels
Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For such examples:

<li>You may not use Softmax.
<li>You must rely on multiple logistic regressions.

For example, suppose your examples are images containing exactly one item—a piece of fruit. Softmax can determine the likelihood of that one item being a pear, an orange, an apple, and so on. If your examples are images containing all sorts of things—bowls of different kinds of fruit—then you'll have to use multiple logistic regressions instea

In [None]:
soft = torch.nn.Softmax(dim=0)

with open('monitsharma.csv') as output:
    output.write('id,label')
    output_id = 0
    for encrypted in load_test:
        encrypted = encrypted.to(DEVICE)
        scores = model.forward(encrypted)

        soft_scores = soft(scores[0])      
        predictions = torch.max(soft_scores, 0)   
        for prediction in predictions:
            output.write(output_id + ',' + prediction)
            output_id += 1
