# Frame Level Speech Recognition with Neural Networks
## IDC410 Machine Learning Assignment
### Submitted to Prof. Sarab Anand
#### Question for the Assesment
In this coursework you will take your knowledge of feedforward neural networks and apply it to the task of speech recognition.

You are provided a dataset of audio recordings (utterances) and their phoneme state (subphoneme) labels. The data comes from articles published in the Wall Street Journal (WSJ) that are read aloud and labelled using the original text. If you have not encountered speech data before or have not heard of phonemes or spectrograms, we will clarify these here:

#### Phonems and Phoneme States
As letters are the atomic elements of written language, phonemes are the atomic elements of speech. It is crucial for us to have a means to distiguish different sounds in speech that may or may not represent the same letter or combinations of letters in the written alphabet. For example, the words "jet" and "ridge" both contain the same sound and we refer to this elemental sound as the phoneme "JH". For this challenge we will consider 46 phonemes in the english language.

["+BREATH+", "+COUGH+", "+NOISE+", "+SMACK+", "+UH+", "+UM+", "AA", "AE", "AH", "AO", "AW", "AY", "B", "CH", "D", "DH", "EH", "ER", "EY", "F", "G", "HH", "IH", "IY", "JH", "K", "L", "M", "N", "NG", "OW", "OY", "P", "R", "S", "SH", "SIL", "T", "TH", "UH", "UW", "V", "W", "Y", "Z", "ZH"]

A powerful technique in speech recognition is to model speech as a markov process with unobserved states. This model considers observed speech to be dependent on unobserved state transitions. We refer to these unobserved states as phoneme states or subphonemes. For each phoneme, there are 3 respective phoneme states. Therefore for our 46 phonemes, there exist 138 respective phoneme states.

Hidden Markov Models (HMMs) estimate the parameters of this unobserved markov process (transition and emission probabilities) that maximize the likelihood of the observed speech data.

Your task is to instead take a model-free approach and classify mel spectrogram frames using a neural network that takes a frame (plus optional context) and outputs class probabilities for all 138 phoneme states. Performance on the task will be measured by classification accuracy on a held-out set of labelled mel spectrogram frames. Training/dev labels are provided as integers [0-137].

#### Representing Speech
As a first step, the speech must be converted into a feature representation that can be fed into the network.

In our representation, utterances have been converted to "mel spectrograms", which are pictorial representations that characterize how the frequency content of the signal varies with time. The frequency-domain of the audio signal provides more useful features for distinguishing phonemes.

For a more intuitive understanding, consider attempting to determine which instruments are playing in an orchestra given an audio recording of a performance. By looking only at the amplitude of the signal of the orchestra over time, it is nearly impossible to distinguish one source from another. But if the signal is transformed into the frequency domain, we can use our knowledge that flutes produce higher frequency sounds and bassoons produce lower frequency sounds. In speech, a similar phenomenon is observed when the vocal tract produces sounds at varying frequencies.

To convert the speech to a mel spectrogram, it is segmented into little "frames", each 25ms wide, where the "stride" between adjacent frames is 10ms. Thus we get 100 such frames per second of speech.

From each frame, we compute a single "mel spectral" vector, where the components of the vector represent the (log) energy in the signal in different frequency bands. In the data we have given you, we have 40-dimensional mel-spectral vectors, i.e. we have computed energies in 40 frequency bands.

Thus, we get 100 40-dimensional mel spectral (row) vectors per second of speech in the recording. Each one of these vectors is referred to as a frame. The details of how mel spectrograms are computed from speech is explained in the attached blog.

Thus, for a T-second recording, the entire spectrogram is a 100T x 40 matrix, comprising 100T 40- dimensional vectors (at 100 vectors (frames) per second).

###### The Training Data Comprises :

<li>Speech Recordings
<li>Frame Level Phoneme State labels

###### The test data comprises

<li>Speech Recordings
<li>Phoneme state labels are not given


#### Expected from Us
Your job is to identify the phoneme state label for each frame in the test data set. It is important to note that utterances are of variable length. We are providing you code to load and parse the raw files into the expected format. For now we are only providing dev data files as the training file is very large.

Dataset
Feature File
[train|dev|test].npy contain a numpy object array of shape [utterances]. Each utterance is a float32 ndarray of shape [time, frequency], where time is the length of the utterance. Frequency dimension is always 40 but time dimension is of variable length.

Label Files
[train|dev]_labels.npy contain a numpy object array of shape [utterances]. Each element in the array is an int32 array of shape [time] and provides the phoneme state label for each frame. There are 138 distinct labels [0-137], one for each subphoneme.

# Import Files

Importing all the necessary files

In [None]:
import numpy as np                # numpy for basic matrix operations
import os                         # operating system, to use files from my PC
import torch                      # PyTorch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
import time                       # time library 
from datetime import datetime
from torch.optim.lr_scheduler import StepLR
import argparse                    # argument parser

# Train, Validation, Test NumPy files loading


Loading the test , train files directly from my PC(not using the google drive feature here)

In [None]:
class LibriSpeech():

    def __init__(self, data_path):
        self.data_path = data_path
        self.dev_set = None
        self.train_set = None
        self.test_set = None
  
    @property
    def dev(self):
        if self.dev_set is None:
            self.dev_set = load_data(self.data_path, 'dev')
        return self.dev_set

    @property
    def train(self):
        if self.train_set is None:
            self.train_set = load_data(self.data_path, 'train')
        return self.train_set
  
    @property
    def test(self):
        if self.test_set is None:
            self.test_set = (np.load(os.path.join(self.data_path, 'test.npy'), encoding='bytes', allow_pickle=True), None)
        return self.test_set

    
def load_data(path, name):
    return (
        np.load(os.path.join(path, '{}.npy'.format(name)), encoding='bytes', allow_pickle=True),
        np.load(os.path.join(path, '{}_labels.npy'.format(name)), encoding='bytes', allow_pickle=True)
    )

# DataSet Class

In [None]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, librispeech, k = 15, lowest=0.1):
        self.k = k
        self.x_list = librispeech[0]
        self.y_list = librispeech[1] if len(librispeech) == 2 else None
        self.idx_map = []
        for i, xs in enumerate(self.x_list):
            for j in range(xs.shape[0]):
                self.idx_map.append((i, j))
        
        self.win_mask = np.concatenate((np.arange(lowest, 1.0, (1 - lowest)/k),
                            np.arange(1.0, lowest, -(1 - lowest)/k),
                            np.array([0.1])))

        # self.win_mask = np.concatenate( (np.zeros(k), np.array([1.]), np.zeros(k)), axis=None )
        self.win_mask = np.repeat(self.win_mask, librispeech[0][0].shape[1])
        

    def __getitem__(self, idx):
        i, j = self.idx_map[idx]
        context = self.x_list[i].take(range(j - self.k, j + self.k + 1), mode='clip', axis=0).flatten()
        context *= self.win_mask
        xi = torch.from_numpy(context).float()
        yi = self.y_list[i][j] if self.y_list is not None else -1
        return xi, yi

    def __len__(self):
        return len(self.idx_map)

# Xavier Initialization

Xavier initialization, originally proposed by Xavier Glorot and Yoshua Bengio in "Understanding the difficulty of training deep feedforward neural networks", is the weights initialization technique that tries to make the variance of the outputs of a layer to be equal to the variance of its inputs. This idea turned out to be very useful in practice. Naturally, this initialization depends on the layer activation function. And in their paper, Glorot and Bengio considered logistic sigmoid activation function, which was the default choice at that moment.

Later on, the sigmoid activation was surpassed by ReLu, because it allowed to solve vanishing / exploding gradients problem. Consequently, there appeared a new initialization technique, which applied the same idea (balancing of the variance of the activation) to this new activation function. It was proposed by Kaiming He at al in "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", and now it often referred to as He initialization.

In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer()

The initialization works better for layers with ReLu activation.
Xavier initialization works better for layers with sigmoid activation

In [None]:
def init_xavier(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

# Model

Making and specifying the neural layers and all the parameters for our model

In [None]:
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        k = 15
        in_size = ((k * 2) + 1) * 13
        out_size = 346

        layers = []
        size_list = [in_size, in_size, 1024,  2048, 2048, 1024, 512, out_size, out_size]

        for i in range(len(size_list) - 2):
            layers.append(nn.Linear(size_list[i],size_list[i+1]))
            layers.append(nn.BatchNorm1d(size_list[i+1]))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(0.2))

        layers.append(nn.Linear(size_list[-2], size_list[-1]))
        self.net = nn.Sequential(*layers)
        print(self.net)
        
    def forward(self, x):
        return self.net(x)


In [None]:
def get_model(k):
    in_size = ((k * 2) + 1) * 13
    out_size = 346

    layers = []
    size_list = [in_size, in_size, 1024,  2048, 2048, 1024, 512, out_size, out_size]

    for i in range(len(size_list) - 2):
        layers.append(nn.Linear(size_list[i],size_list[i+1]))
        layers.append(nn.BatchNorm1d(size_list[i+1]))
        layers.append(nn.ReLU())
        layers.append(nn.Dropout(0.2))
        
    layers.append(nn.Linear(size_list[-2], size_list[-1]))
    mynet = nn.Sequential(*layers)
    print(mynet)
    return mynet

# Train and Test Function

In [None]:
def train(epoch, model, optimizer, train_loader, scheduler, args):
    model.train()
    
    t0 = time.time()
    for batch_idx, (data, target) in enumerate(train_loader):
        if args.cuda:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
        loss = F.cross_entropy(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} Batch: {} [{}/{} ({:.0f}%, time:{:.2f}s)]\tLoss: {:.6f}'.format(
                epoch, batch_idx, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), time.time() - t0,
                loss.data))
            t0 = time.time()
    #scheduler.step()

def test(model, test_loader, args):
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        if args.cuda:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data, volatile=True), Variable(target)
        output = model(data)
        test_loss += F.cross_entropy(output, target, size_average=False).data # sum up batch loss
        pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return "{:.4f}%".format(100. * correct / len(test_loader.dataset))

# Arguments

Setting all the hyperparameters

In [None]:
class Argument():
    def __init__(self):
        self.batch_size = 256     # defining batch size
        self.epochs = 29          # number of iterations
        self.lr = 0.001           # the learning rate
        self.cuda = True          # The availability of GPU
        self.data_dir = "./data/" # which directory to get data from
        self.K = 15
        self.seed = 1001
        self.momentum = 0.9
        self.log_interval = 1000
        self.weights_dir = "./weights/"
        
args = Argument()

# Dataloading

In [None]:
torch.cuda.manual_seed(args.seed)

In [None]:
librispeech_loader = LibriSpeech(args.data_dir)

kwargs = {'num_workers': 1, 'pin_memory': True, 'drop_last': True} if args.cuda else {}

train_loader = torch.utils.data.DataLoader(
    MyDataset(librispeech_loader.train, k=args.K),
    batch_size=args.batch_size, shuffle=True, **kwargs)

test_loader = torch.utils.data.DataLoader(
    MyDataset(librispeech_loader.dev, k=args.K),
    batch_size=args.batch_size, shuffle=True, **kwargs)


# Model initialization

In [None]:

model = get_model(args.K)
model.apply(init_xavier)  # applying Xavier weight initialization
if args.cuda:
    model.cuda()

Sequential(
  (0): Linear(in_features=403, out_features=403, bias=True)
  (1): BatchNorm1d(403, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
  (3): Dropout(p=0.2, inplace=False)
  (4): Linear(in_features=403, out_features=1024, bias=True)
  (5): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (6): ReLU()
  (7): Dropout(p=0.2, inplace=False)
  (8): Linear(in_features=1024, out_features=2048, bias=True)
  (9): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (10): ReLU()
  (11): Dropout(p=0.2, inplace=False)
  (12): Linear(in_features=2048, out_features=2048, bias=True)
  (13): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (14): ReLU()
  (15): Dropout(p=0.2, inplace=False)
  (16): Linear(in_features=2048, out_features=1024, bias=True)
  (17): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (18): ReLU()
  (19): Dr

# Optimizer and Scheduler

## Adam Optimizer
Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. Adam is relatively easy to configure where the default configuration parameters do well on most problems.

## StepLR
Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

In [None]:
optimizer = optim.Adam(model.parameters(), lr=args.lr)
scheduler = StepLR(optimizer, step_size=5, gamma=0.5)

# Training

In [None]:
for epoch in range(1, args.epochs + 1):
    print(datetime.now())
    print('LR: ', scheduler.get_last_lr())
    train(epoch, model, optimizer, train_loader, scheduler, args)
    acc_str = test(model, test_loader, args)
    if not os.path.exists(args.weights_dir):
        os.makedirs(args.weights_dir)
    torch.save(model.state_dict(), "{}/data_{:03d}.pth".format(args.weights_dir, epoch))

## Load best weight
#### ** Please note that you need to load model first (Model Initialization) before executing the below steps

In [None]:

model.load_state_dict(torch.load(args.weights_dir+'/data_029.pth'))

<All keys matched successfully>

## Evaluate Function

In [None]:
def eval_model(model, test_loader):
    with torch.no_grad():
        model.eval()
        pred = []

        for batch_idx, (data, target) in enumerate(test_loader):   
            data = data.cuda()
            outputs = model(data)
            predicted = outputs.data.max(1, keepdim=True)[1]
            pred.append(predicted.cpu().numpy()[0])

        return np.array(pred)

## Test Data Loading

In [None]:
librispeech_loader = LibriSpeech(args.data_dir)
kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}

eval_loader = torch.utils.data.DataLoader(
    MyDataset(librispeech_loader.test, k=args.K),
    batch_size=1, **kwargs)

## Prediction

In [None]:
pred = eval_model(model, eval_loader)

## Save Prediction for Assignment Submission

In [None]:
with open('monitsharma.csv', 'w') as w:
    w.write('id,label\n')
    for i in range(len(pred)):
            w.write(str(i)+','+str(pred[i][0])+'\n')