# Mini-project: Vocal Activity Detection

We will implement a vocal activity detection (VAD) system in this mini-project. VAD aims at detection speech and nonspeech regions in a given utterance, and it is an important module in many speech commnunication applications:
  - Assume that you are making a phone call, and your signal strength or network condition is bad. Typically, speech communication system encodes/compresses your voice into a compact representation, transmits it to the receiver, and deecodes/decompresses it back to your voice. Such transmission needs to be applied for both speech and non-speech regions. With VAD, the system does not need to transmit the non-speech or silent regions as they don't contain useful information. This can save the computation and network bandwidth.
  
A VAD module detects speech and non-speech regions through a classifier. The problem of VAD can be defined as a binary classification problem, where we assign a label "yes" or "no" to each short frame denoting whether it is a speech or non-speech frame. There are many ways that we can construct a classifier, and in this mini-project we will implement two methods: a support vector machine (SVM) classifier and a neural network classifier.

## 0. Data Preparation

Before we start building the models, let's first prepare the dataset for VAD and do some simple visualization.

The audio, adopted from https://github.com/jtkim-kaist/VAD, is a 15 minute clip recorded at a bus station by a cellphone.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf
import librosa
import time

# load the audio file
audio, sr = sf.read('park.wav')

# visualize the audio
plt.plot(audio)
ticks_in_second = np.arange(len(audio)) / sr
max_second = np.floor(ticks_in_second[-1])
plt.xticks(ticks=np.arange(0, (max_second+1)*sr, (max_second+1)*sr/4), 
           labels=np.arange(0, (max_second+1), (max_second+1)/4))
plt.xlabel('Time (s)')
plt.tight_layout()

We can listen to part of it to get an intuition about what it sounds like:

In [None]:
from IPython.display import Audio
Audio(audio[sr*15:sr*25], rate=sr)

You can here wind noise, street noise, and people speaking. The task of VAD is to detect the regions where the people speak.

In this mini-project we will train and test our models in STFT domain. Let's calculate the **STFT magnitude spectrogram** and **Mel spectrogram** of this audio. 

Feel free to use librosa functions. Use **win_length=n_fft=512** and **hop_length=256** for STFT and Mel spectrogram, and **n_mels=64** for Mel.

What is the difference between win_length and hop_length? What happens if hop_length > win_length?

**TODO**: enter your answer here

In [None]:
# TODO: calculate Magnitude Spectrogram

audio_spec = None
audio_spec_mag = None

plt.figure(figsize=(10, 5))
plt.imshow(audio_spec_mag[:,1500:2000]**0.33, origin='lower')
plt.tight_layout()

In [None]:
# TODO: calculate Mel Spectrogram
# use 64 filters for Mel filterbank

mel_spec = None

# visualize part of it
plt.figure(figsize=(20, 5))
plt.imshow(mel_spec[:,1500:2000]**0.33, origin='lower')
plt.tight_layout()

Next we can load the frame-level labels for this audio clip. I've already processed the labels so that they match the length of the magnitude spectrogram above.

In [None]:
# load the label
label = np.asarray(np.load('audio_label.npy'))

# visualize part of it
plt.figure(figsize=(20, 5))
plt.plot(label[1000:6000])
plt.tight_layout()

In [None]:
# visualize all of it
plt.figure(figsize=(20, 5))
plt.plot(label)
plt.tight_layout()

Here we will use the first 10000 frames for testing, 10000-20000 frames for validation, and the rest for training.

In [None]:
train_data = mel_spec[:,20000:]
val_data = mel_spec[:,10000:20000]
test_data = mel_spec[:,:10000]

train_label = label[20000:]
val_label = label[10000:20000]
test_label = label[:10000]

# MVN
train_mean = np.mean(train_data, 1)
train_var = np.var(train_data, 1)

train_data = (train_data - train_mean[:,np.newaxis]) / train_var[:,np.newaxis]
val_data = (val_data - train_mean[:,np.newaxis]) / train_var[:,np.newaxis]
test_data = (test_data - train_mean[:,np.newaxis]) / train_var[:,np.newaxis]

Now we are done with data preparation. Note that here we directly load all the data into the memory as the size of the data is pretty small. For large-scale dataset, you may still need to save the processed data to harddisk and load them with the *DataLoader* method in the Neural Network Basics homework.

In [None]:
# TODO: implement accuracy calculation
def Accuracy(predicts, labels):
    '''
    Compute accuracy of predicted labels against true labels.
    args:
        predicts: binary array or tensor of shape (num_frame, )
        labels: binary array or tensor of shape (num_frame, )
    return:
        scalar float accuracy between 0 and 1
    '''
    
    raise NotImplementedError

## 1. Support Vector Machine

Determining whether a frame is speech or non-speech corresponds to a binary classification problem - we have two classes "speech" and "non-speech", and the model needs to make a prediction. 

Support vector machine (SVM) is a class of powerful classifiers before the dominance of neural networks, and is still one of the most important and widely-used tool nowadays. I will not go into the details of SVMs, and I just borrow a figure from [the Wikipedia page](https://en.wikipedia.org/wiki/Support_vector_machine) to show the intuition behind SVM:

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/494px-SVM_margin.png">

As you can see, a linear SVM attempts to find a "line" ("hyperplane" in a high-dimensional space) that separates the two classes of features. It is generally used as a linear classifier, as it assumes that the features from different classes are **linearly separable**. 

When you train a SVM classifier, you fit the training dataset as well as the corresponding label to the classifier to learn the parameters (i.e. model weights). During test time, you only feed the test dataset into the model and the model makes predictions about the labels.

In this mini-project we will not ask you to implement a SVM classifier from scratch. Instead you are asked to use the [*sklearn* library](https://scikit-learn.org/stable/modules/svm.html) and train a SVM classifier with the data we prepared above.

We will use a linear SVM classifier, denoted by [*LinearSVC*](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) in th sklearn library, for this section. Click the link for *LinearSVC* for its usage, take a look at the examples provided, and apply it on our VAD data. You do not need to change any default parameters.

Don't worry about *ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.*

In [None]:
from sklearn.svm import LinearSVC

# TODO: use the LinearSVC class for the VAD data
# Set random_state to 0 and max_iter to 1e3 for a fair comparison

# Instantiate a SVM classifier
svm = None

# Train your classifier 


# Make prediction on the training, validation, and test sets
train_predict_svm = None
val_predict_svm = None
test_predict_svm = None

# Calculate accuracy on the training, validation, and test sets
train_acc_svm = None
val_acc_svm = None
test_acc_svm = None

print(f'Train set accuracy: {str(round(train_acc_svm*100, 1))}%')
print(f'Val set accuracy: {str(round(val_acc_svm*100, 1))}%')
print(f'Test set accuracy: {str(round(test_acc_svm*100, 1))}%')

Linear SVM requires that the features are linearly separable. However, this is hardly the case for most of the tasks. No worries - for features that are not linearly separable, we can try to find a transformed representation of the features so that the transformed features are linearly separable in another feature space. In SVM, this is called the **kernel trick**, where the kernel can be a nonlinear mapping applied to the features that transforms them to another representation space. This gives us a nonlinear SVM classifier. We will not go into the details of the kernel tricks in SVM in this mini-project.

As a final step, visualize the predictions from the Linear-SVM and compare with the target labels. What do you find?

In [None]:
# TODO: visualize the predictions by Linear-SV and compare with target label
# for train, val, and test set


## 2. Neural Network Classifier

Now we build a simple neural network classifier. We wil use the 3-layer MLP in the Neural Network Basics homework for the classifier. The MLP we discussed in the previous homework is for the task of autoencoding, but here we need to perform a biinary classification task. As a result, we need to change both the input and output dimensions of the model.

The input dimension here is the dimension of the MFCC features (remember that MLP processes each frame independently). Since we are doing a binary classification, the output dimension can thus be 1, representing the **probability** that the input frame belongs to speech or non-speech region. We use *Sigmoid function* to ensure that the range of the output meets the requirement of a probability measure, i.e. between 0 and 1.

Implement the MLP below.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# TODO: adopt the 3-layer MLP from the Neural Network Basics homework here
# Remember to modify the input/output dimensions
    
class MLP(nn.Module):
    pass


In the autoencoding task we used the mean-square error (MSE) as the training objective. For classification tasks, we typically use **cross-entropy (CE)** as the training objective. CE denotes how close two distributions are, and it is always nonnegative - lower values of CE means higher similarities of the two distributions. In our case where the target is either 0 or 1, i.e. the binary classification task, the loss is called the [**binary cross-entropy (BCE)** loss](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a):
<img src="https://miro.medium.com/max/1096/1*rdBw0E-My8Gu3f_BOB6GMA.png">

Here **y** is the label (0 or 1), **p(y)** is the predicted probability (between 0 and 1), and **N** is the total number of training samples you have in a batch. The closer **p(y)** is to **y**, the lower value the loss function will get.

Implement the BCE function in Pytorch. Note that Pytorch does have a built-in function for BCE, but you should not use it directly here.


In [None]:
# TODO: implement BCE function

def BCE(probs, labels, eps=1e-9):
    """
    BCE between the predicted probabilities and the true labels.
    args:
        probs: shape (num_frame,), torch tensor
        labels: shape (num_frame,), torch tensor
        eps: a small value, what is this for?
    return:
        scalar float binary cross entropy averaged over the batch
    """
    raise NotImplementedError


Remember that during the training phase of the MLP, we evaluate the model performance on the validation set to select the best model. However, knowing the value of BCE does not mean that we know the actual classification accuracy. Thus we also need another function to calculate the accuracy given the predicted probabilities.

Typically we use a simple threshold-based rule to determine the predicted label from the predicted probability: if the predicted probabiliti is higher than 0.5, then we categorize it into label 1, otherwise we categorize it to label 0. We can then compare the predicted label with the target label to calculate the accuracy.

Now train the MLP model with the training pipeline we used in the previous homework. Use BCE function for training, and Accuracy for validation to select the best model. 

Also note that the accuracy should be calculated on the entire validation set (10000 frames), so you should feed all the 10000 frames into the Accuracy function to get the correct value. You need to achieve over **85% accuracy on validation set** to get the full marks here, so play with the configurations of the MLP - nonlinearity in the hidden layers, number of hidden units, number of layers, learning rate, or any other tricks you know. Try to hit the highest accuracy you can achieve!

You may also encounter NaNs during training. Think about why and how you can avoid it.

In [None]:
from torch.utils.data import Dataset, TensorDataset, DataLoader

# TODO: Create DataLoader for all 3 dataset partitions. 
# But first, you need to create Dataset objects.
# You can either save and load all the data via h5py, 
# or directly cast numpy arrays to TensorDataset.

# Reference: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

batch_size = 32 # Pass it to DataLoader. Feel free to change it.
train_dataset = None
val_dataset = None
test_dataset = None
train_loader = None
val_loader = None
test_loader = None

In [None]:
# TODO: functions for training and validation
# Use BCE during training, and Accuracy during validation

def train(model, loader, opt, loss_fn):
    '''
    Training Step, called once per epoch.
    Model is updated.
    args:
        model: nn.Module
        loader: torch DataLoader
        opt: torch.optim
        loss_fn: differentiable loss_fn(probs, labels) -> scalar
    return:
        average training loss for this epoch
    '''
    model.train()
    train_loss_batch = []
    for data, labels in loader:
        raise NotImplementedError
    
    train_loss_epoch = sum(train_loss_batch) / len(train_loss_batch)
    
    return float(train_loss_epoch)


def validate(model, loader, criterion, return_results=False):
    '''
    Validation Step, called once per epoch.
    args:
        model: nn.Module
        loader: torch DataLoader
        criterion: criterion(predicts, labels) -> scalar
    return:
        average accuracy of all frames, and if return_results,
        (predicted probabalities, predicted labels)
    
    '''
    model.eval()
    # probs is probability between 0 and 1;
    # predicts is prediced label of 0 or 1.
    all_labels, all_probs, all_predicts \
        = torch.zeros(0), torch.zeros(0), torch.zeros(0)
    
    with torch.no_grad():
        for data, labels in loader:
            raise NotImplementedError
        
            all_labels = torch.cat([all_labels, labels])
            all_probs = torch.cat([all_probs, probs])
            all_predicts = torch.cat([all_predicts, predicts])    
            
    val_acc_epoch = None
    
    if return_results:
        return float(val_acc_epoch), (all_probs, all_predicts)
    else:
        return float(val_acc_epoch)


In [None]:
# TODO: instantiate your model
mlp = None
    
# TODO: initialize the optimizer here
# you can still use Adam as the optimizer
opt = None

# hyperparameters

total_epoch = None  # train the model for ? epochs
model_save = 'best_MLP_VAD.pt'  # path to save the best validation model

# main function

train_loss = []
val_acc = []

for epoch in range(1, total_epoch + 1):
    train_loss_epoch = None
    val_acc_epoch = None
    
    train_loss.append(train_loss_epoch)
    val_acc.append(val_acc_epoch)
    
    print(f'Epoch No.{epoch}:')
    print(f'      Training loss: {str(round(train_loss_epoch, 2))}')
    print(f'      Validation accuracy: {str(round(val_acc_epoch*100, 1))}%')
    
    if train_loss[-1] == np.min(train_loss):
        print('      Best training model found.')
    if val_acc[-1] == np.max(val_acc):
        # save current best model on validation set
        with open(model_save, 'wb') as f:
            torch.save(mlp.state_dict(), f)
            print('      Best validation model found and saved.')
    
    print('-' * 99)
    

In [None]:
# TODO: evaluate the best model found on the validation set on the training, validation and test sets
# Remember to load the best validation model above before making the evaluation


# TODO: print the accuracy on the three datasets
final_train_acc = None
final_val_acc = None
final_test_acc, (test_prob_mlp, test_predict_mlp) = None, None

# val acc should be above 0.85
print(f'Final train acc {str(round(final_train_acc, 2))} | val acc {str(round(final_val_acc, 2))} | test acc {str(round(final_test_acc, 2))}')

We evaluate the model with accuracy; Why cann't we also train the model with accuracy instead of binary cross entropy?

**TODO**: enter your answer here

What's the difference between the accuracy on validation set and test set? Why?

**TODO**: enter your answer here

Now we do some visualization on the predictions. What can you find?

In [None]:
# TODO: visualize the output from the MLP on the test set (both probability and predicted label) and compare with the target label

Now you have a linear-SVM classifier and nonlinear MLP classifier. What are the pros and cons of them? 

**TODO**: enter your answer here

## 3: Post-processing

Until now, we are directly using the model predictions as the predicted labels for speech and non-speech regions. There are many ways that we can improve the results, and here we will try one of the post-processing methods - **smoothing**.

Smoothing is based on a straightforward intuition: if a frame is non-speech, then it is highly probably that its neighbouring frames are also non-speech. We can thus apply a local smoothing on the model predictions via moving average - the smoothed output at each frame is the average prediction output of its neighbouring frames.

Let's implement the smoothing strategy for both Linear-SVM outputs and MLP outputs. Here let's set the number of neighbouring frames to 7: 3 in the past, 1 at the current frame, and 3 in the future. For the first 3 and last 3 frames where the number of past or future frames are not enough, we keep them unchanged.

***Hint***: We can implement smoothing as a 1D convolution. The convolution kernel has length of 7, and is centered so that it sees 3 samples before and 3 samples after. Then, what should be the weights of this kernel?


In [None]:
# TODO: implement smoothing
def smoothing(prediction, context=7):
    # prediction: shape (num_frame,)
    raise NotImplementedError


In [None]:
# TODO: smoothing for Linear-SVM output label, print the accuracy

In [None]:
# TODO: smoothing for MLP output probability, print the accuracy
# The predicted labels are calculated after smoothing on the probability

In [None]:
# TODO: smoothing for MLP output label, print the accuracy

Visualize the MLP probabilities after smoothing, compare with the original probabilities. What do you find?

In [None]:
# TODO: visualization of predicted labels after smoothing the MLP output probabilities

**TODO**: enter your discussions here

## Discussion: Evaluation Metrics for VAD

We only used the classification accuracy as the evaluation metric for the systems above. However, there are well-defined evaluation metris for VAD task. The evaluation metrics for VAD systems is typically done by the following 4 metrics [(Beritelli, F., et al., 2002)](https://ieeexplore.ieee.org/abstract/document/995824/):
  -  **Front End Clipping (FEC)**: Clipping introduced in passing from noise to speech activity.
  -  **Mid Speech Clipping (MSC)**: Clipping due to speech misclassified as noise.
  -  **OVER**: Noise interpreted as speech due to the VAD flag remaining active in passing from speech activity to noise.
  -  **Noise Detected as Speech (NDS)**: Noise interpreted as
speech within a silence period.

The metrics can be visualized as follows. The image is adopted from https://github.com/jtkim-kaist/VAD.
<img src="https://user-images.githubusercontent.com/24668469/49742392-cd778680-fcdb-11e8-96b9-a599a4f85f4f.PNG">

If you are interested, you can try to implement the four metrics and evaluate the three models above with them. This is not part of the required excercises in this mini-project.

## Discussion: Input to the Models

We only used frame-level feature in the two models we built above. However, temporal information can be very important in audio processing tasks. There are several ways you can utilize the temporal information:
  -  Use **context of features**: instead of using the feature of the current frame, concatenate the features in the neighbouring frames to the current frame to create a **context**.
  -  Use **recurrent neural networks**: similar to the example we have in the Neural Network Basics homework.
  
You can play with different types of input features and compare the performance. This is not part of the required excercises in this mini-project.