# I. Phoneme classification 

This is the second part of **Homework 1** and it accounts for 60 points of the total score of 100 points for **Homework 1**.

In this assignment, you are asked to classify audio segments into 7 phoneme classes (VS, NF, SF, WF, ST, CL and q for glotal ). The meaning of those classes (except q) is given in the following table.


![Phoneme classes](image/phoneclass.png)

This assignment contains three tasks:

    - Task 1: Data Preparation and Feature extraction (10 points)
    - Task 2: A simple frame-based classification (35 points)
    - Task 3: Written report (15 points)

For Task 1, you are asked to use librosa to extract MFCC features from wavfiles. 

For Task 2, you are given a training dataset and a validation set in the folder audio/part-2. Your method will be evaluated on a hidden test set. Students who can achive results better than our simple baseline can get maximum score of task 2.

For Task 3, you are asked to report your experimental results.

## II. Submission instruction 

Notes:

    1. The notebook will generate python files in submission. The submission folder will need to be uploaded to the course website. 
    2. While solving the assignment, do **NOT** change class and method names, autograder tests will fail otherwise. However, you can add utility functions into base class if needed.
    3. You'll also have to upload a **PDF version** of the notebook (which would be primarily used to grade your report section of the notebook).  
    
Put the `submission` folder, 2 pdfs exported from 2 notebooks into one folder with name [student_name]_[student_id], zip the folder and upload the course website (the teaching cube).

## III. Setup


In [1]:
%load_ext autoreload
%autoreload 2

`submission` contains all the files that you will submit; output contains the output files of your model

In [2]:
import os
try: 
    os.mkdir('submission')
except FileExistsError:
    pass
try:
   open('submission/__init__.py', 'x')
except FileExistsError:
   pass
try: 
    os.mkdir('output')
except FileExistsError:
    pass

## IV. Data Preprocessing and Feature Extraction (10 points)


In [12]:
%%writefile submission/features.py
import glob
import pickle
import librosa
import scipy
import numpy as np
from tqdm import tqdm
from collections import Counter
from sklearn import preprocessing

def wavfile_to_mfccs(wavfile):
    """
    Returns a matrix of shape (nframes, 39), since there are 39 MFCCs (deltas
    included for each 20ms frame in the wavfile).
    """
    x, sampling_rate = librosa.load(wavfile)

    window_duration_ms = 20
    n_fft = int((window_duration_ms / 1000.) * sampling_rate)

    hop_duration_ms = 10
    hop_length = int((hop_duration_ms / 1000.) * sampling_rate)

    mfcc_count = 13

    #### BEGIN YOUR CODE
    # Call librosa.feature.mfcc to get mfcc features for each frame of 20ms
    # Call librosa.feature.delta on the mfccs to get mfcc_delta
    # Call librosa.feature.delta with order 2 on the mfccs to get mfcc_delta2
    # Stack all of them (mfcc, mfcc_delta, mfcc_delta2) together 
    # to get the matrix mfccs_and_deltas of size (#frames, 39)
    mfcc=librosa.feature.mfcc(x,sr=sampling_rate,n_mfcc=mfcc_count,n_fft=n_fft,hop_length=hop_length)
    mfccs_and_deltas=mfcc
    mfcc_delta1=librosa.feature.delta(mfcc,order=1)
    mfccs_and_deltas=np.concatenate((mfccs_and_deltas,mfcc_delta1),axis=0)
    mfcc_delta2=librosa.feature.delta(mfcc,order=2)
    mfccs_and_deltas=np.concatenate((mfccs_and_deltas,mfcc_delta2),axis=0)
    mfccs_and_deltas=mfccs_and_deltas.transpose(1,0)
    #### END YOUR CODE
    return mfccs_and_deltas, hop_length, n_fft

class ShortTimeAnalysis:
    def __init__(self):
        pass
    
    def perform(self, wavfile):
        pass
    
class MFCCAnalysis(ShortTimeAnalysis):
    def perform(self, wavfile):
        return wavfile_to_mfccs(wavfile)


In [57]:
# Testing your method
wavfile = "audio/part-2/train/SA1.WAV"

X, hop_lengt, window_len = wavfile_to_mfccs(wavfile)
print (X.shape)

(394, 39)


In [13]:
%%writefile submission/dataset.py
import glob
import pickle
import librosa
import scipy
import numpy as np
from tqdm import tqdm
from collections import Counter
from sklearn import preprocessing

unique_classes = ['CL', 'SF', 'VS', 'WF', 'ST', 'NF', "q"]

def read_data_folder(data_path):
    """
    @return
    wav_files: list of file paths to WAV files in the train or val folder.
    labels_files: ist of file paths to PHNCLS files in the train or val folder.
    """
    # get all the WAV and PHNCLS in the folder data_path
    wav_files = sorted(glob.glob(data_path + "/*.WAV"))
    labels_files = sorted(glob.glob(data_path + "/*.PHNCLS"))
    return wav_files, labels_files

def extract_features(wavfile, label_file, first_seg_id=0, stanalysis=MFCCAnalysis()):
    """
    Extract segment labels and representations.

    @arguments:
    wavfile: path to wav file
    label_file: path to PHNCLS file
    first_seg_id: segment_id of the first segment of the current file.
                  When you process a list of files, you may want segment id to increase globally.

    @returns:
    X: #frames, #features
    y: #frames

    frame2seg: mapping from frame id to segment id
    y_seg: segment labels (segment-based groundtruth)
    """
    X_st, hop_length, window_len = stanalysis.perform(wavfile)

    seg_labels = {}
    point_seg_ids = []
    with open(label_file, 'r') as f:
        for line in f.readlines():
            start_frame, end_frame, label = line.split(' ')
            start_frame = int(start_frame)
            end_frame = int(end_frame)
            
            label = label.strip()
            segment_id = len(seg_labels) + first_seg_id
            seg_labels[segment_id] = label
            
            phn_frames = end_frame - start_frame
            point_seg_ids.extend([segment_id]*phn_frames) # point_seg_ids stores segment ids for every sample point.

    X = []
    y = []
    frame_seg_ids = []
    curr_frame = curr_hop = 0
    while (curr_frame < (len(point_seg_ids) - window_len)):
        ### BEGIN YOUR CODE
        # extract the segment ids for the sample points within the frame 
        # from curr_frame to curr_frame + window_len
        # Since one frame may overlap with more than one segment, 
        # sample points within the frame may be assigned with multiple segment ids.
        # We get the major segment id as the segment id corresponding to the current frame.
        pointer=curr_frame
        segment_id=point_seg_ids[curr_frame]
        segment_id_count=0
        while (pointer<curr_frame+window_len):
            if point_seg_ids[pointer]==segment_id:
                segment_id_count+=1
            else:
                segment_id_count-=1
                if segment_id_count<0:
                    segment_id_count=0
                    segment_id=point_seg_ids[pointer]
            pointer+=1
    
        ### END YOUR CODE
        
        label = seg_labels[segment_id]
        y.append(label)
        X.append(X_st[curr_hop,:])
        frame_seg_ids.append(segment_id)
        curr_hop += 1
        curr_frame += hop_length

    return X, y, frame_seg_ids, seg_labels


def prepare_data(wavfiles, label_files, stanalysis=MFCCAnalysis()):
    X = []
    y = []
    segment_ids = []
    seg2labels = {}
    
    file_seg_id = 0
    for i in tqdm(range(len(wavfiles))):
        wavfile = wavfiles[i]
        label_file = label_files[i]
        x_, y_, seg_ids_, seg_labels_ = extract_features(
            wavfile, label_file, first_seg_id=file_seg_id, stanalysis=stanalysis)

        file_seg_id += len(seg_labels_)
        for k,v in seg_labels_.items():
            seg2labels[k] = v

        X.append(x_)
        y.extend(y_)
        segment_ids.extend(seg_ids_)
        

    X = np.concatenate(X)
    return X, y, segment_ids, seg2labels

In [58]:
# Testing 
wavfiles, label_files = read_data_folder("audio/part-2/train/")
X, y, segment_ids, seg2labels = prepare_data(wavfiles[0:1], label_files[0:1])
print (X.shape)
print (len(y))
print (len(segment_ids))
print (len(seg2labels))
assert X.shape[0] == len(y) == len(segment_ids)

100%|██████████| 1/1 [00:00<00:00,  5.25it/s]

(284, 39)
284
284
41





In [10]:
%%writefile submission/preprocessing.py
from sklearn import preprocessing

# Data normalization
def normalize_mean(X):
    """
    Using scikit learn preprocessing to transform feature matrix
    using StandardScaler with mean and standard deviation
    """
    ### BEGIN YOUR CODE
    scaler=preprocessing.StandardScaler()
    X=scaler.fit_transform(X)
    ### END YOUR CODE
    return X, scaler.mean_, np.sqrt(scaler.var_)

def apply_normalize_mean(X, scaler_mean, scaler_std):
    """
    Apply normalizaton to a testing dataset that have been fit using training dataset.
    
    @arguments:
    X: #frames, #features (in case we use mfcc, #features is 39)
    scaler_mean: mean of fitted StandardScaler that you used in normalize_mean function.
    
    @returns:
    X: normalized matrix
    """
    ### BEGIN YOUR CODE
    X=X-scaler_mean
    X=X/scaler_std

    ### END YOUR CODE
    return X


In [112]:
# Testing
X, scaler_mean = normalize_mean(X)

## V. Phone Classifier (35 points)

In this part, we will perform isolated phone classification. We assume that phones are well segmented. This section includes two tasks: phone classifier with MFCC, and building your best model.

### Build your best model (10 points)

Try different methods to perform feature extraction from frames and report your best result. 

### Phone Classifier with MFCC features (20 points)

In [15]:
%%writefile submission/model.py

import numpy as np
from collections import Counter
from torch import nn
from torch.nn import Sequential
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
import torch

def onehot_matrix(samples_vec, num_classes):
    """
    >>> onehot_matrix(np.array([1, 0, 3]), 4)
    [[ 0.  1.  0.  0.]
     [ 1.  0.  0.  0.]
     [ 0.  0.  0.  1.]]

    >>> onehot_matrix(np.array([2, 2, 0]), 3)
    [[ 0.  0.  1.]
     [ 0.  0.  1.]
     [ 1.  0.  0.]]

    Ref: http://bit.ly/1VSKbuc
    """
    num_samples = samples_vec.shape[0]

    onehot = np.zeros(shape=(num_samples, num_classes))
    onehot[range(0, num_samples), samples_vec] = 1

    return onehot

def segment_based_evaluation(y_pred, segment_ids, segment2label):
    """
    @argments:
    y_pred: predicted labels of frames
    segment_ids: segment id of frames
    segment2label: mapping from segment id to label
    """
    seg_pred = {}
    for frame_id, seg_id in enumerate(segment_ids):
        if seg_id not in seg_pred:
            seg_pred[seg_id] = []
        seg_pred[seg_id].append(y_pred[frame_id])

    ncorrect = 0
    for seg_id in seg_pred.keys():
        predicted = seg_pred[seg_id]
        c = Counter(predicted)
        predicted_label = c.most_common()[0][0] # take the majority voting

        if predicted_label == segment2label[seg_id]:
            ncorrect += 1

    accuracy = ncorrect/len(segment2label)
    print('Segment-based Accuracy using %d testing samples: %f' % (len(segment2label), accuracy))

class Net(nn.Module):
    def __init__(self):
        super(Net,self).__init__()
        self.lstm=nn.LSTM(input_size=3,hidden_size=64,batch_first=True)
        self.fcmodel=Sequential(
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(in_features=64,out_features=64),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(in_features=64,out_features=32),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(in_features=32,out_features=7)
        )
    def forward(self,x):
        x,(hn,cn)=self.lstm(x)
        x=x.permute([1,0,2])
        x=x[-1]
        x=self.fcmodel(x)
        return x

class PhonemeClassifier(object):
    def __init__(self):
        unique_phonemes = ['CL', 'SF', 'VS', 'WF', 'ST', 'NF', "q"]
        self.labels = unique_phonemes

    def label_to_ids(self, y):
        y_ = [self.labels.index(label) for label in y]
        return y_

    def id_to_label(self, y):
        y_ = [self.labels[i] for i in y]
        return y_
        
    def train(self, X_train, y_train):
        y_train = self.label_to_ids(y_train)
        y_train = np.asarray(y_train)
        ### BEGIN YOUR CODE
        X_train=torch.from_numpy(X_train).float()
        X_train=X_train.reshape((-1,13,3))
        y_train=torch.Tensor(y_train).long()
        print(X_train.shape,y_train.shape)
        dataset=TensorDataset(X_train,y_train)
        dataset=DataLoader(dataset,batch_size=4096,shuffle=True)
        device = ('cuda' if torch.cuda.is_available() else 'cpu')
        model=Net()
        model=model.to(device)
        loss_fn=nn.CrossEntropyLoss()
        loss_fn=loss_fn.to(device)
        optimizer=torch.optim.Adam(model.parameters(),lr=0.01)
        model.train()
        for epoch in range(20):
            loss_sum=0
            for data in dataset:
                x,y=data
                x=x.to(device)
                y=y.to(device)
                y_pred=model(x)
                loss=loss_fn(y_pred,y)
                loss_sum+=loss
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
            print("Epoch: {}, Loss_sum: {}".format(epoch,loss_sum))
        ### END YOUR CODE
        self.model = model

    def test(self, X_test, y_test):
        """
        @arguments:
        X_test: #frames, #features (39 for mfcc)
        y_test: frame labels
        """
        ### BEGIN YOUR CODE
        # perform prediction and get out_classes array
        # which contain class label id for each frame.
        X_test=torch.from_numpy(X_test).float()
        X_test=X_test.reshape((-1,13,3))
        device = ('cuda' if torch.cuda.is_available() else 'cpu')
        X_test=X_test.to(device)
        self.model.eval()
        label_predict=self.model(X_test)
        out_classes=torch.argmax(label_predict,axis=1)
        ### END YOUR CODE
        
        out_classes = self.id_to_label(out_classes) # from id to string
        out_classes = np.asarray(out_classes)
        acc = sum(out_classes == y_test) * 1.0 / len(out_classes)
        print('Frame-based Accuracy using %d testing samples: %f' % (X_test.shape[0], acc))

        return out_classes


Overwriting submission/model.py


In [24]:
from submission import model

wav_files, label_files = read_data_folder("audio/part-2/train")
X_train, y_train, _, _  = prepare_data(wav_files, label_files)
X_train, scaler_mean, scaler_std= normalize_mean(X_train)
wav_files, label_files = read_data_folder("audio/part-2/val")
print (len(wav_files), len(label_files))
X_test, y_test, test_seg_ids, test_seg2labels  = prepare_data(
        wav_files, label_files)

X_test = apply_normalize_mean(X_test, scaler_mean,scaler_std)
cls = model.PhonemeClassifier()
cls.train(X_train, y_train)
y_pred = cls.test(X_test, y_test)

model.segment_based_evaluation(y_pred, test_seg_ids, test_seg2labels)


In [16]:
%%writefile submission/best_model.py
from submission import model
#from submission import dataset as ds
from submission import features as ft

class BestAnalysis(ft.MFCCAnalysis):
    def perform(self,wavfile):
        ### BEGIN YOUR CODE
        ##
        x, sampling_rate = librosa.load(wavfile)
        window_duration_ms = 40
        n_fft = int((window_duration_ms / 1000.) * sampling_rate)

        hop_duration_ms = 10
        hop_length = int((hop_duration_ms / 1000.) * sampling_rate)
        mfcc_count = 13
        mfcc=librosa.feature.mfcc(x,sr=sampling_rate,n_mfcc=mfcc_count,n_fft=n_fft,hop_length=hop_length)
        mfccs_and_deltas=mfcc
        mfcc_delta1=librosa.feature.delta(mfcc,order=1)
        mfccs_and_deltas=np.concatenate((mfccs_and_deltas,mfcc_delta1),axis=0)
        mfcc_delta2=librosa.feature.delta(mfcc,order=2)
        mfccs_and_deltas=np.concatenate((mfccs_and_deltas,mfcc_delta2),axis=0)
        mfccs_and_deltas=mfccs_and_deltas.transpose(1,0)
        ##
        #### END YOUR CODE
        return mfccs_and_deltas, hop_length, n_fft
        
    
wav_files, label_files = read_data_folder("audio/part-2/train")
X_train, y_train, _, _  = prepare_data(wav_files, label_files, stanalysis=BestAnalysis())

X_train, scaler_mean ,scaler_std= normalize_mean(X_train)
    
wav_files, label_files = read_data_folder("audio/part-2/val")
print (len(wav_files), len(label_files))
X_test, y_test, test_seg_ids, test_seg2labels  = prepare_data(
        wav_files, label_files, stanalysis=BestAnalysis())

X_test = apply_normalize_mean(X_test, scaler_mean,scaler_std)
cls = model.PhonemeClassifier()
cls.train(X_train, y_train)
y_pred = cls.test(X_test, y_test)

model.segment_based_evaluation(y_pred, test_seg_ids, test_seg2labels)


100%|██████████| 342/342 [00:56<00:00,  6.04it/s]


83 83


100%|██████████| 83/83 [00:14<00:00,  5.54it/s]


torch.Size([74923, 13, 3]) torch.Size([74923])
Epoch: 0, Loss_sum: 31.466890335083008
Epoch: 1, Loss_sum: 28.98883056640625
Epoch: 2, Loss_sum: 28.808786392211914
Epoch: 3, Loss_sum: 28.703929901123047
Epoch: 4, Loss_sum: 28.651626586914062
Epoch: 5, Loss_sum: 28.679304122924805
Epoch: 6, Loss_sum: 28.591684341430664
Epoch: 7, Loss_sum: 28.58216667175293
Epoch: 8, Loss_sum: 28.600013732910156
Epoch: 9, Loss_sum: 28.54745101928711
Epoch: 10, Loss_sum: 28.551931381225586
Epoch: 11, Loss_sum: 28.54212188720703
Epoch: 12, Loss_sum: 28.545263290405273
Epoch: 13, Loss_sum: 28.50725555419922
Epoch: 14, Loss_sum: 28.51063346862793
Epoch: 15, Loss_sum: 28.51105499267578
Epoch: 16, Loss_sum: 28.469661712646484
Epoch: 17, Loss_sum: 28.413740158081055
Epoch: 18, Loss_sum: 28.174589157104492
Epoch: 19, Loss_sum: 27.88580894470215
Frame-based Accuracy using 19258 testing samples: 0.513761
Segment-based Accuracy using 3324 testing samples: 0.436823


## VI. Written Report (15%)


### Question 1 (3%):

Describe your classification model.

**Your answer**
At first, I convert the input from shape (1,39) to (3,13), then put it through a LSTM (hidden_size=64), then I concatenate it with three linear layer with shape 64*64, 64*32, 32*7 respectively. (Activation funciton is ReLU, Dropout=0.5).
### Question 2 (2%)
What is the best performance that you can get with MFCC on the validation set?

**Your answer**
Frame-based Accuracy: 0.518901
Segment-based Accuracy: 0.438327

### Question 3 (10%): 

Discribe methods that you have tried and the results that you have achived when you worked on the "best model" task

**Your answer**
I have tried to use only fully connected layers, LSTMS, CNNs, or the combinations of them, but it won't yeild better result because of over-fitting problem. I also tried differnt hop_length and window_length, but it usually results in trade-off problem (longer window_length gives back better frame-based accuracy but worser segment-based accuracy). More features, different optimizers and learning rates also feed back similiar results. 