# ECS7020P_miniproject_basic

# 1 Author

**Student Name**: Yiu Hang Cheung  
**Student ID**: 220777713

# 2 Problem formulation

The objective is to use the MLEnd London Sounds dataset to build a machine learning pipeline that takes an audio segment as an input and predicts whether it is recorded indoors or outdoors.<br>
In fact, audio signal is a complicated data type to analyse. A key to the problem is to determine what features are going to be extracted and passed into the model.

# 3 Machine Learning pipeline

The inputs are the raw audio files. 

They are passed into the machine learning pipeline which includes the feature extraction, data standardisation and the model.

For the feature extraction part, each raw audio file is loaded and transformed into 2 audio time series: 1 for the first 3 seconds, 1 for the following 3 seconds.

The first 3 seconds of the audio segment contains the voice of the participant and the following 3 seconds contains only the background sound.

Then the Mel-frequency cepstral coefficients (MFCC), root mean square (RMS) value, zero-crossing rate and the spectral flatness are computed. They are the attributes to be passed into the model for training.

Before training the model, the attributes are standardised and grid search is performed with cross validation to find the optimal hypermeters.

The model is then trained and final output is a binary class 0 or 1 with 0 meaning outdoor and 1 meaning indoor.

# 4 Transformation stage

The transformation stage includes feature extraction and standardisation.

The raw audio files and their corresponding labels are the inputs.

The feature extraction part is carried out by the getXy function, which uses several feature extraction functions from the librosa library.

Step 1: load each audio segment to get 2 scaled audio time series: 1 for the first 3 seconds, 1 for the following 3 seconds.

Step 2: pass the 2 audio time series to extract the following features:

 - 13 Mel-frequency cepstral coefficients (MFCC) in which each MFCC captures certain details of the spectral envelope of the audio signal; 
    
 - root mean square (RMS) value which measures the average loudness;
 
 - zero-crossing rate which is an indicator to measure the fluctation of the audio wave; and
    
 - spectral flatness which quantify how much noise-like a sound is.
    
All the above feature extraction functions will return the values of each frame so the mean is computed over the number of frames of each audio time series.

As a result, 32 attributes are obtained for each audio segment (13+3)x2 and the getXy function returns an array of attributes X and an array of the corresponding labels y.

Step 3: standardise the attributes (z=x−μ/σ) so they are ready to be passed into the model. This is done after the getXy function, right before passing them into the model.

# 5 Modelling

A support vector classifier is built because each sample is going to be transformed into 32 attributes and SVC can take advantages of kernel methods to deal with high dimensional data.<br>

# 6 Methodology

Given there are 2500 samples, not all of them are required to train the model.<br>
Up to a certain sample number, the model is not going to improve with more samples.

In the experiment, 500, 1000 and 1500 samples will be fed into the pipeline and perform cross-validated grid-search to find the optimal hypermeters: : kernel, C and gamma.<br>
Cross-validation allows more samples to be used for training and gives an average accuracy.<br>
At the same time, it can give us an idea of the number of samples required to train the model. 

After grid-search, the optimised set of hypermeters and the number of required samples can be decided.<br>
The model will then be trained and the remaining number of samples will be used as a validation set.

The model performance will be asseessed by the accuracy.

# 7 Dataset

The label csv file is loaded.

After exploration, a sample with wrong spot label ('room 13') is removed.

In [1]:
import pandas as pd

MLENDLS_df = pd.read_csv('MLEndLS.csv').set_index('file_id')
MLENDLS_df

Unnamed: 0_level_0,area,spot,in_out,Participant
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0001.wav,british,street,outdoor,S151
0002.wav,kensington,dinosaur,indoor,S127
0003.wav,campus,square,outdoor,S18
0004.wav,kensington,hintze,indoor,S179
0005.wav,campus,square,outdoor,S176
...,...,...,...,...
2496.wav,westend,trafalgar,outdoor,S151
2497.wav,campus,square,outdoor,S6
2498.wav,westend,national,indoor,S96
2499.wav,british,room12,indoor,S73


In [2]:
MLENDLS_df.groupby('area')['spot'].value_counts()

area        spot      
Euston      gardens        41
            library        41
            upper          41
            forecourt      40
            ritblat        36
            piazza         35
british     greatcourt     84
            room12         83
            forecourt      82
            square         77
            street         77
            room13          1
campus      square        142
            ground        139
            canal         137
            curve         136
            reception     136
            library       133
kensington  albert         26
            hintze         26
            dinosaur       25
            marine         24
            pond           22
            cromwell       21
southbank   waterloo       41
            bridge         40
            skate          40
            food           39
            book           36
            royal          33
westend     piazza        116
            charing       113
            marke

In [3]:
MLENDLS_df.loc[MLENDLS_df['spot'] == 'room13']

Unnamed: 0_level_0,area,spot,in_out,Participant
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0762.wav,british,room13,indoor,S71


In [4]:
MLENDLS_df = MLENDLS_df.drop(index = ['0762.wav'], axis = 0)
MLENDLS_df.shape

(2499, 4)

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import os, sys, re, pickle, glob
import urllib.request
import IPython.display as ipd
from tqdm import tqdm
import librosa

In [6]:
dataset_path = './dataset/*.wav'
files = glob.glob(dataset_path)
len(files)

2499

In [7]:
def getXy(files,labels_file, scale_audio=False, onlySingleDigit=False):
    X,y,file_id =[],[],[]
    for file in tqdm(files):
        try:
            fileID = file.split('/')[-1]
            file_name = file.split('/')[-1]
            yi = labels_file.loc[fileID]['in_out']=='indoor'

            fs = None
            x_voice, fs_voice = librosa.load(file,sr=fs, duration=3)
            x_bg, fs_bg = librosa.load(file,sr=fs, offset=3, duration=3)
            if scale_audio: x_voice = x_voice/np.max(np.abs(x_voice))
            if scale_audio: x_bg = x_bg/np.max(np.abs(x_bg))

            mfcc_voice = librosa.feature.mfcc(y=x_voice, sr=fs_voice, n_mfcc=13).mean(axis=1)
            rms_voice = librosa.feature.rms(y=x_voice).mean(axis=1)
            zcr_voice = librosa.feature.zero_crossing_rate(x_voice).mean(axis=1)
            flat_voice = librosa.feature.spectral_flatness(y=x_voice).mean(axis=1)
                
            mfcc_bg = librosa.feature.mfcc(y=x_bg, sr=fs_bg, n_mfcc=13).mean(axis=1)
            rms_bg = librosa.feature.rms(y=x_bg).mean(axis=1)
            zcr_bg = librosa.feature.zero_crossing_rate(x_bg).mean(axis=1)
            flat_bg = librosa.feature.spectral_flatness(y=x_bg).mean(axis=1)

            xi = list(mfcc_voice)+list(rms_voice)+list(zcr_voice)+list(flat_voice)+list(mfcc_bg)+list(rms_bg)+list(zcr_bg)+list(flat_bg)
            X.append(xi)
            y.append(int(yi))
            file_id.append(str(file.split('/')[-1]))
        except:
            pass
        continue

    return np.array(X),np.array(y),np.array(file_id,ndmin=2).T

In [8]:
X,y,file_id = getXy(files, labels_file=MLENDLS_df, scale_audio=True, onlySingleDigit=True)

  if scale_audio: x_voice = x_voice/np.max(np.abs(x_voice))
  if scale_audio: x_bg = x_bg/np.max(np.abs(x_bg))
  return f(*args, **kwargs)
100%|███████████████████████████████████████| 2499/2499 [01:33<00:00, 26.70it/s]


As described above, the getXy function will return an array of attributes X and an array of the corresponding label y.

In [9]:
print(X.shape)
X

(2496, 32)


array([[-1.42263245e+02,  1.73113449e+02, -2.35170441e+01, ...,
         1.27644911e-01,  3.98411106e-02,  1.41168246e-04],
       [-1.92750443e+02,  1.63034164e+02, -2.76769543e+01, ...,
         1.46406367e-01,  4.34853101e-02,  1.28979329e-04],
       [-1.39498154e+02,  1.73110825e+02, -1.93795586e+01, ...,
         1.79077715e-01,  3.98090613e-02,  6.16960635e-04],
       ...,
       [-1.53162552e+02,  1.64293106e+02, -2.06665134e+01, ...,
         2.09317267e-01,  3.40081292e-02,  4.72155283e-04],
       [-1.39244354e+02,  1.77909454e+02, -1.21692982e+01, ...,
         2.43926510e-01,  3.87966789e-02,  4.76350484e-04],
       [-2.60122894e+02,  1.80845459e+02,  4.24531269e+00, ...,
         2.71132231e-01,  1.08402208e-02,  5.16847394e-05]])

In [10]:
print(y.shape)
y

(2496,)


array([0, 0, 1, ..., 1, 0, 0])

In [11]:
file_id

array([['2217.wav'],
       ['0400.wav'],
       ['0366.wav'],
       ...,
       ['1061.wav'],
       ['0419.wav'],
       ['1707.wav']], dtype='<U8')

After feature extraction, in addition to the wrong labelled sample, 3 audio files are unable to be processed in the getXy function so there are in total 2496 samples.

The extracted features are saved into a csv file for better visualisation.

In [12]:
headerList = 'mfcc_voice1,mfcc_voice2,mfcc_voice3,mfcc_voice4,mfcc_voice5,mfcc_voice6,mfcc_voice7,mfcc_voice8,mfcc_voice9,mfcc_voice10,mfcc_voice11,mfcc_voice12,mfcc_voice13,rms_voice,zcr_voice,flat_voice,mfcc_bg1,mfcc_bg2,mfcc_bg3,mfcc_bg4,mfcc_bg5,mfcc_bg6,mfcc_bg7,mfcc_bg8,mfcc_bg9,mfcc_bg10,mfcc_bg11,mfcc_bg12,mfcc_bg13,rms_bg,zcr_bg,flat_bg'

np.savetxt('dataset_mfcc_rms_zcr_flat.csv', X, delimiter=',',header=headerList, comments = '')

In [13]:
feature = pd.read_csv('dataset_mfcc_rms_zcr_flat.csv')

feature['label'] = y
feature.insert(0,'file_id', file_id)
feature.set_index('file_id', inplace=True)

feature

Unnamed: 0_level_0,mfcc_voice1,mfcc_voice2,mfcc_voice3,mfcc_voice4,mfcc_voice5,mfcc_voice6,mfcc_voice7,mfcc_voice8,mfcc_voice9,mfcc_voice10,...,mfcc_bg8,mfcc_bg9,mfcc_bg10,mfcc_bg11,mfcc_bg12,mfcc_bg13,rms_bg,zcr_bg,flat_bg,label
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2217.wav,-142.263245,173.113449,-23.517044,37.460068,-15.150518,13.917201,-1.881720,9.590215,-8.548068,11.520725,...,5.469003,-1.257863,9.518785,-14.527369,12.028168,-4.529459,0.127645,0.039841,0.000141,0
0400.wav,-192.750443,163.034164,-27.676954,52.614693,-17.837755,2.103657,13.813667,-11.270192,7.480297,-4.410752,...,-9.319770,4.421349,4.199004,-9.060643,14.125307,-7.535830,0.146406,0.043485,0.000129,0
0366.wav,-139.498154,173.110825,-19.379559,15.026322,3.873517,-2.987137,-1.913729,-2.535430,-6.533541,0.581664,...,1.105968,-9.854543,2.977386,-9.269053,10.686778,-6.429621,0.179078,0.039809,0.000617,1
1078.wav,-128.251419,158.219223,-26.150808,27.838219,-12.987073,7.041806,-13.931302,4.611495,-7.031096,-0.885988,...,4.525082,-9.932795,2.138498,-6.409645,9.112489,-8.362327,0.202194,0.050842,0.001178,1
0372.wav,-264.056976,159.688904,9.983857,30.402243,24.612171,4.881150,13.008461,8.326639,-10.351038,2.310470,...,13.876879,-14.496681,3.934372,17.097912,-14.882359,14.659012,0.219160,0.013904,0.000047,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1713.wav,-238.112946,186.876846,-9.478312,16.544563,15.094830,8.133016,4.176082,-10.214108,2.697186,-5.185079,...,-9.484781,2.221844,-5.322598,-6.382371,3.360484,-4.689069,0.192952,0.045058,0.000053,1
1075.wav,-109.072418,202.586578,-66.948524,59.769085,-41.917896,15.550333,-12.643340,-11.744051,8.915149,-24.430691,...,-11.595716,7.062524,-25.740412,11.605927,-12.562830,3.730646,0.245812,0.050523,0.000161,1
1061.wav,-153.162552,164.293106,-20.666513,41.581753,4.386655,12.441601,3.453288,11.269331,-9.243830,1.476318,...,11.732153,-6.484863,6.391390,-2.444995,7.363931,-11.940302,0.209317,0.034008,0.000472,1
0419.wav,-139.244354,177.909454,-12.169298,13.599029,-9.332035,30.310354,6.266045,6.107516,9.643875,7.326506,...,5.437186,8.939524,5.805763,-12.421236,21.317339,0.520296,0.243927,0.038797,0.000476,0


In [14]:
print('A summary of the dataset:')
feature.info()

A summary of the dataset:
<class 'pandas.core.frame.DataFrame'>
Index: 2496 entries, 2217.wav to 1707.wav
Data columns (total 33 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mfcc_voice1   2496 non-null   float64
 1   mfcc_voice2   2496 non-null   float64
 2   mfcc_voice3   2496 non-null   float64
 3   mfcc_voice4   2496 non-null   float64
 4   mfcc_voice5   2496 non-null   float64
 5   mfcc_voice6   2496 non-null   float64
 6   mfcc_voice7   2496 non-null   float64
 7   mfcc_voice8   2496 non-null   float64
 8   mfcc_voice9   2496 non-null   float64
 9   mfcc_voice10  2496 non-null   float64
 10  mfcc_voice11  2496 non-null   float64
 11  mfcc_voice12  2496 non-null   float64
 12  mfcc_voice13  2496 non-null   float64
 13  rms_voice     2496 non-null   float64
 14  zcr_voice     2496 non-null   float64
 15  flat_voice    2496 non-null   float64
 16  mfcc_bg1      2496 non-null   float64
 17  mfcc_bg2      2496 non-null   float64
 

The following is one of the sample showing its raw audio and labels, followed by the extracted features.

In [15]:
print('Raw data:')
display(ipd.Audio(files[100]))
print(MLENDLS_df.loc[files[100].split('/')[-1]])
print(f'\nExtracted feature:\n{feature.iloc[100,:]}\n')

Raw data:


area           kensington
spot             dinosaur
in_out             indoor
Participant          S191
Name: 1496.wav, dtype: object

Extracted feature:
mfcc_voice1    -136.795441
mfcc_voice2     177.262329
mfcc_voice3     -27.063068
mfcc_voice4      18.417841
mfcc_voice5     -13.696946
mfcc_voice6       3.122657
mfcc_voice7       0.664295
mfcc_voice8       2.906834
mfcc_voice9     -12.023077
mfcc_voice10      9.911384
mfcc_voice11      0.949110
mfcc_voice12      5.787125
mfcc_voice13     -0.938689
rms_voice         0.137610
zcr_voice         0.046025
flat_voice        0.005246
mfcc_bg1        -78.733383
mfcc_bg2        180.009476
mfcc_bg3        -37.808296
mfcc_bg4         13.002582
mfcc_bg5        -13.505101
mfcc_bg6          4.501716
mfcc_bg7         -8.189581
mfcc_bg8          2.296819
mfcc_bg9        -15.913625
mfcc_bg10         9.272951
mfcc_bg11        -3.374328
mfcc_bg12         4.179346
mfcc_bg13        -5.449243
rms_bg            0.207878
zcr_bg            0.052693
flat_bg  

# 8 Results

In [16]:
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

A 5-fold cross-validated grid-search is carried out to find the optimal set of hypermeters.

It is first trained with 500 samples. 

Before training, it is important to ensure the dataset is balanced.

In [17]:
sample_size = 500

print(' The number of indoor recordings is ', np.count_nonzero(y[0:sample_size]))
print(' The number of outdoor recordings is ', sample_size - np.count_nonzero(y[0:sample_size]))

 The number of indoor recordings is  237
 The number of outdoor recordings is  263


In [18]:
pipeline = Pipeline([('scaler', StandardScaler()),
                       ('svc', svm.SVC())])

# import Grid Search class
from sklearn.model_selection import GridSearchCV

# make lists of different parameters to check
parameters = {
    'svc__kernel': ['poly','rbf'],
    'svc__C': [0.1,1,10,100],
    'svc__gamma': [1,0.1,0.01,0.001],
        }

# initialize
grid_pipeline = GridSearchCV(pipeline, parameters, cv=5)

# fit
grid_pipeline.fit(X[0:500], y[0:500])

print('Best hyperparameter setting: {0}.'.format(grid_pipeline.best_estimator_))
print('Average accuracy across folds of best hyperparameter setting: {0}.'.format(grid_pipeline.best_score_))
print('Test dataset accuracy of best hyperparameter setting: {0}.'.format(grid_pipeline.score(X, y)))

Best hyperparameter setting: Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC(C=100, gamma=0.001))]).
Average accuracy across folds of best hyperparameter setting: 0.7799999999999999.
Test dataset accuracy of best hyperparameter setting: 0.7832532051282052.


The grid-search process is repeated with 1000 samples.

In [19]:
sample_size = 1000

print(' The number of indoor recordings is ', np.count_nonzero(y[0:sample_size]))
print(' The number of outdoor recordings is ', sample_size - np.count_nonzero(y[0:sample_size]))

 The number of indoor recordings is  489
 The number of outdoor recordings is  511


In [20]:
pipeline = Pipeline([('scaler', StandardScaler()),
                       ('svc', svm.SVC())])

# import Grid Search class
from sklearn.model_selection import GridSearchCV

# make lists of different parameters to check
parameters = {
    'svc__kernel': ['poly','rbf'],
    'svc__C': [0.1,1,10,100],
    'svc__gamma': [1,0.1,0.01,0.001],
        }

# initialize
grid_pipeline = GridSearchCV(pipeline, parameters, cv=5)

# fit
grid_pipeline.fit(X[0:1000], y[0:1000])

print('Best hyperparameter setting: {0}.'.format(grid_pipeline.best_estimator_))
print('Average accuracy across folds of best hyperparameter setting: {0}.'.format(grid_pipeline.best_score_))
print('Test dataset accuracy of best hyperparameter setting: {0}.'.format(grid_pipeline.score(X, y)))

Best hyperparameter setting: Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC(C=10, gamma=0.01))]).
Average accuracy across folds of best hyperparameter setting: 0.8030000000000002.
Test dataset accuracy of best hyperparameter setting: 0.8197115384615384.


Finally, 1500 samples are used.

In [21]:
sample_size = 1500

print(' The number of indoor recordings is ', np.count_nonzero(y[0:sample_size]))
print(' The number of outdoor recordings is ', sample_size - np.count_nonzero(y[0:sample_size]))

 The number of indoor recordings is  733
 The number of outdoor recordings is  767


In [22]:
pipeline = Pipeline([('scaler', StandardScaler()),
                       ('svc', svm.SVC())])

# import Grid Search class
from sklearn.model_selection import GridSearchCV

# make lists of different parameters to check
parameters = {
    'svc__kernel': ['poly','rbf'],
    'svc__C': [0.1,1,10,100],
    'svc__gamma': [1,0.1,0.01,0.001],
        }

# initialize
grid_pipeline = GridSearchCV(pipeline, parameters, cv=5)

# fit
grid_pipeline.fit(X[0:1500], y[0:1500])

print('Best hyperparameter setting: {0}.'.format(grid_pipeline.best_estimator_))
print('Average accuracy across folds of best hyperparameter setting: {0}.'.format(grid_pipeline.best_score_))
print('Test dataset accuracy of best hyperparameter setting: {0}.'.format(grid_pipeline.score(X, y)))

Best hyperparameter setting: Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC(C=100, gamma=0.001))]).
Average accuracy across folds of best hyperparameter setting: 0.7873333333333333.
Test dataset accuracy of best hyperparameter setting: 0.8108974358974359.


The performance does not improve when 1500 samples are used so it is possible to train the model with a minimum of 1000 samples.

Also, the optimal set of hypermeters is:
kernel='rbf', C=10, gamma=0.01

Therefore, a SVC model is trained with the above hypermeters with 1000 samples and the training accuracy is computed.

A mentionable point is that the attributes are first standardised before passing into the SVC model.

The model is then validated with the remaining 1496 samples and the validating accuracy is also computed.

In [23]:
X_train = X[0:1000]
y_train = y[0:1000]

X_val = X[1000:len(X)]
y_val = y[1000:len(X)]

mean = X_train.mean(0)
sd =  X_train.std(0)

X_train = (X_train-mean)/sd
X_val  = (X_val-mean)/sd

model = svm.SVC(C=10, gamma=0.01, kernel='rbf')
model.fit(X_train,y_train)

y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)

print('Training Accuracy =', np.mean(y_train_pred == y_train))
print('Validation Accuracy =', np.mean(y_val_pred == y_val))

Training Accuracy = 0.883
Validation Accuracy = 0.7774064171122995


The validation accuracy obtained is slightly lower than the average accuracy during cross-validated grid-search.<br>
A possible reason is that there are more unseen data used in validation than in the grid search process.<br>
In the above training process, 1000 samples are used to train the model and the remaining 1496 samples are used to validate it.<br>
While during grid-search using 1000 samples, the model is trained with 80% of the data (i.e. 800 samples) and is validated with 20% of the data (i.e. 200 samples) for each fold.

It is noticable that there is a 10% difference between training and validation accuracies.<br>
This implies the model may have overfitted the training data.<br>
However, the result from grid-search shows that the quality of the model does not improve when more data is used for training.<br>
More useful features may be needed to improve the model.

Overall, an accuracy over 75% is a fairly good performance.

# 9 Conclusions

With 13 MFCC, RMS value, zero-crossing rate and spectral flatness, the model is able to achieve over 75% of accuracy, implying the features extracted, as least part of them, are meaningful to the problem.

However, it is difficult to interpret the SVC model and using the rbf kernel in the model cannot compute the feature importance so some of the attributes could be not as relevant as the others.

The experiment is based on the assumption that all candidates speak "This is London." in the first 3 seconds and the audio should have at least 6 seconds. Cases are that some participants may not follow the instructions very strictly.
To improve the quality of input, one of the possibilities is to inspect the audio files and filter out the ones not meeting the requirements.