# <center>**Principles of Machine Learning**</center>
## <center>**Coursework 2a (basic implementation)**</center>

**Table of Contents**

Declaration

1. Section 1: Author

2. Section 2: Problem Formulation

3. Section 3: Machine Learning Pipeline
  * Section 3.1: Loading the Data
  * Section 3.2: Data Preparation

4. Section 4: Transformation Stage 
  * Section 4.1: Feature Extraction

5. Section 5: Modelling

6. Section 6: Methodology
  * Section 6.1: Model Fitting, Training and Validation
  * Section 6.2: Testing
    * Section 6.2.1: Class 1
    * Section 6.2.2: Class 0

7. Section 7: Dataset

8. Section 8: Results

9. Section 9: Conclusions
 
**Declaration:** Some of the code used in this assignment has been adapted and customized from www.docs.python.org/, www.matplotlib.org/stable/, www.pandas.pydata.org/docs, www.stackoverflow.com/questions/, www.geeksforgeeks.org/, www.kite.com/python/, www.codegrepper.com/, www.stats.stackexchange.com/questions/, www.machinelearningmind.com/, www.kaggle.com/, www.scikit-learn.org, www.towardsdatascience.com/, www.github.com/, www.librosa.org/blog/2019/07/17/resample-on-load/#resample-on-load/, and Principles of Machine Learning Lab, Tutorial and Lecture Notes.

### **Section 1: Author**<br> 
**Student Name**: Kweku Esuon Acquaye<br> 

### **Section 2: Problem Formulation**<br> 
This report uses modern data science methods to analyse audio files from the MLEnd Hums and Whistles dataset, and to build a machine learning pipeline that takes as input a Potter or a StarWars audio segment and predicts its song label (either Potter or StarWars). It constitutes Coursework 1a in fulfilment of the requirements of Principles of Machine Learning module.<br> 

### **Section 3: Machine Learning Pipeline**<br> 
The following steps constitute the machine learning pipeline built to achieve the purpose of this task:<br> 

#### **Section 3.1: Loading the Data**<br> 
The following steps import the necessary dependencies and mounts the drive (i.e. makes drive directly available to Colab) where original audio data files are stored.

In [None]:
# Importing libraries
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, sys, re, pickle, glob
import urllib.request
import zipfile

import IPython.display as ipd
from tqdm import tqdm
import librosa

# Mounting Google Drive 
drive.mount('/content/drive')

Mounted at /content/drive


A function to download files is defined as follows:

In [None]:
# Creating function
def download_url(url, save_path):
    with urllib.request.urlopen(url) as dl_file:
        with open(save_path, 'wb') as out_file:
            out_file.write(dl_file.read())

print("Zip download function created.")

Zip download function created.


The necessary files are downloaded as follows (variable names are reassigned):

In [None]:
# Downloading Potter1 files
url  = "https://collect.qmul.ac.uk/down?t=6PF2H2T4GP4I1LPK/45OGPQAA1S6E5AEAA2RLEU0"
save_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter1_sample.zip'
download_url(url, save_path)
print("Potter1 zip file dowloaded.")

Potter1 zip file dowloaded.


In [None]:
# Downloading Potter2 files
url  = "https://collect.qmul.ac.uk/down?t=4H6293T0GAQLO4RG/55TMAI1BTPMG7I1A71923EO"
save_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter2_sample.zip'
download_url(url, save_path)
print("Potter2 zip file dowloaded.")

Potter2 zip file dowloaded.


In [None]:
# Downloading StarWars1 files
url  = "https://collect.qmul.ac.uk/down?t=45PGP4DEJH9JLV8U/4H7ITD5VEBGREV4UGNI8SAO"
save_path = '/content/drive/MyDrive/Data2/MLEndHW2/starwars1_sample.zip'
download_url(url, save_path)
print("StarWars1 zip file dowloaded.")

StarWars1 zip file dowloaded.


In [None]:
# Downloading StarWars2 files
url  = " https://collect.qmul.ac.uk/down?t=652DPSGDVKPDJ7NV/6PF3DFT4AN3AOPS5JJS9KFG"
save_path = '/content/drive/MyDrive/Data2/MLEndHW2/starwars2_sample.zip'
download_url(url, save_path)
print("StarWars2 zip file dowloaded.")

StarWars2 zip file dowloaded.


The following cell to checks for the presence of downloaded files:

In [None]:
# Checking success of downloads
path = '/content/drive/MyDrive/Data2/MLEndHW2'
os.listdir(path)

['potter1_sample.zip',
 'potter2_sample.zip',
 'starwars1_sample.zip',
 'starwars2_sample.zip']

Download successful.<br> 
The next few steps extracts audio files from the zip files:

In [None]:
# Extracting Potter1 files
directory_to_extract_to = '/content/drive/MyDrive/Data2/MLEndHW2/potter_sample1/'
zip_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter1_sample.zip'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

# Counting files
sample_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter_sample1/*.wav'
files1 = glob.glob(sample_path)
print("There are ", len(files1), " audio files in Potter1 folder.")

There are  206  audio files in Potter1 folder.


In [None]:
# Extracting Potter2 files
directory_to_extract_to = '/content/drive/MyDrive/Data2/MLEndHW2/potter_sample2/'
zip_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter2_sample.zip'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

# Counting files
sample_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter_sample2/*.wav'
files2 = glob.glob(sample_path)
print("There are ", len(files2), " audio files in Potter2 folder.")

There are  205  audio files in Potter2 folder.


In [None]:
# Extracting StarWars1 files
directory_to_extract_to = '/content/drive/MyDrive/Data2/MLEndHW2/starwars_sample1/'
zip_path = '/content/drive/MyDrive/Data2/MLEndHW2/starwars1_sample.zip'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

# Counting files
sample_path = '/content/drive/MyDrive/Data2/MLEndHW2/starwars_sample1/*.wav'
files3 = glob.glob(sample_path)
print("There are ", len(files3), " audio files in StarWars1 folder.")

There are  208  audio files in StarWars1 folder.


In [None]:
# Extracting StarWars2 files
directory_to_extract_to = '/content/drive/MyDrive/Data2/MLEndHW2/starwars_sample2/'
zip_path = '/content/drive/MyDrive/Data2/MLEndHW2/starwars2_sample.zip'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

# Counting files
sample_path = '/content/drive/MyDrive/Data2/MLEndHW2/starwars_sample2/*.wav'
files4 = glob.glob(sample_path)
print("There are ", len(files4), " audio files in StarWars2 folder.")

There are  205  audio files in StarWars2 folder.


#### **Section 3.2: Data Preparation**<br> 

The downloaded zip files are deleted from drive after extraction to conserve storage space.<br> 

Due to the disparate acoustic properties of hums and whistles, it is decided to use hums only to obtain better model accuracy in this task. It is also decided to use sample 1 audio files (i.e. Potter1 and StarWars1) as training and validation data, with sample 2 (i.e. Potter2 and StarWars2) as test data.<br> 

The extracted files from each sample 1 folder are thus manually separated into a combined Potter and StarWars hum folder labelled "potter_and_starwars1" and a combined Potter and StarWars whistle folder which is not used in this task. File names in the combined hum folder are manually cleaned to obtain uniform/identical filename conventions for further analysis.<br> 

65 Potter hum files and 65 StarWars hum files are transferred from the sample 2 folder to the sample 1 folder "potter_and_starwars1" to increase the number of training and validation samples, leaving 69 StaWars and 70 Potter files in their respective folders for testing.

In [None]:
# Checking folders
path = '/content/drive/MyDrive/Data2/MLEndHW2'
os.listdir(path)

['potter_sample2',
 'starwars_sample1',
 'starwars_sample2',
 'starwars_whistle_sample1',
 'starwars_hum_sample1',
 'potter_sample1',
 'potter_whistle_sample1',
 'potter_and_starwars1']

The following step loads and counts the training and validation data:

In [None]:
# Loading training and validation data
sample_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter_and_starwars1/*.wav'
files = glob.glob(sample_path)
len(files)

345

The next step uses Python lists to create a table-like array with information extracted from the file names above, prior to creating a pandas dataframe:

In [None]:
# Extracting info from filenames
data_table = [] 

for file in files:
  try:
    file_name = file.split('/')[-1]
    participant_ID = file.split('/')[-1].split('_')[0]
    interpretation_type = file.split('/')[-1].split('_')[1]
    interpretation_number = file.split('/')[-1].split('_')[2]
    song = file.split('/')[-1].split('_')[3].split('.')[0]
    data_table.append([file_name,participant_ID,interpretation_type,interpretation_number, song])
  except:
    print(file_name)
    
data_table

[['S39_hum_2_StarWars.wav', 'S39', 'hum', '2', 'StarWars'],
 ['S43_hum_2_StarWars.wav', 'S43', 'hum', '2', 'StarWars'],
 ['S44_hum_2_StarWars.wav', 'S44', 'hum', '2', 'StarWars'],
 ['S45_hum_2_StarWars.wav', 'S45', 'hum', '2', 'StarWars'],
 ['S50_hum_2_StarWars.wav', 'S50', 'hum', '2', 'StarWars'],
 ['S52_hum_2_StarWars.wav', 'S52', 'hum', '2', 'StarWars'],
 ['S54_hum_2_StarWars.wav', 'S54', 'hum', '2', 'StarWars'],
 ['S56_hum_2_StarWars.wav', 'S56', 'hum', '2', 'StarWars'],
 ['S57_hum_2_StarWars.wav', 'S57', 'hum', '2', 'StarWars'],
 ['S64_hum_2_StarWars.wav', 'S64', 'hum', '2', 'StarWars'],
 ['S66_hum_2_StarWars.wav', 'S66', 'hum', '2', 'StarWars'],
 ['S67_hum_2_StarWars.wav', 'S67', 'hum', '2', 'StarWars'],
 ['S68_hum_2_StarWars.wav', 'S68', 'hum', '2', 'StarWars'],
 ['S72_hum_2_StarWars.wav', 'S72', 'hum', '2', 'StarWars'],
 ['S75_hum_2_StarWars.wav', 'S75', 'hum', '2', 'StarWars'],
 ['S76_hum_2_StarWars.wav', 'S76', 'hum', '2', 'StarWars'],
 ['S77_hum_2_StarWars.wav', 'S77', 'hum'

The following step creates a pandas dataframe of all 345 data files for subsequent analysis using the above extracted information:

In [None]:
# Creating dataframe of training files
data_df = pd.DataFrame(data_table,columns=['file_id','participant','interpretation','number','song']).set_index('file_id') 
data_df

Unnamed: 0_level_0,participant,interpretation,number,song
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S39_hum_2_StarWars.wav,S39,hum,2,StarWars
S43_hum_2_StarWars.wav,S43,hum,2,StarWars
S44_hum_2_StarWars.wav,S44,hum,2,StarWars
S45_hum_2_StarWars.wav,S45,hum,2,StarWars
S50_hum_2_StarWars.wav,S50,hum,2,StarWars
...,...,...,...,...
S164_hum_2_StarWars.wav,S164,hum,2,StarWars
S166_hum_1_StarWars.wav,S166,hum,1,StarWars
S166_hum_2_StarWars.wav,S166,hum,2,StarWars
S130_hum_2_StarWars.wav,S130,hum,2,StarWars


### **Section 4: Transformation Stage**<br> 
Out of the many properties of sound, 4 are to be extracted from the audio segments and used as attributes to build the model, these are; power, pitch mean, pitch standard deviation, and fraction of voiced region.<br> 

#### **Section 4.1: Feature Extraction**
The next step defines a function for determining the pitch of an audio segments:

In [None]:
def getPitch(x,fs,winLen=0.02):
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs, frame_length=frame_length,hop_length=hop_length)
  return f0,voiced_flag

print("Pitch function created.")

Pitch function created.


The following step defines a function for obtaining the 4 features:

In [None]:
def getXy(files,labels_file, scale_audio=False, onlySingleDigit=False):
  X,y =[],[]
  for file in tqdm(files):
    fileID = file.split('/')[-1]
    file_name = file.split('/')[-1]
    yi = labels_file.loc[fileID]['song']=='StarWars' # this establishes the classification

    fs = None # fs would default to 22050
    x, fs = librosa.load(file,sr=fs)
    if scale_audio: x = x/np.max(np.abs(x))
    f0, voiced_flag = getPitch(x,fs,winLen=0.02)
      
    power = np.sum(x**2)/len(x)
    pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
    pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
    voiced_fr = np.mean(voiced_flag)

    xi = [power,pitch_mean,pitch_std,voiced_fr]
    X.append(xi)
    y.append(yi)

  return np.array(X),np.array(y)

print("Feature extraction function created.")

Feature extraction function created.


### **Section 5: Modelling**<br> 

Classification is set such that StarWars evaluates to 1 (True) and Potter to 0 (False) during definition of the above function.<br> 

A numpy predictor array `X` and a binary label vector `y`of actual audio data are obtained next and their shapes output as follows:

In [None]:
# Creating predictor array and label vector
X,y = getXy(files, labels_file=data_df, scale_audio=True, onlySingleDigit=True)

100%|██████████| 345/345 [16:27<00:00,  2.86s/it]


In [None]:
# Outputting shapes and arrays
print('The shape of X is', X.shape) 
print('The shape of y is', y.shape)
print('The features matrix is', X)
print('The labels vector is', y)

The shape of X is (345, 4)
The shape of y is (345,)
The features matrix is [[1.89902990e-02 1.66573250e+02 2.51369453e+01 7.02312139e-01]
 [2.13878114e-02 2.99355612e+02 5.85071440e+01 6.69505963e-01]
 [2.75914705e-02 2.31122662e+02 6.30361534e+01 7.84673844e-01]
 ...
 [3.11158252e-02 2.55290575e+02 6.29238360e+01 7.96138996e-01]
 [5.63672664e-02 2.19312082e+02 5.18155427e+01 8.01391036e-01]
 [7.03785661e-03 1.44719474e+02 4.76234711e+01 1.01314772e-01]]
The labels vector is [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True False  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True False False  True  True
  True  T

To ascertain that there is no class imbalance, the following output is generated:

In [None]:
# Checking for class imbalance
print(' The number of StarWars recordings in the training dataset is ', np.count_nonzero(y))
print(' The number of Potter recordings in the training dataset is ', y.size - np.count_nonzero(y))

 The number of StarWars recordings in the training dataset is  171
 The number of Potter recordings in the training dataset is  174


There is no class imbalance.<br> 

### **Section 6: Methodology**<br> 

#### **Section 6.1: Model Fitting, Training and Validation**<br> 

In the next few steps the data is split 75% and 25% respectively for training and validation, normalised and a Support Vector Machine used to build a model:

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split

# Splitting data
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.25)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((258, 4), (87, 4), (258,), (87,))

In [None]:
# Normalising data
mean = X_train.mean(0)
sd =  X_train.std(0)

X_train = (X_train-mean)/sd
X_val  = (X_val-mean)/sd

print("Training data normalised.")

Training data normalised.


In [None]:
# Building model
model  = svm.SVC(C=1,gamma=2)
model.fit(X_train,y_train)

yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print('Training Accuracy is', np.mean(yt_p==y_train))
print('Validation Accuracy is', np.mean(yv_p==y_val))
print('The support vectors are', model.support_vectors_.shape)

Training Accuracy is 0.813953488372093
Validation Accuracy is 0.5977011494252874
The support vectors are (237, 4)


#### **Section 6.2: Testing**<br> 
The pipeline and model built above is now tested, first using the sample 2 StarWars hum audio data to obtain a test accuracy for True labels (1), and then subsequently using the sample 2 Potter hum audio data obtain a test accuracy for False labels (0) as follows:<br> 

#### **Section 6.2.1: Class 1**<br>
Testing for true labels using StarWars files from sample 2:

In [None]:
# Loading StarWars test data
sample_path = '/content/drive/MyDrive/Data2/MLEndHW2/starwars_sample2/*.wav'
files = glob.glob(sample_path)
len(files)

69

In [None]:
# Outputting file names
for file in files:
  print(file.split('/')[-1])

S170_hum_2_StarWars.wav
S171_hum_2_StarWars.wav
S172_hum_2_StarWars.wav
S173_hum_2_StarWars.wav
S174_hum_2_StarWars.wav
S175_hum_2_StarWars.wav
S176_hum_2_StarWars.wav
S179_hum_2_StarWars.wav
S182_hum_2_StarWars.wav
S184_hum_2_Starwars.wav
S187_hum_2_StarWars.wav
S188_hum_2_StarWars.wav
S190_hum_2_StarWars.wav
S193_hum_2_StarWars.wav
S195_hum_2_StarWars.wav
S197_hum_2_StarWars.wav
S198_hum_2_StarWars.wav
S200_hum_2_StarWars.wav
S203_hum_2_StarWars.wav
S204_hum_2_StarWars.wav
S207_hum_2_StarWars.wav
S208_hum_2_StarWars.wav
S209_hum_2_StarWars.wav
S213_hum_2_StarWars.wav
S214_hum_2_StarWars.wav
S215_hum_2_StarWars.wav
S217_hum_2_StarWars.wav
S218_hum_2_StarWars.wav
S221_hum_1_StartWars.wav
S221_hum_2_StarWars.wav
S222_hum_2_StarWars.wav
S169_hum_1_StarWars.wav
S169_hum_2_StarWars.wav
S177_hum_1_StarWars.wav
S177_hum_2_StarWars.wav
S178_hum_2_StarWars.wav
S181_hum_1_StarWars.wav
S181_hum_2_StarWars.wav
S185_hum_1_StarWars.wav
S185_hum_2_StarWars.wav
S186_hum_1_StarWars.wav
S186_hum_2_Star

In [None]:
# Extracting info from filenames
test1_data_table = [] 

for file in files:
  try:
    file_name = file.split('/')[-1]
    participant_ID = file.split('/')[-1].split('_')[0]
    interpretation_type = file.split('/')[-1].split('_')[1]
    interpretation_number = file.split('/')[-1].split('_')[2]
    song = file.split('/')[-1].split('_')[3].split('.')[0]
    test1_data_table.append([file_name,participant_ID,interpretation_type,interpretation_number, song])
  except:
    print(file_name)
    
test1_data_table

[['S170_hum_2_StarWars.wav', 'S170', 'hum', '2', 'StarWars'],
 ['S171_hum_2_StarWars.wav', 'S171', 'hum', '2', 'StarWars'],
 ['S172_hum_2_StarWars.wav', 'S172', 'hum', '2', 'StarWars'],
 ['S173_hum_2_StarWars.wav', 'S173', 'hum', '2', 'StarWars'],
 ['S174_hum_2_StarWars.wav', 'S174', 'hum', '2', 'StarWars'],
 ['S175_hum_2_StarWars.wav', 'S175', 'hum', '2', 'StarWars'],
 ['S176_hum_2_StarWars.wav', 'S176', 'hum', '2', 'StarWars'],
 ['S179_hum_2_StarWars.wav', 'S179', 'hum', '2', 'StarWars'],
 ['S182_hum_2_StarWars.wav', 'S182', 'hum', '2', 'StarWars'],
 ['S184_hum_2_Starwars.wav', 'S184', 'hum', '2', 'Starwars'],
 ['S187_hum_2_StarWars.wav', 'S187', 'hum', '2', 'StarWars'],
 ['S188_hum_2_StarWars.wav', 'S188', 'hum', '2', 'StarWars'],
 ['S190_hum_2_StarWars.wav', 'S190', 'hum', '2', 'StarWars'],
 ['S193_hum_2_StarWars.wav', 'S193', 'hum', '2', 'StarWars'],
 ['S195_hum_2_StarWars.wav', 'S195', 'hum', '2', 'StarWars'],
 ['S197_hum_2_StarWars.wav', 'S197', 'hum', '2', 'StarWars'],
 ['S198_

In [None]:
# Creating dataframe of StarWars test files
test1_data_df = pd.DataFrame(test1_data_table,columns=['file_id','participant','interpretation','number','song']).set_index('file_id') 
test1_data_df

Unnamed: 0_level_0,participant,interpretation,number,song
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S170_hum_2_StarWars.wav,S170,hum,2,StarWars
S171_hum_2_StarWars.wav,S171,hum,2,StarWars
S172_hum_2_StarWars.wav,S172,hum,2,StarWars
S173_hum_2_StarWars.wav,S173,hum,2,StarWars
S174_hum_2_StarWars.wav,S174,hum,2,StarWars
...,...,...,...,...
S212_hum_2_StarWars.wav,S212,hum,2,StarWars
S216_hum_1_StarWars.wav,S216,hum,1,StarWars
S216_hum_2_StarWars.wav,S216,hum,2,StarWars
S219_hum_1_StarWars.wav,S219,hum,1,StarWars


In [None]:
# Creating StarWars test arrays
X1_test,y1_test = getXy(files, labels_file=test1_data_df, scale_audio=True, onlySingleDigit=True)

100%|██████████| 69/69 [03:36<00:00,  3.14s/it]


In [None]:
# Outputting shapes and arrays
print('The shape of X1_test is', X1_test.shape) 
print('The shape of y1_test is', y1_test.shape)
print('The features matrix is', X1_test)
print('The labels vector is', y1_test)

The shape of X1_test is (69, 4)
The shape of y1_test is (69,)
The features matrix is [[4.53736205e-02 1.32514717e+02 2.14151036e+01 6.11042945e-01]
 [7.11877295e-03 1.71945229e+02 2.76353999e+01 7.72591263e-01]
 [8.84609043e-03 2.48088311e+02 6.84344252e+01 9.05864198e-01]
 [3.93043296e-02 1.22612629e+02 2.00148462e+01 7.13400901e-01]
 [2.47371672e-02 1.25672151e+02 3.11825594e+01 6.61263276e-01]
 [8.99394202e-03 1.14699589e+02 2.75154362e+01 3.51194319e-01]
 [5.20351802e-02 3.36725613e+02 7.04607928e+01 8.33718245e-01]
 [3.87394080e-02 2.12875665e+02 5.36093520e+01 9.00075700e-01]
 [3.63039098e-02 3.16365914e+02 5.03864443e+01 5.75394665e-01]
 [3.66001933e-02 2.32735760e+02 2.16121088e+01 8.27586207e-01]
 [3.82226550e-02 1.41027851e+02 2.09852182e+01 6.92406692e-01]
 [1.06047041e-02 1.35052958e+02 3.10681866e+01 6.72250546e-01]
 [1.77533947e-02 2.31432478e+02 5.73590714e+01 7.74689100e-01]
 [3.29931180e-02 2.04415692e+02 3.10259205e+01 8.32630098e-01]
 [2.95374415e-02 2.17359247e+02 6

In [None]:
# Normalising StarWars test data
mean = X1_test.mean(0)
sd =  X1_test.std(0)

X1_test = (X1_test-mean)/sd

print("StarWars test data normalised.")

StarWars test data normalised.


In [None]:
# Testing model using StarWars hum files
test1_p = model.predict(X1_test)

print('Testing Accuracy for StarWars is', np.mean(test1_p==y1_test))

Testing Accuracy for StarWars is 0.5652173913043478


#### **Section 6.2.2: Class 0**<br>
Testing for false labels using Potter files from sample 2:

In [None]:
# Loading StarWars test data
sample_path = '/content/drive/MyDrive/Data2/MLEndHW2/potter_sample2/*.wav'
files = glob.glob(sample_path)
len(files)

70

In [None]:
# Outputting file names
for file in files:
  print(file.split('/')[-1])

S155_hum_2_Potter.wav
S156_hum_2_Potter.wav
S157_hum_1_Potter.wav
S157_hum_2_Potter.wav
S158_hum_2_Potter.wav
S159_hum_2_Potter.wav
S160_hum_2_Potter.wav
S161_hum_2_Potter.wav
S163_hum_2_Potter.wav
S165_hum_2_Potter.wav
S166_hum_1_Potter.wav
S166_hum_2_Potter.wav
S167_hum_2_Potter.wav
S168_hum_2_Potter.wav
S169_hum_1_Potter.wav
S169_hum_2_Potter.wav
S170_hum_2_Potter.wav
S171_hum_2_Potter.wav
S172_hum_2_Potter.wav
S173_hum_2_Potter.wav
S174_hum_2_Potter.wav
S175_hum_2_Potter.wav
S176_hum_2_Potter.wav
S177_hum_1_Potter.wav
S177_hum_2_Potter.wav
S179_hum_2_Potter.wav
S180_hum_2_Potter.wav
S182_hum_2_Potter.wav
S183_hum_2_Potter.wav
S184_hum_2_Potter.wav
S185_hum_1_Potter.wav
S185_hum_2_Potter.wav
S187_hum_2_Potter.wav
S188_hum_2_Potter.wav
S190_hum_2_Potter.wav
S191_hum_1_Potter.wav
S191_hum_2_Potter.wav
S192_hum_1_Potter.wav
S192_hum_2_Potter.wav
S193_hum_2_Potter.wav
S195_hum_2_Potter.wav
S196_hum_1_Potter.wav
S196_hum_2_Potter.wav
S197_hum_2_Potter.wav
S198_hum_2_Potter.wav
S199_hum_1

In [None]:
# Extracting info from filenames
test2_data_table = [] 

for file in files:
  try:
    file_name = file.split('/')[-1]
    participant_ID = file.split('/')[-1].split('_')[0]
    interpretation_type = file.split('/')[-1].split('_')[1]
    interpretation_number = file.split('/')[-1].split('_')[2]
    song = file.split('/')[-1].split('_')[3].split('.')[0]
    test2_data_table.append([file_name,participant_ID,interpretation_type,interpretation_number, song])
  except:
    print(file_name)
    
test2_data_table

[['S155_hum_2_Potter.wav', 'S155', 'hum', '2', 'Potter'],
 ['S156_hum_2_Potter.wav', 'S156', 'hum', '2', 'Potter'],
 ['S157_hum_1_Potter.wav', 'S157', 'hum', '1', 'Potter'],
 ['S157_hum_2_Potter.wav', 'S157', 'hum', '2', 'Potter'],
 ['S158_hum_2_Potter.wav', 'S158', 'hum', '2', 'Potter'],
 ['S159_hum_2_Potter.wav', 'S159', 'hum', '2', 'Potter'],
 ['S160_hum_2_Potter.wav', 'S160', 'hum', '2', 'Potter'],
 ['S161_hum_2_Potter.wav', 'S161', 'hum', '2', 'Potter'],
 ['S163_hum_2_Potter.wav', 'S163', 'hum', '2', 'Potter'],
 ['S165_hum_2_Potter.wav', 'S165', 'hum', '2', 'Potter'],
 ['S166_hum_1_Potter.wav', 'S166', 'hum', '1', 'Potter'],
 ['S166_hum_2_Potter.wav', 'S166', 'hum', '2', 'Potter'],
 ['S167_hum_2_Potter.wav', 'S167', 'hum', '2', 'Potter'],
 ['S168_hum_2_Potter.wav', 'S168', 'hum', '2', 'Potter'],
 ['S169_hum_1_Potter.wav', 'S169', 'hum', '1', 'Potter'],
 ['S169_hum_2_Potter.wav', 'S169', 'hum', '2', 'Potter'],
 ['S170_hum_2_Potter.wav', 'S170', 'hum', '2', 'Potter'],
 ['S171_hum_2_

In [None]:
# Creating dataframe of Potter test files
test2_data_df = pd.DataFrame(test2_data_table,columns=['file_id','participant','interpretation','number','song']).set_index('file_id') 
test2_data_df

Unnamed: 0_level_0,participant,interpretation,number,song
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S155_hum_2_Potter.wav,S155,hum,2,Potter
S156_hum_2_Potter.wav,S156,hum,2,Potter
S157_hum_1_Potter.wav,S157,hum,1,Potter
S157_hum_2_Potter.wav,S157,hum,2,Potter
S158_hum_2_Potter.wav,S158,hum,2,Potter
...,...,...,...,...
S194_hum_1_Potter.wav,S194,hum,1,Potter
S205_hum_1_Potter.wav,S205,hum,1,Potter
S205_hum_2_Potter.wav,S205,hum,2,Potter
S186_hum_2_Potter.wav,S186,hum,2,Potter


In [None]:
# Creating Potter test arrays
X2_test,y2_test = getXy(files, labels_file=test2_data_df, scale_audio=True, onlySingleDigit=True)

100%|██████████| 70/70 [03:50<00:00,  3.30s/it]


In [None]:
# Outputting shapes and arrays
print('The shape of X2_test is', X2_test.shape) 
print('The shape of y2_test is', y2_test.shape)
print('The features matrix is', X2_test)
print('The labels vector is', y2_test)

The shape of X2_test is (70, 4)
The shape of y2_test is (70,)
The features matrix is [[4.95476205e-02 1.94632747e+02 4.76049717e+01 8.90649762e-01]
 [1.10784931e-01 3.34012826e+02 6.53425353e+01 8.55380398e-01]
 [5.81637583e-02 3.28034088e+02 6.38598631e+01 8.74641834e-01]
 [3.88690236e-02 3.15976009e+02 6.17583792e+01 8.79831342e-01]
 [9.75654389e-04 1.58089534e+02 4.50818990e+01 6.51574803e-01]
 [3.83276450e-02 1.78047177e+02 3.85875162e+01 6.97208738e-01]
 [5.40660186e-02 1.76546413e+02 3.64734953e+01 7.32255798e-01]
 [2.92108152e-02 1.31276817e+02 3.35135544e+01 8.46762590e-01]
 [7.20354242e-02 1.59955427e+02 2.71549388e+01 7.51713633e-01]
 [1.81691902e-03 1.72736544e+02 3.82057350e+01 5.83734617e-01]
 [4.13946184e-02 3.24745991e+02 6.60763330e+01 8.52509653e-01]
 [5.08846242e-02 3.36594815e+02 6.46604346e+01 7.72200772e-01]
 [1.03243906e-01 1.59938486e+02 3.04685561e+01 6.95094340e-01]
 [1.89953410e-02 2.16332023e+02 3.12571291e+01 8.07951482e-01]
 [2.90136546e-02 2.87373148e+02 4

In [None]:
# Normalising Potter test data
mean = X2_test.mean(0)
sd =  X2_test.std(0)

X2_test = (X2_test-mean)/sd

print("Potter test data normalised.")

Potter test data normalised.


In [None]:
# Testing model using StarWars hum files
test2_p = model.predict(X2_test)

print('Testing Accuracy for Potter is', np.mean(test2_p==y2_test))

Testing Accuracy for Potter is 0.5428571428571428


### **Section 7: Dataset**<br> 
The dataset for this analysis is the MLEnd Hums and Whistles public dataset version 0. It consists of participant-submitted 15-second humming and whistling recordings of fragments of 8 different movie songs. Each participant submitted 2 humming and 2 whistling renditions per song (32 per participant), along with their demographic data. Demographic data is currently unavailable for this task.<br> 

With 210 participants there are 210 x 4 x 8 = 6720 audio files, anonymised with sample numbers, e.g. S12.


### **Section 8: Results**<br> 
The results obtained with the dataset and pipeline above can be summarised as:<br> 

**i. Training Accuracy** 

In [None]:
# Training accuracy at 2 decimal places
print("The training accuracy of the pipeline was obtained to be: ", str(round(0.813953488372093, 2)))

The training accuracy of the pipeline was obtained to be:  0.81


**ii. Validation Accuracy**

In [None]:
# Training accuracy at 2 decimal places
print("The validation accuracy of the pipeline was obtained to be: ", str(round(0.5977011494252874, 2)))

The validation accuracy of the pipeline was obtained to be:  0.6


**iii. Support Vectors**<br> 
The support vectors are **(237, 4)**, from the output of one of the computation cells in Section 6.1 above. This represents a matrix of 4 vectors with 237 values each in the immediate vicinity of the decision boundary or hyperplane that directly impact its position and orientation, and maximise the margin between classes.

**iv. True Class Test Accuracy**

In [None]:
# Testing accuracy of true class at 2 decimal places
print("The testing accuracy of the true class was obtained to be: ", str(round(0.5652173913043478, 2)))

The testing accuracy of the true class was obtained to be:  0.57


**v. False Class Test Acuracy**

In [None]:
# Testing accuracy of false class at 2 decimal places
print("The testing accuracy of the false class was obtained to be: ", str(round(0.5428571428571428, 2)))

The testing accuracy of the false class was obtained to be:  0.54


### **Section 9: Conclusions**<br> 
The obtained training accuracy of 0.81 can be described as fairly good considering that audio files are complex, and with participant submissions being very varied and sometimes inconsistent, for the machine learning model to achieve such accuracy is fairly satisfactory. Besides, the number of available samples used in model training, after cleaning, were only 345.<br>  

The testing accuracies of 0.57 and 0.54 however cannot be described as good, they are infact not good at all considering this is a binary classification, so obtained testing accuracies of essentially 0.5 is the worst possible outcome.<br> 

Evidence of overfitting in this model can be described as high. A general sign of overfitting, among many, is extremely high accuracy for training e.g. 99% and not so high for evaluation / validation / testing,  e.g. 75%. Although the obtained training accuracy in this model is itself not high, the disparity between it and the validation and testing accuracies suggest overfitting to the training data.