# Environment set up


Our first step is to load the Python libraries that we will be using in this project.

In [None]:
import os, glob

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import IPython.display as ipd

from tqdm import tqdm
import librosa

from sklearn import svm
from sklearn.model_selection import train_test_split

# Exploring our dataset sample


Run the following cell to check the contents of our project directory. The output should include this Jupyter notebook and the `data` directory containing our recordings:

In [None]:
os.listdir(os.getcwd())

Let's check how many audio files we have in the `data` directory:

In [None]:
sample_path = 'data/*.wav'
files = glob.glob(sample_path)
len(files)

This figure should be interpreted as the number of **samples** in our dataset. Let's listen to some random audio files:

In [None]:
for _ in range(5):
  n = np.random.randint(98)
  display(ipd.Audio(files[n]))

Can you recognise each song? Can you tell whether the interpreters are humming or whistling to the song?

Let's have a look at the name of the 98 audio file:

In [None]:
for file in files:
  print(file.split('/')[-1])

The name of each file follows the naming convention `[Participant ID]_[type of recording]_[song].wav`. We can parse each file name and extract this information. Let's do it for the first one:

In [None]:
print('The full path to the first audio file is: ', files[0])
print('\n')
print('The name of the first audio file is: ', files[0].split('/')[-1])
print('    The participand ID is: ', files[0].split('/')[-1].split('_')[0])
print('    The type of interpretation is: ', files[0].split('/')[-1].split('_')[1])
print('    The song is: ', files[0].split('_')[2].split('.')[0])

Now that we know how to extract participant ID, type of interpretation and song, let's create a table-like structure using Python lists that collects all this information from each of the audio files:

In [None]:
MLENDHW_table = [] 

for file in files:
  file_name = file.split('/')[-1]
  participant_ID = file.split('/')[-1].split('_')[0]
  interpretation_type = file.split('/')[-1].split('_')[1]
  song = file.split('/')[-1].split('_')[2].split('.')[0]
  MLENDHW_table.append([file_name,participant_ID,interpretation_type, song])

MLENDHW_table

Next we load this table into a Pandas DataFrame. 

In [None]:
MLENDHW_df = pd.DataFrame(MLENDHW_table,columns=['file_id','participant','interpretation','song']).set_index('file_id') 
MLENDHW_df

After this step, we have created a Pandas DataFrame representing our dataset. Our dataset consists of 98 samples described by 4 attributes, namely audio recording, participant ID, type of interpretation and name of the song. 

In this machine learning project we want to build a prediction pipeline that takes an audio recording as an input and labels it as either hum or whistle. We can use the audio recording and type of interpretation attributes of the samples in this dataset to build our target prediction pipeline.

 

# Feature extraction

Audio files are complex data types. Specifically they are **discrete signals** or **time series**, consisting of values on a 1D temporal grid. These values are known as *samples* in signal theory, which might be a bit confusing, as we have used the term *sample* to refer to the *items* in our dataset. 

The **sampling frequency** is the rate at which samples in an audio file are produced. For instance a sampling frequency of 5HZ indicates that we produce 5 samples per second, or 1 sample every 0.2 s.

Let's plot one of our audio signals:

In [None]:
n=0
fs = None 

x, fs = librosa.load(files[n],sr=fs)
t = np.arange(len(x))/fs

plt.plot(t,x)
plt.xlabel('time (sec)')
plt.ylabel('amplitude')
plt.show()

display(ipd.Audio(files[n]))

Can you recognise the song and interpretation type? Does it agree with the values shown in the ``` MLENDHW_df ``` dataframe? Let's check it:
 

In [None]:
MLENDHW_df.loc[files[n].split('/')[-1]]

By changing the value of `n`, you can listen to other examples.

Exactly, how complex is an audio signal? Let's start by looking at the number of samples in our first audio files:

In [None]:
n=0
x, fs = librosa.load(files[n],sr=fs)

print('This audio signal has', len(x), 'samples')

If we are using a raw audio signal as the input of a machine learning model, we will be operating in a predictor space consisting of hundreds of thousands of dimensions. Compare this figure with the number of samples that we have in our dataset. Do you think we have enough samples to train a model that takes one of these audio signals as an input? 

One approach to deal with high dimensionality is to extract a few features from our signals and use these features as predictors, instead of the raw audio signal. In this notebook we will define four audio features, namely:


1.   Power.
2.   Fraction of voiced region.
3.   Pitch mean.
4.   Pitch standard deviation.
   

In the next cell, we define a new function that gets the pitch of an audio signal (do not worry if you do not know what it is, but feel free to independently read about it!).

In [None]:
def getPitch(x,fs,winLen=0.02):
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
                                                 frame_length=frame_length,hop_length=hop_length)
  return f0,voiced_flag

Then next cell defines a function that returns a NumPy array (`X`) containing the 2 of the audio features (namely 'power' and 'fraction of voiced region'), which will be used as predictors, and a binary Numpy array (`y`), that indicates whether the type of interpretation is a hum (`y=1`) or whistle (`y=0`).

In [None]:
def getXy(files,labels_file, scale_audio=False, onlySingleDigit=False):
  X,y =[],[]
  for file in tqdm(files):
    fileID = file.split('/')[-1]
    file_name = file.split('/')[-1]
    yi = labels_file.loc[fileID]['interpretation']=='hum'

    fs = None 
    x, fs = librosa.load(file,sr=fs)
    if scale_audio: x = x/np.max(np.abs(x))
    f0, voiced_flag = getPitch(x,fs,winLen=0.02)
      
    power = np.sum(x**2)/len(x)
    voiced_fr = np.mean(voiced_flag)
    pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
    pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0

    xi = [power, voiced_fr]
    X.append(xi)
    y.append(yi)

  return np.array(X),np.array(y)

Let's apply `getXy` to the subsample and obtain the NumPy predictor array (`X`) and a binary label (`y`). This could take a while, as we are processing each of the 98 recordings. 

In [None]:
X,y = getXy(files, labels_file=MLENDHW_df, scale_audio=True, onlySingleDigit=True)

The next cell shows the shape of `X` and `y`:

In [None]:
print('The shape of X is', X.shape) 
print('The shape of y is', y.shape)

As you can see, we have 98 items consisting of 2 features (stored in `X`) and one binary label (stored in `y`). Is our dataset balanced? Let's have a look:

In [None]:
print(' The number of hum recordings is ', np.count_nonzero(y))
print(' The number of whistle recordings is ', y.size - np.count_nonzero(y))

# Modeling

Let's train a machine learning model using the dataset that we have just created. We will first split the dataset defined by X and y into a training set and a validation set. Then we will apply the SVM method provided by scikit-learn to the training dataset and finally will compute the accuracy of the trained model on the test dataset.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

Can you identify the number of items in the training and validation sets?

Let's now fit an SVM model and print both the training accuracty and validation accuracy.


In [None]:
model  = svm.SVC(C=1)
model.fit(X_train,y_train)

yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))

Compare the training and validation accuracies. Is our model overfitting, underfitting, performing well? What do you think the accuracy of a random classifier would be?
