# 1 Author

**Student Name**:  Salman Ali Sayyed

**Student ID**:  220663575



# 2 Problem formulation

**Basic component**

The problem that we are about to solve is to classify the input audio file into indoor or outdoor audio. For this a machine learning model will be built. The model will classify audio into indoor and outdoor based on certain features of audio file. 

# 3 Machine Learning pipeline

For the first solution of binary classification, following steps had been taken

1. **Data Load**: For this step all the zip files of audio data had been uploaded to google drive and then been extracted to the single folder which in this case was 1_dataset on google drive. Then all the files with .wav extension had been read using python's glob package and stored in a variable files.
2. **Reading csv**: The csv with the labels and the files name had been read and analysed.
3. **Feature extraction**: To extract the feature of audio file two function namely getPitch and getXy had been initiated. Using the python's librosa package spectral features of audio such as spectral centroid, bandwidth, contrast, flatness, rolloff along with audip power, pitch mean and pitch standard deviation had been inserted in a list to define predictors. And a boolean which states weather a particular audio file in indoor or not was considered as label
4. **Splitting dataset**: The predictors and labels had been splitted into train and validation set using sklearn
5. **Selecting Model**: After trying and  adjusting certain hyperparameters of various classifiers. Our model had been narrowed down to RandomForestClassifier with the hyperparameter as the code below.
6. **Validation**: Performace of the model is analyzed based on accuracy, precision, recall and f1-score which are displayed using sklearn's classification report 

Note
The code to load and extract the data and feature extraction is just for demonstration as it has already been done and stored ia csv

In [None]:
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os,sys,re,pickle,glob
import urllib.request
import zipfile

import IPython.display as ipd
from tqdm import tqdm
import librosa

drive.mount('/content/drive')

In [None]:
directory_to_extract_to = '/content/drive/MyDrive/Data/ml_full_dataset/1_dataset'
zip_path_1 = '/content/drive/MyDrive/Data/ml_full_dataset/MLEndLS_1.zip'
zip_path_2 = '/content/drive/MyDrive/Data/ml_full_dataset/MLEndLS_2.zip'
zip_path_3 = '/content/drive/MyDrive/Data/ml_full_dataset/MLEndLS_3.zip'
zip_path_4 = '/content/drive/MyDrive/Data/ml_full_dataset/MLEndLS_4.zip'
zip_path_5 = '/content/drive/MyDrive/Data/ml_full_dataset/MLEndLS_5.zip'

with zipfile.ZipFile(zip_path_1, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)
    
with zipfile.ZipFile(zip_path_2, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)
    
with zipfile.ZipFile(zip_path_3, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)
    
with zipfile.ZipFile(zip_path_4, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

with zipfile.ZipFile(zip_path_5, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)
    
directory_path='/content/drive/MyDrive/Data/ml_full_dataset/1_dataset/*.wav'
files=glob.glob(directory_path)

MLEndLS=pd.read_csv('./MLEndLS.csv').set_index('file_id')

# 4 Transformation stage
Transformations done are as follow

1. Label transformation
Since we only want to know weather the audio is of indoor or outdoor we had transformed the labels into boolean wich states weather it is indoor or not.

2. Feature extraction
All outdoor recording must be noisy as compared to indoor hence spectral features will have higher value for outdoor than indoors. Therefore spectral features such as spectral centroid,bandwidth, contrast, flatness and rolloff are used in predictors

3. Feature transformation
As all the features extracted are in the form of numpy array therefore there mean is taken and used as predictors.
As after applying principle component analysis accuracy was decreasing hence it's not been implemented.

In [None]:
def getPitch(x,fs,winLen=0.02):
  #winLen = 0.02 
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
                                                 frame_length=frame_length,hop_length=hop_length)
  return f0,voiced_flag

In [None]:
def getXy(files,labels_file, scale_audio=False, onlySingleDigit=False):
  X,y =[],[]
  for file in tqdm(files):
    fileID = file.split('/')[-1]
    file_name = file.split('/')[-1]
    yi = labels_file.loc[fileID]['in_out']=='indoor'

    fs = None # if None, fs would be 22050
    x, fs = librosa.load(file,sr=fs)
    spectralCentroid=librosa.feature.spectral_centroid(y=x,sr=fs)
    spectral_bandwidth=librosa.feature.spectral_bandwidth(y=x,sr=fs)
    spectral_contrast=librosa.feature.spectral_contrast(y=x,sr=fs)
    spectral_flatness=librosa.feature.spectral_flatness(y=x)
    spectral_rolloff=librosa.feature.spectral_rolloff(y=x,sr=fs)
    if scale_audio: x = x/np.max(np.abs(x))
    f0, voiced_flag = getPitch(x,fs,winLen=0.02)
      
    power = np.sum(x**2)/len(x)
    pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
    pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
    voiced_fr = np.mean(voiced_flag)

    xi = [power,pitch_mean,pitch_std,voiced_fr,np.mean(spectralCentroid),np.mean(spectral_bandwidth),np.mean(spectral_contrast),np.mean(spectral_flatness),np.mean(spectral_rolloff)]
    X.append(xi)
    y.append(yi)

  return np.array(X),np.array(y)

In [None]:
X,y = getXy(files, labels_file=MLEndLS, scale_audio=True, onlySingleDigit=True)

In [None]:
import csv
import pickle

with open("X.csv","w+") as my_csv:
    csvWriter = csv.writer(my_csv,delimiter=',')
    csvWriter.writerows(X)
    
with open("y", "wb") as fp:
  pickle.dump(y, fp)

In [None]:
X=pd.read_csv('X.csv')
with open("y", "rb") as fp:
  y=pickle.load(fp)

y=np.delete(y,0)

print(X.shape)
print(y.shape)

(2491, 9)
(2491,)


# 5 Modelling

1. **Random Forest Classifier**: Random Forest Classifier was trained as it handles non-linearity of parameters effectively. As it was giving better accuracy as compared to SVM, Decision Tree Classifier and Logistic Regression it was choosen. The hyperparameters max_features=9, max_depth=13 and n_estimators=100 had been taken based on the accuracy score

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)

# from sklearn.tree import DecisionTreeClassifier
# model=DecisionTreeClassifier()

from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(max_features=9,max_depth=13,n_estimators=100)

# from sklearn import svm
# model=svm.SVC(C=0.1, gamma=2)

model.fit(X_train,y_train)


RandomForestClassifier(max_depth=13, max_features=9)

# 6 Methodology

The performance of the model is analysed based on certain scores or results which are as follows
1. **Accuracy**: Its the ratio of correct predictions by the total number of predictions.
2. **Precision**: It's the ratio of true positives over the sum of true positives and false positives
3. **Recall**: It's the ratio of true positives over the sum of true positives and false negatives
4. **F1-score**: It combines these threee metrices into one single matrix that ranges from 0 to 1 and it takes into account both precision and recall


# 7 Dataset

1. 2491 audio files had been used in total
2. Input feature extraction and labelling is done for all of them.
3. Data is then divided into training and validation into 70:30 ratio

# 8 Results



In [None]:
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))

from sklearn.metrics import classification_report,precision_score,recall_score,f1_score

print("Training Precision=",precision_score(yt_p,y_train))
print("Training Recall=",recall_score(yt_p,y_train))
print("Training f1-score=",f1_score(yt_p,y_train))

print("Validation Precision=",precision_score(yv_p,y_val))
print("Validation Recall=",recall_score(yv_p,y_val))
print("Validation f1-score=",f1_score(yv_p,y_val))
print("Training classification report",classification_report(y_train, yt_p))
print("Validation classification report",classification_report(y_val, yv_p))

Training Accuracy 0.9948364888123924
Validation  Accuracy 0.6270053475935828
Training Precision= 0.9975669099756691
Training Recall= 0.9915356711003628
Training f1-score= 0.9945421467556095
Validation Precision= 0.5536723163841808
Validation Recall= 0.6182965299684543
Validation f1-score= 0.5842026825633383
Training classification report               precision    recall  f1-score   support

       False       1.00      0.99      1.00       921
        True       0.99      1.00      0.99       822

    accuracy                           0.99      1743
   macro avg       0.99      0.99      0.99      1743
weighted avg       0.99      0.99      0.99      1743

Validation classification report               precision    recall  f1-score   support

       False       0.63      0.69      0.66       394
        True       0.62      0.55      0.58       354

    accuracy                           0.63       748
   macro avg       0.63      0.62      0.62       748
weighted avg       0.63     

In [None]:
# Normalization
mean = X_train.mean(0)
sd =  X_train.std(0)

X_train = (X_train-mean)/sd
X_val  = (X_val-mean)/sd

model.fit(X_train,y_train)

yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
print("Training Precision=",precision_score(yt_p,y_train))
print("Training Recall=",recall_score(yt_p,y_train))
print("Training f1-score=",f1_score(yt_p,y_train))

print("Validation Precision=",precision_score(yv_p,y_val))
print("Validation Recall=",recall_score(yv_p,y_val))
print("Validation f1-score=",f1_score(yv_p,y_val))
print("Training classification report",classification_report(y_train, yt_p))
print("Validation classification report",classification_report(y_val, yv_p))

Training Accuracy 0.9948364888123924
Validation  Accuracy 0.6270053475935828
              precision    recall  f1-score   support

       False       1.00      0.99      1.00       921
        True       0.99      1.00      0.99       822

    accuracy                           0.99      1743
   macro avg       0.99      0.99      0.99      1743
weighted avg       0.99      0.99      0.99      1743

              precision    recall  f1-score   support

       False       0.63      0.69      0.66       394
        True       0.62      0.55      0.58       354

    accuracy                           0.63       748
   macro avg       0.63      0.62      0.62       748
weighted avg       0.63      0.63      0.63       748



# 9 Conclusions

After training different models and comparing them through the defined metrices RandomForestClassifier was found to be the best
Since we have an imbalanced dataset where number of outdoor labels are more than indoor therefore we cannot rely on accuracy and we do need to compute the precision and recall to get the better understanding regarding the performance of our model.
The scores of the model before and after normalization are as follows:

**Before normalization**

Training Accuracy=0.9942627653471027

Training precision=0.9975669099756691

Training recall=0.9915356711003628

Training f1-score= 0.9945421467556095

Validation accuracy=0.6350267379679144

Validation precision=0.5536723163841808

Validation recall=0.6182965299684543

Validation f1-score=0.5842026825633383

**After Normalization

Training Accuracy=0.9959839357429718

Training Precision=0.9987775061124694

Training Recall=0.9927095990279465

Training f1-score=0.9957343083485679

Validation Accuracy=0.6283422459893048

Validation Precision=0.5698324022346368

Validation Recall=0.6219512195121951

Validation f1-score=0.5947521865889213

Since these scores are better before normalization than that after normalization therefore I had choosen not to normalze the data and stick to these sets of results
