# ECS7020P mini-project submission

The mini-project has two separate components:


1.   **Basic component** [6 marks]: Using the MLEnd London Sounds dataset, build a machine learning pipeline that takes as an input an audio segment and predicts whether the audio segment has been recorded indoors or outdoors.
2.   **Advanced component** [10 marks]: Formulate your own machine learning problem and build a machine learning solution using the MLEnd London Sounds dataset. 

Your submission will consist of two Jupyter notebooks, one for the basic component and another one for advanced component. Please **name each notebook**:

* ECS7020P_miniproject_basic.ipynb
* ECS7020P_miniproject_advanced.ipynb

then **zip and submit them toghether**.

Each uploaded notebook should include: 

*   **Text cells**, describing concisely each step and results.
*   **Code cells**, implementing each step.
*   **Output cells**, i.e. the output from each code cell.

and **should have the structure** indicated below. Notebooks might not be run, please make sure that the output cells are saved.

How will we evaluate your submission?

*   Conciseness in your writing (10%).
*   Correctness in your methodology (30%).
*   Correctness in your analysis and conclusions (30%).
*   Completeness (10%).
*   Originality (10%).
*   Efforts to try something new (10%).

Suggestion: Why don't you use **GitHub** to manage your project? GitHub can be used as a presentation card that showcases what you have done and gives evidence of your data science skills, knowledge and experience. 

Each notebook should be structured into the following 9 sections:


# 1 Author

**Student Name**:  
**Student ID**:  



# 2 Problem formulation

Describe the machine learning problem that you want to solve and explain what's interesting about it.

# 3 Machine Learning pipeline

Describe your ML pipeline. Clearly identify its input and output, any intermediate stages (for instance, transformation -> models), and intermediate data moving from one stage to the next. It's up to you to decide which stages to include in your pipeline. 

input audio -> mfcc coefficient extraction -- MFCC Coefficients -> data cleanup --  -> PCA dimensionality reduction -- PCA -> SVM Classifier -> output label (indoor/outdoor)

# 4 Transformation stage

Describe any transformations, such as feature extraction. Identify input and output. Explain why you have chosen this transformation stage.

In [None]:
PCA

# 5 Modelling

Describe the ML model(s) that you will build. Explain why you have chosen them.

# 6 Methodology

Describe how you will train and validate your models, how model performance is assesssed (i.e. accuracy, confusion matrix, etc)

# 7 Dataset

Describe the dataset that you will use to create your models and validate them. If you need to preprocess it, do it here. Include visualisations too. You can visualise raw data samples or extracted features.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import librosa.display
import math

import os, sys, re, pickle, glob
import urllib.request
import zipfile

import IPython.display as ipd
from tqdm import tqdm
import librosa
import scipy.stats as stats

In [4]:
sample_path = 'MLEndLS/*.wav'
files = glob.glob(sample_path)
MLENDLS_df = pd.read_csv('./MLEndLS.csv').set_index('file_id') 

In [5]:
aud_data = []
labels = []

for n in range(0,len(files)):
# for n in range(50,100):
    a, fs = librosa.load(files[n],sr=None)    
    data = librosa.stft(a)
    print(data.shape)
    # Trimming to get a standard input size
    data = data[:321048]
    if len(data) < 321048: continue
        
    aud_data.append(data)
    labels.append(MLENDLS_df.loc[files[n].split('/')[-1]].in_out)
    
print("Done")

(1025, 591)
(1025, 711)
(1025, 771)
(1025, 622)
(1025, 607)
(1025, 562)
(1025, 702)
(1025, 653)
(1025, 583)
(1025, 2309)
(1025, 629)
(1025, 602)
(1025, 657)
(1025, 713)
(1025, 637)
(1025, 1292)
(1025, 678)
(1025, 648)
(1025, 719)
(1025, 799)
(1025, 678)
(1025, 723)
(1025, 870)
(1025, 647)
(1025, 690)
(1025, 657)
(1025, 690)
(1025, 701)
(1025, 739)
(1025, 605)
(1025, 734)
(1025, 896)
(1025, 603)
(1025, 615)
(1025, 640)
(1025, 691)
(1025, 646)
(1025, 653)
(1025, 603)
(1025, 615)
(1025, 670)
(1025, 1005)
(1025, 740)
(1025, 603)
(1025, 692)
(1025, 646)
(1025, 731)
(1025, 604)
(1025, 807)
(1025, 688)
(1025, 644)
(1025, 692)
(1025, 713)
(1025, 607)
(1025, 536)
(1025, 793)
(1025, 1357)
(1025, 639)
(1025, 667)
(1025, 691)
(1025, 678)
(1025, 777)
(1025, 695)
(1025, 701)
(1025, 827)
(1025, 1001)
(1025, 623)
(1025, 639)
(1025, 628)
(1025, 638)
(1025, 653)
(1025, 939)
(1025, 796)
(1025, 601)
(1025, 689)
(1025, 920)
(1025, 922)
(1025, 708)
(1025, 683)
(1025, 689)
(1025, 673)
(1025, 615)
(1025, 643)

(1025, 610)
(1025, 781)
(1025, 690)
(1025, 615)
(1025, 888)
(1025, 777)
(1025, 609)
(1025, 931)
(1025, 618)
(1025, 764)
(1025, 680)
(1025, 994)
(1025, 598)
(1025, 604)
(1025, 675)
(1025, 690)
(1025, 661)
(1025, 665)
(1025, 696)
(1025, 657)
(1025, 1721)
(1025, 642)
(1025, 648)
(1025, 1153)
(1025, 605)
(1025, 678)
(1025, 640)
(1025, 669)
(1025, 807)
(1025, 634)
(1025, 920)
(1025, 740)
(1025, 754)
(1025, 558)
(1025, 719)
(1025, 610)
(1025, 660)
(1025, 852)
(1025, 607)
(1025, 641)
(1025, 623)
(1025, 855)
(1025, 857)
(1025, 631)
(1025, 658)
(1025, 727)
(1025, 603)
(1025, 654)
(1025, 673)
(1025, 764)
(1025, 722)
(1025, 603)
(1025, 622)
(1025, 622)
(1025, 609)
(1025, 659)
(1025, 667)
(1025, 746)
(1025, 660)
(1025, 625)
(1025, 675)
(1025, 801)
(1025, 633)
(1025, 696)
(1025, 753)
(1025, 722)
(1025, 784)
(1025, 925)
(1025, 605)
(1025, 680)
(1025, 610)
(1025, 633)
(1025, 839)
(1025, 619)
(1025, 607)
(1025, 820)
(1025, 691)
(1025, 861)
(1025, 822)
(1025, 702)
(1025, 660)
(1025, 682)
(1025, 660)
(1

(1025, 704)
(1025, 674)
(1025, 607)
(1025, 690)
(1025, 941)
(1025, 659)
(1025, 1108)
(1025, 601)
(1025, 717)
(1025, 827)
(1025, 640)
(1025, 659)
(1025, 645)
(1025, 637)
(1025, 664)
(1025, 827)
(1025, 542)
(1025, 690)
(1025, 672)
(1025, 615)
(1025, 640)
(1025, 649)
(1025, 804)
(1025, 607)
(1025, 629)
(1025, 684)
(1025, 618)
(1025, 742)
(1025, 603)
(1025, 526)
(1025, 667)
(1025, 787)
(1025, 628)
(1025, 845)
(1025, 672)
(1025, 626)
(1025, 600)
(1025, 657)
(1025, 653)
(1025, 669)
(1025, 745)
(1025, 646)
(1025, 649)
(1025, 1209)
(1025, 698)
(1025, 946)
(1025, 789)
(1025, 690)
(1025, 653)
(1025, 668)
(1025, 696)
(1025, 752)
(1025, 624)
(1025, 704)
(1025, 640)
(1025, 604)
(1025, 680)
(1025, 604)
(1025, 908)
(1025, 754)
(1025, 677)
(1025, 646)
(1025, 615)
(1025, 733)
(1025, 899)
(1025, 642)
(1025, 667)
(1025, 733)
(1025, 690)
(1025, 686)
(1025, 816)
(1025, 720)
(1025, 689)
(1025, 690)
(1025, 615)
(1025, 639)
(1025, 669)
(1025, 684)
(1025, 839)
(1025, 648)
(1025, 813)
(1025, 771)
(1025, 665)
(1

(1025, 637)
(1025, 610)
(1025, 814)
(1025, 686)
(1025, 640)
(1025, 671)
(1025, 852)
(1025, 915)
(1025, 628)
(1025, 603)
(1025, 680)
(1025, 767)
(1025, 1243)
(1025, 612)
(1025, 715)
(1025, 1086)
(1025, 911)
(1025, 752)
(1025, 677)
(1025, 688)
(1025, 848)
(1025, 775)
(1025, 739)
(1025, 703)
(1025, 775)
(1025, 697)
(1025, 697)
(1025, 767)
(1025, 777)
(1025, 646)
(1025, 629)
(1025, 723)
(1025, 601)
(1025, 889)
(1025, 1094)
(1025, 630)
(1025, 1007)
(1025, 1205)
(1025, 690)
(1025, 620)
(1025, 1665)
(1025, 534)
(1025, 669)
(1025, 676)
(1025, 757)
(1025, 605)
(1025, 704)
(1025, 753)
(1025, 613)
(1025, 851)
(1025, 675)
(1025, 605)
(1025, 687)
(1025, 640)
(1025, 810)
(1025, 729)
(1025, 704)
(1025, 869)
(1025, 756)
(1025, 636)
(1025, 610)
(1025, 604)
(1025, 675)
(1025, 715)
(1025, 643)
(1025, 603)
(1025, 666)
(1025, 761)
(1025, 607)
(1025, 979)
(1025, 727)
(1025, 800)
(1025, 677)
(1025, 684)
(1025, 633)
(1025, 607)
(1025, 627)
(1025, 858)
(1025, 662)
(1025, 603)
(1025, 698)
(1025, 646)
(1025, 643

In [1]:
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

numComponents = 7
pca = PCA(n_components=numComponents)
pca.fit(aud_data)

projected = pca.transform(aud_data)
projected = pd.DataFrame(projected,columns=['pc1','pc2','pc3','pc4','pc5','pc6','pc7'],index=range(1,len(aud_data)+1))
# projected['label'] = labels
# display(projected)

NameError: name 'aud_data' is not defined

In [31]:
projected = projected.drop(columns=[
#     'pc3',
    'pc4',
    'pc5',
#     'pc6',
#     'pc7'
])

In [46]:
from sklearn.model_selection import GridSearchCV, train_test_split

X_train, X_val, y_train, y_val = train_test_split(projected,labels,test_size=0.2)
print(X_train.shape, X_val.shape)

(35, 7) (9, 7)


# 8 Results

Carry out your experiments here, explain your results.

In [6]:
from sklearn import svm

parameters = {'C':[1,2,3,4,5,10]}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=5)

clf.fit(X_train,y_train)

print('Hyperparameters: ', clf.best_estimator_)
print('Average accuracy: ', clf.best_score_)
print('Test dataset accuracy:', clf.score(X_val, y_val))

NameError: name 'GridSearchCV' is not defined

In [34]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

from sklearn.model_selection import GridSearchCV, train_test_split

X_train, X_val, y_train, y_val = train_test_split(projected,labels,test_size=0.2)
print(X_train.shape, X_val.shape)
clf = RandomForestClassifier(max_depth=7)
clf.fit(X_train,y_train)

yt_p = clf.predict(X_train)
yv_p = clf.predict(X_val)

print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))

(1983, 3) (496, 3)
Training Accuracy 0.7100353000504287
Validation  Accuracy 0.5866935483870968


In [25]:
from sklearn import svm

parameters = {'n_estimators':[50,200,300], 'max_features':[0.01,0.05,1]}

rfc = RandomForestClassifier(random_state=0)
clf = GridSearchCV(rfc, parameters,cv=5)

clf.fit(X_train,y_train)

print('Hyperparameters: ', clf.best_estimator_)
print('Average accuracy: ', clf.best_score_)
print('Test dataset accuracy:', clf.score(X_val, y_val))

Hyperparameters:  RandomForestClassifier(max_features=0.01, n_estimators=300, random_state=0)
Average accuracy:  0.5890122891382338
Test dataset accuracy: 0.5907258064516129


In [165]:
model  = svm.SVC(C=1)
model.fit(X_train,y_train)

yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
print('The support vectors are', model.support_vectors_.shape)

Training Accuracy 0.7685325264750378
Validation  Accuracy 0.7096774193548387
The support vectors are (1216, 6)


In [100]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf = DecisionTreeClassifier(max_depth=None, min_samples_split=102,
     random_state=0)
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(scores.mean())

clf = RandomForestClassifier(n_estimators=55, max_depth=None,
     min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(scores.mean())

0.5244675978932907
0.5194272701829377


# 9 Conclusions

Your conclusions, improvements, etc should go here