# MusicGroup Machine Learning Test

Marc Siquier Penyafort  
marcsiquierpenyafort@gmail.com

#### Import all necessary packages and functions from utils file
I will use `sklearn` in order to train and test GMMs and to evaluate the results.   
Warnings appear when setting a high train percentage and (as the test dataset is very small) 0 instances are classified for a given class.

In [1]:
from utils import *
import warnings
warnings.filterwarnings('ignore')

#### Fetch all json database files in `databaseDir` directory.
**Note:** Please set-up your `databaseDir` to the dataset folder.

In [2]:
databaseDir = '/home/msiquier/Documents/music_group_ml_test/music_group_ml_test_data'
jsonfiles = fetchFiles(databaseDir, '.json')
print "Number of json files fetched: " + str(len(jsonfiles))

Number of json files fetched: 2831


#### Import and divide dataset 
Here I import all samples from the fetched json files taking into account that each file can contain more than one sample.   
The `data` structure is a python dictionary with the different instruments as keys and for each instrument a list of feature vectors is given. We can also convert it to XY vectors using `convertToXY` but I will go on with the dict structure.   
Afterwards I divide randomly the dataset into train and test (keeping the dict structure) and ensuring that every class has a given percentage of train samples. 

In [3]:
data = importData(jsonfiles)
#X, Y = convertToXY(data)
train, test = randomTrainTest(data, percentage_train=0.7)
printDataStats(train, test)

Created training and testing sets with the following number of samples:

	Train	Test	Total	Class

	384	164	548	hihat
	1181	506	1687	bass
	2397	1026	3423	guitar
	161	69	230	saxophone
	281	120	401	tom
	698	299	997	vocals
	281	120	401	snare
	546	233	779	piano
	283	120	403	kick

	6212	2657	8869	TOTAL


As we can see, we have 9 different classes with a total of 8869 samples.   
Dividing the dataset into train and test with a train percentage of 0.7 we obtain 6212 training samples and 2657 test samples.   
As we have different number of instances per class we say it is a unbalanced dataset and probabilistic generative models such as GMMs are known for work properly with this type of datasets.

#### Compute and Score GMMs
Here I will compute a GMM for each class with the train dataset using 6 gaussian components and full (each component has its own general covariance matrix) covariance matrix.  
As predicted class I take the GMM with best score for each test sample.   
**Note:**: As we divide the dataset randomly, results vary for each execution of the code below.

In [4]:
gmms = computeGMMS(train, n_components=6, covariance_type='full')
correct, predicted = scoreGMMS(gmms, test)
classificationReport(correct, predicted)


Classification report

             precision    recall  f1-score   support

       bass       0.94      0.87      0.91       506
     guitar       0.90      0.97      0.93      1026
      hihat       0.86      0.80      0.83       164
       kick       0.89      0.82      0.86       120
      piano       0.98      0.95      0.97       233
  saxophone       0.96      0.77      0.85        69
      snare       0.91      0.80      0.85       120
        tom       0.85      0.82      0.84       120
     vocals       0.86      0.93      0.89       299

avg / total       0.91      0.91      0.91      2657

Confusion Matrix

[[441  52   1   7   0   0   0   5   0]
 [ 12 994   5   0   3   2   3   0   7]
 [  1   1 131   0   0   0   3   0  28]
 [  5   5   0  99   0   0   0  11   0]
 [  0  11   0   0 222   0   0   0   0]
 [  0   8   0   0   1  53   0   0   7]
 [  1  11   8   0   0   0  96   2   2]
 [  5   8   0   5   0   0   3  99   0]
 [  3  11   8   0   0   0   0   0 277]]


#### Fine-Tuning the models
Now that we see that this models work great with this dataset, I will try to fine tune some of its parameters such as number of components and covarince matrix type. In order to avoid execution dependant results, I am running each iteration 10 times (dividing dataset, training and testing) and I return the average of the 10 results.

Here we test how the model works varying the number of gaussian components and the covariance type:

In [5]:
for cov_type in ['spherical', 'diag', 'tied', 'full']:
    print '\ncovariance\tcomp\tPrec\tRec\tf1-s\n'
    for comp in range(1,11,1):
        stats = runXtimes(data, percentage_train=0.7, n_components=comp, covariance_type=cov_type, nruns=20)
        print '{:<8}\t{}\t'.format(cov_type, comp) + '\t'.join(['%.3f' % a for a in stats])


covariance	comp	Prec	Rec	f1-s

spherical	1	0.904	0.897	0.897
spherical	2	0.908	0.903	0.903
spherical	3	0.912	0.910	0.909
spherical	4	0.913	0.911	0.911
spherical	5	0.912	0.911	0.910
spherical	6	0.910	0.909	0.908
spherical	7	0.910	0.908	0.906
spherical	8	0.904	0.902	0.899
spherical	9	0.898	0.894	0.889
spherical	10	0.897	0.892	0.887

covariance	comp	Prec	Rec	f1-s

diag    	1	0.905	0.896	0.896
diag    	2	0.908	0.904	0.903
diag    	3	0.909	0.906	0.906
diag    	4	0.912	0.911	0.910
diag    	5	0.913	0.911	0.910
diag    	6	0.910	0.908	0.908
diag    	7	0.910	0.908	0.907
diag    	8	0.907	0.904	0.902
diag    	9	0.900	0.895	0.890
diag    	10	0.894	0.890	0.882

covariance	comp	Prec	Rec	f1-s

tied    	1	0.907	0.899	0.899
tied    	2	0.910	0.905	0.905
tied    	3	0.913	0.910	0.910
tied    	4	0.913	0.911	0.911
tied    	5	0.913	0.912	0.911
tied    	6	0.912	0.910	0.909
tied    	7	0.907	0.905	0.903
tied    	8	0.904	0.902	0.900
tied    	9	0.903	0.899	0.894
tied    	10	0.898	0.893	0.887

covariance	comp	Prec

In [6]:
stats = runXtimes(data, percentage_train=0.7, n_components=4, covariance_type='tied', nruns=10, printa=True)

Iter	Prec	Rec	f1-s

1	0.914	0.911	0.910
2	0.917	0.916	0.915
3	0.911	0.906	0.906
4	0.905	0.900	0.899
5	0.909	0.906	0.906
6	0.917	0.915	0.915
7	0.913	0.910	0.910
8	0.903	0.900	0.899
9	0.917	0.916	0.915
10	0.907	0.906	0.905

avg:	0.911	0.909	0.908



In [7]:
data_norm = normalize_data(data)
stats = runXtimes(data_norm, percentage_train=0.7, n_components=4, covariance_type='tied', nruns=10, printa=True)

Iter	Prec	Rec	f1-s

1	0.918	0.899	0.901
2	0.919	0.896	0.898
3	0.936	0.932	0.933
4	0.935	0.931	0.931
5	0.933	0.927	0.928
6	0.916	0.890	0.894
7	0.912	0.889	0.891
8	0.906	0.874	0.877
9	0.905	0.885	0.885
10	0.923	0.906	0.910

avg:	0.920	0.903	0.905

