**Machine Learning Basic Principles 2018 - Data Analysis Project Report**

# Predicting Music Genre of Songs using Support Vector Classification

## Abstract

*Precise _summary_ of the whole report, previews the contents and results. Must be a single paragraph between 100 and 200 words.*

## 1. Introduction

### 1.1. Background
Aalto University offers a mandatory and introductory course for Machine Learning, known as Machine Learning Basic Principles. The course deliverables include a Data Analysis Project. The Project can be done alone or in a team of two students; and it shall be done using the Python language in Jupyter Notebook IDE.

### 1.2. Problem Statement
For Autumn 2018, the Data Analysis Project task participating students to create a music-genre classification algorithm. The data analysis project involves the design of a complete machine learning solution. In particular, the project revolves around the task of identifying the music genre of songs. 

### 1.3. Motivation

### 1.4. Desctiption of Contents

### 1.5. 

*Background, problem statement, motivation, many references, description of
contents. Introduces the reader to the topic and the broad context within which your
research/project fits*

*- What do you hope to learn from the project?*
*- What question is being addressed?*
*- Why is this task important? (motivation)*

*Keep it short (half to 1 page).*

## 2. Data analysis

*Briefly describe data (class distribution, dimensionality) and how will it affect
classification. Visualize the data. Don’t focus too much on the meaning of the features,
unless you want to.*

*- Include histograms showing class distribution.*



In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV



In [2]:
# Load the data and cleanup
dfx = pd.read_csv("train_data.csv", header = None)
dfy = pd.read_csv("train_labels.csv" , header= None)
dftest =  pd.read_csv("test_data.csv", header = None)
## Create Label List
Y_train=[]
for m in dfy.as_matrix().tolist():
    Y_train += m

  import sys


In [3]:
# Normalise Test and Train data
np_scaled_train = preprocessing.quantile_transform(dfx)
X_train = pd.DataFrame(np_scaled_train)

np_scaled_test = preprocessing.quantile_transform(dftest)
X_test = pd.DataFrame(np_scaled_test)

In [4]:
# Analysis of the input data


In [5]:
# Correlation Matrix
corr_matrix = X_train.corr()
corr_matrix_absolute = X_train.corr().abs()

## 3. Methods and experiments

*- Explain your whole approach (you can include a block diagram showing the steps in your process).* 

*- What methods/algorithms, why were the methods chosen. *

*- What evaluation methodology (cross CV, etc.).*



In [6]:
# Feature Reduction using Upper Triangular Matrix from Corelation Matrix
upper_matrix = corr_matrix_absolute.where(np.triu(np.ones(corr_matrix_absolute.shape), k=1).astype(np.bool))
to_drop_index = [column for column in upper_matrix.columns if any(upper_matrix[column] > 0.90)]

X_train_reduced = X_train.drop(X_train.columns[to_drop_index], axis=1)
X_test_reduced = X_test.drop(X_test.columns[to_drop_index], axis=1)

In [7]:
## WARNING - Do not run this block as it normally takes 2-5 hours to execute. 
## This block exist only to find best score, kernel, gamma and other values for SVM method used in next block. 
## Running this block is not required for the purpose of making model & predicting outcomes.

parameter_candidates = [
  {'C': [1, 10, 100, 1000,10000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000,10000], 'gamma': [0.1,0.01,0.001, 0.0001], 'kernel': ['rbf']},
]
clf = GridSearchCV(estimator=SVC(), param_grid=parameter_candidates, n_jobs=-1)
clf.fit(X_train_reduced, Y_train) 
print('Best score for data1:', clf.best_score_) 
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best score for data1: 0.6552830621132248
Best C: 10
Best Kernel: rbf
Best Gamma: 0.01


In [10]:
# Creating ML Model with SVC
model = SVC(kernel = 'rbf', C = 10,gamma=0.01, probability= True).fit(X_train_reduced, Y_train)

In [11]:
# Generating Predictions using Model for both Accuracy and Log-Loss
model_predictions_accuracy = model.predict(X_test_reduced)
model_predictions_log_loss = model.predict_proba(X_test_reduced)

## 4. Results

*Summarize the results of the experiments without discussing their implications.*

*- Include both performance measures (accuracy and LogLoss).*

*- How does it perform on kaggle compared to the train data.*

*- Include a confusion matrix.*



In [12]:
# Export Predictions to File

## Accuracy
np.savetxt("accuracy_solution.csv", 
           np.dstack((np.arange(1, model_predictions_accuracy.size+1),model_predictions_accuracy))[0],
           delimiter=',', comments="", fmt='%i,%i',
           header="Sample_id,Sample_label")

## Log_loss
sample_id_column = np.zeros((X_test_reduced.shape[0],1), dtype=int)
for i in range (X_test_reduced.shape[0]): 
    sample_id_column[i] = i+1
log_loss_solution = np.hstack((sample_id_column, model_predictions_log_loss))
np.savetxt("log_loss_solution.csv", log_loss_solution, 
           delimiter=',', comments="", fmt=','.join(['%i'] + ['%1.10f']*10),
           header="Sample_id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9,Class_10")

In [13]:
#Confusion matrix ...

## 5. Discussion/Conclusions

*Interpret and explain your results *

*- Discuss the relevance of the performance measures (accuracy and LogLoss) for
imbalanced multiclass datasets. *

*- How the results relate to the literature. *

*- Suggestions for future research/improvement. *

*- Did the study answer your questions? *



## 6. References

*List of all the references cited in the document*

## Appendix
*Any additional material needed to complete the report can be included here. For example, if you want to keep  additional source code, additional images or plots, mathematical derivations, etc. The content should be relevant to the report and should help explain or visualize something mentioned earlier. **You can remove the whole Appendix section if there is no need for it.** *