Machine learning project

This project aims to solve the classification challenge proposed on Kaggle. The report contains most of the important things related to this project. Here we present a brief explanation of what was done, starting from the basic EDA to the model selection and performance evaluation.

Exploratory data analysis and preprocessing

The dataset proposed contains about 5000 instances (each one corresponding to a song) with approximately 18 attributes. A small part of them is useless for the classification task so will eventually be discarded. About music_genre, the attribute that, as explained by the name, expresses the genre to be associated with the song, is divided into 10 classes like Rock, Alternative, Jazz, Classic, etc. all of them balanced with 5000 instances each.

By looking at the boxplot generated by separating every music genre we can see that most of the attributes doesn't explain any difference between different music genres (in the picture below you can see a sample of what the plot looks like).

Based on this analysis we expect the model to not perform well on the classification task due to the lack of useful information or variance of the attributes given in the dataset. Only the classic music genre will be distinguished well from the other genres.

Also, null values were replaced by the median if their amount was less than 10%; otherwise the entire attribute was dropped.

Correlation and PCA

By applying the principal component analysis to the dataset and producing the correlation matrix, we observed basically no correlation or any variance explanation between attributes, indicating once again that models will struggle to correctly classify different music genres.

Model selection and performances

The classification was performed by a Support Vector Machine (where parameters, like cost and kernel, were optimized using grid search methods) and a simple Neural Network with two hidden layers.

As expected, both models didn't perform well on the classification task due to the lack of useful information given by the data. By looking at the confusion matrix produced, the SVM reaches an average accuracy of 54.0% and the NN, slightly better, an accuracy of 55.6%. Only the classical music genre performs as expected lightly better. These results are perfectly in line with what has been observed in reports generated by other users on Kaggle, demonstrating that it is not possible to obtain a good classification from the data presented.

Improvements using NLP techniques

By looking at the attributes, we previously discarded the track_name, thinking it was useless for classification purposes. But producing simple wordclouds we noticed some differences between different music genres as can be seen from the picture below.

In order to use this new information the top 6 words, for each music genre, were used as dummy variables to record their use inside the song name. This is how bag of words technique was used to improve data quality and probably achieve higher performance in the classification task.

By increasing the depth of the NN we were able to reach a good average accuracy between classes of 72.6% (+/- 19%), significantly improving performance on classes Alternative, Blues, Country, Rap, Electronic, Hip Hop and Jazz (7 classes on 10 total).

The picture above presents the differences between the models trained first on the original data (SVM and NN) and the second NN trained with the use of the Bag Of Words technique.

Other informations are available in the report or, in a shorter version, in the presentation upper in this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
img		img
progetto		progetto
.DS_Store		.DS_Store
Presentazione_ML.pdf		Presentazione_ML.pdf
README.md		README.md
Relazione 2021_ProgettoML_AlDaLearn-8.pdf		Relazione 2021_ProgettoML_AlDaLearn-8.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine learning project

Exploratory data analysis and preprocessing

Correlation and PCA

Model selection and performances

Improvements using NLP techniques

About

Releases

Packages

Languages

DanielSc4/machine-learning-classification-w-R

Folders and files

Latest commit

History

Repository files navigation

Machine learning project

Exploratory data analysis and preprocessing

Correlation and PCA

Model selection and performances

Improvements using NLP techniques

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages