# COGS 118A - Final Project

# Insert title here

## Group members

- Abraham Torok
- Rohil Ahuja
- Brinda Narayanan
- Jorge Acevado

# Abstract 

Our goal is to create a multi-class classification model to determine the genre of novel songs based on our database called music genre classification, consisting of roughly 18,000 songs described with 15 numerical features. The original dataset was used in an online Hackathon. After exploring the data, we will be using a Random Forest model to classify our data. We selected a random forest as one of our models, as it has the capacity to handle high-dimensional data with many numerical features. We will also use an SVM as a separate model to compare the accuracy between different types of multiclass models. Our performance will be measured by implementing a log-loss metric and this can be cross-referenced with the winners of the original competition as well as submissions from Kaggle where the dataset was made publicly available. 

###ADD LINE ABOUT RESULTS HERE

# Background

With the onset of streaming services such as Spotify and Apple Music, people are constantly listening to music and trying to expand their music taste. As access to music has increased, classification between different genres of music has become much more important to the general population as the number of genres increase. There are 10 major classifications of music: <br>
1. Pop
2. Rock
3. Indie Rock
4. EDM
5. Jazz
6. Country
7. Hip Hop & Rap
8. Classical Music
9. Latin Music
10. K-pop <a name = 'analyticsteps'></a>[<sup>[1]</sup>](#analyticstep).
<br>

Using machine learning to help classify music genres is a relatively new concept, but is something that has been shown to have great potential, especially in the realm of recommendations. Previous work to classify into different categories has been done with different types of models, such as one example with a convoluted neural network. This particular method used audio signals to determine the genre. The particular dataset that was used contained a csv with many different quantifications of the particular song, as well as an audio recording of that song. From the audio recording, they examined the wav graphs in conjunction with the numerical data. From this data they constructed their classifier which worked with 92.93% accuracy. <a name = 'clairvoyant'></a>[<sup>[2]</sup>](#clairvoyant). <br>

From Kaggle, we can also see many other projects that were completed. These projects use different types of machine learning models in order to classify the different songs into genres. One of the algorithms did some clustering using PCA and TSNEs to cluster the data into different points. Another algorithm used a random forest to classify the different songs into genres. All of these different methods had high accuracy, and developed a recommendation system once the model was trained and complete to recommend songs to users <a name="kaggle"></a> [<sup>[3]</sup>](#kaggle).


# Problem Statement

Our problem revolves around classifying songs into different music genres. For many listeners of music, a large issue is discovering new music that fits into genres the user is interested in. Users do not want to have to listen to many new songs in order to just find music in their genre, then narrowing down even further into music that fits their taste. By creating a classifier that can group music into different genres, we can help create better music recommendations for users looking to discover new songs. We will be comparing different types of models, specifically a random forest model and a support vector machine, and training both of them to do the classification of new songs. <br>

Given many different numerical measures of a song such as loudness, acousticness, and instrumentalness, we can classify a new song into one of our pre-existing labels from the training data. All these observations should be easily measurable, simply through the song itself, using particular 3rd-party tools to extract the data from a new song. Because the measurements are all taken directly from the songs, they should be very easy to replicate as well. As there are many different softwares, the data could be slightly different between different types of software, but most should be negligible noise.  In addition, the problem is quantifiable because we can use the error metrics (F1 Score and Log Loss) to quantify the performance of a model for music genre classification.

# Data

Detail how/where you obtained the data and cleaned it (if necessary)

If the data cleaning process is very long (e.g., elaborate text processing) consider describing it briefly here in text, and moving the actual clearning process to another notebook in your repo (include a link here!).  The idea behind this approach: this is a report, and if you blow up the flow of the report to include a lot of code it makes it hard to read.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!

# Data


Dataset link: https://www.kaggle.com/datasets/purumalgi/music-genre-classification?select=train.csv

The Dataset comes from a MachineHack Hackathon in which 15 numeric features are provided for each song along with an artist name and title. The data is already split into training and testing csvs, with ~18000 observations in the training set and ~7,700 observations in the testing set. The features for each observation consist of quantitative variables such as key, tempo, time signature, as well as qualitative variables such as liveness, danceability, and speechiness. The data is already quite clean, with minimal missing data and all fields are numerical, so pre-processing will be minimal. We employed the simple imputer from sklearn to replace missing values with the column mean. 
Some critical variables of our dataset is the Class which is our target variable and represents the music genre in numerical format (eg 1 for rock, etc) The instrumentation variables are also critical as they help train the data, and these variables (danceability, acousticness, etc) are also represented numerically.



During our explorating of the data, it became apparent that some of our features such as speechiness and instrumentalness had an exponential distribution, and in order to normalize these features and reduce the skew, we performed a log transformation.

## Features Pre-Transformation
skew: 1.53 and 3.0 respectively
![Screen%20Shot%202023-03-08%20at%201.10.51%20PM.png](attachment:Screen%20Shot%202023-03-08%20at%201.10.51%20PM.png)


## Features Post-Transformation
skew: -0.2 and 1.16 respectively
![Screen%20Shot%202023-03-08%20at%201.11.20%20PM.png](attachment:Screen%20Shot%202023-03-08%20at%201.11.20%20PM.png)
This transformation has drastically reduced the skew of our data and should provide better results in the final classifier model.

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Proposed Solution

Some possible solutions to the problem of classifying songs into different music genres is to use machine learning algorithms to automatically learn patterns and characteristics of different genres based on their audio features. Two such algorithms are the Support Vector Machine (SVM) classifier and the Random Forest classifier.

To apply either of the algorithms to our dataset of roughly 18,000 songs and 15 numerical features, we would first need to perform some data preprocessing. In this step, we need to preprocess the data by extracting relevant features from the songs. Some of these include danceability, energy, key signature, loudness, acousticness, tempo, etc. Extracting these features allows us to compare the songs and helps in classifying the genres.

Next, we perform data splitting. For this we need to split the dataset into training, validation, and testing sets. The training set is used to train the classifiers, while the validation set is used to tune the hyperparameters of the model. Finally, the testing set is used to evaluate the performance of the trained classifier. 

To train our model using the random forest algorithm we will use the 15 numerical features as input and the genre label as the output. The model would be optimized by tuning hyperparameters such as the number of trees in the forest and the maximum depth of each tree. In training the SVM classifier, we will experiment with different kernel functions and tune the hyperparameters using techniques like cross-validation to find the optimal kernel coefficient and regularization parameter.

To evaluate the performance of our models, we would use metrics such as accuracy, precision, recall, and F1 score. We would also compare the performance of models to a benchmark model such as a logistic regression classifier or a K-Nearest Neighbor classifier which have been shown to work well in music genre classification tasks. Doing this comparison of our models with the benchmark models would allow us to demonstrate the effectiveness of our proposed solution.

To implement the solution, we can use Python and the scikit-learn library, which offers an implementation of the SVM and random forest classifiers. We could use train_test_split from the same library to split the dataset into training and testing sets. The preprocessing of the data could be done using the StandardScaler function from the scikit-learn library.

# Evaluation Metrics

Some evaluation metrics we can employ are the f1 metric for each One vs. Rest classifier as well as the recommended Log Loss Metric that was used in the original scoring of the model accuracy during the competition this dataset was originally intended for. Another possible metric we can use is accuracy.

F1 Score = F1 score = 2 * (precision * recall) / (precision + recall)

Log Loss:

![log_loss_eqn.png](./log_loss_eqn.png)

Accuracy = (True Negatives + True Positives) / (True Negatives + False Negatives + True Positives + False Negatives)

F1 score is a balanced measure of precision and recall, providing an overall measure of the model's accuracy. We will use a weighted F1 score in order to account for the uneven distribution of class labels in our dataset. The weighted F1 score will scale the score for each class by the number of observations. A high F1 score means that the model is accurately classifying instances as positive, while also minimizing the number false positives and false negatives. In the context of our problem statement, a high F1 score means that our model is accurately classifying songs into different genres, while also minimizing the risk of recommending songs that do not fit the user's preferred genres.

Log loss measures the difference between the predicted probabilities of each class and the true probabilities, where a lower log loss value indicates better performance. Log loss takes into account not only the correctness of the predicted class but also the confidence in the prediction, which can be important in cases where some classes are more similar to each other than others. In the context of our problem statement, log loss would provide a more nuanced understanding of the performance or our model by considering the probability distribution of the predicted classes. A low log loss value would indicate that the model is accurately predicting the probabilities of each class, which is importnant for providing users with relevant and diverse music recommendations.

Accuracy is the most straigntforward metric for evaluating classification models, as it simply measures the proportion of correct predictions out of all predictions made. In the context of our problem statement, accuracy tells us the percentage of correctly classified songs out of all the songs that were classified. This metric is important because it directly reflects the user experience - the higher the accuracy, the more likely users will find music in the genres they are interested in.

Ultimately, we decided to focus on accuracy and F1 score as our primary evaluation metrics because they are easier to interpret and more directly relevant to our problem statement. We believe that accurately classifying songs into their respective genres is the most important aspect of our problem, and accuracy and F1 score are well-suited to evaluate this performance.

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

To quickly summarize, our dataset contains different measures of a song, and classifies it into a particular genre, represented by a numerical value. After examining our dataset, and performing the necessary steps to format the data in a way that it is usable for our project, we moved towards our model selection. There were two models that came to mind, the first being Support Vector Machines, and the second being Random Forests. For this problem, we thought that Support Vector Machines would be appropriate because this dataset does not have high-dimensionality, and SVMs are very flexible to non-linear data, due to the use of kernels. We additionally thought that the robustness to noise that SVMs have would be highly beneficial with data relating to audio, working with processing softwares could easily incorporate noise into the dataset, which could be handled by the classifier. The second model we decided to select was the Random Forest classifier, in part due to its flexibility. Having some categorical variables and many numerical variables, we thought a Random Forest model would be a great way to model the data we had. In addition, because of the various features we have, using a Random Forest model could help us identify the importance of certain features, a trait that could aid us in the training of the final model. 

Linear Regression was one model we chose not to use, because our data was not linearly separable. Logistic Regression was another model we chose not to use due to the non-binary nature of classifcations that we had. With 10 different genres to predict, using Logistic Regression was impractical. We considered using a Neural Network to attemtpt to classify the different songs into music genres. After considering, a few different factors, we felt as though a Neural Network was not the best method we could use to solve the problem at hand. One of these factors was that while the data was large, with over 17,000 observations, it was not extremely large, to where the computational intensity and resources required by a Neural Network would be negligible, as its performance would be far better than other models. The relationships in this dataset, are complex, but not overly complex to the point where other, less intense algorithms would be worse performers than a Neural Network. Finally, with the complex of a Neural Network, our final model could be more prone to overfitting, giving us an inaccurate classifier. 

Given these factors, we decided to select two methods, Support Vector Machines, and Random Forests. 

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

After selecting which types of models we would have liked to use, we needed to perform some basic pre-processing of the data to make it usable for us. As mentioned in some of the sections before, noticing that some of our features were highly skewed, we had to perform a log normalization on the instrumentalness and the speechiness of the songs, in order to make the data more readable for the purpose of training out model. 

For the Random Forest Model, we additionally One-Hot Encoded the key variable, because...

GridSearchCV

In order to initially parse through our data and understand the importance of certain features within the context of our model, we performed a permutation feature selection on both of the model types. The results for Random Forests is as follows: <br>

![random_forestFS.png](./random_forestFS.png)

For SVM: <br>

![.png](./.png)

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

Once we had all our preliminary feature selection results, we aimed to train a base model and measure its efficacy.

For the SVM:

For the Random Forest: We performed a Train/Test split of 0.8/0.2, and trained a Random Forest model to attempt to classify the test data into their own classes. Using a simple imputer, where the strategy is mean, we replaced all the missing values within each column. We then used a StandardScaler to scale the data and prepare it for training using the Random Forest. In the basic Random Forest model, we fit the model using 500 trees, a maximum depth of 10 for each tree, with a minimum of 5 samples required to split an internal node. The minimum samples required to be at a leaf node in this classifier was 3, and the number of features to consider when looking at the best split was set to the total number of features. 

![RF_baseline.png](./RF_baseline.png)

After training the model on this data, our baseline model performed with an accuracy score of 52.2%. 


### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

From here, we progressed into more fine-tuned models.

For the SVM:

For the RF: This time, we created a Train/Test split of 0.75/0.25, and performed some of the selection on the data that we discussed earlier. First, we log transformed two of the columns, the "instrumentalness" and the "speechiness." After doing this, we removed certain features that were useless, and could be seen as noise, such as the Artist's Name, or the Track Name. From there, we trained a new model, using a simple imputer, a standard scaler, but this time using 200 trees with a maximum depth of 20 for each tree, a minimum of 2 samples to split an internal node, and 5 samples minimum to be at a leaf node. The total number of features to consider when looking at the best split was set to the total number of features once again. 

![RF_new.png](./RF_new.png)

After training the model on the data, the new accuracy score was 53.7%, indicating a small increase over the previous iterations. 

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?

INCLUDE LEARNING CURVES SOMEWHERE


# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

There are no privacy issues with this project as we are using public data from Kaggle. Because of that, there are terms of use all users sign before making an account and posting their dataset, and getting their data. 

Potential Ethical Concerns are that the data is subjective. We don’t know how one would determine “speechability” and “Danceability” however our goal for this project is to build an ML algorithm that classifies music genres, so we will still be able to do our project.  The ML algorithm could possibly be misguided and overfit.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="analyticsteps"></a>1.[^](#analyticsteps): Rawat, Soumyaa. “Music Genre Classification Using Machine Learning.” Analytics Steps, https://www.analyticssteps.com/blogs/music-genre-classification-using-machine-learning. <br> 
<a name="clairvoyant"></a>2.[^](#clairvoyant): “Music Genre Classification Using CNN.” Clairvoyant, https://www.clairvoyant.ai/blog/music-genre-classification-using-cnn. <br>
<a name="kaggle"></a>3.[^](#kaggle): Malgi, Purushottam. “Music Genre Classification.” Kaggle, 7 Aug. 2021, https://www.kaggle.com/datasets/purumalgi/music-genre-classification/code?select=train.csv. 
