# COGS 118B - Final Project Report

## Introduction

This project focuses on analyzing the relationship between a song's genre and its other features (beats per minute, energy, valence, liveness). Using the results of this analysis, we will be able to determine to what extent the features of a song decide its genre. 

We reach our goal using two unsupervised machine learning algorithms (K-Means Clustering, Principal Component Analysis (PCA), and Gaussian Mixture Model (GMM)) with song data obtained from "Spotify - All Time Top 2000s Mega Dataset" taken from https://www.kaggle.com/iamsumat/spotify-top-2000s-mega-dataset. More information on the implementation of pre-processing of the data, K-means and Principal Component Analysis is available in the Methods section.

## Motivation

Music has a number of unique features that group similar songs together in different categories called genres. With the increasing role played by technology in the world of music, these characteristics become more and more distinct and digital. Therefore, in the world we live in today, we thought it would be incredibly relevant to use the unsupervised machine learning concept of clustering in an attempt to find out whether these characteristics truly decide which genre a song belongs to. For artists who create music, knowing this information will allow them to continue creating music as part of a group (genre) that they hope to belong to.

Our motivation for selecting this dataset is because it had enough relevant data for us to work with (20 genres + 8 features + 1995 songs)

Our motivation for using K-means, PCA and GMM is because we have learnt these concepts through the course of this class, and through implementation we hope to figure out whether these algorithms lead to similar results when performed on the same dataset.

## Related Work

The most common use of machine learning with music revolves around the process of Specific Song Recommendation. While most of these algorithms initially existed as relational rather than accounting for the musical features of the songs, many new projects have come to fruition which focus on recommending new songs based on the qualities of the music users enjoy listening to more than the artist of the song. One such project can be seen at https://towardsdatascience.com/music-to-my-ears-an-unsupervised-approach-to-user-specific-song-recommendation-6c291acc2c12 which uses the Million Song Dataset that we also considered using for our project. However, we decided to go with the dataset we're using in our project because of easier computation and larger relevance. 

## Methods

### Pre-Processing

The first step we took in pre-processing involves the exploration of the data. The dataset we used comes with 1994 rows x 14 columns. We checked for missing values, there were none. 

Next, we extracted audio features within the dataset that are relevant to clustering (columns 4 to 12, but not including column 10) and displayed them with their distribution. Column 10 is length and is therefore is irrelevant, but the rest of the columns include Beats Per Minute (BPM), Energy, Danceability, Loudness (dB), Liveness, Valence, Acousticness and Speechiness. 

Moving forward, we extracted and displayed the 149 different genres in the dataset. However, 149 genres will lead to the creation of too many clusters. To combat this, we reduced the number of genres by grouping similar genres together (i.e., 'album rock' and 'alternative rock' are grouped under 'rock'). Additionally, we also pick the most represented genres in the dataset. This route resulted in an output of 20 genres. Subsequently, 113 rows were removed from the dataset which included songs not of a relevant genre.

The resulting dataset has 1881 rows x 14 columns.

Refer to code attached for implementation and visualization.

### K-Means

K-Means is an algorithm that iterates between two steps. The first step involves finding cluster memberships of each datapoint while keeping the cluster centers fixed. These cluster centers are randomly initialized, and cluster memberships are assigned based on the least euclidian distance between a datapoint and a cluster center. The second step involves repositioning the cluster center to the mean center position of all datapoints assigned that specific cluster membership. The datapoint positions are kept fixed during this second step.

In the context of our project, we executed K-Means two times: Pre-PCA and Post-PCA. The reason for using the algorithm both times is stated in the results section below. With regards to the implementaion, the first step we took in the K-Means algorithm is initializing random centroids (cluster centers) in the 'centroids' variable. The next few steps occur within a while loop which runs until the algorithm converges (until an if statement that checks whether the centroid is moved by a value less than 10^-6 executes as true). Within the loop, first we calculate the Euclidian distance between each datapoint (feature) and centroid (stored in variable dist_mat). Next, we define 'rnk' which is a rank matrix which stores the cluster memberships of each datapoint by assigning each datapoint to the closest centroid. Within this matrix, the value corresponding to a datapoint and its assigned cluster center based on least distance is 1, whereas the value corresponding to that datapoint and all other cluster centers of farther distance is 0. The next step involves moving each centroid closer to the mean position of all the datapoints that are members of that specific centroid. This is done by calculating the average of all the datapoints in a specific cluster, and subsequently moving the relevant centroid closer to that average. We set the number of clusters to 10 (K = 10) where each cluster represents a genre. We stored the relevant features in 'Feats' (BPM, Energy, Danceability, Loudness, Liveness, Valence, Acousticness, Speechiness).

Refer to code attached for implementation.

### Principal Component Analysis (PCA)

Principal Component Analysis (PCA) focuses on dimensionality reduction. Once we reduce the dimensionality of the dataset by removing the redundant dimensions (features), we're left with the features that are the most relevant to the genre which is our ultimate goal.

### GMM

## Results

What did you discover? How well did it work?  As this is a class project, it is likely that many things did not work as well as planned.  For this project, detailing what went wrong is as important as describing what went well.  (approx.7 out of 25 points)

### K-Means

### PCA

### GMM

## Discussion

What did you learn? What could you do better? What would you have done next if you had more time? Why do you think it didn't work if it didn't? If everything worked perfectly,what next steps would you suggest for follow-up work. For full credit discuss two extensions or improvements to your project with short justifications for why 
you think that would work better (improvements) or why they are promising extensions. (approx.7 out of 25 points) 

## Contributions

Pre-processing: Nour Yehia and Zack

K-Means Algorithm: Bryan and Manan

Principal Component Analysis: Darren

Report + Video Presentation: Rahul

This is the generic structure to the divison of work for this project. The individual(s) named next to each module listed above were the leads for that respective module. However, consistent meetings and communication via discord facilitated the working of all team members together throughout all modules of the project.

## Code

The github repository link containing our entire project including code:

https://github.com/DarrenWu42/Cogs118b-Song-Analysis

## References

https://www.kaggle.com/iamsumat/spotify-top-2000s-mega-dataset

https://towardsdatascience.com/music-to-my-ears-an-unsupervised-approach-to-user-specific-song-recommendation-6c291acc2c12 