# COGS 118A - Project Checkpoint

# Names


- Abraham Torok
- Rohil Ahuja
- Brinda Narayanan
- Jorge Acevado

# Abstract 


Our goal is to create a multi-class classification model to determine the genre of novel songs based on our database of roughly 18,000 songs described with 15 numerical features. The original dataset was used in an online Hackathon. After exploring the data, we will be using a Random Forest model to classify our data. We selected a random forest as one of our models, as it has the capacity to handle high-dimensional data with many numerical features. We will also use an SVM as a separate model to compare the accuracy between different types of multiclass models. Our performance will be measured by implementing a log-loss metric and this can be cross referenced with the winners of the original competition as well as submissions from Kaggle where the dataset was made publicly available.

# Background

With the onset of streaming services such as Spotify and Apple Music, people are constantly listening to music and trying to expand their music taste. As access to music has increased, classification between different genres of music has become much more important to the general population as the number of genres increase. There are 10 major classifications of music: <br>
1. Pop
2. Rock
3. Indie Rock
4. EDM
5. Jazz
6. Country
7. Hip Hop & Rap
8. Classical Music
9. Latin Music
10. K-pop <a name = 'analyticsteps'></a>[<sup>[1]</sup>](#analyticstep).
<br>

Using machine learning to help classify music genres is a relatively new concept, but is something that has been shown to have great potential, especially in the realm of recommendations. Previous work to classify into different categories has been done with different types of models, such as one example with a convoluted neural network. This particular method used audio signals to determine the genre. The particular dataset that was used contained a csv with many different quantifications of the particular song, as well as an audio recording of that song. From the audio recording, they examined the wav graphs in conjunction with the numerical data. From this data they constructed their classifier which worked with 92.93% accuracy. <a name = 'clairvoyant'></a>[<sup>[2]</sup>](#clairvoyant). <br>

From Kaggle, we can also see many other projects that were completed. These projects use different types of machine learning models in order to classify the different songs into genres. One of the algorithms did some clustering using PCA and TSNEs to cluster the data into different points. Another algorithm used a random forest to classify the different songs into genres. All of these different methods had high accuracy, and developed a recommendation system once the model was trained and complete to recommend songs to users <a name="kaggle"></a> [<sup>[3]</sup>](#kaggle).


# Problem Statement

Our problem revolves around classifying songs into different music genres. For many listeners of music, a large issue is discovering new music that fits into genres the user is interested in. Users do not want to have to listen to many new songs in order to just find music in their genre, then narrowing down even further into music that fits their taste. By creating a classifier that can group music into different genres, we can help create better music recommendations for users looking to discover new songs. We will be comparing different types of models, specifically a random forest model and a support vector machine, and training both of them to do the classification of new songs. <br>

Given many different numerical measures from a song such as loudness, acousticness, and instrumentalness, we can classify a new song into one of our pre-existing labels from the training data. All these observations should be easily measurable, simply through the song itself, using partiular 3rd-party tools to extract the data from a new song. Because the measurements are all taken directly from the songs, they should be very easy to replicate as well. As there is with many different softwares, the data could be slightly different between different types of software, but most should be negligible noise.

# Data

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


# Proposed Solution

Some possible solutions to the problem of classifying songs into different music genres is to use machine learning algorithms to automatically learn patterns and characteristics of different genres based on their audio features. Two such algorithms are the Support Vector Machine (SVM) classifier and the Random Forest classifier.

To apply either of the algorithms to our dataset of roughly 18,000 songs and 15 numerical features, we would first need to perform some data preprocessing. In this step, we need to preprocess the data by extracting relevant features from the songs. Some of these include danceability, energy, key signature, loudness, acousticness, tempo, etc. Extracting these features allows us to compare the songs and helps in classifying the genres.

Next, we perform data splitting. For this we need to split the dataset into training, validation, and testing sets. The training set is used to train the classifiers, while the validation set is used to tune the hyperparameters of the model. Finally, the testing set is used to evaluate the performance of the trained classifier. 

To train our model using the random forest algorithm we will use the 15 numerical features as input and the genre label as the output. The model would be optimized by tuning hyperparameters such as the number of trees in the forest and the maximum depth of each tree. In training the SVM classifier, we will experiment with different kernel functions and tune the hyperparameters using techniques like cross-validation to find the optimal kernel coefficient and regularization parameter.

To evaluate the performance of our models, we would use metrics such as accuracy, precision, recall, and F1 score. We would also compare the performance of models to a benchmark model such as a logistic regression classifier or a K-Nearest Neighbor classifier which have been shown to work well in music genre classification tasks. Doing this comparison of our models with the benchmark models would allow us to demonstrate the effectiveness of our proposed solution.

To implement the solution, we can use Python and the scikit-learn library, which offers an implementation of the SVM and random forest classifiers. We could use train_test_split from the same library to split the dataset into training and testing sets. The preprocessing of the data could be done using the StandardScaler function from the scikit-learn library.

# Evaluation Metrics

TODO

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

UPDATE THE PROPOSAL TIMELINE ACCORDING TO WHAT HAS ACTUALLY HAPPENED AND HOW IT HAS EFFECTED YOUR FUTURE PLANS

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic (Pelé) | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets (Beckenbaur)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data ,do some EDA (Maradonna) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin programming for project (Cruyff) | Discuss/edit project code; Complete project |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Carlos)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
