# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names


- Jota Yamaguchi(A15581388)
- David Wang(A16035204)
- Wei Zhong (A15568452)
- Josie Li (A15924327)
- Mary Kovic (A15292606)


# Abstract 
Music has become significantly more accessible with the introduction of music applications such as Spotify and Soundcloud. We are no longer limited to listening to the radio or having to buy and use CDs. These applications not only allow us to listen to music wherever and whenever, but it also allows us to see measurements of certain audio features determined by the app. In this project, we will be predicting the genre of a song based on these features given by the Spotify music application. Our dataset from kaggle contains 6917 songs, each with 23 features (shape of 6917 x 23). Some of these features are simple characteristics such as artist or date of release. There are also more complicated features that spotify has found a way to measure such as ‘danceability’ or ‘liveliness’. Most importantly, the first column/feature contains the genre of each song. This is the feature that we will be predicting using the remaining features. Our success will be determined by the percentage of accuracy of our model and how well it can classify each song to its correct genre.

# Background

Spotify has an enormous library of music, with more than 70 million tracks as of 2021 <a name="ref_1"></a>[<sup>[1]</sup>](https://backlinko.com/spotify-users). With such an expansive array of songs for users to choose from, relevant recommendations are integral to enhancing user experience.

Given features of a song, spotify users can be recommended music that more accurately represents their tastes. Furthermore, recommendations can include lesser known artists to curate a more diversified musical palette. Researchers at Spotify have been working at perfecting a recommendation system for users. They found that users were more satisfied when recommended music included less popular tracks or those that strayed from their usual genre <a name="ref_2"></a>[<sup>[2]</sup>](https://research.atspotify.com/shifting-consumption-towards-diverse-content-via-reinforcement-learning/).

There has been some success in predicting genre based on the waveform of the song<a name="ref_3"></a>[<sup>[3]</sup>](https://towardsdatascience.com/predicting-music-genres-using-waveform-features-5080e788eb64), but we aim to predict genre based on the qualitative characteristics of a song. These characteristics include (but are not limited to) “Danceability”, “Acousticness”, and “Energy” <a name="ref_4"></a>[<sup>[4]</sup>](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a).

Songs in particular genres can be distinguished by a particular style. Songs in the same genre will bear similar characteristics, form, and/or style <a name="ref_5"></a>[<sup>[5]</sup>](https://towardsdatascience.com/predicting-music-genres-using-waveform-features-5080e788eb64). Therefore, we hypothesize that the characteristics of a song (as indicated above) will predict the genre that the song falls under. Determining the genres of songs based on qualitative categories is a more intuitive approach to classification.

# Problem Statement

In our project, we will try to predict the genres of songs based on the 23 features provided to us by the dataset. Given the fact that genres are discrete categories, we assume that a classification supervised-ML model will be our solution. The majority of features are represented as either a numerical value, a binomial distribution, or booleans. This allows us to represent our classification mathematically. Furthermore, we will use percentage of accuracy to measure our model success. If our model is successful, it should be replicable with any song on spotify given the 23 features. 


# Data

You should have a strong idea of what dataset(s) will be used to accomplish this project. 

If you know what (some) of the data you will use are please give the following infomration for each dataset
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc will be needed

If you don't yet know what your dataset(s) will be, you should be able to describe what you desire in terms of the above bullets

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Ethics & Privacy

There are very few Ethics and Privacy concerns with this project. The Spotify account, “The Sounds of Spotify” is where the dataset comes from. It is a verified public account by Spotify that showcases music from every genre, year, etc. The biggest concern is with Spotify’s audio features. We do not know how they came up with these features and their values. We also do not know if they are accurate. However, given the fact that all of the songs on the dataset are from Spotify, we can assume that the features of the songs are measured in the same manner. Therefore, we can say that by using these features we will be able to predict the genre of the song. 


# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* *Be proactive about deadlines*
* *Make sure you do your job that is assigned*
* *communicate any trouble and ask for help if you need*
* *Follow deadlines, if there is a deadline work should be done before*
* *Try to be able to make time for meetings*
* *Since we use discord, check discord messages often*
* *Any conflict should be discussed within ourselves first*
* *We will do majority vote when we have to make decisions*
* *Each person should have equal amount of work*


# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/16  |  9 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 4/20  |  9 PM |  Do background research on topic (Mary) | Discuss ideal dataset(s) and ethics; draft project proposal (everyone)| 
| 4/22  | 9 PM  | Edit, finalize, and submit proposal; Search for datasets (Jota)  | Discuss Wrangling and possible analytical approaches (Wei); Assign group members to lead each specific part (Josie)  |
| 4/29  | 9 PM  | Import & Wrangle Data ,do some EDA (David) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 5/6  | 9 PM | Finalize wrangling/EDA; Begin programming for project (David)) | Discuss/edit project code; Complete project |
| 5/13  | 9 PM  | Complete analysis; Draft results/conclusion/discussion (Josie)| Discuss/edit full project |
| 6/8  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="footnote1"></a>1.[^](#1):Spotify User Stats (Updated Oct 2021).” Backlinko, 14 Oct. 2021, https://backlinko.com/spotify-users. <br>

<a name="2"></a>2.[^](#2)“Shifting Consumption towards Diverse Content via Reinforcement Learning.” Spotify Research, 30 Sept. 2021, https://research.atspotify.com/shifting-consumption-towards-diverse-content-via-reinforcement-learning/. <br>

<a name="sotanote"></a>3.[^](#3): Venturott, Pedro Henrique Gomes. “Predicting Music Genres Using Waveform Features.” Medium, Towards Data Science, 3 Mar. 2021, https://towardsdatascience.com/predicting-music-genres-using-waveform-features-5080e788eb64. <br>

<a name="footnote1"></a>4.[^](#4):Plantinga, Bo. “What Do Spotify's Audio Features Tell Us about This Year's Eurovision Song Contest? 🤔.” Medium, Medium, 29 Apr. 2018, https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a.  <br>

<a name="footnote1"></a>5.[^](#5):Venturott, Pedro Henrique Gomes. “Predicting Music Genres Using Waveform Features.” Medium, Towards Data Science, 3 Mar. 2021, https://towardsdatascience.com/predicting-music-genres-using-waveform-features-5080e788eb64. <br>