# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Wesley Nguyen
- Jay Buensuceso
- Aniket Dhar
- Juhita Vijjali


# Abstract 
The goal of this project is to design a more organic recommendation system for music, leveraging the Spotify API. The recorded data quantifies various characteristics of songs, including acousticness, danceability, and energy, allowing songs to be compared numerically to one another. With these metrics, the relationship and parameters of a users' given playlist can be quantified, and songs that share similar qualities to those in the playlist can be recommended. Additionally, songs with a smaller similarity score can be recommended to determine whether the user may like other genres beyond the ones already in their playlist, allowing the recommendation system to feel more organic. The success of this model can be determined based on a measure of how long and how many times recommended songs are played, as well as potential changes in the overall composition of measured parameters in the user's playlist.

# Background

When looking at which topic we wanted to focus our project on, we came by an interesting paper, 'ALGORITHMS AND CURATED PLAYLIST EFFECT ON MUSIC STREAMING SATISFACTION'<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote), where it studied the effects of algorthmically created playlists and it's effects on the users. It found that the more the user intereacted with the music streaming app, the more satisfied they were with the curated playlist<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). If algorthmically curated music had such an effect on listeners, then we thought it would be a great idea to create our own program that created playlists based off of the songs the listeners liked. But there was one issue we battled wihth when we came across the study,'Algorithmic Effects on the Diversity of Consumption on Spotify<a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). This paper explained how algorthmicly created playlists have less music diversity and when people listened to diverse music, they moved away from algorthmic comsumption and increased their organic consumption<a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). But this flaw in algorthmically curated platlists sparked the idea for our group to attempt to create a playlist that as closely as possible reflected organically consumed, diverse music. 


# Problem Statement

The problem we are attempting to solve is the idea that algorithmic playlists, as in playlists generated using an algorithm, are not as diverse as organically curated music playlists. As described in our background, if a user enjoys the curated playlist created algorithmically, there is a higher retention rate on the application. However, the con of algorithmic playlists is that they are not diverse as compared to organically curated playlists leading to users stepping away from the algorithmic palylists.

Many algorithms struggle with organic recommendation systems, instead prioritizing the recommendation of content users are already interested in. The interest of users can be quantified by how long and how many times they may engage with a certain creator, piece of media, or other form of content, with better recommendations having greater amounts of engagement than poorer recommendations. Furthermore, time on the platform, like ratios, and user-driven recommendations can be used as further parameters to quantify how good these recommendations are.

Thus, by creating a model that can curate a playlist algorithmically, but also have a diverse enough selection of music, the client, in this case Spotify, can retain the userbase that would have stepped away towards the more organically diverse curated music. Taking this into account, algorithms instead must replicate the sporadicity of organic recommendations, and determine methods of predicting new content the user will enjoy.

# Data

You should have a strong idea of what dataset(s) will be used to accomplish this project. 

Primary Dataset:

- dataset.csv

- https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download

- Size of dataset: greater than 114K datapoints, 10 variables

- Critical variables: 
    - artists: string
    - trackname: string
    - popularity: number
    - explicit: boolean
    - danceability: number
    - duration_ms: number
    - energy: number

- All other variables can be cleaned out of the training data
    - track_id
    - album name

# Proposed Solution

Our solution to the problem of organic machine recommendation systems is the implementation of both batch and stochastic gradient descent methods, as well as k fold cross validation to create a system that is able to recommend things organically. Since our dataset contains much of the data regarding songs that we already need, batch gradient descent is well suited to create a principle set of weights for the algorithm to use, which can then be updated in real time using stochastic gradient descent. In this manner, the online nature of stochastic gradient descent will allow for the recommendation system to evolve with the users' preferences, and thus grow from the hot start generated by the batch gradient descent. The lighter computational complexity of stochastic gradient descent allows for the use of k folds cross validation as well,  allowing us to score new recommendations based on theorized metrics and determine how well the model is operating. By doing so, the stochastic algorithm weights can be changed if it is measured that recommendations do poorly, or reinforced if recommendations do well. In this manner, the algorithm can be tested, and would be viable to solving the issue of organic recommendation.

# Evaluation Metrics

Given the context of predicting new music content, the evaluation metric we will be using is accuracy. This is because we want to determine if the music that our machine learning model is predicting is actually music that makes sense to be played. One possiible way to determine how accurate our model is to create our own playlists/sets of songs and determining whether the music predicted by our model falls in that playlist. 

We also plan to play around with confusion matrices and calculating other metrics like recall, precision, and F1 scores to see what those results could tell us and how they could possibly be used to better our model. 

# Ethics & Privacy

In order to generate the data that the model will take to generate a recommended playlist, the user has to input information regarding the types of songs they listen to, whether they're fine with explicitness, and other variables such as if they want their playlist to be dancable. In order for the user to understand how their data is being used, we plan on writing explicitly how their account would be used in conjunction with our project and stick to those written conditions.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* Communicate if you are unable to make a meeting, will typically be on Tuesdays at 6PM
* Ask when you need help, deadlines are normally weekly so we can all work together
* Don't take on more than you can handle
* If conflict arises, discuss as an entire group, don't make individual decisions
* Check Discord regularly for communication

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/5  |  6 PM |  Brainstorm topics/questions (all)  | Get to know each other; Determine best form of communication; Brainstorm project ideas; discuss hypothesis; begin background research | 
| 2/15  |  2 PM |  Do background research on topic (all) | Continue brainstorming and finalize project topic; Discuss ideal dataset(s) and ethics; Find datasets | 
| 2/21  | 6 PM  | Edit, finalize, and submit proposal(all); Upload datasets (Neel)  | Finalize project proposal; Assign group members to lead each specific part   |
| 2/28  | 6 PM  | Import & Wrangle Data, do some EDA (all) | Review/Edit wrangling/EDA; Discuss Analysis Plan; Start working on Checkpoint: most likely will need to update timeline based on progress   |
| 3/7  | 6 PM  | Finalize wrangling/EDA; Begin programming for project (all) | Discuss/edit project code; Complete and review checkpoint |
| 3/14  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (all) | Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Sanchez, Johny. “Algorithms and Curated Playlist Effect on Music Streaming Satisfaction ...” Texas Christian University, https://repository.tcu.edu/bitstream/handle/116099117/22417/Sanchez__Johny-Honors_Project.pdf. <br> 
<a name="admonishnote"></a>2.[^](#admonish): Anderson, Ashton, et al. “Algorithmic Effects on the Diversity of Consumption on Spotify.” University of Toronto, https://www.cs.utoronto.ca/~ashton/pubs/alg-effects-spotify-www2020.pdf.<br>

