# COGS 118A - Project Checkpoint

**EDITED PROBLEM STATEMENT AND METRICS SECTION USING FEEDBACK FROM PROPOSAL**

# Names

- Wesley Nguyen
- Jay Buensuceso
- Aniket Dhar
- Juhita Vijjali

# Abstract 
The goal of this project is to design a more organic recommendation system for music, leveraging the Spotify API. The recorded data quantifies various characteristics of songs, including acousticness, danceability, and energy, allowing songs to be compared numerically to one another. With these metrics, the relationship and parameters of a users' given playlist can be quantified, and songs that share similar qualities to those in the playlist can be recommended. Additionally, songs with a smaller similarity score can be recommended to determine whether the user may like other genres beyond the ones already in their playlist, allowing the recommendation system to feel more organic. The success of this model can be determined based on a measure of how long and how many times recommended songs are played, as well as potential changes in the overall composition of measured parameters in the user's playlist.

# Background

When looking at which topic we wanted to focus our project on, we came by an interesting paper, 'ALGORITHMS AND CURATED PLAYLIST EFFECT ON MUSIC STREAMING SATISFACTION'<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote), where it studied the effects of algorthmically created playlists and it's effects on the users. It found that the more the user intereacted with the music streaming app, the more satisfied they were with the curated playlist<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). If algorthmically curated music had such an effect on listeners, then we thought it would be a great idea to create our own program that created playlists based off of the songs the listeners liked. But there was one issue we battled wihth when we came across the study,'Algorithmic Effects on the Diversity of Consumption on Spotify<a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). This paper explained how algorthmicly created playlists have less music diversity and when people listened to diverse music, they moved away from algorthmic comsumption and increased their organic consumption<a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). But this flaw in algorthmically curated platlists sparked the idea for our group to attempt to create a playlist that as closely as possible reflected organically consumed, diverse music. 

# Problem Statement

The problem we are attempting to solve is the idea that algorithmic playlists, as in playlists generated using an algorithm, are not as diverse as organically curated music playlists. As described in our background, if a user enjoys the curated playlist created algorithmically, there is a higher retention rate on the application. However, the con of algorithmic playlists is that they are not diverse as compared to organically curated playlists leading to users stepping away from the algorithmic palylists.

Many algorithms struggle with organic recommendation systems, instead prioritizing the recommendation of content users are already interested in. The interest of users can be quantified by how long and how many times they may engage with a certain creator, piece of media, or other form of content, with better recommendations having greater amounts of engagement than poorer recommendations. Furthermore, time on the platform, like ratios, and user-driven recommendations can be used as further parameters to quantify how good these recommendations are.

Thus, by creating a model that can curate a playlist algorithmically, but also have a diverse enough selection of music, the client, in this case Spotify, can retain the userbase that would have stepped away towards the more organically diverse curated music. Taking this into account, algorithms instead must replicate the sporadicity of organic recommendations, and determine methods of predicting new content the user will enjoy.

In our solution we plan on solving this using the K Nearest Neighbors (KNN) algorithm to generate a playlist of songs based on an initial query song.

# Data

Primary Dataset:

- dataset.csv

- https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download

- Size of dataset: greater than 114K datapoints, 10 variables

- Critical variables: 
    - artists: string
    - trackname: string
    - popularity: number
    - explicit: boolean
    - danceability: number
    - duration_ms: number
    - energy: number
    
- All other variables can be cleaned out of the training data
    - track_id
    - album name


In [1]:
%pip install spotipy
%pip3 install pandas
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


UsageError: Line magic function `%pip3` not found.


In [18]:
#Imports and Setup
import pandas as pd
import numpy as np
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from sklearn import preprocessing
from sklearn.neighbors import NearestNeighbors
import random

#Set up spotipy
cid = 'fc25643584634870914b9ebff3c22821'
secret = '55a8bc39a91c4fc9b9208d0acfc4cf4b'

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [19]:
df = pd.read_csv('dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [20]:
df = df.drop(columns=['Unnamed: 0','track_id', 'album_name'], axis=1)
df.head()

Unnamed: 0,artists,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,Gen Hoshino,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,Ben Woodward,Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,Ingrid Michaelson;ZAYN,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,Kina Grannis,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,Chord Overstreet,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


# Proposed Solution

Our solution to the problem of organic machine recommendation systems is the implementation of both batch and stochastic gradient descent methods, as well as k fold cross validation to create a system that is able to recommend things organically. Since our dataset contains much of the data regarding songs that we already need, batch gradient descent is well suited to create a principle set of weights for the algorithm to use, which can then be updated in real time using stochastic gradient descent. In this manner, the online nature of stochastic gradient descent will allow for the recommendation system to evolve with the users' preferences, and thus grow from the hot start generated by the batch gradient descent. The lighter computational complexity of stochastic gradient descent allows for the use of k folds cross validation as well,  allowing us to score new recommendations based on theorized metrics and determine how well the model is operating. By doing so, the stochastic algorithm weights can be changed if it is measured that recommendations do poorly, or reinforced if recommendations do well. In this manner, the algorithm can be tested, and would be viable to solving the issue of organic recommendation.

# Evaluation Metrics

Given the context of predicting new music content, the evaluation metric we will be using is accuracy. This is because we want to determine if the music that our machine learning model is predicting is actually music that makes sense to be played. One possible way to determine how accurate our model is to create our own playlists/sets of songs and determining whether the music predicted by our model falls in that playlist. 

We also plan to play around with confusion matrices and calculating other metrics like recall, precision, and F1 scores to see what those results could tell us and how they could possibly be used to better our model. 

accuracy = (TP + TN) / (P + N)
recall = TP / P 
precision = TP / PP
F1 score = (2 * PPV * TPR) / (PPV + TPR)

# Preliminary results





In [64]:
#Add one-hot encoding for categorical variables
bool_var = df.select_dtypes(include= ['boolean'])

encoder = preprocessing.LabelEncoder()
label = bool_var.apply(encoder.fit_transform)
df['explicit'] = encoder.fit_transform(df['explicit'])
df['genre_code'] = encoder.fit_transform(df['track_genre'])
#onehot_encoder = preprocessing.OneHotEncoder()
#genredf = pd.DataFrame(onehot_encoder.fit_transform(df[['track_genre']]).toarray(), columns = df['track_genre'].unique())
#artistdf = pd.DataFrame(onehot_encoder.fit_transform(df[['artists']]).toarray(), columns = df['artists'].unique())
#df = df.join(genredf)
#df = df.join(artistdf)

In [69]:
predf = df.drop(columns = ['artists', 'popularity','track_genre','track_name','valence','instrumentalness','acousticness','duration_ms', 'key'], axis = 1)
predf.head()

Unnamed: 0,explicit,danceability,energy,loudness,mode,speechiness,liveness,tempo,time_signature,genre_code
0,0,0.676,0.461,-6.746,0,0.143,0.358,87.917,4,0
1,0,0.42,0.166,-17.235,1,0.0763,0.101,77.489,4,0
2,0,0.438,0.359,-9.734,1,0.0557,0.117,76.332,4,0
3,0,0.266,0.0596,-18.515,1,0.0363,0.132,181.74,3,0
4,0,0.618,0.443,-9.681,1,0.0526,0.0829,119.949,4,0


In [70]:
#Copying 1 song here from predf due to large dataset size - can be replaced with mean album parameters later with spotipy
example_user = predf.iloc[90000]
k = 5
model = NearestNeighbors(n_neighbors=k, algorithm="auto", metric='euclidean')

In [71]:
model.fit(predf)
distances, indices = model.kneighbors([example_user])



In [72]:
recommendations = indices[0]
random.shuffle(recommendations)
recommendations = recommendations[:20]
recommended_songs = df.iloc[indices[0], :]
recommended_songs.head()

Unnamed: 0,artists,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,genre_code
90538,Grupo Lluvia,Una Vieja Canción De Amor,32,210053,0,0.702,0.351,7,-14.238,1,0.0304,0.227,9e-06,0.11,0.599,99.514,3,rock-n-roll,91
91200,Elvis Presley,Can't Help Falling in Love,80,182360,0,0.396,0.293,2,-14.062,1,0.0275,0.941,0.000196,0.105,0.343,100.307,3,rock,90
92000,Elvis Presley,Can't Help Falling in Love,80,182360,0,0.396,0.293,2,-14.062,1,0.0275,0.941,0.000196,0.105,0.343,100.307,3,rockabilly,92
90000,Elvis Presley,Can't Help Falling in Love,80,182360,0,0.396,0.293,2,-14.062,1,0.0275,0.941,0.000196,0.105,0.343,100.307,3,rock-n-roll,91
90666,Johnny Rivers,Tracks Of My Tears,31,180800,0,0.423,0.39,2,-13.829,1,0.0292,0.591,0.0,0.312,0.7,100.475,4,rock-n-roll,91


Preliminary Results show that our datasat has duplicate values within it. Using the test case of Evlis Presley's "Cant Help Falling In Love" we find that there is three identical songs listed in the dataset skewing our KNN algorithm.

For this initial test, we have chosen to only include the features: explicit, danceability, energy, loudness, mode, speechiness, liveness,	tempo, time_signature, genre_code in order to get similar music to the original query that we sent into the model. This is due to the fact that popularity and duration was being favored more than the other features leading to songs from wildly different genres being included in the recommendation (one test had heavy metal included with lullaby songs).

However, upon further investigation, we believe that we should normilize the features and reintroduce duration and popularity to the model to see if the recommendations improve at all.

# Ethics & Privacy

In order to generate the data that the model will take to generate a recommended playlist, the user has to input information regarding the types of songs they listen to, whether they're fine with explicitness, and other variables such as if they want their playlist to be dancable. In order for the user to understand how their data is being used, we plan on writing explicitly how their account would be used in conjunction with our project and stick to those written conditions.

# Team Expectations 

* Communicate if you are unable to make a meeting, will typically be on Tuesdays at 6PM
* Ask when you need help, deadlines are normally weekly so we can all work together
* Don't take on more than you can handle
* If conflict arises, discuss as an entire group, don't make individual decisions
* Check Discord regularly for communication

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/5  |  6 PM |  Brainstorm topics/questions (all)  | Get to know each other; Determine best form of communication; Brainstorm project ideas; discuss hypothesis; begin background research | 
| 2/15  |  2 PM |  Do background research on topic (all) | Continue brainstorming and finalize project topic; Discuss ideal dataset(s) and ethics; Find datasets | 
| 2/21  | 6 PM  | Edit, finalize, and submit proposal(all); Upload datasets (Neel)  | Finalize project proposal; Assign group members to lead each specific part   |
| 2/28  | 6 PM  | Import & Wrangle Data, do some EDA (all) | Review/Edit wrangling/EDA; Discuss Analysis Plan; Start working on Checkpoint: most likely will need to update timeline based on progress   |
| 3/7  | 6 PM  | Finalize wrangling/EDA; Begin programming for project (all) | Discuss/edit project code; Complete and review checkpoint |
| 3/14  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (all) | Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Sanchez, Johny. “Algorithms and Curated Playlist Effect on Music Streaming Satisfaction ...” Texas Christian University, https://repository.tcu.edu/bitstream/handle/116099117/22417/Sanchez__Johny-Honors_Project.pdf. <br> 
<a name="admonishnote"></a>2.[^](#admonish): Anderson, Ashton, et al. “Algorithmic Effects on the Diversity of Consumption on Spotify.” University of Toronto, https://www.cs.utoronto.ca/~ashton/pubs/alg-effects-spotify-www2020.pdf.<br>