<a href="https://colab.research.google.com/github/CS418/group-project-blackbox/blob/main/Spotify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Introduction**
Predicting the popularity of songs based purely on song metrics such as key, danceability, and acousticness. Currently, being able to predict that something might be popular beforehand is an important research subject for every industry. It also has recently become a very important subject for the growing and competitive music industry as well. Since wide use of digital music platforms (Spotify, Billboard, Lastfm), data can be easily reached and the listening behaviors of the listeners can be easily observed. This provides convenience in forecasting techniques and it is also frequently used in recommendation systems.

**Any changes?**
Yes, we are currently focussing on the first hypothesis & trying to understand user’s usability aspects more than data over continents.
We are starting with understanding what features make a user end up liking a song.
The hypothesis we are trying to gain insight into is 
Chances of a user liking a song based on his current favorite song.


In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# Data Extraction


### Extract Song titles from my Playlist using Spotify API
### API :  GET https://api.spotify.com/v1/playlists/{playlist_id}/tracks


In [None]:
from dataclasses import dataclass
import json
  
f = open('gdrive/My Drive/playlist.json')
  
data = json.load(f)
  
len(data['items'])
track_ids_names=[]
track_ids=[]
for d in data['items']:
  track_ids_names.append({'id':d['track']['id'],'name':d['track']['name']})
  track_ids.append(d['track']['id'])
track_ids_names

[{'id': '7zqt0ygYwyKODlFdQei9JP', 'name': 'Numb - Acoustic'},
 {'id': '2o4KEc2eSVUytZvTzrmyDB', 'name': 'Let Her Go'},
 {'id': '095AYlZ56ruP67DLDqq1KG', 'name': 'Bleeding Love - Acoustic'},
 {'id': '2AMgVizCthD4TDGPzh8mnU', 'name': 'Unstoppable - Acoustic'},
 {'id': '3LK7SPiHGMXhNgNGj8zCdd', 'name': "Say You Won't Let Go - Acoustic"},
 {'id': '31NgcsBeirPuW7KOM5Em7N', 'name': 'Young Dumb & Broke - Acoustic'},
 {'id': '0XQl455sLGJYqFRXQLxonl', 'name': "It'll Be Okay - Acoustic"},
 {'id': '0KfTSjLex3EzAQRKlSyEOi', 'name': 'Sunroof - Acoustic'},
 {'id': '4Dsu3mYLx5yrwYRNC6068w', 'name': 'Sign Of The Times - Acoustic'},
 {'id': '11WCNXGNBU8KOjSbrmHJpV', 'name': 'Stay - Acoustic'},
 {'id': '1r7dJi79oGDYcvGGYbG0bn', 'name': 'Young And Beautiful - Acoustic'},
 {'id': '3bygSayUYIsaqYfP9Wo8U9', 'name': 'Love Song - Acoustic'},
 {'id': '1IlCOX7DvLFOngKIP5IsTB', 'name': 'Easily - Acoustic'},
 {'id': '5ii7QwYTHiOk989QnuL3zQ', 'name': 'Runaway - Acoustic'},
 {'id': '5Ep5Fw8d3M5YzIglXziYCT', 'name':

## Extract track features of all the songs in the playlist


In [None]:
import requests

headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'Bearer BQApG4L-oRYsrw68aQ4fsKy-Ua_FbeDxvIrh9WR1EmaH3HSG80VJH4m37j5Ivrbff2ON5znOxve8vlhJXMjWpNTzWK3Fd5czqD_AvKVc3kO18pKOGEaiw0NHHYPhez2-qpVt2jMAFGzo-40VfFBtpCT1hyg2mZ-xJ58gaqaHiyRMF1I21TlLWjwt_-R4gJvsZeY',
}


song_features = []
for t in track_ids:
  response = requests.get('https://api.spotify.com/v1/audio-features/'+t, headers=headers)
  song_features.append(response.json())
  
song_features


track_ids_names

In [None]:
import json

with open('gdrive/My Drive/track_features.json', 'w') as f:
    json.dump(song_features, f)

print(song_features)

In [None]:
import csv
import pandas as pd

df = pd.read_json('gdrive/My Drive/track_features.json')

df.to_csv('gdrive/My Drive/extracted_track_features.csv', encoding='utf-8', index=False)

# We drop all the features which are meta data in the dataset 

df = df.drop(columns=['type', 'id','uri','track_href','analysis_url'], axis=1)
df

# Data Pre-processing


In [None]:
# Since Instrumentalness and time_signature do not have any impact on the data, we remove the features

df = df.drop(columns=['instrumentalness','time_signature'])

df.columns
df['liked'] = [1]*100
df

In [None]:
df.info()

df.to_csv('gdrive/My Drive/cleaned_liked_songs.csv', encoding='utf-8', index=False)


In [None]:
# We inserted the 95 most non liked songs features to the data after querying †˙e Spotify API to the data


data=pd.read_csv('gdrive/My Drive/cleaned_liked_disliked_songs.csv').dropna(axis=1, how='all')
data.head()

#  The model will somewhat think they only have to learn what is a liked song because they'll only see them at the beginning.So we need to shuffle the songs

data = data.sample(frac=1)
data.head()


In [None]:
#  The model will somewhat think they only have to learn what is a liked song because they'll only see them at the beginning.So we need to shuffle the songs

data = data.sample(frac=1)
data

data.to_csv('gdrive/My Drive/final_data_spotify.csv', encoding='utf-8', index=False)




# Data is cleaned and ready to use

# EDA
## The main features we are getting in this dataset are
### acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

### danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

### duration_ms : The duration of the track in milliseconds.

### energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.


### key : The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

### liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

### loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

###mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

### speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
### tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

### valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

### And the variable that has to be predicted :
### liked : 1 for liked songs , 0 for disliked songs


In [None]:
data=pd.read_csv('gdrive/My Drive/final_data_spotify.csv').dropna(axis=1, how='all')
data.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


data.info()
data.isnull().sum()

data['liked'].value_counts()

In [None]:
corr = data.corr()[["liked"]]
fig, ax = plt.subplots(figsize=(10,10)) 
sns.heatmap(
    corr, 
    annot=True,
    ax=ax
);

I like songs which are:
  1. acoustic 
  2. melodic(mode)
  3. Loud
  4. Key

Songs I dont like:
  1. high energy
  2. short duration

In [None]:
#  The model will somewhat think they only have to learn what is a liked song because they'll only see them at the beginning.So we need to shuffle the songs

data = data.sample(frac=1)
data


## Data Preprocessing


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , roc_auc_score , confusion_matrix

X = data.drop('liked' , axis = 1)
y = data['liked']


X_train , X_test , y_train , y_test = train_test_split(X , y , test_size=0.2)

print("the shapes of x-train and x-test are : " , X_train.shape , X_test.shape)
print("the shapes of y-train and y-test are : " , y_train.shape , y_test.shape)


In [None]:
from sklearn.tree import DecisionTreeClassifier , plot_tree


dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train , y_train)

y_preds_dt = dt_clf.predict(X_test)


print("Accuracy Score of the Decision Tree Model" , accuracy_score(y_test , y_preds_dt))


In [None]:
cm  = confusion_matrix(y_test , y_preds_dt)

x_axis_labels = ["Yes" , "No"]
y_axis_labels = ["Yes" , "No"]

f , ax = plt.subplots(figsize=(10,7))
sns.heatmap(cm , annot=True, linewidths=0.2 , linecolor="black" , fmt=".0f" , ax=ax , cmap="Greens" , 
           xticklabels=x_axis_labels , yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusin Matrix Of Decision Tree Model")