# Spotify Project

## Setup
#### The setup portion of this project deals with the following concepts:
<li> Api Calls with Python </li>
<li> Data Manipulation with Python </li>
<li> Web Crawling with Selenium </li>
<li> Pandas DataFrame Generation </li>

In [11]:
import requests as rq # tool to make api calls
import pandas as pd # data processing
import base64 # to encoded the credentials for api calls
import json # for responses to api calls
from urllib.parse import quote # for url encoding
from datetime import datetime, timedelta # constructing urls
pd.set_option("display.max_columns", 100)

In [8]:
url = 'https://accounts.spotify.com/api/token' # api call to recieve token
clientID = '67158a9f3e804254bfe2e64fb370b549' # app id that is created with spotify 
clientS = 'c875dc2bdf7d4cd2ae696b7ff434b502' # secret for app that is registered

# need to get a token from spotify to validate our api calls
# much easier to create a function to do this, since we may be 
# using this often
def get_token(clientID,clientS,url):
    newS = clientID + ':' + clientS
    newS = base64.b64encode(newS.encode("utf-8"))
    newS64 = str(newS, "utf-8")
    headers = {
        'Authorization' : 'Basic ' + newS64
    }
    data = {
        'grant_type' : 'client_credentials'
    }
    resp = rq.post(url, headers=headers, data=data)
    token = resp.json()['access_token']
    return(token)

In [9]:
token = get_token(clientID,clientS,url)
token

'BQBCgLbCpXTc547yuHaFpHyRuKrC5QCn_2eDVp2RttFVQvJPJwgJjo7vIiRg_rmH08rTtMv_zQEldFqNLF8'

In [10]:
# In order to get song information, we need to obtain a song id
# this is only done by using spotify's search api, let's create
# a function to search a song with spotify's api and return the id
# of that song
# 'https://api.spotify.com/v1/search?q={}%20artist:{}&type=track'(api endpoint that we are filling with variables)
def get_song_id(songName,artist,token):
    

SyntaxError: unexpected EOF while parsing (<ipython-input-10-2e8a4d1a5415>, line 7)

In [None]:
songID = get_song_id("Circles","Post Malone",token)
songID

### Next Steps
We now have a way to use spotify's api and find id of songs based off of their names. This is a great place for us because now we just need to get a list of popular songs. To do that, we will be using a good friend of ours SELENIUM!!!

In [None]:
from selenium import webdriver
import chromedriver_binary

https://spotifycharts.com/regional/us/weekly/2019-09-20--2019-09-27

In [None]:
# we need a way to retrieve the song information off of the page
# we can use a function to pass in a list of web elements and extract
# their necessary data
# this function will be used within the function below
def get_all_songs(webEL):

In [None]:
# With the function above, we can get all song information on a single
# page. But, we are going to need much more songs in order to train a 
# model on it, so this function iterates through mulitple pages and
# calls the above function to extract the songs. Once the songs are 
# extracted, we navigate to a new page and repeat the process. It returns 
# a list of all song names and artists
def get_page_songs():

https://spotifycharts.com/regional/us/weekly/2019-09-20--2019-09-27

The trailing date string is what the function below creates ^^

In [None]:
# this function generates a string for the url and returns the
# start date so we can use this function in a loop
def construct_date_string(endDate):
    startDate = endDate - timedelta(days=7)
    endString = datetime.strftime(endDate,"%Y-%m-%d")
    startString = datetime.strftime(startDate,"%Y-%m-%d")
    finalString = startString + '--' + endString
    return(startDate,finalString)

In [None]:
# After all the song names and artists are saved into a list
# we can call another api endpoint to retrieve information about
# the song(more features to it). We can use our "get_song_id" function 
# to get the songs id that we will need for calling the api
def build_data(all_songs,token):

In [None]:
all_songs = get_page_songs()

In [None]:
DF = build_data(all_songs[:10],token) # doing the first 10 since we have around 4000 and it would take too long

### Retrieving Label
In order to actually predict the popularity of the songs, the data needs a value to train on. This program will use the popularity of songs as a prediciton value. There is an API call to spotify that we can use to retrieve the popularity of those songs. It is a number between 0 and 100. After we get the popularity values of each songs, we will append the data to our existing data frame. Then we can do some machine learning!!

In [12]:
# load the data into a dataframe
DF = pd.read_csv('allSongs.csv')
DF

Unnamed: 0.1,Unnamed: 0,Song Name,Artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0,Circles,Post Malone,0.695,0.762,0.0,-3.497,1.0,0.0395,0.1920,0.002440,0.0863,0.553,120.042,audio_features,21jGcNKet2qwijlDFuPiPb,spotify:track:21jGcNKet2qwijlDFuPiPb,https://api.spotify.com/v1/tracks/21jGcNKet2qw...,https://api.spotify.com/v1/audio-analysis/21jG...,215280.0,4.0
1,1,Ransom,Lil Tecca,0.745,0.642,7.0,-6.257,0.0,0.2870,0.0204,0.000000,0.0658,0.226,179.974,audio_features,1lOe9qE0vR9zwWQAOk6CoO,spotify:track:1lOe9qE0vR9zwWQAOk6CoO,https://api.spotify.com/v1/tracks/1lOe9qE0vR9z...,https://api.spotify.com/v1/audio-analysis/1lOe...,131240.0,4.0
2,2,BOP,DaBaby,0.769,0.787,11.0,-3.909,1.0,0.3670,0.1890,0.000000,0.1290,0.836,126.770,audio_features,6Ozh9Ok6h4Oi1wUSLtBseN,spotify:track:6Ozh9Ok6h4Oi1wUSLtBseN,https://api.spotify.com/v1/tracks/6Ozh9Ok6h4Oi...,https://api.spotify.com/v1/audio-analysis/6Ozh...,159715.0,4.0
3,3,Truth Hurts,Lizzo,0.715,0.624,4.0,-3.046,0.0,0.1140,0.1100,0.000000,0.1230,0.412,158.087,audio_features,5qmq61DAAOUaW8AUo8xKhh,spotify:track:5qmq61DAAOUaW8AUo8xKhh,https://api.spotify.com/v1/tracks/5qmq61DAAOUa...,https://api.spotify.com/v1/audio-analysis/5qmq...,173325.0,4.0
4,4,223's (feat. 9lokknine),YNW Melly,0.931,0.502,0.0,-9.311,0.0,0.3530,0.0389,0.000000,0.0912,0.712,94.999,audio_features,4sjiIpEv617LDXaidKioOI,spotify:track:4sjiIpEv617LDXaidKioOI,https://api.spotify.com/v1/tracks/4sjiIpEv617L...,https://api.spotify.com/v1/audio-analysis/4sji...,176640.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3678,3995,Verte Ir,DJ Luian,0.857,0.647,11.0,-4.152,0.0,0.0984,0.3910,0.000001,0.0842,0.553,95.982,audio_features,4lzxJ4jCuFDXXGkE1LmpKR,spotify:track:4lzxJ4jCuFDXXGkE1LmpKR,https://api.spotify.com/v1/tracks/4lzxJ4jCuFDX...,https://api.spotify.com/v1/audio-analysis/4lzx...,267500.0,4.0
3679,3996,This Feeling,The Chainsmokers,0.575,0.571,1.0,-7.906,1.0,0.0439,0.0558,0.000000,0.0912,0.449,105.049,audio_features,4NBTZtAt1F13VvlSKe6KTl,spotify:track:4NBTZtAt1F13VvlSKe6KTl,https://api.spotify.com/v1/tracks/4NBTZtAt1F13...,https://api.spotify.com/v1/audio-analysis/4NBT...,197947.0,4.0
3680,3997,Best Part (feat. Daniel Caesar),H.E.R.,0.473,0.371,4.0,-10.219,0.0,0.0405,0.7950,0.000000,0.1090,0.413,75.208,audio_features,4OBZT9EnhYIV17t4pGw7ig,spotify:track:4OBZT9EnhYIV17t4pGw7ig,https://api.spotify.com/v1/tracks/4OBZT9EnhYIV...,https://api.spotify.com/v1/audio-analysis/4OBZ...,209400.0,4.0
3681,3998,No Role Modelz,J. Cole,0.696,0.521,10.0,-8.465,0.0,0.3320,0.3020,0.000000,0.0565,0.458,100.000,audio_features,62vpWI1CHwFy7tMIcSStl8,spotify:track:62vpWI1CHwFy7tMIcSStl8,https://api.spotify.com/v1/tracks/62vpWI1CHwFy...,https://api.spotify.com/v1/audio-analysis/62vp...,292987.0,4.0


In [13]:
# get the url and headers for api request
url = "https://api.spotify.com/v1/tracks/11dFghVXANMlKmJXsNCbNl" # ending is the song ID
headers = {
        'Authorization' : 'Bearer '+token # still need to use our token
}

In [15]:
# make request to the url
response = rq.get(url,headers=headers).json()

In [16]:
# show the popularity of the song we recieved
response['popularity']

7

In [19]:
# create function with the input being the song ID and the output being the popularity of that song
def get_popularity(songID):
    url = "https://api.spotify.com/v1/tracks/" + songID
    headers = {
        'Authorization' : 'Bearer '+token # still need to use our token
    }
    response = rq.get(url,headers=headers).json()
    return(response['popularity'])

In [20]:
# use the function above to get every popularity rating of the songs in our data
# save all the values into a list
tempList = []
for el in DF['id'][:10]:
    tempList.append(get_popularity(el))
tempList

[96, 61, 85, 94, 90, 83, 91, 92, 93, 89]

In [21]:
DF2 = pd.read_csv('allSongsLabeled.csv')

In [22]:
DF2

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Song Name,Artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,Popularity
0,0,0,Circles,Post Malone,0.695,0.762,0.0,-3.497,1.0,0.0395,0.1920,0.002440,0.0863,0.553,120.042,audio_features,21jGcNKet2qwijlDFuPiPb,spotify:track:21jGcNKet2qwijlDFuPiPb,https://api.spotify.com/v1/tracks/21jGcNKet2qw...,https://api.spotify.com/v1/audio-analysis/21jG...,215280.0,4.0,96
1,1,1,Ransom,Lil Tecca,0.745,0.642,7.0,-6.257,0.0,0.2870,0.0204,0.000000,0.0658,0.226,179.974,audio_features,1lOe9qE0vR9zwWQAOk6CoO,spotify:track:1lOe9qE0vR9zwWQAOk6CoO,https://api.spotify.com/v1/tracks/1lOe9qE0vR9z...,https://api.spotify.com/v1/audio-analysis/1lOe...,131240.0,4.0,61
2,2,2,BOP,DaBaby,0.769,0.787,11.0,-3.909,1.0,0.3670,0.1890,0.000000,0.1290,0.836,126.770,audio_features,6Ozh9Ok6h4Oi1wUSLtBseN,spotify:track:6Ozh9Ok6h4Oi1wUSLtBseN,https://api.spotify.com/v1/tracks/6Ozh9Ok6h4Oi...,https://api.spotify.com/v1/audio-analysis/6Ozh...,159715.0,4.0,85
3,3,3,Truth Hurts,Lizzo,0.715,0.624,4.0,-3.046,0.0,0.1140,0.1100,0.000000,0.1230,0.412,158.087,audio_features,5qmq61DAAOUaW8AUo8xKhh,spotify:track:5qmq61DAAOUaW8AUo8xKhh,https://api.spotify.com/v1/tracks/5qmq61DAAOUa...,https://api.spotify.com/v1/audio-analysis/5qmq...,173325.0,4.0,94
4,4,4,223's (feat. 9lokknine),YNW Melly,0.931,0.502,0.0,-9.311,0.0,0.3530,0.0389,0.000000,0.0912,0.712,94.999,audio_features,4sjiIpEv617LDXaidKioOI,spotify:track:4sjiIpEv617LDXaidKioOI,https://api.spotify.com/v1/tracks/4sjiIpEv617L...,https://api.spotify.com/v1/audio-analysis/4sji...,176640.0,4.0,90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3678,3678,3995,Verte Ir,DJ Luian,0.857,0.647,11.0,-4.152,0.0,0.0984,0.3910,0.000001,0.0842,0.553,95.982,audio_features,4lzxJ4jCuFDXXGkE1LmpKR,spotify:track:4lzxJ4jCuFDXXGkE1LmpKR,https://api.spotify.com/v1/tracks/4lzxJ4jCuFDX...,https://api.spotify.com/v1/audio-analysis/4lzx...,267500.0,4.0,88
3679,3679,3996,This Feeling,The Chainsmokers,0.575,0.571,1.0,-7.906,1.0,0.0439,0.0558,0.000000,0.0912,0.449,105.049,audio_features,4NBTZtAt1F13VvlSKe6KTl,spotify:track:4NBTZtAt1F13VvlSKe6KTl,https://api.spotify.com/v1/tracks/4NBTZtAt1F13...,https://api.spotify.com/v1/audio-analysis/4NBT...,197947.0,4.0,83
3680,3680,3997,Best Part (feat. Daniel Caesar),H.E.R.,0.473,0.371,4.0,-10.219,0.0,0.0405,0.7950,0.000000,0.1090,0.413,75.208,audio_features,4OBZT9EnhYIV17t4pGw7ig,spotify:track:4OBZT9EnhYIV17t4pGw7ig,https://api.spotify.com/v1/tracks/4OBZT9EnhYIV...,https://api.spotify.com/v1/audio-analysis/4OBZ...,209400.0,4.0,79
3681,3681,3998,No Role Modelz,J. Cole,0.696,0.521,10.0,-8.465,0.0,0.3320,0.3020,0.000000,0.0565,0.458,100.000,audio_features,62vpWI1CHwFy7tMIcSStl8,spotify:track:62vpWI1CHwFy7tMIcSStl8,https://api.spotify.com/v1/tracks/62vpWI1CHwFy...,https://api.spotify.com/v1/audio-analysis/62vp...,292987.0,4.0,82


In [None]:
# append the list we just created to the DataFrame that has all our data within it

After these steps we are ready to use the data for some machine learning!!!

### Machine Learning Concepts
Now that we are getting into the machine learning part of the project, let's go over some important topics of machine learning that we need to think about

In [24]:
DF2

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Song Name,Artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,Popularity
0,0,0,Circles,Post Malone,0.695,0.762,0.0,-3.497,1.0,0.0395,0.1920,0.002440,0.0863,0.553,120.042,audio_features,21jGcNKet2qwijlDFuPiPb,spotify:track:21jGcNKet2qwijlDFuPiPb,https://api.spotify.com/v1/tracks/21jGcNKet2qw...,https://api.spotify.com/v1/audio-analysis/21jG...,215280.0,4.0,96
1,1,1,Ransom,Lil Tecca,0.745,0.642,7.0,-6.257,0.0,0.2870,0.0204,0.000000,0.0658,0.226,179.974,audio_features,1lOe9qE0vR9zwWQAOk6CoO,spotify:track:1lOe9qE0vR9zwWQAOk6CoO,https://api.spotify.com/v1/tracks/1lOe9qE0vR9z...,https://api.spotify.com/v1/audio-analysis/1lOe...,131240.0,4.0,61
2,2,2,BOP,DaBaby,0.769,0.787,11.0,-3.909,1.0,0.3670,0.1890,0.000000,0.1290,0.836,126.770,audio_features,6Ozh9Ok6h4Oi1wUSLtBseN,spotify:track:6Ozh9Ok6h4Oi1wUSLtBseN,https://api.spotify.com/v1/tracks/6Ozh9Ok6h4Oi...,https://api.spotify.com/v1/audio-analysis/6Ozh...,159715.0,4.0,85
3,3,3,Truth Hurts,Lizzo,0.715,0.624,4.0,-3.046,0.0,0.1140,0.1100,0.000000,0.1230,0.412,158.087,audio_features,5qmq61DAAOUaW8AUo8xKhh,spotify:track:5qmq61DAAOUaW8AUo8xKhh,https://api.spotify.com/v1/tracks/5qmq61DAAOUa...,https://api.spotify.com/v1/audio-analysis/5qmq...,173325.0,4.0,94
4,4,4,223's (feat. 9lokknine),YNW Melly,0.931,0.502,0.0,-9.311,0.0,0.3530,0.0389,0.000000,0.0912,0.712,94.999,audio_features,4sjiIpEv617LDXaidKioOI,spotify:track:4sjiIpEv617LDXaidKioOI,https://api.spotify.com/v1/tracks/4sjiIpEv617L...,https://api.spotify.com/v1/audio-analysis/4sji...,176640.0,4.0,90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3678,3678,3995,Verte Ir,DJ Luian,0.857,0.647,11.0,-4.152,0.0,0.0984,0.3910,0.000001,0.0842,0.553,95.982,audio_features,4lzxJ4jCuFDXXGkE1LmpKR,spotify:track:4lzxJ4jCuFDXXGkE1LmpKR,https://api.spotify.com/v1/tracks/4lzxJ4jCuFDX...,https://api.spotify.com/v1/audio-analysis/4lzx...,267500.0,4.0,88
3679,3679,3996,This Feeling,The Chainsmokers,0.575,0.571,1.0,-7.906,1.0,0.0439,0.0558,0.000000,0.0912,0.449,105.049,audio_features,4NBTZtAt1F13VvlSKe6KTl,spotify:track:4NBTZtAt1F13VvlSKe6KTl,https://api.spotify.com/v1/tracks/4NBTZtAt1F13...,https://api.spotify.com/v1/audio-analysis/4NBT...,197947.0,4.0,83
3680,3680,3997,Best Part (feat. Daniel Caesar),H.E.R.,0.473,0.371,4.0,-10.219,0.0,0.0405,0.7950,0.000000,0.1090,0.413,75.208,audio_features,4OBZT9EnhYIV17t4pGw7ig,spotify:track:4OBZT9EnhYIV17t4pGw7ig,https://api.spotify.com/v1/tracks/4OBZT9EnhYIV...,https://api.spotify.com/v1/audio-analysis/4OBZ...,209400.0,4.0,79
3681,3681,3998,No Role Modelz,J. Cole,0.696,0.521,10.0,-8.465,0.0,0.3320,0.3020,0.000000,0.0565,0.458,100.000,audio_features,62vpWI1CHwFy7tMIcSStl8,spotify:track:62vpWI1CHwFy7tMIcSStl8,https://api.spotify.com/v1/tracks/62vpWI1CHwFy...,https://api.spotify.com/v1/audio-analysis/62vp...,292987.0,4.0,82


In [35]:
DF2.corr()['Popularity']

Unnamed: 0         -0.154239
Unnamed: 0.1       -0.154085
danceability       -0.005625
energy             -0.046067
key                -0.071899
loudness            0.032116
mode                0.031063
speechiness        -0.106884
acousticness        0.078108
instrumentalness    0.002326
liveness            0.017777
valence             0.077350
tempo              -0.117984
duration_ms         0.081662
time_signature     -0.028775
Popularity          1.000000
Name: Popularity, dtype: float64

#### Thinking about the data
In terms of our data we have multiple columns with a wide range of values. Some have negatives, some are very small and some are very large. What we want to do is to scale all of these down to a similar range of values, somewhere between 0 and 1. Sklearn has some scaling packages that we could use for this.

But before we scale any of the values, we need split our data into testing and training data. For example, if you were taking a test and you had the exact questions that were on the test beforehand, then you would be able to ace that test. But does that mean you truly know the material? Or do you just know how to do those questions? This is what we are trying to avoid with our model. We want to give it practice problems that are similar to the test but not the exact questions on it. By splitting the data, we are giving our model a study guide and then giving it the test after it has studied those questions on the study guide.

Here is how to setup a test train split on our data.

Our X values are the questions we are giving our model, then we have the answers be the y values.

So we need to chose columns in our data that will be our questions and also the answers to those questions, which are the popularity values.

In [25]:
columns = ['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo']

In [27]:
from sklearn.model_selection import train_test_split
X = DF2[columns]
y = DF2['Popularity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [29]:
y_train

421      92
123      84
2110     82
1220     72
819      82
       ... 
1130     90
1294    100
860      88
3507     76
3174     89
Name: Popularity, Length: 2467, dtype: int64

In [30]:
from sklearn.preprocessing import MinMaxScaler

In [33]:
#scaler = MinMaxScaler()
#print(scaler.fit(X_train))

X_train_scaled = scaler.transform(X_train)

In [34]:
X_train_scaled

array([[0.67716535, 0.45672915, 1.        , ..., 0.17988485, 0.39020435,
        0.22297016],
       [0.63385827, 0.56336289, 1.        , ..., 0.08823875, 0.49399935,
        0.10817684],
       [0.70734908, 0.80356942, 0.54545455, ..., 0.1364117 , 0.69402098,
        0.48831161],
       ...,
       [0.90419948, 0.37366708, 0.72727273, ..., 0.07648925, 0.17288355,
        0.39592523],
       [0.80052493, 0.58693456, 0.72727273, ..., 0.1035131 , 0.42696508,
        0.40331758],
       [0.70341207, 0.48479066, 0.54545455, ..., 0.06191987, 0.53724727,
        0.51083349]])