## Spotify Project
This project is supposed to help you understand how you can use python to aggregate data into a usable format for simple machine learning. 

The end goal is to build a program to aggregate Spotify song data using api calls and web scraping, then perform some machine learning on that data.

Topics covered:
* HTTP requests
* Web scraping with Selenuim
* data manipulation with pandas
* data modeling with sklearn

### Key Points
In order to call Spotify's api, we need an access token. Spotify has documentation on how to do this, but simply speaking, we make an API call to a link that spotify provides and our token is returned(need to register an app with spotify first to gain the client ID and client secret). 

After we get the token, we can call most of spotify's API endpoints(we have a basic scope credential, which is beyond this project). We want to get information on songs, but the API endpoint for song information only takes a song id. How do we get the song id?

The only way to get a song's id is to use the 'search' API endpoint. This can take many parameters but we will be giving it song name and artist. The response we get will have that songs ID that we can use to get more information on that song.

Once we a way to get a song's ID based off of name and artist. We need a list of songs with their names and artists. This is where web scraping will come into play. Spotify has a list of weekly top songs and we can pull the song names and artists off of that and save them in a list.

After all this, we have what we need to build our data. The steps are as follows:

#### Steps:
1. get a token
2. get a list of songs using selenium to web scrape song name and artist
3. get the IDs of each song in our list using the search API
4. call the API endpoint for song data for each song(with ID) and save data in a list
5. turn the list generated from step 4 into a dataframe
6. celebrate cause the boring stuff is over

In [1]:
import requests as rq # tool to make api calls
import pandas as pd # data processing
import base64 # to encoded the credentials for api calls
import json # for responses to api calls
from urllib.parse import quote # for url encoding
from datetime import datetime, timedelta # constructing urls

In [2]:
url = 'https://accounts.spotify.com/api/token' # api call to recieve token
clientID = '67158a9f3e804254bfe2e64fb370b549' # app id that is created with spotify 
clientS = 'c875dc2bdf7d4cd2ae696b7ff434b502' # secret for app that is registered

In [3]:
# need to get a token from spotify to validate our api calls
# much easier to create a function to do this, since we may be 
# using this often
import base64
def get_token(clientID,clientS,url):
    newS = clientID + ':' + clientS
    newS = base64.b64encode(newS.encode("utf-8"))
    newS64 = str(newS, "utf-8")
    data = {
        'grant_type' : 'client_credentials'
    }
    headers = {
        'Authorization' : 'Basic ' + newS64
    }
    resp = rq.post(url,headers=headers, data=data)
    token = resp.json()['access_token']
    return(token)

In [4]:
token = get_token(clientID,clientS,url)
token

'BQDG0hQvMmzkjEGAVB1YRCfXzRIihrfYvpjIlag2fb0tTdTayceY2ONXDbTWjDM4xVMZH5XFRfYov8RGnRA'

In [5]:
# In order to get song information, we need to obtain a song id
# this is only done by using spotify's search api, let's create
# a function to search a song with spotify's api and return the id
# of that song
def get_song_id(songName,artist,token):
    songNameQuote = quote(songName)
    artistQuote = quote(artist)#create the encoded song name
    url2 = 'https://api.spotify.com/v1/search?q={}%20artist:{}&type=track'.format(songNameQuote,artistQuote) # use api to search for song
    print(url2)
    headers = {
        'Authorization' : 'Bearer '+token
    }
    resp = rq.get(url2,headers=headers) # use search api in spotify to find song ID
    info = resp.json()
    items = info['tracks']['items'] # get all the responses, then iterate through them
    id = ''
    for el in items:
        if(el['name'] == songName):
            id = el['uri'].split(':')[-1]
            break
    return(id) # return the id of the song to use in api calls

In [6]:
songID = get_song_id("Circles","Post Malone",token)
songID

https://api.spotify.com/v1/search?q=Circles%20artist:Post%20Malone&type=track


'21jGcNKet2qwijlDFuPiPb'

### Next Steps
We now have a way to use spotify's api and find id of songs based off of their names. This is great, what do you think would be the next step?

In [7]:
from selenium import webdriver
import chromedriver_binary

In [8]:
def get_all_songs(webEL):
    songsList = []
    for el in webEL:
        tempEl = el.get_attribute('innerHTML')
        tempEl = tempEl.split('>')
        if(len(tempEl) > 2):
            songName = tempEl[1].split('<')[0]
            artist = tempEl[3].split('<')[0][3:]
            songsList.append([songName,artist])
    return(songsList)

In [9]:
def get_page_songs():
    allSongs = []
    d = datetime.today() - timedelta(days=3)
    driver = webdriver.Chrome()
    for i in range(20):
        d,dateRangeString =  construct_date_string(d)
        driver.get('https://spotifycharts.com/regional/us/weekly/'+dateRangeString)
        songNames = driver.find_elements_by_class_name("chart-table-track");
        songName = get_all_songs(songNames)
        allSongs.extend(songName)
    return(allSongs)

In [10]:
def construct_date_string(endDate):
    startDate = endDate - timedelta(days=7)
    endString = datetime.strftime(endDate,"%Y-%m-%d")
    startString = datetime.strftime(startDate,"%Y-%m-%d")
    finalString = startString + '--' + endString
    return(startDate,finalString)

In [11]:
def build_data(all_songs,token):
        keys = ['Song Name','Artist']
        url = 'https://api.spotify.com/v1/audio-features/'
        headers = {
            'Authorization' : 'Bearer '+token
        }
        for song in all_songs:
            try:
                songid = get_song_id(song[0],song[1],token)
                resp = rq.get(url = url+songid,headers=headers)
                values = list(resp.json().values())
                if(len(keys) == 2):
                    print(resp.json().keys())
                    keys.extend(resp.json().keys())
                if(len(values) != 0):
                    song.extend(values)
                else:
                    all_songs.remove(song)
            except:
                print("There was an error on song: " + song)
        allDataDF = pd.DataFrame(all_songs,columns = keys)
        return(allDataDF)

In [12]:
#all_songs = get_page_songs()

In [13]:
#DF = build_data(all_songs,token)
#DF.dropna(inplace=True)
#DF.to_csv('allSongs.csv')
DF = pd.read_csv('allSongs.csv')

In [14]:
DF

Unnamed: 0.1,Unnamed: 0,Song Name,Artist,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0,Circles,Post Malone,0.695,0.762,0.0,-3.497,1.0,0.0395,0.1920,...,0.0863,0.553,120.042,audio_features,21jGcNKet2qwijlDFuPiPb,spotify:track:21jGcNKet2qwijlDFuPiPb,https://api.spotify.com/v1/tracks/21jGcNKet2qw...,https://api.spotify.com/v1/audio-analysis/21jG...,215280.0,4.0
1,1,Ransom,Lil Tecca,0.745,0.642,7.0,-6.257,0.0,0.2870,0.0204,...,0.0658,0.226,179.974,audio_features,1lOe9qE0vR9zwWQAOk6CoO,spotify:track:1lOe9qE0vR9zwWQAOk6CoO,https://api.spotify.com/v1/tracks/1lOe9qE0vR9z...,https://api.spotify.com/v1/audio-analysis/1lOe...,131240.0,4.0
2,2,BOP,DaBaby,0.769,0.787,11.0,-3.909,1.0,0.3670,0.1890,...,0.1290,0.836,126.770,audio_features,6Ozh9Ok6h4Oi1wUSLtBseN,spotify:track:6Ozh9Ok6h4Oi1wUSLtBseN,https://api.spotify.com/v1/tracks/6Ozh9Ok6h4Oi...,https://api.spotify.com/v1/audio-analysis/6Ozh...,159715.0,4.0
3,3,Truth Hurts,Lizzo,0.715,0.624,4.0,-3.046,0.0,0.1140,0.1100,...,0.1230,0.412,158.087,audio_features,5qmq61DAAOUaW8AUo8xKhh,spotify:track:5qmq61DAAOUaW8AUo8xKhh,https://api.spotify.com/v1/tracks/5qmq61DAAOUa...,https://api.spotify.com/v1/audio-analysis/5qmq...,173325.0,4.0
4,4,223's (feat. 9lokknine),YNW Melly,0.931,0.502,0.0,-9.311,0.0,0.3530,0.0389,...,0.0912,0.712,94.999,audio_features,4sjiIpEv617LDXaidKioOI,spotify:track:4sjiIpEv617LDXaidKioOI,https://api.spotify.com/v1/tracks/4sjiIpEv617L...,https://api.spotify.com/v1/audio-analysis/4sji...,176640.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3678,3995,Verte Ir,DJ Luian,0.857,0.647,11.0,-4.152,0.0,0.0984,0.3910,...,0.0842,0.553,95.982,audio_features,4lzxJ4jCuFDXXGkE1LmpKR,spotify:track:4lzxJ4jCuFDXXGkE1LmpKR,https://api.spotify.com/v1/tracks/4lzxJ4jCuFDX...,https://api.spotify.com/v1/audio-analysis/4lzx...,267500.0,4.0
3679,3996,This Feeling,The Chainsmokers,0.575,0.571,1.0,-7.906,1.0,0.0439,0.0558,...,0.0912,0.449,105.049,audio_features,4NBTZtAt1F13VvlSKe6KTl,spotify:track:4NBTZtAt1F13VvlSKe6KTl,https://api.spotify.com/v1/tracks/4NBTZtAt1F13...,https://api.spotify.com/v1/audio-analysis/4NBT...,197947.0,4.0
3680,3997,Best Part (feat. Daniel Caesar),H.E.R.,0.473,0.371,4.0,-10.219,0.0,0.0405,0.7950,...,0.1090,0.413,75.208,audio_features,4OBZT9EnhYIV17t4pGw7ig,spotify:track:4OBZT9EnhYIV17t4pGw7ig,https://api.spotify.com/v1/tracks/4OBZT9EnhYIV...,https://api.spotify.com/v1/audio-analysis/4OBZ...,209400.0,4.0
3681,3998,No Role Modelz,J. Cole,0.696,0.521,10.0,-8.465,0.0,0.3320,0.3020,...,0.0565,0.458,100.000,audio_features,62vpWI1CHwFy7tMIcSStl8,spotify:track:62vpWI1CHwFy7tMIcSStl8,https://api.spotify.com/v1/tracks/62vpWI1CHwFy...,https://api.spotify.com/v1/audio-analysis/62vp...,292987.0,4.0


### Retrieving Label
In order to actually predict the popularity of the songs, the data needs a value to train on. This program will use the popularity of songs as a prediciton value. There is an API call to spotify that we can use to retrieve the popularity of those songs. It is a number between 0 and 100. After we get the popularity values of each songs, we will append the data to our existing data frame. Then we can do some machine learning!!

In [15]:
url = "https://api.spotify.com/v1/tracks/11dFghVXANMlKmJXsNCbNl"
headers = {
        'Authorization' : 'Bearer '+token
}

In [16]:
response = rq.get(url,headers=headers).json()

In [17]:
print('Popularity Score: ' + str(response['popularity']) + '/100')
response

Popularity Score: 7/100


{'album': {'album_type': 'single',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/6sFIWsNpZYqfjUpaCgueju'},
    'href': 'https://api.spotify.com/v1/artists/6sFIWsNpZYqfjUpaCgueju',
    'id': '6sFIWsNpZYqfjUpaCgueju',
    'name': 'Carly Rae Jepsen',
    'type': 'artist',
    'uri': 'spotify:artist:6sFIWsNpZYqfjUpaCgueju'}],
  'available_markets': [],
  'external_urls': {'spotify': 'https://open.spotify.com/album/0tGPJ0bkWOUmH7MEOR77qc'},
  'href': 'https://api.spotify.com/v1/albums/0tGPJ0bkWOUmH7MEOR77qc',
  'id': '0tGPJ0bkWOUmH7MEOR77qc',
  'images': [{'height': 640,
    'url': 'https://i.scdn.co/image/966ade7a8c43b72faa53822b74a899c675aaafee',
    'width': 640},
   {'height': 300,
    'url': 'https://i.scdn.co/image/107819f5dc557d5d0a4b216781c6ec1b2f3c5ab2',
    'width': 300},
   {'height': 64,
    'url': 'https://i.scdn.co/image/5a73a056d0af707b4119a883d87285feda543fbb',
    'width': 64}],
  'name': 'Cut To The Feeling',
  'release_date': '2017-05-26',
 

In [18]:
def get_populatity(songID):
    url = "https://api.spotify.com/v1/tracks/" + str(songID)
    headers = {
        'Authorization' : 'Bearer '+token
    }
    response = rq.get(url,headers=headers).json()
    return(response['popularity'])

In [21]:
popList = []
i = 0
for el in DF['id']:
    popList.append(get_populatity(el))
    if(i % 1000 == 0):
        print(i)
    i += 1
popList

0
1000
2000
3000


[96,
 61,
 85,
 94,
 90,
 83,
 91,
 92,
 93,
 89,
 90,
 88,
 80,
 95,
 91,
 89,
 93,
 100,
 79,
 96,
 89,
 90,
 80,
 90,
 94,
 97,
 90,
 96,
 89,
 58,
 87,
 85,
 87,
 87,
 94,
 88,
 77,
 77,
 56,
 86,
 91,
 97,
 87,
 77,
 76,
 86,
 83,
 76,
 85,
 92,
 89,
 94,
 57,
 92,
 75,
 84,
 82,
 86,
 90,
 86,
 94,
 81,
 89,
 89,
 89,
 85,
 85,
 84,
 75,
 83,
 86,
 83,
 85,
 89,
 87,
 90,
 83,
 95,
 83,
 83,
 92,
 80,
 79,
 83,
 90,
 82,
 92,
 89,
 91,
 82,
 85,
 88,
 78,
 88,
 56,
 82,
 90,
 84,
 97,
 84,
 89,
 94,
 87,
 87,
 85,
 83,
 82,
 87,
 82,
 82,
 81,
 85,
 89,
 92,
 81,
 77,
 76,
 86,
 83,
 83,
 89,
 83,
 87,
 84,
 86,
 89,
 82,
 88,
 86,
 81,
 81,
 83,
 79,
 88,
 86,
 86,
 83,
 87,
 77,
 85,
 84,
 72,
 88,
 82,
 82,
 81,
 89,
 84,
 92,
 76,
 79,
 82,
 87,
 84,
 86,
 86,
 85,
 87,
 88,
 89,
 81,
 86,
 85,
 87,
 80,
 84,
 87,
 86,
 81,
 85,
 78,
 82,
 86,
 84,
 84,
 79,
 85,
 84,
 83,
 81,
 89,
 84,
 65,
 82,
 96,
 61,
 92,
 94,
 89,
 91,
 90,
 93,
 90,
 100,
 93,
 88,
 95,
 89,
 91,
 90

In [22]:
DF['Popularity'] = popList

In [24]:
DF.to_csv('allSongsLabeled.csv')

In [26]:
columns = ['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo']

In [28]:
from sklearn.model_selection import train_test_split
X = DF[columns]
y = DF['Popularity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [29]:
X_train

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
421,0.732,0.463,11.0,-6.972,0.0,0.0287,0.37400,0.000000,0.1940,0.397,95.971
123,0.699,0.558,11.0,-7.622,1.0,0.0603,0.46200,0.000000,0.1160,0.493,79.992
2110,0.755,0.772,6.0,-5.585,1.0,0.4000,0.12800,0.000000,0.1570,0.678,132.906
1220,0.848,0.455,1.0,-9.568,1.0,0.0746,0.00252,0.000000,0.1120,0.660,163.004
819,0.596,0.552,0.0,-10.278,0.0,0.0970,0.07650,0.334000,0.1040,0.112,97.949
...,...,...,...,...,...,...,...,...,...,...,...
1130,0.599,0.887,4.0,-3.967,1.0,0.0984,0.01920,0.000001,0.3000,0.881,170.918
1294,0.759,0.540,9.0,-6.039,0.0,0.0287,0.03700,0.000000,0.0945,0.750,116.947
860,0.905,0.389,8.0,-14.505,1.0,0.3320,0.74000,0.162000,0.1060,0.196,120.046
3507,0.826,0.579,8.0,-8.241,0.0,0.0801,0.00881,0.000000,0.1290,0.431,121.075


In [30]:
y_train

421      92
123      84
2110     82
1220     72
819      82
       ... 
1130     90
1294    100
860      88
3507     76
3174     89
Name: Popularity, Length: 2467, dtype: int64

In [37]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)

In [38]:
X_train_scaled

array([[0.67716535, 0.45672915, 1.        , ..., 0.17988485, 0.39020435,
        0.22297016],
       [0.63385827, 0.56336289, 1.        , ..., 0.08823875, 0.49399935,
        0.10817684],
       [0.70734908, 0.80356942, 0.54545455, ..., 0.1364117 , 0.69402098,
        0.48831161],
       ...,
       [0.90419948, 0.37366708, 0.72727273, ..., 0.07648925, 0.17288355,
        0.39592523],
       [0.80052493, 0.58693456, 0.72727273, ..., 0.1035131 , 0.42696508,
        0.40331758],
       [0.70341207, 0.48479066, 0.54545455, ..., 0.06191987, 0.53724727,
        0.51083349]])