# Dissecting Spotify Valence - ETL
In this assignment i will try to dissect Spotify's Valence metric this is the data ETL file.

---

> Iliadis Viktoras, Undergraduate Student <br />
> Department of Management Science and Technology <br />
> Athens University of Economics and Business <br />
> iliadisviktoras@gmail.com

For cleaner code i decided to split the files , in this file we will extract the charts from zenodo , and use the spotify API and the spotipy library to get each song's audio features and analysis.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

import glob
import re
from datetime import datetime

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Extracting to a DF the Zenodo Data
We select the charts.zip from:
* The [Zenodo](https://zenodo.org/record/4778563#.YgAF4bpBy3A) dataset from the spotify_anova class file with 2017-2019 charts.
I will copy the process from the spotify_anova example , except some features i find to be redundant for the purpose of this analysis . 

In [2]:
header = 0
dfs = []
for file in glob.glob('Charts/*/201?/*.csv'):
    weekly_chart = pd.read_csv(file, header=header, sep='\t')
    dfs.append(weekly_chart)
all_charts = pd.concat(dfs)

#### Spotify API
In order to get each track's audio feature and analysis , we need to set up our Spotify API.
You can do the same in [Spotify for Developers](https://developer.spotify.com/).
We also create a file to store the credential's

```
config = {
    'client_id' : 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
    'client_secret' :'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
}
```
* We will get, for each of the top-streaming tracks, its [Track features](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features) and its [Audio Analysis](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-analysis) . 
* To do that, we'll create a dictionary keyed by `song_id`, with values being the audio features for the specific track.
* [Spotipy Documentation](https://spotipy.readthedocs.io/en/2.19.0/) , for all the functions called . 

In [4]:
from spotify_config import config

client_credentials_manager = SpotifyClientCredentials(config['client_id'],
                                                      config['client_secret'])
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)


In [None]:
features = {}
all_track_ids = list(all_charts['song_id'].unique())

* We start with the audio features , using the API's functions that intakes up to 100 Id's at a time .
* To do that we create a dictionary keyed by `song_id`

In [None]:
features = {}
all_track_ids = list(all_charts['song_id'].unique())

start = 0
num_tracks = 100
while start < len(all_track_ids):
    tracks_batch = all_track_ids[start:start+num_tracks]
    features_batch = sp.audio_features(tracks_batch)
    features.update({ track_id : track_features 
                     for track_id, track_features in zip(tracks_batch, features_batch) })
    start += num_tracks

* We turn the dictionary into a Dataframe
* We keep only the features that offer info on the track 

In [None]:
tracks = pd.DataFrame.from_dict(features, orient='index')
tracks = tracks.drop(['type','id','uri','track_href','analysis_url'],axis=1)

* We continue with the audio analysis  
Unfortunately unlike the audio features  that we can do batches of 100's , the audio analysis must be extracted track by track , that takes a significant amount of time , so we will do it once and then export the joined data into a csv to be able to speed up the process . The code will be kept for documentation purposes . 

**From the Audio analysis we keep for each unique song :** 
* The amount of segments and  sections  , perharps the consistency or inconsistency of the sound has something to do with the valence , we only considered the ones with a confidence of 0.3 and greater .
* The average dominance of each the 12 pitches across all segments for each song . 
* The average  for the 12 values of timbre , we are going to try and use them independently and summed . 
* Tatums average duration , i found that tatums was heavily used in the development of  Echo Nest's algorithms .   

For each of the features i mentioned above we create the corresponding function , with the audio analysis as input we extracted with the help of spotipy.

In [10]:
# Calculates for each of the 12 pitch's the avg across the track's segments
def CalcPitch(song_analysis):
    # we create an empty list to store for each pitch it's value
    pitchesvalues =[0] * 12
    Totalvalues = 0 
    # we iterate all different segments
    for index in range(len(song_analysis['segments'])):
        if (song_analysis['segments'][index]['confidence'] >0.3):
            Totalvalues += 1 
            counter = 0 
        # we sum for all the different segments each pitch's value
            for pitch in song_analysis['segments'][index]['pitches']:
                pitchesvalues[counter] += pitch
                counter +=1
    # we get the average pitch's value for the whole song by diving with the amount of total segments        
    pitchesvalues[:] = [x / Totalvalues for x in pitchesvalues]        
# returns an array of 12 values , for the avg of the 12 pitch's across all segments
    return pitchesvalues

def CalcTimbre(song_analysis):
    # we create an empty list to store for each pitch it's value
    Timbrevalues =[0] * 12
    Totalvalues = 0 
    # we iterate all different segments
    for index in range(len(song_analysis['segments'])):
        if (song_analysis['segments'][index]['confidence'] >0.3):
            Totalvalues += 1
            counter = 0 
        # we sum for all the different segments each timbre's value
            for timbre in song_analysis['segments'][index]['timbre']:
                Timbrevalues[counter] += timbre
                counter +=1
    # we get the average pitch's value for the whole song by diving with the amount of total segments        
    Timbrevalues[:] = [x / Totalvalues for x in Timbrevalues]        
# returns an array of 12 values , for the avg of the 12 timbres's across all segments
    return Timbrevalues


# Calculates the average tatum duration 

def AVGTatum(song_analysis):
    tatumsum = 0 
#     for each of the available tatums
    countvalues = 0 
    for index in range(len(song_analysis['tatums'])):
        if (song_analysis['tatums'][index]['confidence'] >0.3):
            countvalues +=1
#    we sum every duration
            tatumsum += song_analysis['tatums'][index]['duration']
#    we divided by the amount of tatums to get the avg duration     
    tatumavg = tatumsum / countvalues
    return round(tatumavg,4)

# Calculates the total amount of segments in 1 track , recognizing only the ones with a confidence greater than point 3
def TotalSegments(song_analysis):
    sumsegments = 0 
    for index in range(len(song_analysis['segments'])):
        if (song_analysis['segments'][index]['confidence'] > 0.3):
            sumsegments += 1
    return sumsegments

# Calculates the total amount of sections in 1 track , recognizing only the ones with a confidence greated than point 3 
def TotalSections(song_analysis):
    sumsections = 0 
    for index in range(len(song_analysis['sections'])):
        if (song_analysis['sections'][index]['confidence'] > 0.3):
            sumsections += 1
    return sumsections


In [None]:
analysis = {}
counter = 0
while counter < len(all_track_ids):
#     select an id from the 13880 unique ones
    track = all_track_ids[counter]
# get the tracks track analysis from the api     
    song_analysis = sp.audio_analysis(track)

# Now we use the functions we created and calculate the features for all unique id's 
    EngineeredFeatures = {}
    EngineeredFeatures['TotalSegments'] = TotalSegments(song_analysis)
    EngineeredFeatures['TotalSections'] = TotalSections(song_analysis)   
    EngineeredFeatures['AvgTatumDuration'] = AVGTatum(song_analysis)
# We get the 12 values for Pitch and Timbre
    Pitch12 = CalcPitch(song_analysis)
    Timbre12 = CalcTimbre(song_analysis)
    for index in range(len(Pitch12)):
        EngineeredFeatures['Pitch'+str(index+1)] = Pitch12[index]
    for index in range(len(Pitch12)):
        EngineeredFeatures['Timbre'+str(index+1)] = Timbre12[index]
  
    analysis.update({ track: EngineeredFeatures })
    counter += 1
analysisDF = pd.DataFrame.from_dict(analysis, orient='index')

Now we join the features from the audio features and the audio analysis , and export it to csv to use it in our analysis in the notebook . 

In [None]:
JoinedDF = tracks.join(analysisDF)
JoinedDF.to_csv("TrackFeatures.csv")