# The data behing Spotify music

- [Spotify](https://www.spotify.com/) is a service with more than **70 million tracks**, and with **345 million active users per month**. With these numbers, Data Analysis is a must! The Spotify databases are enriched with lots of features: *popularity*, *danceability*, *key*... you can see them for yourself in its [API](https://developer.spotify.com/documentation/web-api/).


- In my analysis, I'm going to use a [dataset](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) collected from this API by a Kaggle user, Yamac Eren Ay. There are two main datasets, one for songs and another one for artists (the other datasets are derived from the first ones, by aggregation techniques)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import re

In [2]:
import ipywidgets as widgets
from ipywidgets import interact

In [3]:
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

import cufflinks as cf
cf.go_offline(connected=True)
cf.set_config_file(colorscale='plotly', world_readable=True)


The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.9.0-CAPI-1.16.2). Conversions between both will be slow.



## Songs dataset

In [4]:
df = pd.read_csv('./data/data.csv')
print(df.shape)
df.head()

(174389, 19)


Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


First, some cleaning:
- Drop duplicates
- Check for NaN
- Tokenize artists

In [5]:
# check for duplicated items, and drop them
print(f'{df.duplicated().sum()} duplicated tracks')
df = df.drop_duplicates()
df = df.reset_index(drop=True)

# check for NaN values
print(f'{df.isna().sum().sum()} NaN values')

# tokenize artist column
def tokenize_str(text):
    regex_rule = re.compile("['\"\[\]]")
    text = re.sub(regex_rule, "", text)
    return text.split(",")

df['artists'] = df['artists'].apply(tokenize_str)
df.head()

2159 duplicated tracks
0 NaN values


Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,[Mamie Smith],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,[Screamin Jay Hawkins],0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,[Mamie Smith],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,[Oscar Velazquez],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,[Mixe],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


### Song popularity analysis

Let's transform the duration into minutes, to see it more clearly

In [6]:
df['duration_min'] = df['duration_ms']/(1000*60)

And now let's plot one feature vs the other:

In [7]:
# I'm going to take a sample, because the graph is so slow...
df_plot = df.sample(frac=0.1)

In [8]:
@interact
def time_plot(x=list(df_plot.drop(['duration_ms','artists'], axis=1).columns), 
              y=list(df_plot.drop(['duration_ms','artists'], axis=1).columns)):

    df_plot.iplot(kind='scatter', x=x, y=y, mode='markers', size=3,
             xTitle=x.title(), yTitle=y.title(), theme='white', colors='#1DB954',
             title=f'{x.title()} vs {y.title()}')

interactive(children=(Dropdown(description='x', options=('acousticness', 'danceability', 'energy', 'explicit',…

We can't see much on these graphs!

### How has music taste changed over time?

Let's aggregate by year, now:

In [9]:
mean_by_year = df.groupby(by='year').mean().reset_index()
mean_by_year

Unnamed: 0,year,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,duration_min
0,1920,0.631242,0.515750,238092.997135,0.418700,0.123209,0.354219,4.770774,0.216049,-12.654020,0.636103,0.610315,0.082984,113.226900,0.498210,3.968217
1,1921,0.862105,0.432171,257891.762821,0.241136,0.070513,0.337158,5.108974,0.205219,-16.811660,0.666667,0.391026,0.078952,102.425397,0.378276,4.298196
2,1922,0.828934,0.575620,140135.140496,0.226173,0.000000,0.254776,4.842975,0.256662,-20.840083,0.661157,0.090909,0.464368,100.033149,0.571190,2.335586
3,1923,0.957247,0.577341,177942.362162,0.262406,0.000000,0.371733,4.810811,0.227462,-14.129211,0.789189,5.205405,0.093949,114.010730,0.625492,2.965706
4,1924,0.940200,0.549894,191046.707627,0.344347,0.000000,0.581701,5.648305,0.235219,-14.231343,0.754237,0.661017,0.092089,120.689572,0.663725,3.184112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,2017,0.203493,0.581529,243134.111058,0.687585,0.227404,0.242731,5.318269,0.231528,-8.014637,0.608654,32.733654,0.097748,121.703343,0.435017,4.052235
98,2018,0.231001,0.603681,225477.958462,0.659335,0.248462,0.226369,5.408846,0.233501,-8.192142,0.583846,28.498462,0.122919,123.557107,0.423175,3.757966
99,2019,0.262731,0.605277,225277.108084,0.627724,0.242519,0.217415,5.357749,0.218535,-8.536509,0.599375,33.439482,0.107521,122.213681,0.458687,3.754618
100,2020,0.203989,0.603137,218745.913530,0.673927,0.189034,0.242623,5.353519,0.240397,-8.007122,0.591107,27.051282,0.099877,123.974626,0.455352,3.645765


In [10]:
@interact
def time_plot(y=list(mean_by_year.drop('duration_ms', axis=1).columns)[1:]):

    mean_by_year.iplot(kind='scatter', x='year', y=y, mode='markers', size=5,
             xTitle='year', yTitle=y.title(), theme='white', colors='#1DB954',
             title=f'mean {y.title()} over time')
    

interactive(children=(Dropdown(description='y', options=('acousticness', 'danceability', 'energy', 'explicit',…

- Well, we can see that the songs are becoming:
    - longer,
    - louder, 
    - with faster tempos 


- The change in the valence (positiviness of the track) is also interesting: it had more variance in the past, then peaked on the 80's to go down until last year.


- The songs also have less variance in the key, and the tendency is going to songs:
    - between F (*Somebody That I Used to Know (Gotye), Yesterday, Hey Jude*) 
    - and F# keys (*Born this way (Lady Gaga), I wanna dance with somebody (Whitney Houston)*)

### Oldest and newest songs

In [11]:
df.sort_values(by=['year','release_date']).iloc[0]

acousticness                           0.991
artists                        [Mamie Smith]
danceability                           0.598
duration_ms                           168333
energy                                 0.224
explicit                                   0
id                    0cS0A1fUEUd1EW3FcF8AEI
instrumentalness                    0.000522
key                                        5
liveness                               0.379
loudness                             -12.628
mode                                       0
name                Keep A Song In Your Soul
popularity                                12
release_date                            1920
speechiness                           0.0936
tempo                                149.976
valence                                0.634
year                                    1920
duration_min                         2.80555
Name: 0, dtype: object

Newest and loudest song:

In [12]:
df.sort_values(by=['year','tempo'], ascending=[False,False]).iloc[0]

acousticness                            0.0489
artists             [J Balvin,  Jeon,  Anitta]
danceability                             0.642
duration_ms                             189653
energy                                   0.837
explicit                                     0
id                      0caF41CcNMjpG6AoV6xHB9
instrumentalness                             0
key                                          4
liveness                                0.0898
loudness                                -3.356
mode                                         1
name                                   Machika
popularity                                   0
release_date                        2021-01-22
speechiness                              0.367
tempo                                  211.968
valence                                  0.523
year                                      2021
duration_min                           3.16088
Name: 19865, dtype: object

### What makes a song popular?

Let's aggregate by popularity, to see these relationships more clearly:

In [13]:
mean_by_popularity = df.groupby(by='popularity').mean().reset_index()
mean_by_popularity

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year,duration_min
0,0,0.686870,0.532300,231501.194593,0.362382,0.079765,3.415497e-01,5.224194,0.220316,-14.030980,0.674268,0.195604,112.122256,0.498639,1959.237257,3.858353
1,1,0.664209,0.515032,232236.778967,0.385811,0.029703,2.922477e-01,5.428686,0.225508,-13.750756,0.683168,0.161253,114.508610,0.463973,1962.585764,3.870613
2,2,0.632248,0.512122,231608.636993,0.423087,0.023063,2.913310e-01,5.190959,0.234347,-12.754102,0.695111,0.122784,115.547566,0.477612,1967.328875,3.860144
3,3,0.631016,0.511365,211356.902778,0.444724,0.011218,3.199950e-01,5.142628,0.239475,-12.294946,0.697115,0.086457,116.805191,0.499185,1971.490385,3.522615
4,4,0.650044,0.502556,208418.051916,0.418885,0.006799,3.412457e-01,5.313350,0.238968,-12.631209,0.697157,0.095630,116.796001,0.484398,1971.213226,3.473634
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,93,0.718000,0.675500,164633.000000,0.461000,0.500000,7.950000e-07,7.000000,0.111800,-8.939500,1.000000,0.203750,104.047000,0.311450,2020.000000,2.743883
94,94,0.177660,0.789600,167209.400000,0.510800,0.600000,2.600000e-02,4.200000,0.134720,-6.823800,0.600000,0.132320,103.832000,0.559200,2020.000000,2.786823
95,95,0.306500,0.797000,192018.500000,0.619500,1.000000,2.726000e-04,2.500000,0.108000,-7.108500,0.000000,0.103200,136.918000,0.491500,2020.000000,3.200308
96,96,0.344500,0.718500,156425.500000,0.762000,1.000000,0.000000e+00,3.500000,0.182550,-4.164500,0.500000,0.062350,117.502000,0.719000,2020.000000,2.607092


And now let's see the relationships between popularity and the other variables

In [14]:
@interact
def time_plot(x=list(mean_by_popularity.drop(['duration_ms'], axis=1).columns)[1:]):
   
    mean_by_popularity.iplot(kind='scatter', x=x, y='popularity', mode='markers', size=5,
             xTitle=x.title(), yTitle='popularity', theme='white', colors='#1DB954',
             title=f'Popularity vs mean {x.title()}')
    

interactive(children=(Dropdown(description='x', options=('acousticness', 'danceability', 'energy', 'explicit',…

- It turns out that **most recent songs are more popular**. This is caused somehow by the way the popularity is calculated: it takes into account the total number of plays and **how recent** those plays are. This can add a bias on the relationship of popularity and other features of the songs.


- A summary is that the popular songs are the ones:
    - more danceable, 
    - with more audio engineering, 
    - louder, 
    - and not very long (3.5 min)


### Examples of most and less popular songs

Which are the **less popular** songs?

In [15]:
df[df['popularity']== 0]

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year,duration_min
7,0.99600,[Mamie Smith & Her Jazz Hounds],0.474,186173,0.239,0,02FzJbHtqElixxCmrpSCUa,0.186,9,0.195,-9.712,1,Arkansas Blues,0,1920,0.0289,78.784,0.366,1920,3.102883
8,0.99600,[Francisco Canaro],0.469,146840,0.238,0,02i59gYdjlhBmbbWhf8YuK,0.960,8,0.149,-18.717,1,La Chacarera - Remasterizado,0,1920-07-08,0.0741,130.060,0.621,1920,2.447333
9,0.00682,[Meetya],0.571,476304,0.753,0,06NUxS2XL3efRh0bloxkHm,0.873,8,0.092,-6.943,1,Broken Puppet - Original Mix,0,1920-01-01,0.0446,126.993,0.119,1920,7.938400
10,0.95200,[Dorville],0.688,150067,0.220,0,07jrRR1CUUoPb1FLfSy9Jh,0.000,6,0.262,-15.208,0,Oouin,0,1920,0.8450,82.024,0.414,1920,2.501117
11,0.99600,[Francisco Canaro],0.579,167213,0.356,0,0ANuF7SvPeIHanGcCpy9jR,0.948,10,0.174,-14.574,1,Desengaño - Remasterizado,0,1920-07-08,0.0394,131.494,0.703,1920,2.786883
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
172224,0.79500,[Alessia Cara],0.429,144720,0.211,0,3N3Wi5Un7iT8amLezSRwub,0.000,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021,2.412000
172225,0.79500,[Alessia Cara],0.429,144720,0.211,0,45XnLMuqf3vRfskEAMUeCH,0.000,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021,2.412000
172226,0.79500,[Alessia Cara],0.429,144720,0.211,0,4pPFI9jsguIh3wC7Otoyy8,0.000,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021,2.412000
172227,0.79500,[Alessia Cara],0.429,144720,0.211,0,52YtxLVUyvtiGPxwwxayHZ,0.000,4,0.196,-11.665,1,A Little More,0,2021-01-22,0.0360,94.710,0.228,2021,2.412000


There are a lot of songs with popularity zero! Let's take the less danceable and loud:

In [16]:
df[df['popularity']==0].sort_values(by=['danceability', 'loudness']).head(13)

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year,duration_min
22610,0.0,[Benny Goodman],0.0,5991,0.0,0,3IcXTeq9O2dpsSXsDj9naH,0.0,0,0.0,-60.0,0,Pause Track - Live,0,1938,0.0,0.0,0.0,1938,0.09985
22679,0.0,[Benny Goodman],0.0,6362,0.0,0,523qs4UcGlQ6ycdha1VGqs,0.0,0,0.0,-60.0,0,Pause Track - Live,0,1938,0.0,0.0,0.0,1938,0.106033
61738,0.0,[Future Rapper],0.0,420000,0.0,0,0Rd7eiAZGayLT8TmrVpQzG,0.0,0,0.0,-60.0,0,StaggerLee Has His Day at the Beach,0,1949-02-17,0.0,0.0,0.0,1949,7.0
61830,0.0,[Sarah Vaughan],0.0,5108,0.0,0,0hr9kRUi2X4MXc72A4VxG4,0.0,0,0.0,-60.0,0,Pause Track,0,1949,0.0,0.0,0.0,1949,0.085133
142980,0.0,[Sarah Vaughan],0.0,6467,0.0,0,3lRVIn6D6EUbvkOgPZAU1H,0.0,0,0.0,-60.0,0,Pause Track,0,1949,0.0,0.0,0.0,1949,0.107783
144203,0.978,[Unspecified],0.0,42107,0.189,0,07kyGuUKm3yIFs8AoLExJj,0.92,10,0.114,-36.524,1,Stethoscope Sounds: Normal Heart and Lung Soun...,0,1955-01-01,0.0,0.0,0.0,1955,0.701783
156496,0.137,[N.A.T.E. Jones],0.0,8042,0.279,1,334gn6CIPuCzq5laDm1m0Y,0.000122,10,0.342,-29.832,0,Happy New Years!! (Clear Vision),0,2020-01-01,0.0,0.0,0.0,2020,0.134033
159136,0.729,"[Wolfgang Amadeus Mozart, Sebastian Fischer, ...",0.0,11493,0.146,0,0CpFb3YFhEo2EgDOSQaGbe,0.0,7,0.402,-27.851,1,"Die Zauberflöte, K.620 / Act 1: Ist's denn Wir...",0,1955-01-01,0.0,0.0,0.0,1955,0.19155
95426,0.974,[Ogden Nash],0.0,9680,0.116,0,0R5mcFkjcryBDgw9vKpFSo,0.0457,5,0.205,-27.357,1,The Perfect Husband,0,1950-01-01,0.0,0.0,0.0,1950,0.161333
92718,0.227,[Corey-G],0.0,12000,0.101,1,1N4jTBZnj2M3SOTLB8FXPs,0.0,5,0.264,-26.782,0,Happy New Year 2019,0,2019-01-01,0.0,0.0,0.0,2019,0.2


The first 12 tracks are pause tracks, or speech introductions... there is even a track with just silence for 7 minutes!!

The first one that is really a song is a tango, *La mina del Ford*, from the singer *Ignacio Corsini*

Let's see the **most popular** songs:

In [17]:
df[df['popularity']> 95].sort_values(by='popularity', ascending=False)

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year,duration_min
19768,0.721,[Olivia Rodrigo],0.585,242014,0.436,1,7lPN2DXiMsVn7XUKtOW1CS,1.3e-05,10,0.105,-8.761,1,drivers license,100,2021-01-08,0.0601,143.874,0.132,2021,4.033567
19588,0.221,"[24kGoldn, iann dior]",0.7,140526,0.722,1,3tjFYV6RSFtuktYl3ZtYcq,0.0,7,0.272,-3.558,0,Mood (feat. iann dior),96,2020-07-24,0.0369,90.989,0.756,2020,2.3421
19591,0.468,[Ariana Grande],0.737,172325,0.802,1,35mvY5S1H3J2QZyna3TFe0,0.0,0,0.0931,-4.771,1,positions,96,2020-10-30,0.0878,144.015,0.682,2020,2.872083


All of these are famous artists: Olivia Rodrigo is a former Disney Channel actress, 24kgoldn is a rapper with other succesful themes and Ariana Grande is, well, Ariana Grande

### Does the key of a songs affects to other features?

In [18]:
mean_by_key = df.groupby(by='key').mean().reset_index()
mean_by_key['duration_min'] = mean_by_year['duration_ms']/(1000*60)
mean_by_key['key'] = mean_by_key['key'].map(
    {0:'C', 1:'C#', 2:'D', 3:'D#', 4:'E', 5:'F', 6:'F#', 7:'G', 8:'G#', 9:'A', 10:'A#', 11:'B'})
mean_by_key

Unnamed: 0,key,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,liveness,loudness,mode,popularity,speechiness,tempo,valence,year,duration_min
0,C,0.518295,0.532754,232269.551497,0.46086,0.057347,0.191891,0.213467,-11.979978,0.808521,26.039199,0.095243,117.188636,0.526403,1975.701244,3.968217
1,C#,0.439267,0.564078,231804.163926,0.498519,0.139034,0.192247,0.208647,-11.887598,0.755569,25.789356,0.176447,115.967498,0.505154,1978.435865,4.298196
2,D,0.487484,0.521206,236255.26984,0.495301,0.053015,0.190954,0.213842,-11.540599,0.778837,26.978526,0.084144,117.897979,0.523099,1977.392203,2.335586
3,D#,0.685961,0.496564,226733.973283,0.384241,0.0299,0.244568,0.198109,-13.000934,0.776024,21.109911,0.086853,114.867187,0.497562,1967.682447,2.965706
4,E,0.475399,0.523851,235468.511801,0.505455,0.055962,0.185151,0.220298,-11.407504,0.559692,27.624182,0.082983,117.845289,0.513354,1979.111771,3.184112
5,F,0.582946,0.529292,231789.071398,0.437731,0.042961,0.206887,0.210515,-12.287754,0.656252,24.555379,0.09021,115.222857,0.53021,1972.781717,3.079627
6,F#,0.437523,0.561007,227724.400463,0.512985,0.11067,0.188979,0.214424,-11.499361,0.579034,26.37004,0.15711,115.936454,0.521821,1979.628638,2.630606
7,G,0.497629,0.534822,232274.13513,0.481861,0.058779,0.195315,0.217148,-11.745701,0.801118,26.122721,0.089693,117.394189,0.533041,1977.108748,2.94238
8,G#,0.550528,0.533388,230736.403868,0.461607,0.069302,0.2075,0.197323,-11.86954,0.810959,25.168942,0.098754,115.884323,0.518518,1974.814941,3.573267
9,A,0.446203,0.535586,230870.395578,0.520386,0.047803,0.175271,0.205842,-11.230454,0.659446,27.461629,0.095671,119.419618,0.543805,1978.606437,2.81326


In [19]:
@interact
def key_plot(y=list(mean_by_key.drop(['duration_ms', 'key'], axis=1).columns)):
  
    mean_by_key.iplot(kind='scatter', x='key', y=y, mode='lines+markers', size=5,
             xTitle='key', yTitle=y.title(), theme='white', colors='#1DB954',
             title=f'{y.title()} mean for every key')

interactive(children=(Dropdown(description='y', options=('acousticness', 'danceability', 'energy', 'explicit',…

- The songs in **D#** are not very popular! These are songs like *Rolling in the Deep* (*Adele*) and *Bohemian Rhapsody* (*Queen*)