Skip to content

TristanT56/My-Spotify-Data-Analysis---Python

Repository files navigation

My Spotify Data Analysis - Python

Table of Contents

Introduction

The main goal of this project is to practice my Python skills in data science and to introduce you to another part of myself which is my music tastes.

In this project, I will analyse my spotify data from July 2020 to July 2021.

First, I will clean and prepare the data. Then, I will do an exploratory data analysis (EDA) to better understand my data and finish cleaning some weird/wrong outliers. And finally, the analysis. I will separate the analysis into two parts:

Part 1 - The objectives are :

  • To know which are my top artists and my top tracks.
  • To see my music consumption on spotify per month during the analysis period (2020-07/2021-07).
  • To see when I listen to the most spotify during the week (by day of the week and by hour).

Part 2 - The objectives are:

  • Find the tracks corresponding to the max and min for each audio feature.
  • Find the audio features corresponding to all the songs I listened to.
  • Compare it to the audio features of my top tracks.
  • Analyse the audio features corresponding to the tracks I listen to in the morning, afternoon, evening and night/party.
  • Using these findings and only the audio features, find a track I might like to listen to in the morning/afternoon/evening/and during a party at night.

About the data:

There are two datasets, the first is my historical Spotify data (in JSON format). I got it by requesting it from my Spotify account. The second is the audio features of the tracks I listened to (in CSV format). I got them using the Spotify API (we will see that sometimes the API couldn't find information for some tracks).

C) Cleaning & Preparation

Table 1 - My Spotify historical data:

#Let's import my spotify historical data into a DataFrame and explore with .head() and .info()
import pandas as pd
spotify_tt = pd.read_json(r'C:\Users\Tristan\Documents\DATA\spotify_project\StreamingHistory0.json', orient = 'records')
spotify_tt.head()
endTime artistName trackName msPlayed
0 2020-07-16 15:43 Rich Mullins Hold Me Jesus 8631
1 2020-07-16 15:43 Paolo Conte L'Orchestrina 198840
2 2020-07-16 15:53 Josh Wilson Savior, Please 8540
3 2020-07-17 09:29 Isaac Delusion fancy 31861
4 2020-07-17 09:30 B77 Fleur 23520
spotify_tt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6859 entries, 0 to 6858
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   endTime     6859 non-null   object
 1   artistName  6859 non-null   object
 2   trackName   6859 non-null   object
 3   msPlayed    6859 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 214.5+ KB

Let's see if we can remove some unnecessary records in our table to reduce the work and time of asking the spotify api about the audio features of each track later.

#Converting msPlayed to min_played and changing the column name
spotify_tt.msPlayed = spotify_tt.msPlayed /  60000
spotify_tt.rename(columns = {'msPlayed':'min_played'}, inplace=True)

#Let's see the first values:
print(spotify_tt.min_played.value_counts().sort_index().head(10))
0.000000    164
0.000017      3
0.000067      1
0.000083      2
0.000100      1
0.000133      3
0.000183      2
0.000200      1
0.000233      1
0.000300      1
Name: min_played, dtype: int64
#These results are strange, let's analyse all that:
from matplotlib import pyplot as plt
import seaborn as sns
fig, ax = plt.subplots()
sns.histplot(spotify_tt['min_played'], ax=ax)

ax2 = plt.axes([0.4, 0.3, 0.45, 0.5], facecolor='y')
sns.histplot(spotify_tt['min_played'],binwidth=0.5, ax=ax2)
ax2.set_title('zoom : tracks < 5min')
ax2.set_xlabel('min_played (binwidth = 0.5 = 30s)')
ax2.set_ylabel('Nb of tracks')
ax2.set_xlim([0,5])
ax.set_title("Nb of tracks by duration (min played)")
ax.set_ylabel('Nb of tracks')

durationtrack_png

spotify_tt[spotify_tt['min_played'] < 0.5].count()
endTime       1960
artistName    1960
trackName     1960
min_played    1960
dtype: int64

There are 1960 tracks that were played for less than 30 seconds (0.5 min). This is probably when I change tracks directly after the first few seconds. We will drop them.

#let's drop those records and validate that with assert and .shape before and after the drop
print(spotify_tt.shape)

spotify_tt = spotify_tt[spotify_tt.min_played >= 0.5]
assert spotify_tt[spotify_tt['min_played'] < 0.5].empty

print(spotify_tt.shape)
(6859, 4)
(4899, 4)

Ok, now let's look at the tracks with more than 8 and 10 minutes played:

print(spotify_tt[spotify_tt['min_played'] > 8].count())
spotify_tt[spotify_tt['min_played'] > 8].head()
endTime       49
artistName    49
trackName     49
min_played    49
dtype: int64
endTime artistName trackName min_played
39 2020-08-12 07:56 B.B. King Why I Sing The Blues 8.623333
42 2020-08-12 08:10 The Districts Young Blood 8.680217
83 2020-08-12 10:48 Peter Cat Recording Co. Memory Box 8.064000
136 2020-08-13 10:52 Joakim Nothing Gold - Todd Terje Remix 9.024467
150 2020-08-13 11:41 Donna Summer I Feel Love - 12" Version 8.253100
print(spotify_tt[spotify_tt['min_played'] > 10].count())
spotify_tt[spotify_tt['min_played'] > 10]
endTime       8
artistName    8
trackName     8
min_played    8
dtype: int64
endTime artistName trackName min_played
164 2020-08-14 15:17 Michael Kiwanuka Cold Little Heart 10.067467
802 2020-08-25 13:57 Love De-Luxe Here Comes That Sound Again 11.173433
1796 2020-11-10 15:42 Lil Dicky Truman 10.240067
4169 2021-04-15 21:24 Le Joboscope Data scientist 30.725217
4170 2021-04-15 21:54 Le Joboscope Data analyst 17.910500
6211 2021-07-05 15:10 A suivre Beatmakers S1 (2/10) : Etienne de Crécy 28.405400
6213 2021-07-05 15:53 A suivre Beatmakers S1 (8/10) : Synapson 23.405700
6215 2021-07-06 07:46 A suivre Beatmakers S1 (2/10) : Etienne de Crécy 10.871333

All seems to be normal. For tracks > 8 minutes: they are tracks that are more than 8 minutes long (I checked on spotify). For tracks > 10 minutes: these are often podcasts. Or tracks that are around 8, 9 or 10 minutes long that I played back directly before the true end of the tracks.

#Now, let's rename the columns and reorder the table
spotify_tt.rename(columns = {'endTime':'datetime', 'artistName': 'artist', 'trackName':'track'}, inplace= True)
spotify_tt = spotify_tt[['datetime', 'track', 'artist', 'min_played']]
spotify_tt.head()
datetime track artist min_played
1 2020-07-16 15:43 L'Orchestrina Paolo Conte 3.314000
3 2020-07-17 09:29 fancy Isaac Delusion 0.531017
7 2020-07-17 09:31 Plein de bisous Lewis OfMan 1.056267
8 2020-07-17 09:32 Le métro et le bus Lewis OfMan 0.925917
10 2020-07-17 09:36 La légende urbaine Voyou 3.779767
spotify_tt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4899 entries, 1 to 6858
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    4899 non-null   object 
 1   track       4899 non-null   object 
 2   artist      4899 non-null   object 
 3   min_played  4899 non-null   float64
dtypes: float64(1), object(3)
memory usage: 191.4+ KB
#We need to convert datetime to a datetime format (we will set that as index later)
spotify_tt.datetime = pd.to_datetime(spotify_tt.datetime)
spotify_tt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4899 entries, 1 to 6858
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    4899 non-null   datetime64[ns]
 1   track       4899 non-null   object        
 2   artist      4899 non-null   object        
 3   min_played  4899 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 191.4+ KB
#Last check for missing data
spotify_tt.isna().sum()
datetime      0
track         0
artist        0
min_played    0
dtype: int64
#Ok it's clean now, let's save it into a new csv file:
spotify_tt.to_csv(r'C:\Users\Tristan\Documents\DATA\spotify_project\spotify_tt_clean.csv', index=False)

Ok, we cleaned up our historical data table and deleted 1960 unnecessary records to get our audio features with the spotify API. For this, I was helped by Vlad Gheorghe's article. Article: https://towardsdatascience.com/get-your-spotify-streaming-history-with-python-d5a208bbcbd3

I now have a csv file with my audio features and other information, let's open it:

Table 2 - The audio features of my tracks:

#Let's import the csv file and explore it
features = pd.read_csv(r'C:\Users\Tristan\Documents\DATA\spotify_project\features.csv')
features.head()
Unnamed: 0 danceability energy key loudness mode speechiness acousticness instrumentalness liveness ... tempo type id uri track_href analysis_url duration_ms time_signature albumName albumID
0 Junk___Étienne de Crécy 0.797 0.579 2.0 -9.340 1.0 0.0497 0.00522 0.871 0.0502 ... 124.988 audio_features 0R1l25gRCjthLiYRraCVZW spotify:track:0R1l25gRCjthLiYRraCVZW https://api.spotify.com/v1/tracks/0R1l25gRCjth... https://api.spotify.com/v1/audio-analysis/0R1l... 297293.0 4.0 Commercial EP 3 6a4usN5WtkByPHdcOdCD23
1 The Letter___The Box Tops 0.638 0.428 9.0 -12.156 0.0 0.0687 0.25200 0.000 0.1320 ... 139.434 audio_features 6RJK553YhstRzyKA4mug09 spotify:track:6RJK553YhstRzyKA4mug09 https://api.spotify.com/v1/tracks/6RJK553YhstR... https://api.spotify.com/v1/audio-analysis/6RJK... 112800.0 4.0 The Letter/Neon Rainbow 08mPxuP35Db56jUUgRvGFs
2 Zoom Zoom___Polo & Pan 0.678 0.849 7.0 -6.983 1.0 0.0367 0.04590 0.613 0.0833 ... 94.003 audio_features 1gWnuGAiTk3Q4yrIbwymUK spotify:track:1gWnuGAiTk3Q4yrIbwymUK https://api.spotify.com/v1/tracks/1gWnuGAiTk3Q... https://api.spotify.com/v1/audio-analysis/1gWn... 209800.0 4.0 Caravelle 0SuFqlCe5i30Fr75ZlPQVT
3 Californie___Caballero & JeanJass 0.920 0.516 11.0 -8.257 0.0 0.1190 0.38000 0.000 0.0698 ... 112.013 audio_features 0sJX7GTLCNowidzM9HfaH5 spotify:track:0sJX7GTLCNowidzM9HfaH5 https://api.spotify.com/v1/tracks/0sJX7GTLCNow... https://api.spotify.com/v1/audio-analysis/0sJX... 258891.0 4.0 Double hélice 3 6nMcxKyjXxxA0WeIpOpnuJ
4 Una Rosa Blanca___Ibrahim Maalouf 0.463 0.672 8.0 -6.071 0.0 0.0498 0.72800 0.743 0.0993 ... 74.916 audio_features 4MOCTiC5mMrJuhLFSNjiIM spotify:track:4MOCTiC5mMrJuhLFSNjiIM https://api.spotify.com/v1/tracks/4MOCTiC5mMrJ... https://api.spotify.com/v1/audio-analysis/4MOC... 338000.0 4.0 S3NS 1XWCws077Z4B9SwwUzOAfo

5 rows Ă— 21 columns

features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1951 entries, 0 to 1950
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1951 non-null   object 
 1   danceability      1866 non-null   float64
 2   energy            1866 non-null   float64
 3   key               1866 non-null   float64
 4   loudness          1866 non-null   float64
 5   mode              1866 non-null   float64
 6   speechiness       1866 non-null   float64
 7   acousticness      1866 non-null   float64
 8   instrumentalness  1866 non-null   float64
 9   liveness          1866 non-null   float64
 10  valence           1866 non-null   float64
 11  tempo             1866 non-null   float64
 12  type              1866 non-null   object 
 13  id                1866 non-null   object 
 14  uri               1866 non-null   object 
 15  track_href        1866 non-null   object 
 16  analysis_url      1866 non-null   object 
 17  duration_ms       1866 non-null   float64
 18  time_signature    1866 non-null   float64
 19  albumName         1866 non-null   object 
 20  albumID           1866 non-null   object 
dtypes: float64(13), object(8)
memory usage: 320.2+ KB
features.type.value_counts()
audio_features    1866
Name: type, dtype: int64

Ok, we will keep only the audio features that interest us: danceability, energy, loudness, instrumentalness, acousticness, tempo and mode.

Ps :

  • Speechiness is not interesting for music but rather for podcasts. Instrumentalness is the same but for music (1 = no vocals and 0 = lots of vocals).

  • Liveness, we don't care if it is live or not, we are only interested in the music itself and its characteristics.

  • Type, as we can see above, all my data has the type 'audio_feature' so it is not interesting.

  • Id, we can also drop this as we will use 'track' and 'artist' from the first column (which we will split) to merge with my historical data table.

features['track'] = features['Unnamed: 0'].str.split('___').str[0]
features['artist'] = features['Unnamed: 0'].str.split('___').str[1]
features = features[['track', 'artist', 'danceability', 'energy', 'valence', 'loudness','instrumentalness', 'acousticness', 'tempo', 'mode']]
features.head()
track artist danceability energy valence loudness instrumentalness acousticness tempo mode
0 Junk Étienne de Crécy 0.797 0.579 0.0415 -9.340 0.871 0.00522 124.988 1.0
1 The Letter The Box Tops 0.638 0.428 0.9010 -12.156 0.000 0.25200 139.434 0.0
2 Zoom Zoom Polo & Pan 0.678 0.849 0.5020 -6.983 0.613 0.04590 94.003 1.0
3 Californie Caballero & JeanJass 0.920 0.516 0.5340 -8.257 0.000 0.38000 112.013 0.0
4 Una Rosa Blanca Ibrahim Maalouf 0.463 0.672 0.5220 -6.071 0.743 0.72800 74.916 0.0
#Let's make sure we don't have duplicated rows
assert features[features.duplicated()].empty
#Let's check missing data
features.isna().sum()
track                0
artist               0
danceability        85
energy              85
valence             85
loudness            85
instrumentalness    85
acousticness        85
tempo               85
mode                85
dtype: int64
features[features.danceability.isna()]
track artist danceability energy valence loudness instrumentalness acousticness tempo mode
22 Vertigo Valley French 79 NaN NaN NaN NaN NaN NaN NaN NaN
55 Do It Camp Claude NaN NaN NaN NaN NaN NaN NaN NaN
63 Your Night Con Funk Shun NaN NaN NaN NaN NaN NaN NaN NaN
79 When The Sun Goes Down Arctic Monkeys NaN NaN NaN NaN NaN NaN NaN NaN
84 fancy Isaac Delusion NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
1860 Harlem Shuffle - Alternate Take The Foundations NaN NaN NaN NaN NaN NaN NaN NaN
1862 Heaven The Blaze NaN NaN NaN NaN NaN NaN NaN NaN
1924 Slim's Night Out PillowTalk NaN NaN NaN NaN NaN NaN NaN NaN
1933 Ti voglio Ornella Vanoni NaN NaN NaN NaN NaN NaN NaN NaN
1936 Vieille branche Biga Ranx NaN NaN NaN NaN NaN NaN NaN NaN

85 rows Ă— 10 columns

Those are the tracks that the API did not find, let's drop them.

#let's drop those records and validate the drop
print(features.shape)
features = features.dropna(subset = ['danceability'])
print(features.isna().sum())
print(features.shape)
(1951, 10)
track               0
artist              0
danceability        0
energy              0
valence             0
loudness            0
instrumentalness    0
acousticness        0
tempo               0
mode                0
dtype: int64
(1866, 10)
#Let's check the data type
features.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1866 entries, 0 to 1950
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             1866 non-null   object 
 1   artist            1866 non-null   object 
 2   danceability      1866 non-null   float64
 3   energy            1866 non-null   float64
 4   valence           1866 non-null   float64
 5   loudness          1866 non-null   float64
 6   instrumentalness  1866 non-null   float64
 7   acousticness      1866 non-null   float64
 8   tempo             1866 non-null   float64
 9   mode              1866 non-null   float64
dtypes: float64(8), object(2)
memory usage: 160.4+ KB
#Let's convert mode to integer (1= Major mode, 0= Minor mode)
features['mode'] = features['mode'].astype('int')
features.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1866 entries, 0 to 1950
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             1866 non-null   object 
 1   artist            1866 non-null   object 
 2   danceability      1866 non-null   float64
 3   energy            1866 non-null   float64
 4   valence           1866 non-null   float64
 5   loudness          1866 non-null   float64
 6   instrumentalness  1866 non-null   float64
 7   acousticness      1866 non-null   float64
 8   tempo             1866 non-null   float64
 9   mode              1866 non-null   int32  
dtypes: float64(7), int32(1), object(2)
memory usage: 153.1+ KB
#Ok, now it is clean, let's save it into a new csv file:
features.to_csv(r'C:\Users\Tristan\Documents\DATA\spotify_project\features_clean.csv', index=False)

B) Exploratory Data Analysis (EDA)

OK, now that our two tables are pretty much clean, let's do some exploratory data analysis that will help us understand our data but also finish cleaning up possible wrong outliers.

1 - Creating a function for EDA visualisation:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np

#Let's create a function to plot an histgram with a boxplot
def boxdistplot(x,l, **kwargs):
    ax = sns.distplot(x, hist_kws=dict(alpha=0.3), bins = 25, color ='g', ax=l)
    ax.set_xlim([min(x)-(max(x)/20), max(x)+(max(x)/20)])
    
    kdelimit = ax.lines[0]
    xkde = kdelimit.get_xdata()
    ykde = kdelimit.get_ydata()
    left = x.mean() - x.std()
    right = x.mean() + x.std()
    ax.vlines(x.mean(), 0, np.interp(x.mean(), xkde, ykde), color='r', label ="mean: "+ str(round(x.mean(),2)))
    ax.fill_between(xkde, 0, ykde, where=(left <= xkde) & (xkde <= right), interpolate=True, facecolor='r', alpha=0.2,
                    label="std: "+ str(round(x.std(),2)))    
    ax.vlines(x.median(), 0, np.interp(x.median(), xkde, ykde), color='b', label="median: "+ str(round(x.median(),2)) )
    ax.vlines(x.quantile(0.25), 0, np.interp(x.quantile(0.25), xkde, ykde), alpha=0, label="q1: "+ str(round(x.quantile(0.25),2)) )    
    ax.vlines(x.quantile(0.75), 0, np.interp(x.quantile(0.75), xkde, ykde), alpha=0, label="q3: "+ str(round(x.quantile(0.75),2)) )    

    ax.legend(prop={"size":9})
    
    ax2 = ax.twinx()
    sns.boxplot(x=x, ax=ax2, color = 'g')
    ax2.set(ylim=(-5, 5))

2 - Table 1 EDA:

We have already started the EDA for spotify_tt in the Cleaning & Preparation part as we needed to explore the min_played column to remove some unwanted records to help the spotify API get the audio features faster.

So here we will only explore artists and tracks.

Histograms and boxplots - Distribution of the number of plays by Artists and Tracks:

spotify_tt.artist.value_counts().describe()
count    650.000000
mean       7.536923
std       17.470539
min        1.000000
25%        1.000000
50%        2.000000
75%        7.750000
max      309.000000
Name: artist, dtype: float64
spotify_tt.track.value_counts().describe()
count    1543.000000
mean        3.174984
std         4.656332
min         1.000000
25%         1.000000
50%         1.000000
75%         3.000000
max        45.000000
Name: track, dtype: float64
#Let's use the function
fig, ax = plt.subplots(2, 1)

boxdistplot(spotify_tt.artist.value_counts(), ax[0])
plt.title('Nb of plays by artist')

boxdistplot(spotify_tt.track.value_counts(), ax[1])
plt.title('Nb of plays by track')

plt.subplots_adjust(right=1.5, top= 1.5 , hspace=0.4)
plt.show()

EDAtable1_png

Interpretation:

Artist: we can see with .describe(), that I listened to 650 different artists. The distribution of the number of plays by artist is highly right-skewed (median = 2, std = 17,47 and mean = 7.54). That means, for the majority of the artists, I listened to them only 2 times. But there are some artists that I listened many, many, times. For example, there is one artist that I listened 309 times between July 2020 and 2021 (wow!).

Track: we can see with .describe(), that I listened to 1543 different tracks. The distribution of the number of plays by track is also highly right-skewed (median = 1, std = 4.66 and mean = 3.17). That means, for the majority of the tracks, I listened to them only once. But there are few tracks that I listened many times. For example, there is one track that I listened 45 times between July 2020 and 2021.

3 - Table 2 EDA:

Merging the two tables:

We will merge the two tables to weight my audio features statistics with the number of plays of each track (= each recording in my historical data table).

#Let's merge the two tables
spotify_features = spotify_tt.merge(features, on = ['track', 'artist'], how='inner')
spotify_features.head()
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
0 2020-07-16 15:43:00 L'Orchestrina Paolo Conte 3.314000 0.758 0.819 0.748 -8.698 0.000184 0.578 117.47 1
1 2020-08-15 19:14:00 L'Orchestrina Paolo Conte 3.311583 0.758 0.819 0.748 -8.698 0.000184 0.578 117.47 1
2 2020-08-17 13:01:00 L'Orchestrina Paolo Conte 3.311533 0.758 0.819 0.748 -8.698 0.000184 0.578 117.47 1
3 2020-08-23 15:41:00 L'Orchestrina Paolo Conte 3.311450 0.758 0.819 0.748 -8.698 0.000184 0.578 117.47 1
4 2020-08-25 14:27:00 L'Orchestrina Paolo Conte 3.314000 0.758 0.819 0.748 -8.698 0.000184 0.578 117.47 1

Histograms and boxplots - Part 1 - Defining features and looking for outliers that shouldn't be there:

#Let's use the function
fig, ax = plt.subplots(2, 3)

boxdistplot(spotify_features['danceability'], ax[0, 0])
boxdistplot(spotify_features['energy'], ax[0, 1])
boxdistplot(spotify_features['valence'], ax[0, 2])
boxdistplot(spotify_features['loudness'], ax[1, 0])
boxdistplot(spotify_features['instrumentalness'], ax[1, 1])
boxdistplot(spotify_features['acousticness'], ax[1, 2])

plt.title('Audio features distributions')
plt.subplots_adjust(right=2, top= 1.5 , wspace=0.3, hspace=0.3)
plt.show()

EDA1table21_png

fig, ax = plt.subplots(1, 2, squeeze=False)

boxdistplot(spotify_features['tempo'], ax[0, 0])
sns.histplot(spotify_features, x='mode', hue='mode',  ax=ax[0, 1])
perc = spotify_features['mode'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
ax[0, 1].legend( ['major = '+ perc.iloc[0], 'minor = '+ perc.iloc[1]], title='Mode:', loc="upper center")


plt.subplots_adjust(right=1.5, top= 0.75 , wspace=0.3, hspace=0.3)
plt.show()

EDA1table22_png

Tempo: we can see two weird outliers on the boxplot (tempo < 80 and tempo > 200), let's check that:

spotify_features[spotify_features['tempo'] < 80].drop_duplicates(subset = ['track', 'artist']).sort_values('tempo').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
4254 2021-06-07 15:48:00 Lovin' Feeling French 79 3.415550 0.0000 0.609 0.0000 -8.737 0.627000 0.13500 0.000 1
209 2020-08-12 09:33:00 Awake Electric Guest 5.014433 0.0993 0.626 0.0399 -8.212 0.001610 0.00499 49.452 0
4005 2021-05-06 13:33:00 When I Look Up Jack Johnson 0.969783 0.5350 0.185 0.4950 -17.911 0.000007 0.70100 58.583 1
2180 2020-10-23 12:06:00 Baby Jane Arthur Dupont 2.783550 0.6010 0.375 0.4330 -9.164 0.005400 0.64900 59.993 1
2991 2020-12-28 21:43:00 Le chat Pow Wow 2.848917 0.3830 0.230 0.4660 -12.615 0.000000 0.79600 60.067 1
spotify_features[spotify_features['tempo'] > 200].drop_duplicates(subset = ['track', 'artist']).sort_values('tempo', ascending = False).head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
3226 2021-02-20 18:19:00 La main à la pâte L'Entourloop 2.458050 0.601 0.5260 0.535 -8.521 0.001230 0.096 245.511 0
902 2020-08-15 15:06:00 Hell N Back Bakar 3.557383 0.584 0.6840 0.720 -4.314 0.000091 0.312 210.164 0
1110 2020-08-15 20:29:00 It's Too Late Carole King 3.886583 0.450 0.4420 0.812 -12.718 0.005640 0.493 208.282 0
28 2020-07-17 10:05:00 On a marché sur la lune Voyou 3.611767 0.327 0.8910 0.162 -6.908 0.049600 0.416 204.851 1
50 2020-08-12 07:25:00 Foule sentimentale Chilly Gonzales 1.783017 0.312 0.0948 0.539 -15.544 0.864000 0.935 204.544 0

Tempo: There seems to be a lot of errors with Spotify's tempo recognition. We have some very low tempo songs that are the real tempo like "When I Look Up" from "Jack Johnson" (I checked on https://songdata.io/track/50lUuRklAxwZ8G7uFzX8td/When-I-Look-Up-by-Jack-Johnson ).

But others that are not ok. For example, the tempo of "Awake" from "Electric Guest" has been divided approximately by 2, its real tempo is 111BPM (on https://songdata.io/track/2QoC2SAaXpLrpVNCGJnrgd/Awake-by-Electric-Guest). In fact the tempo is often wrong by half or double due to elements in the music that make the recognition wrong. This is why we have very low tempo but also very high ones.

We will simply drop the extreme values (tempo=0 and tempo = 245) and keep in mind not to take tempo analysis too seriously.

#Let's drop extrem tempo and validate the drop with assert and .shape
print(spotify_features.shape)

spotify_features = spotify_features[spotify_features['tempo'] > 0]
spotify_features = spotify_features[spotify_features['tempo'] <230]
assert spotify_features[spotify_features['tempo'] < 0].empty
assert spotify_features[spotify_features['tempo'] > 230].empty

print(spotify_features.shape)
(4654, 12)
(4652, 12)

Mode: tells if the music is in major mode or minor mode. 58.2% of my spotify history tracks are in major mode.

Danceability: to analyse whether a track is more or less danceable. 0: not danceable / 1: highly danceable.

We can see one weird outlier on the boxplot, let's check if we have incorrect extreme values.

#Danceability < 0.2: Music you can't dance so much.
#(Awake and Shadows do not belong here)
spotify_features[(spotify_features['danceability'] < 0.2)].drop_duplicates(subset = ['track', 'artist']).sort_values('danceability').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
209 2020-08-12 09:33:00 Awake Electric Guest 5.014433 0.0993 0.62600 0.0399 -8.212 0.00161 0.004990 49.452 0
3531 2021-04-09 12:25:00 L'estasi dell'oro Ennio Morricone 3.384433 0.1360 0.48400 0.1020 -12.945 0.52300 0.715000 99.566 1
1961 2020-08-31 09:44:00 Shadows Talisco 1.857733 0.1570 0.00476 0.0305 -33.114 0.95500 0.952000 131.936 1
2654 2020-12-12 21:08:00 Between the Buttons French 79 5.292000 0.1600 0.38600 0.0348 -12.019 0.68000 0.171000 90.379 0
4570 2021-07-08 17:25:00 I'm Alive The Hives 2.000233 0.1610 0.90200 0.1250 -2.717 0.00560 0.000083 86.841 0
#Danceability > 0.9: Music with a groove or something you can dance to.
#I generally agree with the results but I find it hard to believe that Quick Drive and Liquid sunshine are in the top 5...
#(credits from parcel or last night a dj saved my life are more suited to be in the top 5). 
#It may also depend on the type of dance...
#(Quick Drive and Liquid sunshine do not belong here)
spotify_features[(spotify_features['danceability'] > 0.9)].drop_duplicates(subset = ['track', 'artist']).sort_values('danceability', ascending= False).head(8)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
4334 2021-06-09 10:23:00 Quick Drive Niko B 3.040300 0.980 0.495 0.950 -4.997 0.000118 0.00731 120.036 1
2760 2020-12-16 17:44:00 Credits (feat. Dean Dawson) Parcels 1.042600 0.977 0.499 0.967 -9.662 0.000000 0.16500 115.015 1
4621 2021-07-10 23:43:00 Tshegue Tshegue 2.983050 0.970 0.526 0.373 -8.392 0.000201 0.01720 123.971 1
4100 2021-05-22 09:51:00 Last Night a D.J. Saved My Life Indeep 5.659150 0.968 0.345 0.954 -14.170 0.003440 0.15400 109.803 1
1542 2020-08-17 23:07:00 Liquid Sunshine Biga Ranx 3.224883 0.964 0.406 0.723 -8.259 0.020600 0.03290 112.018 0
1033 2020-08-15 19:04:00 Shake That Eminem 6.512400 0.963 0.643 0.534 -5.785 0.000049 0.05070 107.005 0
4119 2021-05-29 11:04:00 Best Friend Foster The People 1.916850 0.959 0.598 0.408 -5.534 0.000000 0.03580 127.028 1
4208 2021-06-05 10:01:00 Yard Man - Original Chris Michaels 1.659050 0.958 0.799 0.791 -5.126 0.000062 0.02380 127.985 0

Energy: to analyse if it is a track calm/peaceful or a track that will gives you some energy, gives you the urge to clap your hands, jump, run, ... 0: calm music / 1: High energy

#Energy < 0.2: Quiet/peaceful/slow music (it's often piano like Debussy's Claire de Lune).  
#(Nostalgia and Shadows do not belong here).
spotify_features[(spotify_features['energy'] < 0.2)].drop_duplicates(subset = ['track', 'artist']).sort_values('energy').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
746 2020-08-14 20:42:00 Nostalgia Ronnie Pacitti 3.932650 0.165 0.00243 0.0368 -37.115 0.875 0.985 68.420 1
1961 2020-08-31 09:44:00 Shadows Talisco 1.857733 0.157 0.00476 0.0305 -33.114 0.955 0.952 131.936 1
3635 2021-04-16 12:11:00 Claire de lune Claude Debussy 2.534983 0.365 0.01000 0.0364 -25.268 0.924 0.995 135.048 1
3056 2021-01-06 11:48:00 This Way Or Another Owen Kennedy 0.500783 0.356 0.02160 0.1190 -21.830 0.899 0.995 100.084 1
581 2020-08-13 11:28:00 Overnight Chilly Gonzales 3.380667 0.388 0.02440 0.1890 -25.245 0.894 0.991 80.132 0
#Energy > 0.9: Music speed/gives you the urge to jump, clap, run,...
#(Ritmo Especial doesn't belong here)
spotify_features[(spotify_features['energy'] > 0.9)].drop_duplicates(subset = ['track', 'artist']).sort_values('energy', ascending= False).head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
2428 2020-10-30 12:08:00 Ritmo Especial Daniel Maloso 4.116833 0.802 0.996 0.496 -4.665 0.861000 0.00939 120.996 1
4200 2021-06-05 09:44:00 TURN OFF THE LIGHTS Dog Blood 0.500383 0.756 0.994 0.765 -2.466 0.717000 0.00217 128.027 0
2974 2020-12-28 21:09:00 Hippy Hippy Shake Big Soul 2.846367 0.738 0.994 0.529 -5.445 0.000325 0.04700 129.307 1
4567 2021-07-08 17:17:00 Two-Timing Touch and Broken Bones The Hives 2.008433 0.342 0.992 0.938 -3.251 0.006040 0.01900 165.230 0
3179 2021-02-04 21:03:00 Banana Split Lio 1.374283 0.677 0.985 0.967 -3.577 0.005240 0.06350 156.141 1

Valence: to analyse if it is a track that will put you on a good or bad/melancholic mood. 0: bad/melancholic mood / 1: good mood.

#Valence < 0.1: Music dark/bad mood/melancholic 
#(shadows doesn't belong here)
spotify_features[(spotify_features['valence'] < 0.2)].drop_duplicates(subset = ['track', 'artist']).sort_values('valence').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
3172 2021-01-30 15:30:00 Ephos Flug 7.872167 0.719 0.72700 0.0296 -10.563 0.945000 0.0423 132.000 1
4435 2021-06-12 19:07:00 Queens The Blaze 2.217667 0.617 0.58900 0.0301 -12.767 0.268000 0.6900 125.058 1
1961 2020-08-31 09:44:00 Shadows Talisco 1.857733 0.157 0.00476 0.0305 -33.114 0.955000 0.9520 131.936 1
317 2020-08-12 11:22:00 Loreley Kölsch 5.750000 0.781 0.50900 0.0311 -7.793 0.000022 0.2910 127.964 0
4252 2021-06-07 15:36:00 After Party French 79 5.455100 0.532 0.57700 0.0335 -12.320 0.905000 0.1640 119.041 1
#Valence > 0.9: Music good mood/happy
spotify_features[(spotify_features['valence'] > 0.9)].drop_duplicates(subset = ['track', 'artist']).sort_values('valence', ascending= False).head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
384 2020-08-12 13:11:00 Gotta Go Home Boney M. 3.760433 0.781 0.936 0.980 -5.843 0.052200 0.2860 131.659 1
934 2020-08-15 17:31:00 September Earth, Wind & Fire 3.584700 0.697 0.832 0.979 -7.264 0.001310 0.1680 125.926 1
469 2020-08-12 14:25:00 Passe mon truc Stupeflip 3.280217 0.679 0.876 0.978 -6.033 0.149000 0.0961 160.459 1
3202 2021-02-05 21:20:00 Je fume pu d'shit Stupeflip 3.282000 0.894 0.544 0.976 -5.402 0.004150 0.0672 107.292 1
3259 2021-02-20 21:07:00 Pata Pata - Mono Version Miriam Makeba 2.867800 0.837 0.853 0.975 -5.417 0.000003 0.6150 126.845 1

Loudness: to analyse if the track is more or less loud. Range: -60 and 0 (db).

#loudness < -20: Music like piano songs that are not loud
#(Nostalgia and shadows do not belong here)
spotify_features[(spotify_features['loudness'] < -20)].drop_duplicates(subset = ['track', 'artist']).sort_values('loudness').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
746 2020-08-14 20:42:00 Nostalgia Ronnie Pacitti 3.932650 0.165 0.00243 0.0368 -37.115 0.875 0.985 68.420 1
1961 2020-08-31 09:44:00 Shadows Talisco 1.857733 0.157 0.00476 0.0305 -33.114 0.955 0.952 131.936 1
3635 2021-04-16 12:11:00 Claire de lune Claude Debussy 2.534983 0.365 0.01000 0.0364 -25.268 0.924 0.995 135.048 1
581 2020-08-13 11:28:00 Overnight Chilly Gonzales 3.380667 0.388 0.02440 0.1890 -25.245 0.894 0.991 80.132 0
1271 2020-08-16 15:45:00 The Entertainer Liberace 2.045517 0.466 0.15000 0.7960 -24.581 0.877 0.992 159.517 1
#Loudness > -5 : Music like rock songs that are loud
spotify_features[(spotify_features['loudness'] > -5)].drop_duplicates(subset = ['track', 'artist']).sort_values('loudness', ascending = False).head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
205 2020-08-12 09:28:00 Menez daou Les Ramoneurs De Menhirs 5.162217 0.442 0.946 0.660 0.074 0.00767 0.09380 177.137 1
4357 2021-06-09 12:06:00 Poundshop Kardashians Sam Fender 2.654917 0.557 0.875 0.770 -1.596 0.00000 0.04190 138.023 0
4338 2021-06-09 10:41:00 Bear Claws The Academic 3.568633 0.552 0.877 0.628 -1.879 0.00000 0.01420 97.056 0
2956 2020-12-28 20:38:00 Louxor J'Adore - Katerine vs Joachim Garraud Philippe Katerine 3.118433 0.471 0.929 0.696 -1.897 0.49600 0.00217 134.717 0
4198 2021-06-05 09:42:00 BREAK LAW Dog Blood 0.512767 0.679 0.950 0.120 -2.247 0.00205 0.01480 107.994 1

Instrumentalness: to analyse if a track has vocals or not. 0: A lots of vocals / 1: no vocals.

#Instru 0.9 - 1: no vocals at all.
#(shadows doesn't belong here)
spotify_features[(spotify_features['instrumentalness'] > 0.5)].drop_duplicates(subset = ['track', 'artist']).sort_values('instrumentalness', ascending=False).head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
4639 2021-07-11 12:17:00 Bilboquet (Sirba) Polo & Pan 3.269100 0.803 0.78500 0.3270 -7.757 0.962 0.483 100.028 0
3588 2021-04-13 17:30:00 Where Is My Mind Maxence Cyrin 2.752667 0.333 0.11400 0.0566 -21.255 0.960 0.906 141.839 1
3629 2021-04-16 11:29:00 Walk to School Philip Glass 1.742283 0.225 0.03680 0.0388 -23.039 0.956 0.977 169.851 0
1961 2020-08-31 09:44:00 Shadows Talisco 1.857733 0.157 0.00476 0.0305 -33.114 0.955 0.952 131.936 1
3023 2020-12-31 17:14:00 Hip Hop First of All Guts 2.675700 0.652 0.54900 0.1660 -9.090 0.950 0.218 83.007 0
#Instru 0.5 - 0.9: Almost no voice. 
#Mostly it's techno/electro music with voice samples as in CamelPhat's Cola. 
#Or a song with just a few backing vocals like l'estasi dell'oro.
spotify_features[(spotify_features['instrumentalness'] > 0.5)].drop_duplicates(subset = ['track', 'artist']).sort_values('instrumentalness').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
869 2020-08-15 14:18:00 Cola CamelPhat 3.728650 0.706 0.740 0.444 -7.904 0.512 0.02450 122.007 1
3531 2021-04-09 12:25:00 L'estasi dell'oro Ennio Morricone 3.384433 0.136 0.484 0.102 -12.945 0.523 0.71500 99.566 1
3467 2021-03-27 21:49:00 Salam Aleykoum Salut C'est Cool 5.450317 0.690 0.963 0.154 -6.341 0.527 0.00776 138.001 1
2158 2020-10-14 17:24:00 Pigalle Bellaire 4.169933 0.906 0.661 0.228 -7.532 0.527 0.00417 124.987 0
2606 2020-12-09 11:57:00 Enoi - Âme Live Version Âme 0.727617 0.709 0.736 0.389 -9.095 0.531 0.00849 124.407 1
#Instru 0.1 - 0.5: Music with more vocals but still a lot of techno/electro.
spotify_features[(spotify_features['instrumentalness'] > 0.1)].drop_duplicates(subset = ['track', 'artist']).sort_values('instrumentalness').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
4424 2021-06-12 18:14:00 Don't Wanna Dance Boston Bun 2.823367 0.797 0.827 0.680 -5.607 0.101 0.03340 121.979 1
3119 2021-01-21 04:49:00 Dr. Greenthumb Cypress Hill 3.154217 0.803 0.547 0.144 -8.779 0.102 0.04820 103.853 0
3323 2021-03-13 04:04:00 Bout de bois Salut C'est Cool 3.447933 0.702 0.758 0.134 -10.740 0.104 0.00295 165.963 1
9 2020-07-17 09:32:00 Le métro et le bus Lewis OfMan 0.925917 0.592 0.576 0.276 -8.279 0.107 0.49600 110.045 1
1604 2020-08-21 10:33:00 Il fait chaud Corine 2.294550 0.679 0.834 0.453 -6.947 0.108 0.02100 108.000 0
#Instru 0 - 0.1 : Songs with a lot of vocals.
spotify_features[(spotify_features['instrumentalness'] < 0.1)].drop_duplicates(subset = ['track', 'artist']).sort_values('instrumentalness').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
3054 2020-12-31 19:34:00 Chérie Amadou & Mariam 1.943750 0.834 0.964 0.680 -3.822 0.0 0.1980 127.994 1
3126 2021-01-30 04:38:00 Les prisons de Nantes Tri Yann 2.345400 0.528 0.395 0.962 -10.130 0.0 0.8040 161.460 1
3065 2021-01-14 23:26:00 Is This Love Bob Marley & The Wailers 3.845550 0.776 0.559 0.758 -8.375 0.0 0.1100 122.242 0
3063 2021-01-14 23:22:00 One Love / People Get Ready - Medley Bob Marley & The Wailers 2.882217 0.725 0.523 0.950 -9.593 0.0 0.0783 76.292 1
3055 2021-01-03 18:34:00 You Really Got Me - Mono Mix The Kinks 2.237717 0.573 0.939 0.963 -6.441 0.0 0.4930 137.382 1

Acousticness: to analyse if the track is more or less acoustic. 0: not accoustic / 1: 100% acoustic.

#Acousticness < 0.2 : Not acoustic music (like band music or electro/techno)
spotify_features[(spotify_features['acousticness'] < 0.2)].drop_duplicates(subset = ['track', 'artist']).sort_values('acousticness').head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
4569 2021-07-08 17:23:00 Good Samaritan The Hives 3.115000 0.186 0.924 0.2570 -3.830 0.00968 0.000002 149.966 1
4231 2021-06-05 10:15:00 Robot Rock Daft Punk 0.802233 0.590 0.787 0.5980 -5.766 0.84500 0.000007 111.926 1
1510 2020-08-17 18:29:00 Hellifornia Gesaffelstein 1.349033 0.537 0.784 0.0384 -3.734 0.28800 0.000009 93.988 1
4371 2021-06-09 12:57:00 Mixtape 2003 The Academic 3.404167 0.288 0.933 0.3450 -4.120 0.33300 0.000013 167.109 1
3625 2021-04-15 18:19:00 Force majeure Gaspard Augé 3.435283 0.578 0.696 0.2190 -6.866 0.87400 0.000048 119.999 1
#Acousticness > 0.9 : Highly acoustic music (like piano)
spotify_features[(spotify_features['acousticness'] > 0.9)].drop_duplicates(subset = ['track', 'artist']).sort_values('acousticness', ascending= False).head(5)
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
3635 2021-04-16 12:11:00 Claire de lune Claude Debussy 2.534983 0.365 0.0100 0.0364 -25.268 0.924 0.995 135.048 1
3056 2021-01-06 11:48:00 This Way Or Another Owen Kennedy 0.500783 0.356 0.0216 0.1190 -21.830 0.899 0.995 100.084 1
1271 2020-08-16 15:45:00 The Entertainer Liberace 2.045517 0.466 0.1500 0.7960 -24.581 0.877 0.992 159.517 1
581 2020-08-13 11:28:00 Overnight Chilly Gonzales 3.380667 0.388 0.0244 0.1890 -25.245 0.894 0.991 80.132 0
3630 2021-04-16 11:35:00 Gaze Moux 2.669333 0.574 0.0577 0.0849 -22.165 0.912 0.991 140.994 0

Ok, so we have to drop the outliers that have ended up in places where they don't belong, to get better results. It's like the tempo, sometimes the recognition is wrong because of certain elements of the music.

#Let's drop those records and validate the drop with .shape and assert
print(spotify_features.shape)
spotify_features = spotify_features[spotify_features['track'] != 'Shadows']
spotify_features = spotify_features[spotify_features['track'] != 'Nostalgia']
spotify_features = spotify_features[spotify_features['track'] != 'Awake']
spotify_features = spotify_features[spotify_features['track'] != 'Quick Drive']
spotify_features = spotify_features[spotify_features['track'] != 'Liquid sunshine']
spotify_features = spotify_features[spotify_features['track'] != 'Ritmo Especial']
print(spotify_features.shape)
assert spotify_features[(spotify_features['track'] == 'Shadows')| (spotify_features['track'] == 'Nostalgia') | (spotify_features['track'] == 'Awake')].empty
(4652, 12)
(4638, 12)

Rescaling the audio features for comparison:

To be able to compare the features, we need to convert them to the same scale. We can use the min max scaling technique to convert their values between 0 and 1. (They are all already between 0 and 1, except for the loudness, so I prefer to scale them back.)

from sklearn.preprocessing import MinMaxScaler
#Let's rescale the features
min_max_scaler_tt = MinMaxScaler()

spotify_features.iloc[:,4:10]=min_max_scaler_tt.fit_transform(spotify_features.iloc[:,4:10])
spotify_features.head()
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
0 2020-07-16 15:43:00 L'Orchestrina Paolo Conte 3.314000 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
1 2020-08-15 19:14:00 L'Orchestrina Paolo Conte 3.311583 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
2 2020-08-17 13:01:00 L'Orchestrina Paolo Conte 3.311533 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
3 2020-08-23 15:41:00 L'Orchestrina Paolo Conte 3.311450 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
4 2020-08-25 14:27:00 L'Orchestrina Paolo Conte 3.314000 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
spotify_features.describe()
min_played danceability energy valence loudness instrumentalness acousticness tempo mode
count 4638.000000 4638.000000 4638.000000 4638.000000 4638.000000 4638.000000 4638.000000 4638.000000 4638.000000
mean 3.488633 0.646954 0.634980 0.572766 0.678102 0.173087 0.264564 116.959213 0.581716
std 1.507279 0.176804 0.189612 0.271474 0.117976 0.296806 0.276047 24.042351 0.493330
min 0.500017 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 58.583000 0.000000
25% 2.737333 0.536266 0.519309 0.359217 0.616151 0.000013 0.033063 99.822250 0.000000
50% 3.469892 0.673603 0.659553 0.594907 0.690040 0.002401 0.157787 117.010000 1.000000
75% 4.216121 0.774078 0.782520 0.806397 0.760398 0.212058 0.438943 127.982000 1.000000
max 30.725217 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 210.164000 1.000000

Ok, now let's analyse the correlation between features

Correlation matrix of the audio features:

sns.heatmap(round(spotify_features.corr(),1),vmin=-1, vmax=1, annot = True,
            cbar_kws={'label': '1 = Positively correlated\n -1 = Negatively correlated'})
plt.title('Correlation between features')
plt.show()

corrmatrix_png

Correlation results:

1 - Valence is a bit positively correlated (0.4) with danceability and energy. If the music scores high on danceability or energy, it is more likely to score high on valence as well. In other words, if music makes you want to jump (energy) or dance (danceability), it will also put you in a good mood (valence).

2 - Loudness is quite positively correlated (0.7) with energy. If the music scores high on loudness, it is more likely to score high high on energy as well. In other words, more the music is loud, more it gives you the urge to jump/clap (energy).

3 - Acousticness is somehow negatively correlated (-0.6) with energy. If music scores high on acousticness, it is more likely to score low on energy. In other words, the more acoustic the music, the less likely it is to make you want to jump/clap (energy).

4 - Therefore, Acousticness is also a bit negatively correlated (-0.4) with loudness. If music scores high on acousticness, it is more likely to score low on loudness. In other words, less the music will be acoustic (like a rock band), louder it is more likely to be.

So, for example, if you are listening to 100% piano music (very acoustic), it is more likely that the music does not have much energy and therefore not much valence. In addition, this music should not be too loud. At the end, this piano music is more likely to be quiet/peaceful/slow and should put you in a somewhat melancholic mood.

Histograms and Boxplots - Part 2 - Understanding my audio features preferences:

Let's replot the histograms and boxplots now that we have eliminated some weird outliers:

fig, ax = plt.subplots(2, 3)

boxdistplot(spotify_features['danceability'], ax[0, 0])
boxdistplot(spotify_features['energy'], ax[0, 1])
boxdistplot(spotify_features['valence'], ax[0, 2])
boxdistplot(spotify_features['loudness'], ax[1, 0])
boxdistplot(spotify_features['instrumentalness'], ax[1, 1])
boxdistplot(spotify_features['acousticness'], ax[1, 2])

plt.subplots_adjust(right=2, top= 1.5 , wspace=0.3, hspace=0.3)
plt.show()

EDA2table21_png

fig, ax = plt.subplots(1, 2, squeeze=False)

boxdistplot(spotify_features['tempo'], ax[0, 0])
sns.histplot(spotify_features, x='mode', hue='mode',  ax=ax[0, 1])
#spotify_features['mode'].loc[spotify_features['mode']==1].count()
perc = spotify_features['mode'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
ax[0, 1].legend( ['major = '+ perc.iloc[0], 'minor = '+ perc.iloc[1]], title='Mode:', loc="upper center")


plt.subplots_adjust(right=1.5, top= 0.75 , wspace=0.3, hspace=0.3)
plt.show()

EDA2table22_png

My audio features results:

Danceability: My musical tastes in terms of danceability are diverse (std = 0.18) but it seems that I have a small preference for music that is quite danceable (mean = 0.65).

Energy: As for dancability, my musical tastes in terms of energy are diversified (std = 0.19) and I have a small preference for music that has some energy (mean = 0.63).

Valence: Regarding valence, my musical tastes are very diversified (std = 0.27 and mean = 0.57). We can see two peaks/modes which indicate that I like music with a high valence score around 0.8 but also calm/melancholic music with a valence score around 0.45. (You can even see that I sometimes like to listen to music with a very low valence score. This is for example dark techno/electro music like 'Gesaffelstein').

Loudness: My musical taste in terms of loudness is not very diverse (std = 0.12), I like music that is quite loud (mean = 0.68).

Instrumentalness: My data in terms of instrumentalness are highly right-skewed (median = 0, mean = 0.17 and std = 0.3). This indicates that the majority of my music has an instrumentality score of 0 (median=0) but as it is very diverse (std = 0.3 and mean = 0.17), we can also find some music with a very high instrumentality score. In other words, this means that I prefer real song/music with vocals (instrumentalness = 0) but I also like techo/electro music without any vocals like 'Bilboquet (Sirba)' from 'Polo & Pan' or 100% piano music (without vocals).

Acousticness: My data in terms of acousticness are also right-skewed (median = 0.16, mean = 0.26 and std = 0.28). This indicates that the majority of my music has a low acousticness score (median=0.16) but as it is diverse (std = 0.28 and mean = 0.26), we can also find some music with a very high acousticness score. In other words, this means that I prefer real band music or electro/techno music (acousticness < 0.2) but I also like acoustic and quiet music like piano.

Tempo: As we said before, we have to take this tempo analysis very carefully as Spotify's API seems to have a lot of errors in tempo recognition. But to give a general idea, it seems I like all types of tempo, but I have a preference for tempo around 117 BPM (std = 24, mean and median = 117).

Mode: I seem to prefer music in major mode (58.2% of all my music). But since it's not a big difference, it means that I don't really care about the mode of the music.

So, to sum up: My musical tastes are very diverse. I like all types of tempo and mode, although I have a slight preference for music in major mode with a tempo around 117BPM. I listen to music that is more or less danceable, music with a lot of energy but also sometime calm music, music with a good valence but also music that is more melancholic, chill or even dark sometimes. I like loud music and I prefer music with voices that are not acoustic. However, I also sometimes like music without vocals to focus on the different sounds like electro/techno music or acoustic music that is calm/chill like piano music.

#Ok, we will save that and start the analysis:
#let's save it into a new csv file:
spotify_features.to_csv(r'C:\Users\Tristan\Documents\DATA\spotify_project\spotify_features_clean.csv', index=False)

C) Analysis Part 1 - Tops

In this first part of the analysis, my objectives are :

  • To know which are my top artists and my top tracks.
  • To see my music consumption on spotify per month during the analysis period (2020-07/2021-07).
  • To see when I listen to the most spotify during the week (by day of the week and by hour).

For this part, we will use the first table 'spotify_tt' and not the full table with the features because 'spotify_tt' contains all my historical data, even those for which the spotify API could not find any information. As a reminder, we had to delete 85 records after we merged the two tables because they were tracks from spotify_tt for which the api could not find any features.

#We need to set datetime as index
spotify_tt.set_index('datetime', inplace = True)
spotify_tt.head()
track artist min_played
datetime
2020-07-16 15:43:00 L'Orchestrina Paolo Conte 3.314000
2020-07-17 09:29:00 fancy Isaac Delusion 0.531017
2020-07-17 09:31:00 Plein de bisous Lewis OfMan 1.056267
2020-07-17 09:32:00 Le métro et le bus Lewis OfMan 0.925917
2020-07-17 09:36:00 La légende urbaine Voyou 3.779767

1 - Top artists:

#Top 10 artist by nb of plays:
top10_artist_count = spotify_tt['artist'].value_counts().reset_index().head(10)
top10_artist_count.columns = ['artist', 'nb_of_plays']
top10_artist_count
artist nb_of_plays
0 The Lumineers 309
1 Jungle 125
2 La Femme 107
3 Guts 95
4 L'Impératrice 85
5 easy life 79
6 Parcels 78
7 Lil Dicky 77
8 Polo & Pan 69
9 Kid Francescoli 68
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("white")
sns.barplot(x= 'artist', y= 'nb_of_plays', data= top10_artist_count, palette='winter_r')
plt.xticks(rotation=70,  ha="right", rotation_mode="anchor")
plt.xlabel(None)
plt.ylabel('Nb of plays\n')
plt.title('Top 10 artists (Nb of plays)')

top10_artist_count['P'] = top10_artist_count.nb_of_plays.astype('str')
plt.legend( top10_artist_count['artist']+ '  (played '+ top10_artist_count['P'] +' times)',loc = 2, bbox_to_anchor = (1,1))

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\topartist1.png', bbox_inches='tight')
plt.show()

artists1_png

#Top 10 artist by nb of hours:
top10_artist_h = spotify_tt.groupby('artist')['min_played'].sum().div(60).round(1).sort_values(ascending=False).reset_index().head(10)
top10_artist_h.columns = ['artist', 'nb_of_hours']
top10_artist_h
artist nb_of_hours
0 The Lumineers 16.9
1 Jungle 6.9
2 La Femme 6.3
3 Guts 6.1
4 L'Impératrice 5.6
5 Lil Dicky 5.0
6 Parcels 4.9
7 Polo & Pan 4.8
8 Isaac Delusion 4.5
9 easy life 3.9
sns.set_style("white")
sns.barplot(x= 'artist', y= 'nb_of_hours', data= top10_artist_h, palette='winter_r')
plt.xticks(rotation=70,  ha="right", rotation_mode="anchor")
plt.xlabel(None)
plt.ylabel('Nb of hours\n')
plt.title('Top 10 artists (Nb of Hours)')

top10_artist_h['H'] = top10_artist_h.nb_of_hours.astype('str')
plt.legend( top10_artist_h['artist']+ '  ('+ top10_artist_h['H'] +'H)',  loc = 2, bbox_to_anchor = (1,1))

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\topartist2.png', bbox_inches='tight')
plt.show()

artists2_png

2 - Top tracks:

#Top 10 track by nb of plays:
top10_track_count = spotify_tt[['track', 'artist']].value_counts().reset_index().head(10)
top10_track_count.columns = ['track','artist' ,'nb_of_plays']
top10_track_count
track artist nb_of_plays
0 Disco Inferno The Trammps 40
1 Sledgehammer Peter Gabriel 33
2 Nomalizo Letta Mbulu 32
3 Nothing But A Heartache The Flirtations 31
4 Last Train to London Electric Light Orchestra 30
5 Isabella Isaac Delusion 28
6 Idol Mind Enterprises 27
7 Baianá Bakermat 27
8 (More and More) It Ain't Easy Jungle 27
9 Maryland Elephanz 26
sns.set_style("white")
sns.barplot(x= 'track', y= 'nb_of_plays', data= top10_track_count, palette='winter_r')
plt.xticks(rotation=70,  ha="right", rotation_mode="anchor")
plt.xlabel(None)
plt.ylabel('Nb of plays\n')
plt.title('Top 10 tracks (Nb of plays)')

top10_track_count['P'] = top10_track_count.nb_of_plays.astype('str')
plt.legend(top10_track_count['track'] + ' - ' + top10_track_count['artist']+ '  (played '+ top10_track_count['P'] +' times)',  loc = 2, bbox_to_anchor = (1,1))

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\toptrack1.png', bbox_inches='tight')
plt.show()

tracks1_png

#Top 10 track by nb of hours:
top10_track_h = spotify_tt.groupby(['track', 'artist'])['min_played'].sum().div(60).round(1).sort_values(ascending=False).reset_index().head(10)
top10_track_h.columns = ['track', 'artist', 'nb_of_hours']
top10_track_h
track artist nb_of_hours
0 Here Comes That Sound Again Love De-Luxe 2.6
1 Sledgehammer Peter Gabriel 2.5
2 Nomalizo Letta Mbulu 2.5
3 Isabella Isaac Delusion 2.3
4 Disco Inferno The Trammps 2.2
5 I Feel Love Donna Summer 2.0
6 Last Train to London Electric Light Orchestra 2.0
7 Leader Of The Landslide The Lumineers 1.9
8 Esperar Pra Ver Poolside 1.7
9 Psycho Killer - 2005 Remaster Talking Heads 1.6
sns.set_style("white")
sns.barplot(x= 'track', y= 'nb_of_hours', data= top10_track_h, palette='winter_r')
plt.xticks(rotation=70,  ha="right", rotation_mode="anchor")
plt.xlabel(None)
plt.ylabel('Nb of hours\n')
plt.title('Top 10 tracks (Nb of hours)')

top10_track_h['H'] = top10_track_h.nb_of_hours.astype('str')
plt.legend(top10_track_h['track'] + ' - ' + top10_track_h['artist']+ '  ('+ top10_track_h['H'] +'H)',  loc = 2, bbox_to_anchor = (1,1))

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\toptrack2.png', bbox_inches='tight')
plt.show()

tracks2_png

3 - Top tracks of my top 1 artist:

#Top 10 track of The Lumineers by nb of plays:

Top1 = spotify_tt.loc[spotify_tt.artist.str.contains('Lumineers'), 'track'].value_counts().reset_index().head(10)
Top1.columns = ['track', 'nb_of_plays']
Top1
track nb_of_plays
0 Gloria 22
1 Sleep On The Floor 20
2 Leader Of The Landslide 20
3 It Wasn't Easy To Be Happy For You 19
4 Ophelia 19
5 Soundtrack Song - Bonus Track 18
6 April 17
7 Cleopatra 17
8 Patience 16
9 Salt And The Sea 16
sns.set_style("white")
sns.barplot(x= 'track', y= 'nb_of_plays', data= Top1, palette='winter_r')
plt.xticks(rotation=70,  ha="right", rotation_mode="anchor")
plt.xlabel(None)
plt.ylabel('Nb of plays\n')
plt.title('Top 10 tracks (Nb of plays) of my Top 1 artist: The Lumineers')

Top1['P'] = Top1.nb_of_plays.astype('str')
plt.legend( Top1['track']+ '  ('+ Top1['P'] +')',  loc = 2, bbox_to_anchor = (1,1))

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\toptracktopartist.png', bbox_inches='tight')
plt.show()

trackartist1_png

4 - Creating a wordcloud visual with my top 100 artists:

#Let's create a wordcloud visual with my top 100 artists for the project image on my website
from wordcloud import WordCloud

wc_artist = spotify_tt['artist'].value_counts().head(100)
fig, ax = plt.subplots(figsize=(20,15))
wordcloud = WordCloud(width=1000,height=600, max_words=100,relative_scaling=0.78,normalize_plurals=False).generate_from_frequencies(wc_artist)
ax.imshow(wordcloud, interpolation='bilinear')

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\wc_top100.png', bbox_inches='tight')
plt.axis(False)

wctop100_png

5 - My music consumption on spotify per month during the analysis period (2020-07/2021-07):

month_h = spotify_tt.min_played.resample('M').sum().div(60).reset_index()
month_h.columns = ['month', 'nb_of_hours']
month_h
month nb_of_hours
0 2020-07-31 0.759305
1 2020-08-31 54.920501
2 2020-09-30 11.011556
3 2020-10-31 11.429484
4 2020-11-30 5.951911
5 2020-12-31 29.259204
6 2021-01-31 12.622811
7 2021-02-28 14.628442
8 2021-03-31 19.739608
9 2021-04-30 29.808107
10 2021-05-31 23.025133
11 2021-06-30 28.704854
12 2021-07-31 43.618953
from datetime import datetime

sns.set_theme(style="white")
fig, ax = plt.subplots()
fig = sns.lineplot(x= 'month', y = 'nb_of_hours' ,data = month_h, linewidth = 3)
sns.despine()
plt.xticks(month_h.month, rotation = 45, ha="right", rotation_mode="anchor")
plt.xlabel('\nMonths (2020-07/2021-07)')
plt.ylabel('Nb of hours\n')
ax.axvspan(datetime(2020,9,30), datetime(2020,12,15), alpha=0.1, color='red')
ax.axvspan(datetime(2021,4,3), datetime(2021,5,3), alpha=0.1, color='red')

ax.annotate('Lockdown\n in France',
            fontsize=10,
            fontweight='demi',
            xy=(datetime(2020,11,8), 45),  
            xycoords='data',
            xytext=(45, 10),      
            textcoords='offset points',
            arrowprops=dict(arrowstyle="->", color = 'black')) 
ax.annotate('',
            xy=(datetime(2021,4,25), 45),  
            xycoords='data',
            xytext=(-35, 10),      
            textcoords='offset points',
            arrowprops=dict(arrowstyle="->", color = 'black')) 

plt.title('My music consumption on spotify per month (From 2020-07 to 2021-07)')

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\months.png', bbox_inches='tight')
plt.show()

months_png

Interpretation:

We can see that I listen to spotify the most during the summer or during the Christmas and new eve period. We can also see that during the lockdown of 2020-09/2020-12 my music consumption on spotify dropped (It may not be a cause and effect relationship but it is a correlation). The 2021 lockdown does not seem to have affected my consumption.

6 - Which day I listen to spotify the most?:

day_h = spotify_tt.groupby(spotify_tt.index.date)['min_played'].sum().div(60).reset_index()
day_h.columns = ['weekday', 'nb_of_hours']
day_h['weekday'] = pd.to_datetime(day_h['weekday'])

weekday_h = day_h.groupby(day_h['weekday'].apply(lambda x: x.day_name()))['nb_of_hours'].mean().reset_index()
weekday_h['weekday'] = pd.Categorical(weekday_h['weekday'], ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
weekday_h = weekday_h.sort_values('weekday')
weekday_h
weekday nb_of_hours
1 Monday 1.641194
5 Tuesday 1.540520
6 Wednesday 1.357016
4 Thursday 1.031079
0 Friday 1.559729
2 Saturday 2.104713
3 Sunday 1.293779
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots()
fig = sns.barplot(x= 'weekday', y = 'nb_of_hours' ,data = weekday_h, palette="Set2")
sns.despine()
plt.xticks(rotation = 45, ha="right", rotation_mode="anchor")
plt.xlabel(None)
plt.ylabel('Nb of hours (mean)\n')
plt.title('Which day I listen to spotify the most?')

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\days.png', bbox_inches='tight')
plt.show()

days_png

Interpretation:

Saturday seems to be my favourite day to listen to music on spotify and Thursday the day I listen to the least.

7 - Heatmap: When do I listen to Spotify the most during the week (by days and hours)?:

spotify_tt.reset_index(inplace=True)
spotify_tt['weekday'] = spotify_tt.datetime.apply(lambda x: x.day_name())
spotify_tt['hour'] = pd.DatetimeIndex(spotify_tt["datetime"]).hour
hmap_hourday = spotify_tt.groupby(['hour', 'weekday'])['min_played'].sum().div(60).reset_index()
hmap_hourday.columns = ['hour', 'weekday', 'nb_of_hours']
hmap_hourday['weekday'] = pd.Categorical(hmap_hourday['weekday'], ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
hmap_hourday = hmap_hourday.sort_values(['hour', 'weekday'])
hmap_hourday_pivot = hmap_hourday.pivot("hour", 'weekday', 'nb_of_hours')
hmap_hourday_pivot
weekday Monday Tuesday Wednesday Thursday Friday Saturday Sunday
hour
0 NaN 0.090550 1.145496 0.950365 1.861331 0.791024 2.532518
1 NaN 0.933529 NaN 0.060663 1.664840 0.206100 1.244205
2 NaN 0.013941 NaN NaN 0.501499 0.948349 NaN
3 NaN NaN NaN 0.922506 0.434322 0.997728 0.187207
4 NaN NaN NaN 0.986153 0.209337 0.592768 0.962670
5 NaN NaN NaN 0.128917 1.419416 1.054068 0.913024
6 NaN 0.848112 NaN NaN 0.093284 0.543313 NaN
7 2.154869 0.721891 0.315204 0.018783 0.242081 0.053570 0.096764
8 1.769703 0.820385 1.963738 0.815128 2.254765 2.400634 0.180764
9 2.857846 1.493154 4.203145 0.756811 1.543869 3.530228 1.437077
10 2.070544 1.371110 3.375253 1.078541 3.498350 3.374417 2.896623
11 2.059680 1.075490 3.763706 0.921599 3.946406 3.550700 1.727983
12 2.460083 2.109759 4.860632 2.268029 6.334099 1.219263 2.150987
13 5.135331 2.873739 4.130917 2.411043 5.135137 1.031386 1.732020
14 4.815619 4.349594 3.419427 2.264124 4.091344 2.703529 2.331014
15 3.278965 3.857129 3.186157 2.820599 3.884292 3.009397 2.631171
16 2.042431 2.919122 2.517530 2.515338 1.224134 2.016299 3.550102
17 0.961435 1.998544 1.112515 1.650704 1.125251 4.317831 2.161906
18 2.253352 1.074518 1.174136 0.745287 1.297907 5.773222 1.517422
19 2.569555 0.987227 0.035959 0.706920 2.774501 3.926154 0.265591
20 1.416190 0.165743 NaN 2.371770 1.823672 6.048907 1.196161
21 0.956355 0.674201 0.319539 3.366483 3.476316 4.371774 0.559844
22 1.046890 1.081222 0.253854 2.092066 3.405139 4.072307 0.758668
23 1.539813 1.351448 0.862225 2.111617 2.349238 4.503722 0.016977
fig, ax = plt.subplots(figsize=(3,6))
ax = sns.heatmap(hmap_hourday_pivot.fillna(0), robust=True, cmap="viridis",cbar_kws={'label': 'Sum of hours listenning music on Spotify'}, ax = ax);
ax.set(title="When do I listen to Spotify the most during the week? ", xlabel=None ,ylabel="Hour of the day")
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([0,1,2,3,4,5])
colorbar.set_ticklabels(['0H', '1H', '2H', '3H', '4H', '5H'])

plt.savefig(r'C:\Users\Tristan\Documents\DATA\spotify_project\heatmap_weekday.png', bbox_inches='tight')
plt.show()

hours_png

Interpretation:

We can see that between Monday and Friday, I listen the most between 9H and 15H (my peak hours are during my lunch break: 12H, 13H and 14H). During the weekend, I listen the most the Saturday afternoon after 17H.

D) Analysis Part 2 - Audio features

In this second part of the analysis, my objectives are:

  • Find the tracks corresponding to the max and min for each audio feature.
  • Find the audio features corresponding to all the songs I listened to.
  • Compare it to the audio features of my top tracks.
  • Analyse the audio features corresponding to the tracks I listen to in the morning, afternoon, evening and night/party.
  • Using these findings and only the audio features, find a track I might like to listen to in the morning/afternoon/evening/and during a party at night.

For this part, we will use the full table with the 'spotify_features'.

1 - The tracks corresponding to the max and min for each audio feature:

d = spotify_features.loc[spotify_features.danceability == spotify_features.danceability.max()].drop_duplicates(subset = ['track','artist'])
e = spotify_features.loc[spotify_features.energy == spotify_features.energy.max()].drop_duplicates(subset = ['track','artist'])
v = spotify_features.loc[spotify_features.valence == spotify_features.valence.max()].drop_duplicates(subset = ['track','artist'])
l = spotify_features.loc[spotify_features.loudness == spotify_features.loudness.max()].drop_duplicates(subset = ['track','artist'])
i = spotify_features.loc[spotify_features.instrumentalness == spotify_features.instrumentalness.max()].drop_duplicates(subset = ['track','artist'])
a = spotify_features.loc[spotify_features.acousticness == spotify_features.acousticness.max()].drop_duplicates(subset = ['track','artist'])
dd = spotify_features.loc[spotify_features.danceability == spotify_features.danceability.min()].drop_duplicates(subset = ['track','artist'])
ee = spotify_features.loc[spotify_features.energy == spotify_features.energy.min()].drop_duplicates(subset = ['track','artist'])
vv = spotify_features.loc[spotify_features.valence == spotify_features.valence.min()].drop_duplicates(subset = ['track','artist'])
ll = spotify_features.loc[spotify_features.loudness == spotify_features.loudness.min()].drop_duplicates(subset = ['track','artist'])
ii = spotify_features.loc[spotify_features.instrumentalness == spotify_features.instrumentalness.min()].drop_duplicates(subset = ['track','artist'])
aa = spotify_features.loc[spotify_features.acousticness == spotify_features.acousticness.min()].drop_duplicates(subset = ['track','artist'])

print('\n----MAX danceability')
print(d[['track', 'artist']])
print('\n----MIN danceability')
print(dd[['track', 'artist']])
print('\n ')
print('\n----MAX energy')
print(e[['track', 'artist']])
print('\n----MIN energy')
print(ee[['track', 'artist']])
print('\n ')
print('\n----MAX valence')
print(v[['track', 'artist']])
print('\n----MIN valence')
print(vv[['track', 'artist']])
print('\n ')
print('\n----MAX loudness')
print(l[['track', 'artist']])
print('\n----MIN loudness')
print(ll[['track', 'artist']])
print('\n ')
print('\n----MAX instrumentalness')
print(i[['track', 'artist']])
print('\n----MIN instrumentalness')
print(ii[['track', 'artist']].sample(2, random_state=10))
print('\n ')
print('\n----MAX acousticness')
print(a[['track', 'artist']])
print('\n----MIN acousticness')
print(aa[['track', 'artist']])
----MAX danceability
                            track   artist
2760  Credits (feat. Dean Dawson)  Parcels

----MIN danceability
                  track           artist
3531  L'estasi dell'oro  Ennio Morricone

 

----MAX energy
                    track     artist
2974    Hippy Hippy Shake   Big Soul
4200  TURN OFF THE LIGHTS  Dog Blood

----MIN energy
               track          artist
3635  Claire de lune  Claude Debussy

 

----MAX valence
             track    artist
384  Gotta Go Home  Boney M.

----MIN valence
      track artist
3172  Ephos   Flug

 

----MAX loudness
          track                    artist
205  Menez daou  Les Ramoneurs De Menhirs

----MIN loudness
               track          artist
3635  Claire de lune  Claude Debussy

 

----MAX instrumentalness
                  track      artist
4639  Bilboquet (Sirba)  Polo & Pan

----MIN instrumentalness
            track      artist
1004  Gold Digger  Kanye West
2047    La source        1995

 

----MAX acousticness
                    track          artist
3056  This Way Or Another    Owen Kennedy
3635       Claire de lune  Claude Debussy

----MIN acousticness
               track     artist
4569  Good Samaritan  The Hives

2 - The audio features corresponding to all the tracks I listened to:

spotify_features.iloc[:,4:10].mean().reset_index()
index 0
0 danceability 0.646954
1 energy 0.634980
2 valence 0.572766
3 loudness 0.678102
4 instrumentalness 0.173087
5 acousticness 0.264564
import plotly.graph_objects as go

categories = ['Danceability', 'Energy', 'Valence', 'Loudness','Instrumentalness', 'Acousticness']

fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r= spotify_features.iloc[:,4:10].mean(),
      theta=categories,
      fill='toself',
      name = 'All tracks in my Spotify history (2020/2021): average weighted by nb of plays.'
))

fig.update_layout(
    title = "Audio features corresponding to all the tracks I listened to",
    polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0, 1]
    )),
  showlegend=True)

fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.3,
    xanchor="left",
    x=0
))

fig.write_image(r'C:\Users\Tristan\Documents\DATA\spotify_project\all_tracks.png')
fig.show()

alltracks_png

Interpretation:

Although my tastes are diverse (see Exploratory Data Analysis results in part B), I seem to prefer music that is quite danceable with some energy and also quite loud. I prefer music with a good valence. I prefer music that is not acoustic and that has vocals.

3 - Audio features: all tracks VS top tracks:

#My top 5 tracks
top5 = spotify_features[['track', 'artist']].value_counts().reset_index().head(5)
top5.columns = ['track','artist' ,'nb_of_plays']
top5
track artist nb_of_plays
0 Disco Inferno The Trammps 40
1 Sledgehammer Peter Gabriel 33
2 Nomalizo Letta Mbulu 32
3 Nothing But A Heartache The Flirtations 31
4 Last Train to London Electric Light Orchestra 30
#My top 1 track from my top 1 artist
top1_artist = spotify_features.loc[spotify_features.artist.str.contains('Lumineers'), ['track', 'artist']].value_counts().reset_index().head(1)
top1_artist.columns = ['track','artist' ,'nb_of_plays']
top1_artist
track artist nb_of_plays
0 Gloria The Lumineers 22
top5_top1artist = pd.concat([top5, top1_artist])
top5_top1artist
track artist nb_of_plays
0 Disco Inferno The Trammps 40
1 Sledgehammer Peter Gabriel 33
2 Nomalizo Letta Mbulu 32
3 Nothing But A Heartache The Flirtations 31
4 Last Train to London Electric Light Orchestra 30
0 Gloria The Lumineers 22
top5_top1artist_features = spotify_features.loc[(spotify_features['track'].isin(top5_top1artist['track'])) &
                           (spotify_features['artist'].isin(top5_top1artist['artist']))]
top5_top1artist_features.head()
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
247 2020-08-12 10:31:00 Last Train to London Electric Light Orchestra 4.315017 0.702735 0.535569 0.972643 0.610962 0.00082 0.397989 121.493 1
248 2020-08-15 20:16:00 Last Train to London Electric Light Orchestra 4.499267 0.702735 0.535569 0.972643 0.610962 0.00082 0.397989 121.493 1
249 2020-08-17 17:57:00 Last Train to London Electric Light Orchestra 4.499533 0.702735 0.535569 0.972643 0.610962 0.00082 0.397989 121.493 1
250 2020-08-25 15:56:00 Last Train to London Electric Light Orchestra 4.490467 0.702735 0.535569 0.972643 0.610962 0.00082 0.397989 121.493 1
251 2020-08-26 16:30:00 Last Train to London Electric Light Orchestra 4.499100 0.702735 0.535569 0.972643 0.610962 0.00082 0.397989 121.493 1
top_tracks = top5_top1artist_features.iloc[:, 4:10].mean().reset_index()
all_tracks = spotify_features.iloc[:,4:10].mean().reset_index()
vs = top_tracks.merge(all_tracks, on= 'index')
vs.rename(columns = {'index':'audio_feature',   '0_x':'top_tracks', '0_y':'all_tracks'}, inplace =True)
vs['diff'] =  vs['top_tracks'] - vs['all_tracks']
vs
audio_feature top_tracks all_tracks diff
0 danceability 0.646419 0.646954 -0.000535
1 energy 0.706922 0.634980 0.071942
2 valence 0.746982 0.572766 0.174216
3 loudness 0.731358 0.678102 0.053256
4 instrumentalness 0.134393 0.173087 -0.038694
5 acousticness 0.223034 0.264564 -0.041530
from plotly.subplots import make_subplots

categories = ['Danceability', 'Energy', 'Valence', 'Loudness','Instrumentalness', 'Acousticness']



fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'polar'}]])


fig.add_trace(go.Scatterpolar(
      r= top5_top1artist_features.iloc[:, 4:10].mean(),
      theta=categories,
      fill='toself',
      name='Top 5 tracks & top 1 track of my top artist: average weighted by nb of plays.'
), 1,1)


fig.add_trace(go.Scatterpolar(
      r= spotify_features.iloc[:,4:10].mean(),
      theta=categories,
      fill='toself',
      name = 'All tracks in my Spotify history (2020/2021): average weighted by nb of plays.'
), 1,1)



fig.update_layout(
    title = "Audio features: all tracks VS top tracks\n",
    
    polar=dict(
    angularaxis_showticklabels=True,
    radialaxis_showticklabels=True,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
  showlegend=True)


fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.3,
    xanchor="right",
    x=1
))


fig.write_image(r'C:\Users\Tristan\Documents\DATA\spotify_project\all_vs_tops.png')
fig.show()

allvstoptracks_png

Interpretation:

We can see that my top tracks have more valence, more energy and are a little bit louder than the rest of the tracks I listened to. The danceability is the same. And my top tracks are a bit less acoustic and instrumental.

In other words, I'll probably like a track more if it puts me in a good mood (valence) and gives me energy. But it has to have vocals and not be acoustic.

Let's see the audio features of my top tracks + my top track of my top artist:

top5_top1artist_unique = top5_top1artist_features.drop_duplicates(subset = ['track','artist'])
top5_top1artist_unique
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
247 2020-08-12 10:31:00 Last Train to London Electric Light Orchestra 4.315017 0.702735 0.535569 0.972643 0.610962 0.000820 0.397989 121.493 1
1746 2020-08-30 09:30:00 Gloria The Lumineers 3.599617 0.460166 0.715447 0.662247 0.759490 0.000005 0.661306 78.211 1
2225 2020-10-26 10:40:00 Nomalizo Letta Mbulu 2.720633 0.840666 0.480691 0.726431 0.550627 0.002401 0.424119 98.276 1
2257 2020-10-26 10:55:00 Sledgehammer Peter Gabriel 5.413750 0.619501 0.685976 0.414983 0.791769 0.000000 0.024721 96.458 1
2301 2020-10-26 11:02:00 Disco Inferno The Trammps 3.564000 0.749108 0.847561 0.847433 0.845119 0.628898 0.026229 129.323 1
2518 2020-11-13 10:16:00 Nothing But A Heartache The Flirtations 2.737333 0.419738 0.941057 0.833754 0.803370 0.000273 0.000166 111.845 1
categories = ['Dan.', 'Ener.', 'Val.', 'Loud.','Inst.', 'Acou.']



fig = make_subplots(rows=2, cols=3, specs=[[{'type': 'polar'}]*3]*2)




fig.add_trace(go.Scatterpolar(
      r= top5_top1artist_unique.iloc[4, 4:10],
      theta=categories,
      fill='toself',
      name="Disco Inferno - The Trammps (Top 1)"
), 1,1)


fig.add_trace(go.Scatterpolar(
      r=top5_top1artist_unique.iloc[3, 4:10],
      theta=categories,
      fill='toself',
      name="Sledgehammer - Peter Gabriel (Top 2)"
), 1, 2)


fig.add_trace(go.Scatterpolar(
      r=top5_top1artist_unique.iloc[2, 4:10],
      theta=categories,
      fill='toself',
      name="Nomalizo - Letta Mbulu (Top 3)"
), 1,3)

fig.add_trace(go.Scatterpolar(
      r=top5_top1artist_unique.iloc[5, 4:10],
      theta=categories,
      fill='toself',
      name="Nothing But A Heartache - The Flirtations (Top 4)"
), 2,1)

fig.add_trace(go.Scatterpolar(
      r=top5_top1artist_unique.iloc[0, 4:10],
      theta=categories,
      fill='toself',
      name="Last Train to London - Electric Light Orchestra (Top 5)"
), 2,2)

fig.add_trace(go.Scatterpolar(
      r= top5_top1artist_unique.iloc[1, 4:10],
      theta=categories,
      fill='toself',
      name="Gloria - The Lumineers (Top 1 track of my top 1 artist)"
), 2,3)






fig.update_layout(
    title = "Audio features of my top tracks",
    
    polar=dict(
    angularaxis_showticklabels=True,
    radialaxis_showticklabels=True,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
    
    polar2=dict(
    angularaxis_showticklabels=False,
    radialaxis_showticklabels=False,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
    
    polar3=dict(
    angularaxis_showticklabels=False,
    radialaxis_showticklabels=False,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
    
    polar4=dict(
    angularaxis_showticklabels=False,
    radialaxis_showticklabels=False,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
    
    polar5=dict(
    angularaxis_showticklabels=False,
    radialaxis_showticklabels=False,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
    
    polar6=dict(
    angularaxis_showticklabels=False,
    radialaxis_showticklabels=False,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
  showlegend=True)



fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.8,
    xanchor="left",
    x=0
))


fig.write_image(r'C:\Users\Tristan\Documents\DATA\spotify_project\tops_features.png')
fig.show()

toptracks_png

Interpretation:

Here again we can see that my tastes are very diverse. My top tracks do not match each other, they each have their own type of audio features. However, they all have either a good valence, a lot of energy or a good danceability score and they are all loud. Disco Inferno has moments with less vocals. Nomalizo, Last Train To London and Gloria are more acoustic.

4 - The audio features corresponding to the tracks I listen to in the morning, afternoon, evening and night/party:

#Let's set datetime as index 
spotify_features.set_index('datetime', inplace = True)
spotify_features.head()
track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode
datetime
2020-07-16 15:43:00 L'Orchestrina Paolo Conte 3.314000 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
2020-08-15 19:14:00 L'Orchestrina Paolo Conte 3.311583 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
2020-08-17 13:01:00 L'Orchestrina Paolo Conte 3.311533 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
2020-08-23 15:41:00 L'Orchestrina Paolo Conte 3.311450 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
2020-08-25 14:27:00 L'Orchestrina Paolo Conte 3.314000 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1
#Let's use the datetimeindex to find the hour of listening
spotify_features.reset_index(inplace= True)
spotify_features['hour'] = pd.DatetimeIndex(spotify_features["datetime"]).hour
spotify_features.head()
datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode hour
0 2020-07-16 15:43:00 L'Orchestrina Paolo Conte 3.314000 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1 15
1 2020-08-15 19:14:00 L'Orchestrina Paolo Conte 3.311583 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1 19
2 2020-08-17 13:01:00 L'Orchestrina Paolo Conte 3.311533 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1 13
3 2020-08-23 15:41:00 L'Orchestrina Paolo Conte 3.311450 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1 15
4 2020-08-25 14:27:00 L'Orchestrina Paolo Conte 3.314000 0.739596 0.822154 0.755892 0.653855 0.000191 0.580904 117.47 1 14
#Let's create a function to create the column period (values: morning/afternoon/evening/night_party)
def periodofday(x):
    if (x >= 6) and (x < 12):
        return 'morning'
    elif (x >= 12) and (x < 20 ):
        return 'afternoon'
    elif (x >= 20) or (x < 1):
        return'evening'
    elif (x >= 1) and (x < 6):
        return'night_party'

spotify_features['period'] = spotify_features['hour'].apply(periodofday)
spotify_features['period'] = pd.Categorical(spotify_features['period'], ['morning', 'afternoon', 'evening', 'night_party'])
spotify_features['period'].value_counts()
afternoon      2324
morning        1113
evening         950
night_party     251
Name: period, dtype: int64
#Let's create  a table for each period
morning = spotify_features.loc[spotify_features['period'] == 'morning']
afternoon = spotify_features.loc[spotify_features['period'] == 'afternoon']
evening = spotify_features.loc[spotify_features['period'] == 'evening']
night = spotify_features.loc[spotify_features['period'] == 'night_party']
#Let's compare the audio features of each period

categories = ['Danceability', 'Energy', 'Valence', 'Loudness','Instrumentalness', 'Acousticness']



fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'polar'}]])


fig.add_trace(go.Scatterpolar(
      r= night.iloc[:, 4:10].mean(),
      theta=categories,
      fill='toself',
      name='Night (party): 1 - 6H'
), 1,1)


fig.add_trace(go.Scatterpolar(
      r= evening.iloc[:, 4:10].mean(),
      theta=categories,
      fill='toself',
      name='Evening: 20 - 1H'
), 1,1)


fig.add_trace(go.Scatterpolar(
      r= afternoon.iloc[:, 4:10].mean(),
      theta=categories,
      fill='toself',
      name='Afternoon: 12 - 20H'
), 1,1)


fig.add_trace(go.Scatterpolar(
      r= morning.iloc[:, 4:10].mean(),
      theta=categories,
      fill='toself',
      name='Morning: 6 - 12H'
), 1,1)




fig.update_layout(
    title = "Audio features by periods of the day",
    
    polar=dict(
    angularaxis_showticklabels=True,
    radialaxis_showticklabels=True,
    radialaxis=dict(
     visible=True,
      range=[0, 1])),
  showlegend=True)


fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.3,
    xanchor="left",
    x=0
))


fig.write_image(r'C:\Users\Tristan\Documents\DATA\spotify_project\audio_features_day.png')
fig.show()

daytracks_png

Interpretation:

It can be seen that as the day progresses, I listen to music with more valence, more energy and that are more danceable. I also seem to listen to more acoustic music in the morning (and afternoon) than in the evening and night/party.

So if I want music for the morning, it should be a track with a high score for acousticness and a low score for other features (compared to all the tracks I have listened to). For the night it should be a track with a low score for acousticness and high score for the other features (compared to all the tracks I have listened to). For the afternoon and evening it should be between those two extremes.

That is why I will use:

  • for the morning, audio features that are under the average (above for acousticness) of the morning audio features.
  • for the afternoon, audio features that are between the quartile 1 and quartile 3 of the afternoon audio features.
  • for the afternoon, audio features that are between the quartile 1 and quartile 3 of the afternoon audio features.
  • for the night/party, audio features that are above the average (under for acousticness) of the night audio features.

5 - Using these findings and only the audio features, let's find a track I might like to listen to in the morning/afternoon/evening/and during a party at night:

morning_example = spotify_features.loc[(spotify_features.danceability < morning.danceability.mean() ) &
                       (spotify_features.energy < morning.energy.mean() ) &
                       (spotify_features.valence < morning.valence.mean()) &
                       (spotify_features.valence < morning.loudness.mean()) &
                       (spotify_features.instrumentalness < morning.instrumentalness.mean() )&
                       (spotify_features.acousticness > morning.acousticness.mean() )
                       ].reset_index().drop_duplicates(subset=['track','artist']).sample(n=1,random_state=3)

morning_example
index datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode hour period
308 3983 2021-05-06 13:07:00 To The Sea Jack Johnson 3.49935 0.539834 0.429878 0.510732 0.68128 0.00005 0.566833 160.393 1 13 afternoon
afternoon_example = spotify_features.loc[(spotify_features.danceability > afternoon.danceability.quantile(0.25) ) &
                       (spotify_features.danceability < afternoon.danceability.quantile(0.75) ) &
                       (spotify_features.energy > afternoon.energy.quantile(0.25) ) &
                       (spotify_features.energy < afternoon.energy.quantile(0.75) ) &
                       (spotify_features.valence > afternoon.valence.quantile(0.25) ) &
                       (spotify_features.valence < afternoon.valence.quantile(0.75)) &
                       (spotify_features.loudness > afternoon.loudness.quantile(0.25) )&
                       (spotify_features.loudness < afternoon.loudness.quantile(0.75) )&
                       (spotify_features.instrumentalness > afternoon.instrumentalness.quantile(0.25) )&
                       (spotify_features.instrumentalness < afternoon.instrumentalness.quantile(0.75) )&
                       (spotify_features.acousticness > afternoon.acousticness.quantile(0.25) )&
                       (spotify_features.acousticness < afternoon.acousticness.quantile(0.75) )
                       ].reset_index().drop_duplicates(subset=['track','artist']).sample(n=1,random_state=1234)

afternoon_example
index datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode hour period
57 1566 2020-08-19 23:58:00 Knee Socks Arctic Monkeys 4.292717 0.629013 0.54065 0.58649 0.668653 0.003025 0.123616 97.975 0 23 evening
evening_example = spotify_features.loc[(spotify_features.danceability > evening.danceability.quantile(0.25) ) &
                       (spotify_features.danceability < evening.danceability.quantile(0.75) ) &
                       (spotify_features.energy > evening.energy.quantile(0.25) ) &
                       (spotify_features.energy < evening.energy.quantile(0.75) ) &
                       (spotify_features.valence > evening.valence.quantile(0.25) ) &
                       (spotify_features.valence < evening.valence.quantile(0.75)) &
                       (spotify_features.loudness > evening.loudness.quantile(0.25) )&
                       (spotify_features.loudness < evening.loudness.quantile(0.75) )&
                       (spotify_features.instrumentalness > evening.instrumentalness.quantile(0.25) )&
                       (spotify_features.instrumentalness < evening.instrumentalness.quantile(0.75) )&
                       (spotify_features.acousticness > evening.acousticness.quantile(0.25) )&
                       (spotify_features.acousticness < evening.acousticness.quantile(0.75) )
                       ].reset_index().drop_duplicates(subset=['track','artist']).sample(n=1,random_state=98765)

evening_example
index datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode hour period
44 3250 2021-03-08 16:46:00 Natural Blues Moby 2.178883 0.64566 0.650407 0.67803 0.685266 0.052183 0.048942 107.992 0 16 afternoon
night_party_example =  spotify_features.loc[(spotify_features.danceability > night.danceability.mean() ) &
                       (spotify_features.energy > night.energy.mean() ) &
                       (spotify_features.valence > night.valence.mean()) &
                       (spotify_features.valence > night.loudness.mean()) &
                       (spotify_features.instrumentalness > night.instrumentalness.mean() )&
                       (spotify_features.acousticness < night.acousticness.mean() )
                       ].reset_index().drop_duplicates(subset=['track','artist']).sample(n=1,random_state=125631)
night_party_example
index datetime track artist min_played danceability energy valence loudness instrumentalness acousticness tempo mode hour period
147 3091 2021-01-15 02:37:00 In Degrees - Purple Disco Machine Remix Foals 4.0534 0.841855 0.767276 0.90846 0.771013 0.435551 0.003817 121.986 1 2 night_party

Results:

So the tracks for my day could be:

  • Morning (6 - 12H): 'To The Sea' from 'Jack Johnson'
  • Afternoon (12 - 20H): 'Knee Socks' from 'Arctic Monkeys'
  • Evening (20 - 1H): 'Natural Blues' from 'Moby'
  • During a party at night (1 - 6H): 'In Degrees' from 'Foals - Purple Disco Machine Remix'

Comment: Indeed, this could be what I listen to during different times of the day (and at a party in the night).

Conclusion

We have seen in the exploratory data analysis (EDA) as well as in the result of the analysis that my musical tastes are very diversified. Indeed, I like to listen to quiet music like 100% piano music, louder music like rock band music, music with vocals but sometimes also music without any vocals like electro/techno music. I have a small preference for music that puts me in a good mood and gives me energy like my favourite track 'Disco Inferno', but I also like melancholic or more acoustic music like the tracks by my favourite artist 'The Lumineers'. For example, I love to listen to 'The Lumineers' when I'm working, it helps me to concentrate.

I seem to listen to more music in the summer or during the Christmas/New Year period. During the week we saw that I listen to more music on Saturdays and less on Thursdays. Also, it was interesting to see that between Monday and Friday I listen to the most music during my launch break (around 11H, 12H, 13H, 14H) and at the weekend it's clearly Saturday afternoon after 17H that I listen to Spotify the most.

Finally, we found a pattern with my musical tastes and audio characteristics that I prefer during the day. We saw that as the day progresses, I listen to music with good valence, good energy, more danceable and less acoustic. We also conducted an interesting experiment to find a song for each period of the day that I might like to listen to, simply by selecting a song from a certain range of audio characteristics. The result is indeed relevant as the four tracks chosen could really match what I like to listen to at each time of the day.

Comment: To be even more precise and relevant, we need to have more information about the tracks such as the genre of the track.

from IPython.display import Image
Image(r'C:\Users\Tristan\Documents\DATA\spotify_project\spotify_lists.png')

sumup_png

About

Practising my Python skills with a data analysis project on my Spotify data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published