# Top 50 Spotify Tracks- 2020 TC DA Project

_For information about this notebook please refer to README_

## Imports and Loads

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('spotifytoptracks.csv', index_col=0)
data.head(3)

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap


## Cleaning

In [3]:
data.isnull().any(axis=None)

False

In [4]:
data.duplicated().any()

False

It appears that there are no null values nor duplicate entries in the data. Given that the data is made of the top Spotify tracks, it makes no sense to look for outliers nor to drop them from the rest of the data.

Let's check the genre values, since that's where there could be potential for duplication.

In [5]:
data.genre.unique()

array(['R&B/Soul', 'Alternative/Indie', 'Hip-Hop/Rap', 'Dance/Electronic',
       'Nu-disco', 'Pop', 'R&B/Hip-Hop alternative', 'Pop/Soft Rock',
       'Pop rap', ' Electro-pop', 'Hip-Hop/Trap', 'Dance-pop/Disco',
       'Disco-pop', 'Dreampop/Hip-Hop/R&B',
       'Alternative/reggaeton/experimental', 'Chamber pop'], dtype=object)

Okay, so we see that many of the genres contain the same keywords for denoting a possible mix of genre or a sub-genre of music. Given this data, there is no way to tell which genre, from a possibility of two or three, the song is closer to or if it's a unique genre of it's own right. Therefore we'll keep these as stand-alones.

In [6]:
data.describe()

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.6093,0.71672,5.72,-6.2259,0.256206,0.124158,0.015962,0.196552,0.55571,119.69046,199955.36
std,0.154348,0.124975,3.709007,2.349744,0.26525,0.116836,0.094312,0.17661,0.216386,25.414778,33996.122488
min,0.225,0.351,0.0,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,0.494,0.6725,2.0,-7.5525,0.0528,0.048325,0.0,0.09395,0.434,99.55725,175845.5
50%,0.597,0.746,6.5,-5.9915,0.1885,0.07005,0.0,0.111,0.56,116.969,197853.5
75%,0.72975,0.7945,8.75,-4.2855,0.29875,0.1555,2e-05,0.27125,0.72625,132.317,215064.0
max,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


## Exploration

The data contains following number of observations and features respectively:

In [7]:
data.shape

(50, 16)

In [8]:
data.columns

Index(['artist', 'album', 'track_name', 'track_id', 'energy', 'danceability',
       'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'genre'],
      dtype='object')

Artist, album, track_name, track_id, key, genre are __categorical__ features.
On the other hand, energy, danceability, loudness, acousticness, speechiness, instrumentalness, liveness, valence, tempo and duration_ms are __numeric__ features.

Let's see the artists who got more than one hit song on 2020.

In [9]:
data.groupby('artist')['track_name'].count(
)[data.groupby('artist')['track_name'].count() > 1].sort_values(
ascending=False)

artist
Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Harry Styles     2
Justin Bieber    2
Lewis Capaldi    2
Post Malone      2
Name: track_name, dtype: int64

We see that there are __7__ artists who had more than one hit track in __2020__, with __Billie Eilish__, __Dua Lipa__ and __Travis Scott__ being the __most popular__ ones,

In [10]:
data['artist'].unique().size

40

with 40 artists sharing the top 50 tracks.

In [11]:
data.groupby('album')['track_name'].count(
)[data.groupby('album')['track_name'].count()>1].sort_values(
ascending=False)

album
Future Nostalgia        3
Changes                 2
Fine Line               2
Hollywood's Bleeding    2
Name: track_name, dtype: int64

In [12]:
data['album'].unique().size

45

Future Nostalgia album had even 3 songs hit the 50-most-popular mark, with Changes, Fine Line and Hollywood's Bleeding having had 2 top 50 songs each.
In total there were 45 albums in this list.

Now, let's see how easy it is to dance to some of these.

The following songs have danceability score above 0.7:

In [13]:
data.query('danceability > 0.7')['track_name']

1                                      Dance Monkey
2                                           The Box
3                             Roses - Imanbek Remix
4                                   Don't Start Now
5                      ROCKSTAR (feat. Roddy Ricch)
7                  death bed (coffee for your head)
8                                           Falling
10                                             Tusa
13                                  Blueberry Faygo
14                         Intentions (feat. Quavo)
15                                     Toosie Slide
17                                           Say So
18                                         Memories
19                       Life Is Good (feat. Drake)
20                 Savage Love (Laxed - Siren Beat)
22                                      Breaking Me
24                              everything i wanted
25                                         Señorita
26                                          bad guy
27          

These - danceability score below 0.4:

In [14]:
 data.query('danceability < 0.4')['track_name']

44    lovely (with Khalid)
Name: track_name, dtype: object

Let's check out the loudness aspect of our data.

These are the loudest tracks with loudness score above -5:

In [15]:
data.query('loudness > -5')['track_name']

4                                   Don't Start Now
6                                  Watermelon Sugar
10                                             Tusa
12                                          Circles
16                                    Before You Go
17                                           Say So
21                                        Adore You
23                           Mood (feat. iann dior)
31                                   Break My Heart
32                                         Dynamite
33                 Supalonely (feat. Gus Dapperton)
35                  Rain On Me (with Ariana Grande)
37    Sunflower - Spider-Man: Into the Spider-Verse
38                                            Hawái
39                                          Ride It
40                                       goosebumps
43                                          Safaera
48                                         Physical
49                                       SICKO MODE
Name: track_

Conversely, these are the quietest ones with loudness below -8.

In [16]:
data.query('loudness < -8')['track_name']

7                   death bed (coffee for your head)
8                                            Falling
15                                      Toosie Slide
20                  Savage Love (Laxed - Siren Beat)
24                               everything i wanted
26                                           bad guy
36                               HIGHEST IN THE ROOM
44                              lovely (with Khalid)
47    If the World Was Ending - feat. Julia Michaels
Name: track_name, dtype: object

Let's check which tracks are at the extremes of the dataset in respect to length.

In [17]:
min_max_dur = data[['track_name','duration_ms']].agg(['min', 'max'])

Convert the duration to minutes.

In [18]:
min_max_dur['duration_m'] = min_max_dur['duration_ms'] / 60000
min_max_dur.drop('duration_ms', axis =1)

Unnamed: 0,track_name,duration_m
min,Adore You,2.3421
max,lovely (with Khalid),5.213667


Now let's examine the genre aspect of our dataset.
For starters, which of the genres is the most popular one?

In [19]:
data.groupby('genre')['track_name'].count().sort_values(ascending=False)

genre
Pop                                   14
Hip-Hop/Rap                           13
Dance/Electronic                       5
Alternative/Indie                      4
 Electro-pop                           2
R&B/Soul                               2
Alternative/reggaeton/experimental     1
Chamber pop                            1
Dance-pop/Disco                        1
Disco-pop                              1
Dreampop/Hip-Hop/R&B                   1
Hip-Hop/Trap                           1
Nu-disco                               1
Pop rap                                1
Pop/Soft Rock                          1
R&B/Hip-Hop alternative                1
Name: track_name, dtype: int64

It seems like within the 50 most popular tracks the leading genre is Pop, however very closely followed by Hip-Hop and Rap. Now, if we decided to classify other sub genres as belonging to Pop or Hip-Hop/Rap this could potentially change the picture. However, there would be a need for aditional data to do so.

Alright, now let's have a look at the genres which only have 1 song in the top 50 chart.

In [20]:
least_to_most = data.groupby('genre')['track_name'].count().sort_values()
least_to_most[least_to_most == 1]

genre
Alternative/reggaeton/experimental    1
Chamber pop                           1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Hip-Hop/Trap                          1
Nu-disco                              1
Pop rap                               1
Pop/Soft Rock                         1
R&B/Hip-Hop alternative               1
Name: track_name, dtype: int64

So, it seems, that most of these genres are derivatives or 'different flavours' of the more popular ones.

For the end, let's just see how many genres we have in total.

In [21]:
data.genre.unique().size

16

Now we'll look at correlations between features.

In [22]:
corr_data=data.corr(numeric_only=True)
corr_data.head()

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
energy,1.0,0.152552,0.062428,0.79164,-0.682479,0.074267,-0.385515,0.069487,0.393453,0.075191,0.081971
danceability,0.152552,1.0,0.285036,0.167147,-0.359135,0.226148,-0.017706,-0.006648,0.479953,0.168956,-0.033763
key,0.062428,0.285036,1.0,-0.009178,-0.113394,-0.094965,0.020802,0.278672,0.120007,0.080475,-0.003345
loudness,0.79164,0.167147,-0.009178,1.0,-0.498695,-0.021693,-0.553735,-0.069939,0.406772,0.102097,0.06413
acousticness,-0.682479,-0.359135,-0.113394,-0.498695,1.0,-0.135392,0.352184,-0.128384,-0.243192,-0.241119,-0.010988


Let's examine strongest positive and negative correlations. We'll use the common threshold of 0.7 to denote a strong correlation. 

In [23]:
str_pos = corr_data[corr_data[corr_data > 0.7].count() > 1]
str_pos[str_pos.index]

Unnamed: 0,energy,loudness
energy,1.0,0.79164
loudness,0.79164,1.0


So the only two features strongly positively correlated are energy and loudness.

What about the negative?

In [24]:
str_neg = corr_data[corr_data[corr_data < -0.7].count() > 0]
str_neg[str_neg.index]

So, in the context of our threshold there seem to be no features strongly negatively correlated.

Are there any features with no correlation whatsoever?

In [25]:
corr_data[corr_data == 0].any(axis=None)

False

Another thing we can do is to compare means of different feature values in respect to genre. 

In [26]:
comparisons = data.pivot_table(['danceability','loudness', 'acousticness'],'genre').loc[
              ["Hip-Hop/Rap", "Pop", "Dance/Electronic", "Alternative/Indie"]]
comparisons

Unnamed: 0_level_0,acousticness,danceability,loudness
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hip-Hop/Rap,0.188741,0.765538,-6.917846
Pop,0.323843,0.677571,-6.460357
Dance/Electronic,0.09944,0.755,-5.338
Alternative/Indie,0.5835,0.66175,-5.421


Alternative/Indie and Dance/Electronic are at the extremes in terms of acousticness with being maximum and minimum respectively. Somewhat interesting is the fact that pop is more acoustically driven than Hip-Hop, which  might indicate the general trends for production technique choices for 2020 by genre. However to answer this question this data is not sufficient.

In [27]:
comparisons['danceability']

genre
Hip-Hop/Rap          0.765538
Pop                  0.677571
Dance/Electronic     0.755000
Alternative/Indie    0.661750
Name: danceability, dtype: float64

Quite suprisingly Hip-hop is the most danceable genre with danceability mean higher than Dance/Electronic music itself. Alternative/Indie and Pop are the least dancable ones. However the differences between all genres in the score are not huge.

In [28]:
comparisons['loudness']

genre
Hip-Hop/Rap         -6.917846
Pop                 -6.460357
Dance/Electronic    -5.338000
Alternative/Indie   -5.421000
Name: loudness, dtype: float64

In terms of loudness, Hip-Hop/Rap and Dance/Electronic lie in the extremes being the minimum and the maximum.
Surprisingly, Alternative/Indie, which is usually perceived as quite a "relaxed" genre of music, is relatively high in loudness. Though, we should refer to the fact that there are only 4 entries of Alternative/Indie which might give a skewed view of the tendency.

## Going Forward

An interesting features to examine would be musical key and mode and how they correlate to other features of the dataset. In general, much more could be uncovered with availability of data, such as total listens for each song, listening times for each song, geographical listener distribution, to name a few. One  potentially interesting examination would be to segregate the songs by mode such as Lydian, Mixolydian, Double Harmonic etc. as well as get data of occurances of these modes in different countries' folk tunes and see if there is some correlation between the modes of the modern prefered songs vs. most common modes of folk songs per each country. Combining various sources of data would facilitate finding counter-intuitive and hitherto unknown insights.