<h1>Spotify - Top 50 tracks of 2020</h1>
<a href="https://www.kaggle.com/datasets/atillacolak/top-50-spotify-tracks-2020" target="_blank">Dataset source</a>

<h1>Purpose of the analysis</h1>
<p>The analysis explores several numerical features of the most popular songs within the music streaming platform Spotify. The aim is to assemble a "blueprint" of a hit song.</p>

<h1 id="initial-setup">Initial Setup</h1>
<p>Import the modules, read the file contents, print data head.</p>

In [1]:
import numpy as np
import pandas as pd
from typing import List

In [2]:
df_tracks = pd.read_csv('../data/spotify_top_tracks.csv', index_col=0)
df_tracks.reset_index(inplace=True)
df_tracks = df_tracks.rename(columns = {'index':'placement'})
df_tracks['placement'] = df_tracks['placement'] + 1
df_tracks.head(5).T

Unnamed: 0,0,1,2,3,4
placement,1,2,3,4,5
artist,The Weeknd,Tones And I,Roddy Ricch,SAINt JHN,Dua Lipa
album,After Hours,Dance Monkey,Please Excuse Me For Being Antisocial,Roses (Imanbek Remix),Future Nostalgia
track_name,Blinding Lights,Dance Monkey,The Box,Roses - Imanbek Remix,Don't Start Now
track_id,0VjIjW4GlUZAMYd2vXMi3b,1rgnBhdG2JDFTbYkYRZAku,0nbXyq5TXYPCO7pr3N8S4I,2Wo6QQD1KMDWeFkkjLqwx5,3PfIrDoz19wz7qK7tYeu62
energy,0.73,0.593,0.586,0.721,0.793
danceability,0.514,0.825,0.896,0.785,0.793
key,1,6,10,8,11
loudness,-5.934,-6.401,-6.687,-5.457,-4.521
acousticness,0.00146,0.688,0.104,0.0149,0.0123


<h1 id="data-cleaning">Data Cleaning</h1>
<p>Check whether all numeric values fall within the expected range. Query for missing values, check whether there are no duplicate tracks within the dataset.</p>

In [3]:
df_tracks.describe()

Unnamed: 0,placement,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,25.5,0.6093,0.71672,5.72,-6.2259,0.256206,0.124158,0.015962,0.196552,0.55571,119.69046,199955.36
std,14.57738,0.154348,0.124975,3.709007,2.349744,0.26525,0.116836,0.094312,0.17661,0.216386,25.414778,33996.122488
min,1.0,0.225,0.351,0.0,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,13.25,0.494,0.6725,2.0,-7.5525,0.0528,0.048325,0.0,0.09395,0.434,99.55725,175845.5
50%,25.5,0.597,0.746,6.5,-5.9915,0.1885,0.07005,0.0,0.111,0.56,116.969,197853.5
75%,37.75,0.72975,0.7945,8.75,-4.2855,0.29875,0.1555,2e-05,0.27125,0.72625,132.317,215064.0
max,50.0,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


<p>No extreme values exist among the numeric variables, min and max ranges stay within the defined specifications in the dataset description.</p>

In [4]:
print(f"There is a total of {df_tracks['track_id'].nunique()} unique tracks.")

There is a total of 50 unique tracks.


In [5]:
df_tracks.isna().sum()

placement           0
artist              0
album               0
track_name          0
track_id            0
energy              0
danceability        0
key                 0
loudness            0
acousticness        0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
genre               0
dtype: int64

<p>No missing / duplicate tracks and all tracks fall within specified variable ranges! Proceeding to exploratory analysis.</p>

<h1 id="exploratory-analysis">Exploratory Data Analysis</h1>

<h2 id="dataset-structure">Dataset Structure</h2>
<p>Familiarize with the dataset.</p>

<b>QUESTION</b>: How many observations are there in this dataset?

In [6]:
total_rows, total_cols = df_tracks.shape
print(f'The dataset contains {total_rows} observations.')

The dataset contains 50 observations.


<hr>
<b>QUESTION</b>: How many features does this dataset have?

In [7]:
numeric_cols = df_tracks.select_dtypes(include=[np.number]).columns
categorical_cols = df_tracks.select_dtypes(exclude=[np.number]).columns
print(f"The dataset has {len(numeric_cols)} numeric columns, {len(categorical_cols)} categorical columns.")

The dataset has 12 numeric columns, 5 categorical columns.


<hr>
<b>QUESTION</b>: Which of the features are categorical?

In [8]:
print(f"Dataset's categorical columns: {', '.join([col for col in categorical_cols])}")

Dataset's categorical columns: artist, album, track_name, track_id, genre


<hr>
<b>QUESTION</b>: Which of the features are numeric?

In [9]:
print(f"Dataset's numeric columns: {', '.join([col for col in numeric_cols])}")

Dataset's numeric columns: placement, energy, danceability, key, loudness, acousticness, speechiness, instrumentalness, liveness, valence, tempo, duration_ms


<h2 id="artists-overview">Artists' Overview</h2>
<p>Explore the artists of the top 50 tracks.</p>

<b>QUESTION</b>: Are there any artists that have more than 1 popular track? If yes, which and how many?

In [10]:
artist_tracks = df_tracks.groupby('artist')['track_id'].count()
multiple_tracks = artist_tracks[artist_tracks > 1]
multiple_tracks = multiple_tracks.sort_values(ascending=False)
print('The following artists have multiple tracks in the top 50:')
for artist, tracks in multiple_tracks.items():
    print(f'{artist.title()} has {tracks} tracks')

The following artists have multiple tracks in the top 50:
Billie Eilish has 3 tracks
Dua Lipa has 3 tracks
Travis Scott has 3 tracks
Harry Styles has 2 tracks
Justin Bieber has 2 tracks
Lewis Capaldi has 2 tracks
Post Malone has 2 tracks


<b>QUESTION</b>: Who was the most popular artist?
<p>The most popular artist can be interpreted differently. For instance, it could be the artist with the number 1 track, although we might get the case of a "one-hit wonder". Another approach is to check for consistency - find the artist who has the most songs in the top 50. If multiple exist, sum their songs' placements; The smallest value wins. The analysis below will explore both options.</p>

<i>Approach #1</i>: Author of the #1 track.

In [11]:
top_1_artist = df_tracks.loc[df_tracks['placement'] == 1]['artist']
print(f"The author of the #1 track is {top_1_artist[0]}")

The author of the #1 track is The Weeknd


<i>Approach #2</i>: Artist with most songs in the top 50 & smallest sum of placements

In [12]:
popular_artists = {}
artists_max_tracks = list(multiple_tracks[multiple_tracks == multiple_tracks.max()].index)
for artist in artists_max_tracks:
    # Retrieve songs' placements for artists with most tracks in TOP50
    placements = list(df_tracks[df_tracks['artist'] == artist].index)
    popular_artists[artist] = placements

# Retrieve the most popular artist (sum of placements is the smallest)
most_popular_artist = min(popular_artists, key=lambda artist: sum(popular_artists[artist]))
track_placements = popular_artists[most_popular_artist] 
print(f"The most popular artist is {most_popular_artist}")
print(f"The artist has {len(popular_artists[most_popular_artist])} tracks in the top 50")
print(f"Their songs have the following placements: {[place for place in track_placements]}")
print(f"Genres: {', '.join([g for g in df_tracks.loc[df_tracks['placement'].isin(track_placements)]['genre']])}")

The most popular artist is Dua Lipa
The artist has 3 tracks in the top 50
Their songs have the following placements: [4, 31, 48]
Genres: Dance/Electronic, Pop, Pop


<p><b>Dua Lipa</b> is the most popular artist in the second approach. Two of her three songs that made the list are Pop songs.</p>

<hr>
<b>QUESTION</b>: How many artists in total have their songs in the top 50?

In [13]:
total_artists = df_tracks['artist'].nunique()
print(f"{total_artists} artists are in the top 50.")

40 artists are in the top 50.


<p>Overall, it is a very diverse population in the top 50. No specific artist dominated the charts; The most songs one artist had was 3. The next sub-section will explore the albums.</p>

<h2 id="albums-overview">Albums' Overview</h2>
<p>Explore the featured albums within the top 50 tracks.</p>

<hr>
<b>QUESTION</b>: Are there any albums that have more than 1 popular track? If yes, which and how many?

In [14]:
# Artist is included to prevent cases where two separate artists may have an album with the same name.
total_albums = df_tracks.groupby(['artist','album'])['track_id'].count()
popular_albums = total_albums[total_albums > 1]
print(f"{len(popular_albums)} albums have more than 1 song in the top 50.")
print(', '.join([index for index in popular_albums.index.get_level_values(1)]))

4 albums have more than 1 song in the top 50.
Future Nostalgia, Fine Line, Changes, Hollywood's Bleeding


<hr>
<b>QUESTION</b>: How many albums in total have their songs in the top 50?

In [15]:
print(f"{len(total_albums)} albums are in the top 50.")

45 albums are in the top 50.


<p>A very similar and expected outcome with regards to albums - the list is especially diverse, and only 4 albums acummulated more than one song in the top 50 charts. Drawing out the elements of a hit song by analyzing an artist or an album might not be the best approach in this case. Proceeding to the numerical variables analysis.</p>

<h2 id="danceability">Danceability</h2>
<p>Draw insights based on the danceability score.</p>

<b>QUESTION</b>: Which tracks have a danceability score above 0.7?

In [16]:
high_danceability = df_tracks[df_tracks['danceability'] > 0.7].sort_values(by='danceability', ascending=False).head(10)
print("Top 10 tracks by high danceability score (sorted by descending order):")
print('\n'.join([track for track in high_danceability['track_name']]))
print(f"\nA total of {len(high_danceability)} tracks have a high danceability score.")

Top 10 tracks by high danceability score (sorted by descending order):
WAP (feat. Megan Thee Stallion)
The Box
Ride It
Sunday Best
Supalonely (feat. Gus Dapperton)
goosebumps
SICKO MODE
Toosie Slide
Dance Monkey
Godzilla (feat. Juice WRLD)

A total of 10 tracks have a high danceability score.


<hr>
<b>QUESTION</b>: Which tracks have a danceability score below 0.4?

In [17]:
low_danceability = df_tracks[df_tracks['danceability'] < 0.4].sort_values(by='danceability', ascending=True)
print("The following tracks have a low danceability score (sorted by ascending order):")
print('\n'.join([track for track in low_danceability['track_name']]))

The following tracks have a low danceability score (sorted by ascending order):
lovely (with Khalid)


A high danceability score seems to be a solid indicator of a hit song - 32 out of 50 songs had a high danceability score! Let's proceed to loudness.

<h2 id="loudness">Loudness</h2>
<p>Explore the effect of loudness among the top songs. Larger negative decibel (dB) values indicate that the track is quieter.</p>

<b>QUESTION: </b>Which tracks have their loudness above -5?

In [18]:
loud_tracks = df_tracks[df_tracks['loudness'] > -5].sort_values(by='loudness', ascending=False)
print("The following tracks have a loudness rating above -5dB (sorted by descending order):")
print('\n'.join([track for track in loud_tracks['track_name']]))
print(f"\nA total of {len(loud_tracks)} tracks are relatively loud.")

The following tracks have a loudness rating above -5dB (sorted by descending order):
Tusa
goosebumps
Break My Heart
HawÃ¡i
Circles
Mood (feat. iann dior)
Adore You
SICKO MODE
Physical
Rain On Me (with Ariana Grande)
Safaera
Watermelon Sugar
Ride It
Sunflower - Spider-Man: Into the Spider-Verse
Dynamite
Don't Start Now
Say So
Supalonely (feat. Gus Dapperton)
Before You Go

A total of 19 tracks are relatively loud.


<hr>
<b>QUESTION</b>: Which tracks have their loudness below -8?

In [19]:
quiet_tracks = df_tracks[df_tracks['loudness'] < -8].sort_values(by='loudness', ascending=True)
print("The following tracks have a loudness rating below -8dB (sorted by ascending order):")
print('\n'.join([track for track in quiet_tracks['track_name']]))

The following tracks have a loudness rating below -8dB (sorted by ascending order):
everything i wanted
bad guy
lovely (with Khalid)
If the World Was Ending - feat. Julia Michaels
Toosie Slide
death bed (coffee for your head)
HIGHEST IN THE ROOM
Falling
Savage Love (Laxed - Siren Beat)


<p>It seems that louder does not necessarily mean better, at least in mainstream media. Less than half of the tracks had a relatively loud sound rating. Let's quickly check the track duration stats.</p>

<h2 id="length">Track Length</h2>
<p>A quick check of the longest and shortest tracks.</p>

<b>QUESTION</b>: Which track is the longest?

In [20]:
longest_track = df_tracks.loc[df_tracks['duration_ms'].idxmax()]
longest_track[['track_name', 'duration_ms']]
longest_track_duration = round(longest_track['duration_ms'] / 60000, 2)
print(f"The longest track is {longest_track['track_name']} at {longest_track_duration} minutes")

The longest track is SICKO MODE at 5.21 minutes


<hr>
<b>QUESTION</b>: Which track is the shortest?

In [21]:
shortest_track = df_tracks.loc[df_tracks['duration_ms'].idxmin()]
shortest_track[['track_name', 'duration_ms']]
shortest_track_duration = round(shortest_track['duration_ms'] / 60000, 2)
print(f"The shortest track is {shortest_track['track_name']} at {shortest_track_duration} minutes")

The shortest track is Mood (feat. iann dior) at 2.34 minutes


<h2 id="genre-analysis">Genre Analysis</h2>
<p>Finish basic data analytics with an overview of genre popularity.</p>

<b>QUESTION</b>: Which genre is the most popular?

In [22]:
genres_count = df_tracks['genre'].value_counts()
max_count = genres_count.max()
most_popular_genres = genres_count[genres_count == max_count]

# Loop through the list in case there is more than one genre with the same tracks' count.
print("The most popular genre(s):")
for genre, count in most_popular_genres.items():
    print(f"{genre} with {count} tracks")

The most popular genre(s):
Pop with 14 tracks


<hr>
<b>QUESTION</b>: Which genres have just one song on the top 50?

In [23]:
one_track_genres = genres_count[genres_count == 1]
print("Genres with only one song in the top 50:")
print('\n'.join([genre for genre, _ in one_track_genres.items()]))

Genres with only one song in the top 50:
R&B/Hip-Hop alternative
Nu-disco
Pop/Soft Rock
Pop rap
Hip-Hop/Trap
Dance-pop/Disco
Disco-pop
Dreampop/Hip-Hop/R&B
Alternative/reggaeton/experimental
Chamber pop


<hr>
<b>QUESTION</b>: How many genres in total are represented in the top 50?

In [24]:
print(f"{len(genres_count)} genres are in the top 50")

16 genres are in the top 50


<b>ADDITIONAL ANALYSIS</b>: How many tracks' genres have 'pop' in them?

In [25]:
pop_tracks = df_tracks.loc[df_tracks['genre'].str.lower().str.contains('pop')]
print(f"{len(pop_tracks)} tracks have the substring 'pop' in them")

22 tracks have the substring 'pop' in them


To no surprise, <b>Pop</b>, which encompasses the idea of <i>"popular"</i>, is the most popular genre in the top 50, with 14 tracks identified in this genre and a total of 22 tracks having 'pop' in their genre.

<h1 id="statistical-analysis">Statistical Analysis</h1>
<p>Let's proceed into a more complex and insightful form of analysis using correlation statistics. A hit song is most likely a gestalt - a combination of many elements that make up a greater whole. Let's try to identify these elements.</p>

<b>QUESTION</b>: Which features are strongly positively correlated?

In [26]:
# Create the correlation Matrix
corr_matrix = df_tracks[numeric_cols].corr()
np.fill_diagonal(corr_matrix.values, np.nan)
corr_matrix

Unnamed: 0,placement,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
placement,,0.030381,-0.176321,-0.052844,0.034935,-0.036557,0.09579,-0.003126,-0.063216,-0.034159,0.081289,0.309563
energy,0.030381,,0.152552,0.062428,0.79164,-0.682479,0.074267,-0.385515,0.069487,0.393453,0.075191,0.081971
danceability,-0.176321,0.152552,,0.285036,0.167147,-0.359135,0.226148,-0.017706,-0.006648,0.479953,0.168956,-0.033763
key,-0.052844,0.062428,0.285036,,-0.009178,-0.113394,-0.094965,0.020802,0.278672,0.120007,0.080475,-0.003345
loudness,0.034935,0.79164,0.167147,-0.009178,,-0.498695,-0.021693,-0.553735,-0.069939,0.406772,0.102097,0.06413
acousticness,-0.036557,-0.682479,-0.359135,-0.113394,-0.498695,,-0.135392,0.352184,-0.128384,-0.243192,-0.241119,-0.010988
speechiness,0.09579,0.074267,0.226148,-0.094965,-0.021693,-0.135392,,0.028948,-0.142957,0.053867,0.215504,0.366976
instrumentalness,-0.003126,-0.385515,-0.017706,0.020802,-0.553735,0.352184,0.028948,,-0.087034,-0.203283,0.018853,0.184709
liveness,-0.063216,0.069487,-0.006648,0.278672,-0.069939,-0.128384,-0.142957,-0.087034,,-0.033366,0.025457,-0.090188
valence,-0.034159,0.393453,0.479953,0.120007,0.406772,-0.243192,0.053867,-0.203283,-0.033366,,0.045089,-0.039794


In [27]:
strong_corr_threshold = 0.6
strong_pos_corr = corr_matrix[corr_matrix >= strong_corr_threshold]
strong_pos_corr = strong_pos_corr.unstack().sort_values(ascending=False).drop_duplicates().dropna()
for (corr_1, corr_2), corr_value in strong_pos_corr.items():
    print(f"{corr_1} is strongly positively correlated to {corr_2} at value {round(corr_value, 2)}")

energy is strongly positively correlated to loudness at value 0.79


<p>Unfortunaly, `placement` did not have any strong correlations to other variables. An interesting, maybe obvious, positive correlation is between energy and loudness. Based on our previous observations, a loud song might not make the top of the top, but it might certainly raise the energy of the song and be a good contender in a dancefloor list.</p>

<hr>
<b>QUESTION</b>: Which features are strongly negatively correlated?

In [28]:
strong_neg_corr = corr_matrix[corr_matrix <= strong_corr_threshold * -1]
strong_neg_corr = strong_neg_corr.unstack().sort_values(ascending=False).drop_duplicates().dropna()
for (corr_1, corr_2), corr_value in strong_neg_corr.items():
    print(f"{corr_1} is strongly negatively correlated to {corr_2} at value {round(corr_value, 2)}")

energy is strongly negatively correlated to acousticness at value -0.68


<p>An expected result; Acoustic music is more relaxing by nature and is for appreciation, rather than the dancefloor or mosh pits.</p>

<hr>
<b>QUESTION</b>: Which features are not correlated?

In [29]:
weak_corr_threshold = 0.03
weak_corr = corr_matrix[(corr_matrix <= weak_corr_threshold) & (corr_matrix >= weak_corr_threshold * -1)]
weak_corr = weak_corr.unstack().sort_values(ascending=False).drop_duplicates().dropna()
for (corr_1, corr_2), corr_value in weak_corr.items():
    print(f"{corr_1} is not correlated to {corr_2} at value {round(corr_value, 2)}")

instrumentalness is not correlated to speechiness at value 0.03
tempo is not correlated to liveness at value 0.03
key is not correlated to instrumentalness at value 0.02
tempo is not correlated to instrumentalness at value 0.02
placement is not correlated to instrumentalness at value -0.0
duration_ms is not correlated to key at value -0.0
danceability is not correlated to liveness at value -0.01
loudness is not correlated to key at value -0.01
duration_ms is not correlated to acousticness at value -0.01
danceability is not correlated to instrumentalness at value -0.02
loudness is not correlated to speechiness at value -0.02


<p>A lot of non-correlated value pairs. An interesting find is that `placement` is not correlated to `instrumentalness` at all. This gives a hint that instrumental complexity might not be necessary to produce a hit song. Well-known hits like <i>Bohemian Rhapsody</i> or <i>Billie Jean</i> have been covered a cappella by many with great success!</p>

<h2 id="genre-comparison">Genres' Comparison</h2>
<p>For a selected list of genres, compare a specific numeric variable.</p>

<b>QUESTION</b>: How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [30]:
def compare_genres(selected_genres: List, variable: str, df: pd.DataFrame, sort_ascending: bool=False) -> None:
    """Given a list of genres and a variable name, return the filtered DataFrame by score means."""
    genre_comparison = df.groupby('genre')[variable].mean()
    genre_comparison = genre_comparison[selected_genres].sort_values(ascending=sort_ascending)
    print(f"Selected genres by {variable} score:")
    for genre, score in genre_comparison.items():
        print(f"{genre.title()} with a score of {score:.2f}")

In [31]:
selected_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']
compare_genres(selected_genres, 'danceability', df_tracks)

Selected genres by danceability score:
Hip-Hop/Rap with a score of 0.77
Dance/Electronic with a score of 0.76
Pop with a score of 0.68
Alternative/Indie with a score of 0.66


<b>ANSWER</b>: Hip-Hop/Rap is the most danceable genre from the list, while Alternative/Indie is the least danceable of the few.

<hr>
<b>QUESTION</b>: How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [32]:
selected_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']
compare_genres(selected_genres, 'loudness', df_tracks)

Selected genres by loudness score:
Dance/Electronic with a score of -5.34
Alternative/Indie with a score of -5.42
Pop with a score of -6.46
Hip-Hop/Rap with a score of -6.92


<b>ANSWER</b>: Dance/Electronic is the loudest, while Hip-Hop/Rap is the least loud of the few.

<hr>
<b>QUESTION</b>: How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [33]:
selected_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']
compare_genres(selected_genres, 'acousticness', df_tracks)

Selected genres by acousticness score:
Alternative/Indie with a score of 0.58
Pop with a score of 0.32
Hip-Hop/Rap with a score of 0.19
Dance/Electronic with a score of 0.10


<b>ANSWER</b>: Alternative/Indie is the most acoustic, while Dance/Electronic is the least acoustic of the few.

<h1 id="conclusion">Conclusion</h1>
<p>The compiled list of top tracks of 2020 has given a great overview of what the Spotify userbase likes to stream. The analysis has given some hints on what a hit song could include:</p>
<ol>
    <li>Consider the elements of the <b>Pop</b> genre, as its formula is still popular among listeners;</li>
    <li>Make sure the song is <b>danceable</b>; A high rating is prevalent in the charts;</li>
    <li>For added <b>energy</b>, consider adding more <b>loudness</b>, but don't overdo it;</li>
    <li><b>Instrumentality</b> is not a must-have for a hit song. Electronic music and vocal-driven Hip-Hop/Rap score high in <b>danceability</b>;</li>
</ol>

<p>Even though the findings are plentiful, there are many ways the study could be improved.</p>

<h1 id="improvements">Improvements for Future Studies</h1>
<p>While the dataset has provided valueable insights, the following additions can support the selection of a hit song's criteria:</p>
<h2>Numeric Variables</h2>
<ul>
    <li>Tempo - BPM of the song</li>
    <li>Lyrical complexity score</li>
</ul>

<h2>Additional Analysis</h2>
<ul>
    <li>Gender influence in specific genre success</li>
    <li>Most successful sub-genres of pop</li>
</ul>

<h2>Dataset Extensions</h2>
<ul>
    <li>Top 100 charts</li>
    <li>Charts by region / continent</li>
</ul>