# Top 50 Spotify tracks analysis

## Introduction

The goal of analysis is to analyze Spotify's top 50 hits and quantify what makes a hit song.

The main questions the analysis answers to is which artisti are the most popular, which tracks features correlate the most/least.

In [1]:
#The librabries are installed in 2024

import numpy as np
import pandas as pd # data processing, CSV file.
import matplotlib.pyplot as plt
import seaborn as sns

#requirements file

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/top-50-spotify-tracks-2020/spotifytoptracks.csv


In [2]:
#read CSV file and put a name for a dataframe
df_old = pd.read_csv('/kaggle/input/top-50-spotify-tracks-2020/spotifytoptracks.csv')
df_old.head() #observe the data columns

Unnamed: 0.1,Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


In [3]:
#start the row indexing from 1
df_old.index = np.arange(1, len(df_old) + 1)

In [4]:
#drop Unnamed column
df = df_old.drop('Unnamed: 0', axis=1)

In [5]:
#drop track id column which is not needed for the analysis
df = df.drop('track_id', axis=1)

### How many observations are there in this dataset?

In [6]:
num_observations = df.shape[0]
num_observations

50

### How many features this dataset has?

In [7]:
num_features = df.shape[1]
num_features

15

### Which of the features are categorical?

In [8]:
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

categorical_features

['artist', 'album', 'track_name', 'genre']

### Which of the features are numeric?

In [9]:
numeric_features = df.select_dtypes(include=['number']).columns.tolist()
numeric_features

['energy',
 'danceability',
 'key',
 'loudness',
 'acousticness',
 'speechiness',
 'instrumentalness',
 'liveness',
 'valence',
 'tempo',
 'duration_ms']

In [10]:
#Check the main view of the dataset
df.head()

Unnamed: 0,artist,album,track_name,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
1,The Weeknd,After Hours,Blinding Lights,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
2,Tones And I,Dance Monkey,Dance Monkey,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
3,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
4,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
5,Dua Lipa,Future Nostalgia,Don't Start Now,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


### Data cleaning

The following codes are used to check if there are any null or NA/NaN values.
Gladly the dataset is clean and nothing additional is not needed to be done.

In [11]:
df.isnull().any()

artist              False
album               False
track_name          False
energy              False
danceability        False
key                 False
loudness            False
acousticness        False
speechiness         False
instrumentalness    False
liveness            False
valence             False
tempo               False
duration_ms         False
genre               False
dtype: bool

From 50 rows we can notice that there is no chance for duplications.
But just in case I check it to make sure our assumption is true.

In [12]:
df.duplicated()

1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50    False
dtype: bool

In [13]:
#check if there are NA or NaN values
df.isna().any()

artist              False
album               False
track_name          False
energy              False
danceability        False
key                 False
loudness            False
acousticness        False
speechiness         False
instrumentalness    False
liveness            False
valence             False
tempo               False
duration_ms         False
genre               False
dtype: bool

In [14]:
len(df.columns)

15

In [15]:
# Describe the main calculations of all mesurable variables
df.describe()

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.6093,0.71672,5.72,-6.2259,0.256206,0.124158,0.015962,0.196552,0.55571,119.69046,199955.36
std,0.154348,0.124975,3.709007,2.349744,0.26525,0.116836,0.094312,0.17661,0.216386,25.414778,33996.122488
min,0.225,0.351,0.0,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,0.494,0.6725,2.0,-7.5525,0.0528,0.048325,0.0,0.09395,0.434,99.55725,175845.5
50%,0.597,0.746,6.5,-5.9915,0.1885,0.07005,0.0,0.111,0.56,116.969,197853.5
75%,0.72975,0.7945,8.75,-4.2855,0.29875,0.1555,2e-05,0.27125,0.72625,132.317,215064.0
max,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


### How many albums in total have their songs in the top 50?

In [16]:
unique_albums = df['album'].nunique()
unique_albums

45

### Are there any albums that have more than 1 popular track? If yes, which and how many?

In [17]:
# Group by 'album' and count the number of tracks for each artist
albums_tracks_no = df.groupby('album').track_name.count()

# Filter artists with more than one track
filter_tr_alb = albums_tracks_no[albums_tracks_no > 1]

#Sort albums from by descending order by tracks number
sorted_albums = filter_tr_alb.sort_values(ascending=False)

# Display the result
print(sorted_albums)

album
Future Nostalgia        3
Changes                 2
Fine Line               2
Hollywood's Bleeding    2
Name: track_name, dtype: int64


### How many artists in total have their songs in the top 50?

In [18]:
unique_artists = df['artist'].nunique()
unique_artists

40

### Are there any artists that have more than 1 popular track? If yes, which and how many?

In [19]:
# Find artists with more than 1 popular track
artists_with_multiple_tracks = df['artist'].value_counts()
artists_with_multiple_tracks = artists_with_multiple_tracks[artists_with_multiple_tracks > 1]

artists_with_multiple_tracks

artist
Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Justin Bieber    2
Harry Styles     2
Lewis Capaldi    2
Post Malone      2
Name: count, dtype: int64

### Who was the most popular artist?

In [20]:
most_popular_artist = artists_with_multiple_tracks.idxmax()
most_popular_artist_count = artists_with_multiple_tracks.max()

print('The most popular artist is', most_popular_artist, 'with', most_popular_artist_count, 'tracks')

The most popular artist is Billie Eilish with 3 tracks


### Which tracks have a danceability score above 0.7?

In [21]:
dance_filter1 = df['danceability'] > 0.7
dance_tracks = df.loc[dance_filter1, ['track_name', 'danceability']]
dance_tracks

Unnamed: 0,track_name,danceability
2,Dance Monkey,0.825
3,The Box,0.896
4,Roses - Imanbek Remix,0.785
5,Don't Start Now,0.793
6,ROCKSTAR (feat. Roddy Ricch),0.746
8,death bed (coffee for your head),0.726
9,Falling,0.784
11,Tusa,0.803
14,Blueberry Faygo,0.774
15,Intentions (feat. Quavo),0.806


### Which tracks have a danceability score below 0.4?

In [22]:
dance_filter2 = df['danceability'] < 0.4

dance_tracks2 = df.loc[dance_filter2, ['track_name', 'danceability']]
dance_tracks2

Unnamed: 0,track_name,danceability
45,lovely (with Khalid),0.351


### Which tracks have their loudness above -5?

In [23]:
loud_filter1 = df['loudness'] > -5

loud_track1 = df.loc[loud_filter1, ['track_name', 'loudness']]
loud_track1

Unnamed: 0,track_name,loudness
5,Don't Start Now,-4.521
7,Watermelon Sugar,-4.209
11,Tusa,-3.28
13,Circles,-3.497
17,Before You Go,-4.858
18,Say So,-4.577
22,Adore You,-3.675
24,Mood (feat. iann dior),-3.558
32,Break My Heart,-3.434
33,Dynamite,-4.41


### Which tracks have their loudness below -8?

In [24]:
loud_filter2 = df['loudness'] < -8

dance_tracks2 = df.loc[dance_filter2, ['track_name', 'danceability']]
dance_tracks2

Unnamed: 0,track_name,danceability
45,lovely (with Khalid),0.351


### Which track is the longest?

In [25]:
max_length_index = df['duration_ms'].idxmax()
max_length_track = df.loc[max_length_index, ['track_name', 'duration_ms']]
max_length_track
#longest_track

track_name     SICKO MODE
duration_ms        312820
Name: 50, dtype: object

### Which track is the shortest?

In [26]:
min_length = df['duration_ms'].idxmin()
min_length_track = df.loc[min_length, ['track_name', 'duration_ms']]
min_length_track

track_name     Mood (feat. iann dior)
duration_ms                    140526
Name: 24, dtype: object

### Which genre is the most popular?

In [27]:
#find the count of all genres
genre_counts = df['genre'].value_counts()

#find the maximum count of a genre
most_popular_genre = genre_counts.idxmax()
most_popular_genre_count = genre_counts.max()

print(f"The most popular genre is {most_popular_genre} with {most_popular_genre_count} tracks")

The most popular genre is Pop with 14 tracks


### Which genres have just one song on the top 50?

In [28]:
genre_counts = df['genre'].value_counts()

# Filter genres that have just one track
one_track_genres = genre_counts[genre_counts == 1].index

# Retrieve the tracks corresponding to those genres
one_genre_tracks = df[df['genre'].isin(one_track_genres)][['track_name', 'genre']]

one_genre_tracks


Unnamed: 0,track_name,genre
5,Don't Start Now,Nu-disco
9,Falling,R&B/Hip-Hop alternative
13,Circles,Pop/Soft Rock
24,Mood (feat. iann dior),Pop rap
28,WAP (feat. Megan Thee Stallion),Hip-Hop/Trap
32,Break My Heart,Dance-pop/Disco
33,Dynamite,Disco-pop
38,Sunflower - Spider-Man: Into the Spider-Verse,Dreampop/Hip-Hop/R&B
44,Safaera,Alternative/reggaeton/experimental
45,lovely (with Khalid),Chamber pop


### How many genres in total are represented in the top 50?

In [29]:
number_genres = df['genre'].nunique()

number_genres

16

### Which features are strongly positively correlated?

In [30]:
# Specify only the numeric columns for correlation
numeric_columns = ['energy', 'danceability', 'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']

# Calculate the correlation matrix for the specified numeric columns
correlation_matrix_specified = df[numeric_columns].corr()

# Find strongly positively correlated features (correlation > 0.7)
strong_positive_correlations_specified = correlation_matrix_specified[correlation_matrix_specified > 0.7].stack().reset_index()
strong_positive_correlations_specified = strong_positive_correlations_specified[strong_positive_correlations_specified['level_0'] != strong_positive_correlations_specified['level_1']]
strong_positive_correlations_specified.columns = ['Feature 1', 'Feature 2', 'Correlation']

strong_positive_correlations_specified

Unnamed: 0,Feature 1,Feature 2,Correlation
1,energy,loudness,0.79164
4,loudness,energy,0.79164


### Which features are strongly negatively correlated?

In [31]:
# Find the strongest negative correlation
strongest_negative_correlation = correlation_matrix_specified.min().min()

# Identify the features with the strongest negative correlation
strongest_negative_features = correlation_matrix_specified.stack().reset_index()
strongest_negative_features.columns = ['Feature 1', 'Feature 2', 'Correlation']
strongest_negative_features = strongest_negative_features[(strongest_negative_features['Correlation'] == strongest_negative_correlation) & (strongest_negative_features['Feature 1'] != strongest_negative_features['Feature 2'])]

strongest_negative_correlation, strongest_negative_features

(-0.6824785203241528,
        Feature 1     Feature 2  Correlation
 4         energy  acousticness    -0.682479
 44  acousticness        energy    -0.682479)

### Which features are not correlated?

In [32]:
# Set a threshold for non-correlation
threshold = 0.2

# Find pairs of features with correlation coefficients close to zero
non_correlated_features = correlation_matrix_specified[(correlation_matrix_specified > -threshold) & (correlation_matrix_specified < threshold)].stack().reset_index()
non_correlated_features = non_correlated_features[non_correlated_features['level_0'] != non_correlated_features['level_1']]
non_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation']

non_correlated_features


Unnamed: 0,Feature 1,Feature 2,Correlation
0,energy,danceability,0.152552
1,energy,key,0.062428
2,energy,speechiness,0.074267
3,energy,liveness,0.069487
4,energy,tempo,0.075191
...,...,...,...
69,duration_ms,acousticness,-0.010988
70,duration_ms,instrumentalness,0.184709
71,duration_ms,liveness,-0.090188
72,duration_ms,valence,-0.039794


### How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [33]:
#Filter genres for comparison

comp_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']

# Filter the DataFrame to include only the needed genres
filtered_df = df[df['genre'].isin(comp_genres)]

# Group by the 'genres' column and calculate the mean danceability score
mean_danceability = filtered_df.groupby('genre')['danceability'].mean().reset_index()

#sort values
mean_danceability_sorted = mean_danceability.sort_values(by='danceability', ascending=False).reset_index(drop=True)

print("Mean Danceability Score by Genre:")
print(mean_danceability_sorted)

Mean Danceability Score by Genre:
               genre  danceability
0        Hip-Hop/Rap      0.765538
1   Dance/Electronic      0.755000
2                Pop      0.677571
3  Alternative/Indie      0.661750


### How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [34]:
loud_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']

# Filter the DataFrame to include only the needed genres
filtered_df = df[df['genre'].isin(loud_genres)]

# Group by the 'genres' column and calculate the mean danceability score
mean_danceability = filtered_df.groupby('genre')['loudness'].mean().reset_index()

#sort values
mean_loudness_sorted = mean_danceability.sort_values(by='loudness', ascending=False).reset_index(drop=True)

print("Mean Loudness Score by Genre:")
print(mean_loudness_sorted)

Mean Loudness Score by Genre:
               genre  loudness
0   Dance/Electronic -5.338000
1  Alternative/Indie -5.421000
2                Pop -6.460357
3        Hip-Hop/Rap -6.917846


### How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [35]:
acou_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']

# Filter the DataFrame to include only the needed genres
filtered_df = df[df['genre'].isin(acou_genres)]

# Group by the 'genres' column and calculate the mean danceability score
mean_acousticness = filtered_df.groupby('genre')['acousticness'].mean().reset_index()

#sort values
mean_acousticness_sorted = mean_acousticness.sort_values(by='acousticness', ascending=False).reset_index(drop=True)

print("Mean Acousticness Score by Genre:")
print(mean_acousticness_sorted)

Mean Acousticness Score by Genre:
               genre  acousticness
0  Alternative/Indie      0.583500
1                Pop      0.323843
2        Hip-Hop/Rap      0.188741
3   Dance/Electronic      0.099440


## Conclusion

The main conclusions after conducting TOP 50 tracks analysis are these:

- There are **40 artists** with **45 albums** who got to TOP 50.
- 3 Artists tracks has got to TOP 50 with 3 of their tracks.
- **Loudness** and **energy** are the most correlated features in the tracks which likely both to appear in a track. Loudness is usually measured in dB and it is what it means. Energy includes tempo and loudness, so it is not surprising they are correlating well together.
-  The data analysis shows that acousticness and energy seem to be the most **negatively correlated** feautures: there is little chance these features both appear in one track. Acousticness means more instrumental music which is not that loud or in quick tempo.
- **Acousticness** is the biggeset in Alternative/Inde the most compared to the other genres.
- **Loudness** appears more in Pop or HipHop/Rap genres.
- **Danceability** is most likely to appear in Hip-Hop/Rap and Dance/Electronic genres.

#### For the further analysis:
- More visuals should be conducted.
- Should determine each feature domination by genre (e. g. see if Acousticness the biggest feature of Alternative/Indie genre and similar).
- Which features or their combinations impact the success in the TOP charts and if they have a significant effect at all.