<h1>Identifying Key Features of a Hit Spotify Track</h1>

<h5>The goal of this analysis is to identify the features that increase the likeliness for a track to become a hit on Spotify.<h5>
<h5>Such insigts could be beneficial to music producers aiming to craft their music in a way that meets the taste of the average Spotify listener.<h5>

Dataset source:
https://www.kaggle.com/datasets/atillacolak/top-50-spotify-tracks-2020

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("spotifytoptracks.csv", index_col=0)
df.head()

Checking if there are any Null values in any of the Series.

In [None]:
df.isnull().any()

Checking for duplicate rows in the DataFrame.

In [None]:
df[df.duplicated(keep=False)]

Number of the observations in the dataset:

In [None]:
len(df)

Number of the features in the dataset:

In [None]:
len(df.columns)

Alternative overview the DataFrame:

In [None]:
df.info()

Identifying categorical features:

In [None]:
for col in df.columns:
    print(f"{df[col].nunique()} in {col}")

Key (good fit for categorical data) - encodes a key the track is in:

In [None]:
df['key'].value_counts()

Overview of genres (good fit for categorical data):

In [None]:
df['genre'].value_counts()

Instrumentalness - the closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.<br />
(Not a good fit for categorical data)

In [None]:
df['instrumentalness'].value_counts().sort_values

Assigning 'categorical' DataType to 'key' and 'genre' Series:

In [None]:
df[['key','genre']] = df[['key','genre']].astype('category')
df.info()

Identifying the numeric Series:

In [None]:
n = df.select_dtypes(include=['float','int']).columns
print(f"There are total of {len(n)} numeric Series in the DataFrame:")
print(*n, sep="\n")

Artists with multiple songs on the list:

In [None]:
a = df[df['artist'].duplicated()][['artist']].squeeze().unique()
print(f"There are total of {len(a)} artists with multiple tracks on the list:\n")
print(*a, sep="\n")

List of duplicate artists and their song names:

In [None]:
df[df['artist'].duplicated(keep=False)][['artist', 'track_name']].sort_values(by=['artist','track_name'])

Identifying the #1 artist:

In [None]:
a = df['artist'].loc[0]
print(f"Artist on the top of the list is:\n{a}")

Number of unique artists on the list:

In [None]:
c = len(df['artist'].unique())
print(f"Number of unique artists on the list:\n{c}")

Identifying albums that have multiple tracks on the list:

In [None]:
a = df[df['album'].duplicated()][['album']].squeeze().unique()
print(f"There are {len(a)} albums with multiple tracks on the list:\n")
print(*a, sep="\n")

Number of unique albums on the list:

In [None]:
a = len(df['album'].unique())
print(f"There are {a} unique albums on the list.\n")

Danceability describes how suitable a track is for dancing based on a combination of musical elements including<br />
tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

In [None]:
d = df['track_name'].loc[df['danceability'] > 0.7]
print(f"There are {len(d)} tracks with danceability score above 0.7:\n")
print(*d, sep="\n")

In [None]:
d = df['track_name'].loc[df['danceability'] < 0.4]
print(f"There are {len(d)} tracks with danceability score below 0.4:\n")
print(*d, sep="\n")

Loudness Units relative to Full Scale (LUFS) is a standardized way to measure the perceived loudness of audio, taking into account how humans hear sound.<br />
Spotify normalizes the target level of tracks to -14 dB LUFS<br />
Loudness in the DataFrame is expressed in Integrated LUFS (iLUFS) which show the original loudness of the track.

In [None]:
l = df['track_name'].loc[df['loudness'] > -5]
print(f"There are {len(l)} tracks with loudness score above -5:\n")
print(*l, sep="\n")

In [None]:
l = df['track_name'].loc[df['loudness'] < -8]
print(f"There are {len(l)} tracks with loudness score below -8:\n")
print(*l, sep="\n")

Longest track:

In [None]:
t = df['track_name'].loc[df['duration_ms'].idxmax()]
count = df['duration_ms'].max()
print(f"The track with the longest duration ({count} ms) is: \n{t}")

Shortest track:

In [None]:
t = df['track_name'].loc[df['duration_ms'].idxmin()]
count = df['duration_ms'].min()
print(f"The track with the shortest duration ({count} ms) is: \n{t}")

Most popular genre on the list:

In [None]:
g = df['genre'].value_counts().idxmax()
count = df['genre'].value_counts().max()
print(f'The most popular genre is "{g}" with {count} occurances on the list.')

Genres that only appear on the list once:

In [None]:
u = df['genre'].value_counts()[df['genre'].value_counts() == 1].index
print(f'There are {len(u)} genres with single occurance on the list:\n')
print(*u, sep="\n")

Number of different genres on the list:

In [None]:
u = df['genre'].value_counts().index
print(f'There are {len(u)} different genres in the list:\n')
print(*u, sep="\n")

Pearson correlation for Series with continuous data:

In [None]:
pearson_corr = df.corr(method='pearson',numeric_only=True)
pearson_corr

In [None]:
strong_corr = pearson_corr[((pearson_corr > 0.5) & (pearson_corr < 1)) |
                           ((pearson_corr < -0.5) & (pearson_corr < 0))]
mask = np.triu(np.ones_like(strong_corr, dtype=bool))
print("The Pearson correlation indicates moderate-to-strong relationships in these data pairs:")
strong_corr.where(mask).dropna(how="all").dropna(axis=1, how="all").fillna("")

Spearman correlation for Series with categorical data:

In [None]:
df['genre_encoded'] = pd.factorize(df['genre'])[0]
df['genre_encoded'] = df['genre_encoded'].astype('category')
spearman_corr = df[['key','genre_encoded']].corr(method='spearman',numeric_only=False)
spearman_corr

In [None]:
mask = np.triu(np.ones_like(spearman_corr, dtype=bool), k=1)
print("The Spearman correlation for the categorical data indicates no meaningful associations: ")
spearman_corr.where(mask).dropna(how="all").dropna(axis=1, how="all").fillna("")

Comparing average danceability, loudness, acousticness measures of the top 4 genres:



In [None]:
selected_genres = df[['genre','danceability','loudness','acousticness']].loc[df['genre'].isin(['Pop','Hip-Hop/Rap','Dance/Electronic','Alternative/Indie'])]
selected_genres = selected_genres.groupby('genre', observed=True).mean()

In [None]:
danceability = selected_genres['danceability'].sort_values(ascending=False)
danceability

In [None]:
loudness = selected_genres['loudness'].sort_values(ascending=False)
loudness

In [None]:
acousticness = selected_genres['acousticness'].sort_values(ascending=False)
acousticness

<h3>Summarizing the findings<h3>

The following features may increase the likeliness for the song to become popular on Spotify:
<ul>
  <li>It should be in the genre of Pop or Hip-Hop/Rap</li>
  <li>The track should have high danceability score.</li>
  <li>Loudness should be moderate. Howerver, Spotify normalizes loudness for all track to -14 dB LUFS. </li>
  <li>The track should revolve around lyrics (instrumentalness = 0)</li>
  <li>Pop tracks should have a moderate level of accousticness, where for Hip-Hop/Rap tracks it is not as important.</li>
</ul>