# Nigerian Music scraped from Spotify - an analysis

In [None]:
# import the Seaborn package for good data visualization.

%pip install seaborn

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# append the song data from nigerian-songs.csv.
# Load up a dataframe with some data about the songs. 

df = pd.read_csv("./nigerian-songs.csv")
df.head()

Get information about the dataframe

In [None]:
df.info()

Double-check for null values, by calling isnull() and verifying the sum being 0

In [None]:
df.isnull().sum()

Look at the general values of the data. Note that popularity can be '0' - and there are many rows with that value

In [None]:
df.describe()

Look at the general values of the data. Note that popularity can be '0', which show songs that have no ranking. Let's remove those

Let's examine the genres. Quite a few are listed as 'Missing' which means they aren't categorized in the dataset with a genre 

In [None]:
# use a barplot to find out the most popular genres

import seaborn as sns

top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top[:5].index,y=top[:5].values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')

Remove 'Missing' genres, as it's not classified in Spotify


In [None]:
df = df[df['artist_top_genre'] != 'Missing']
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')

By far, the top three genres dominate this dataset. Let's concentrate on afro dancehall, afropop, and nigerian pop, additionally filter the dataset to remove anything with a 0 popularity value (meaning it was not classified with a popularity in the dataset and can be considered noise for our purposes)

In [None]:
df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
df = df[(df['popularity'] > 0)]
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')

The data is not strongly correlated except between energy and loudness, which makes sense. Popularity has a correspondence to release data, which also makes sense, as more recent songs are probably more popular. Length and energy seem to have a correlation - perhaps shorter songs are more energetic?

In [None]:
# do a quick test to see if the data correlates in any particularly strong way

df_modified = df.iloc[:, 4:]
corrmat = df_modified.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

# The only strong correlation is between energy and loudness, which is not too surprising, given that loud music is usually pretty energetic. Otherwise, the correlations are relatively weak.

Note that correlation does not imply causation! We have proof of correlation but no proof of causation.

Are the genres significantly different in the perception of their danceability, based on their popularity? Examine our top three genres data distribution for popularity and danceability along a given x and y axis 

In [None]:
# examine our top three genres data distribution for popularity and danceability along a given x and y axis

import seaborn as sns

sns.set_theme(style="ticks")

# Show the joint distribution using kernel density estimation
g = sns.jointplot(
    data = df,
    x="popularity", y="danceability", hue="artist_top_genre",
    kind="kde",
)

We can discover concentric circles around a general point of convergence, showing the distribution of points.

Note that this example uses a KDE (Kernel Density Estimate) graph that represents the data using a continuous probability density curve. This allows us to interpret data when working with multiple distributions.

In general, the three genres align in terms of their popularity and danceability. A scatterplot of the same axes shows a similar pattern of convergence. Try a scatterplot to check the distribution of data per genre

In [None]:
sns.FacetGrid(df, hue="artist_top_genre") \
   .map(plt.scatter, "popularity", "danceability") \
   .add_legend()

A scatterplot of the same axes shows a similar pattern of convergence

In general, for clustering, we can use scatterplots to show clusters of data, so mastering this type of visualization is very useful