# Full Mini EDA + Knowledge Engineering Assignment – Spotify Dataset
## Step 1 – Basic Understanding
- Load the dataset.
- Print shape, info(), head().
- Check missing values (which columns, how much).
- Number of unique artists, albums, and genres.
- Are there duplicate tracks?

## Step 2 – Data Cleaning
- Remove duplicate data
- Ensure all audio features numeric dtypes (danceability, energy, loudness, tempo, valence, etc.)

## Step 3 – Descriptive Statistics
- Mean, median, std for tempo, energy, danceability, loudness, valence.
- Song with highest danceability.
- Song with highest energy.
- Artist with most tracks.
- Most common genre.

## Step 4 – Audio Feature Distributions
- Histograms for tempo, danceability, energy, loudness, valence.
- Which features look normally distributed? Which are skewed?
- Do most songs have high/low energy?
- Any extreme outliers in track duration?

## Step 5 – Bivariate Analysis
- Scatter: tempo vs danceability.
- Compare energy vs loudness (are they correlated?).
- Boxplot: danceability across genres.
- Do popular tracks (popularity > 80) have higher energy/danceability?
- Genre-specific tempo differences.

# Step 6 – Correlation & Relationships
- Correlation matrix of numeric features.
- Heatmap visualization.

# Step 7 – Visualization Tasks
- Barplot: Top 10 artists with most tracks.
- Pie chart: genre share of songs.
- Radar plot: average audio features by genre.

# Step 8 – Knowledge Engineering (Feature Insights)
- Which features drive popularity?
- Relationship between valence (happiness) and mode (major/minor).
- Build a Mood Score = (0.5 × valence + 0.3 × danceability + 0.2 × energy).
- Which year had the happiest mood score?
- Which genre scores highest on mood?

# Step 9 – Insights & Reporting
- 3–5 key insights about Spotify songs evolution (tempo, duration, features).
- Which genres dominate overall? Which are niche?
- Do artists stick to one style or vary widely across features?
- If you were Spotify, which features would you use to design personalized playlists?

In [None]:
print("Importing Modules")
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### Step 1 – Basic Understanding

In [None]:
print("Load the dataset")
spotify_df=pd.read_csv("Spotify_dataset.csv")
spotify_df.sample(2)

In [None]:
print(f"Print shape => {spotify_df.shape}\nhead()\n{spotify_df.head()}")
print("\ninfo()\n")
spotify_df.info()

In [None]:
print("Check missing values (which columns, how much).")
spotify_df.isnull().sum()

In [None]:
print("Number of unique artists, albums, and genres")
unique_artist_list=spotify_df['artist_name'].unique()
unique_albums=spotify_df['track_name'].unique()
unique_genre=spotify_df['genre'].unique()
print(f"unique_artist_list\n{unique_artist_list}\n\nunique_albums\n{unique_albums}\n\nunique_genre\n{unique_genre}")

In [None]:
duplicate_count=True if spotify_df['track_name'].duplicated().sum() > 0 else False
print(f"Are there duplicate tracks? {duplicate_count}")
spotify_df[spotify_df['track_name'].duplicated()]['track_name'].sort_values(ascending=False)

## Step 2 – Data Cleaning

In [None]:
# Check duplicates by track name + artist + duration
spotify_df.duplicated(subset=['track_id','track_name','artist_name']).sum()
# Drop Duplicated
spotify_df.drop_duplicates(subset=['track_id','track_name','artist_name'],inplace=True,ignore_index=True)
# Check duplicates after dropping
spotify_df.duplicated(subset=['track_id','track_name','artist_name']).sum()

## Step 3 – Descriptive Statistics

In [None]:
# Mean, median, std for tempo, energy, danceability, loudness, valence.
print(f"Tempo Mean, Median, Std\nMean = {spotify_df['tempo'].mean()}\nMedian = {spotify_df['tempo'].median()}\nStd = {spotify_df['tempo'].std()}")
print(f"\nEnergy Mean, Median, Std\nMean = {spotify_df['energy'].mean()}\nMedian = {spotify_df['energy'].median()}\nStd = {spotify_df['energy'].std()}")
print(f"\nDanceability Mean, Median, Std\nMean = {spotify_df['danceability'].mean()}\nMedian = {spotify_df['danceability'].median()}\nStd = {spotify_df['danceability'].std()}")
print(f"\nLoudness Mean, Median, Std\nMean = {spotify_df['loudness'].mean()}\nMedian = {spotify_df['loudness'].median()}\nStd = {spotify_df['loudness'].std()}")
print(f"\nValence Mean, Median, Std\nMean = {spotify_df['valence'].mean()}\nMedian = {spotify_df['valence'].median()}\nStd = {spotify_df['valence'].std()}")
print("This is a long process below is cleaner code")

In [None]:
column_list=['tempo', 'energy', 'danceability', 'loudness', 'valence']
for col in column_list:
    print(f"\n{col} Mean, Median, Std\nMean = {spotify_df[col].mean():.2f}\nMedian = {spotify_df[col].median():.2f}\nStd = {spotify_df[col].std():.2f}")

In [None]:
#  Song with highest danceability.
res=spotify_df.loc[spotify_df['danceability'].idxmax()][['track_name','danceability']]
print(f"Song - {res.track_name} with highest danceability - {res.danceability}")

In [None]:
# Song with highest energy.
res=spotify_df.loc[spotify_df['energy'].idxmax()][['track_name','energy']]
print(f"Song - {res.track_name} with highest energy - {res.energy}")

In [None]:
# Artist with most tracks
print("Top 5 Artist with most tracks")
spotify_df.groupby("artist_name")["track_name"].count().sort_values(ascending=False).head(5)

In [None]:
# Most common genre
print("Top 5 Most common genre")
# spotify_df.groupby('genre')['track_name'].count().sort_values(ascending=False).head(5)
spotify_df['genre'].value_counts().sort_values(ascending=False).head(5)

## Step 4 – Audio Feature Distributions

In [None]:
# Histograms for tempo, danceability, energy, loudness, valence.
column_list={0:'tempo', 1:'energy', 2:'danceability', 3:'loudness', 4:'valence'}
colors={0:'red', 1:'blue', 2:'green', 3:'aqua', 4:'pink'}
fig,ax=plt.subplots(2,3,figsize=(15,8))
ax=ax.flatten()
for col in column_list:
    feature = column_list[col]
    ax[col].hist(spotify_df[column_list[col]],bins=30,edgecolor="black",label=column_list[col],color=colors[col])
    ax[col].legend()
    ax[col].set_title(f"{feature.capitalize()} Distribution")
    ax[col].set_xlabel(feature.capitalize())
fig.supylabel("Count")
fig.suptitle("Distribution of songs")
plt.tight_layout()
plt.show()


In [None]:
# Which features look normally distributed? Which are skewed?
features = ['tempo','energy','danceability','loudness','valence','duration_ms']

fig, axes = plt.subplots(2, 3, figsize=(15,8))
axes = axes.flatten()

for i, col in enumerate(features):
    axes[i].hist(spotify_df[col], bins=30, edgecolor="black", color="skyblue")
    axes[i].set_title(f"{col} Distribution")

plt.tight_layout()
plt.show()


In [None]:
# Do most songs have high/low energy?
spotify_df['energy_type']=pd.cut(spotify_df['energy'],bins=[-1,0.33,0.66,1] ,labels=['low','medium','high'])
res=spotify_df['energy_type'].value_counts(normalize=True).sort_values(ascending=False)*100
print(f"{res.head(1)} \n\nMost songs have high energy")

plt.figure(figsize=(6,5))
res.plot(kind='bar', color=['skyblue','orange','red'], edgecolor='black')
plt.title("Distribution of Song Energy Levels")
plt.ylabel("Percentage (%) of Songs")
plt.xlabel("Energy Category")
plt.show()

In [None]:
# Any extreme outliers in track duration?
spotify_df['duration_min']=round(spotify_df['duration_ms']/60000,2)
Q1 = spotify_df['duration_min'].quantile(0.25)
Q3 = spotify_df['duration_min'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers = spotify_df[(spotify_df['duration_min'] < lower_bound) | (spotify_df['duration_min'] > upper_bound)]
print(f"Number of extreme outliers in duration: {len(outliers)}")
print(outliers[['track_name','artist_name','duration_min']].head())


## Step 5 – Bivariate Analysis

In [None]:
# Scatter: tempo vs danceability.
plt.scatter(spotify_df['tempo'],spotify_df['danceability'],alpha=0.6,color='green',edgecolor='black')
plt.xlabel("Tempo (BPM)")
plt.ylabel("Danceability")
plt.title("Tempo vs Danceability")
plt.show()

# Correlation check
corr_val=spotify_df['tempo'].corr(spotify_df['danceability'])
print(f"Correlation between tempo and danceability is {corr_val:.2f}")
print("Tempo and danceability are not highly correlated")

In [None]:
# Compare energy vs loudness (are they correlated?).
plt.scatter(spotify_df['energy'],spotify_df['loudness'],alpha=0.6,color='yellow',edgecolor='black')
plt.xlabel("Energy")
plt.ylabel("Loudness (DB)")
plt.title("Energy vs Loudness")
plt.show()

corr_val=spotify_df['energy'].corr(spotify_df['loudness'])
print(f"Correlation between Energy and loudness is {corr_val:.2f}")
print("Energy and Loudness are highly correlated.")

In [None]:
# Boxplot: danceability across genres.
plt.figure(figsize=(20,8))
spotify_df.boxplot(column='danceability',by='genre',rot=90)
plt.title("Danceability across Genres")
plt.suptitle("")  # removes automatic 'Boxplot grouped by genre' title
plt.xlabel("Genre")
plt.ylabel("Danceability")
plt.show()

In [None]:
# Do popular tracks (popularity > 80) have higher energy/danceability?
spotify_df['popularity_tag']=spotify_df['popularity'].apply(lambda x : 'More Popular' if x > 80 else 'Less Popular')
result=round(spotify_df.groupby('popularity_tag')[['energy','danceability']].mean(),2)
print("Popular tracks (popularity > 80) have higher energy/danceability\n",result)
result['energy'].iloc[0]

fig,ax=plt.subplots(1,2,figsize=(10,5))
ax[0].bar(result.index,result['energy'],color=['yellow','aqua'], alpha=0.5,edgecolor='green')
ax[0].set_title("popularity vs Energy")

ax[1].bar(result.index,result['danceability'],color=['yellow','aqua'], alpha=0.5,edgecolor='green')
ax[1].set_title("popularity vs Danceability")

fig.supxlabel("Popularity")
fig.supylabel("Mean")
plt.tight_layout()
plt.show()

In [None]:
# Genre-specific tempo differences.
res=spotify_df.groupby('genre')['tempo'].mean().sort_values()

plt.figure(figsize=(12,6))
plt.bar(res.index,res.values,color="yellow",edgecolor='green')
plt.xlabel("Genre")
plt.xticks(rotation=90)
plt.ylabel("Tempo Mean")
plt.title('Genre-specific tempo differences')
plt.show()

## Step 6 – Correlation & Relationships






In [None]:
# Correlation matrix of numeric features.
corr_matrix=round(spotify_df.corr(numeric_only=True),2)
fig, ax = plt.subplots(figsize=(10, 6))
cax = ax.matshow(corr_matrix, cmap="coolwarm")
fig.colorbar(cax)

# Set ticks
ax.set_xticks(range(len(corr_matrix.columns)))
ax.set_yticks(range(len(corr_matrix.columns)))
ax.set_xticklabels(corr_matrix.columns, rotation=90)
ax.set_yticklabels(corr_matrix.columns)

plt.title("Correlation Matrix of Numeric Features", pad=20)
plt.show()

In [None]:
# Heatmap visualization.
fig, ax = plt.subplots(figsize=(10, 6))
im = ax.imshow(corr_matrix, cmap="coolwarm")

# Add colorbar
cbar = plt.colorbar(im)
cbar.set_label("Correlation")

# Annotate with values
for i in range(len(corr_matrix)):
    for j in range(len(corr_matrix)):
        ax.text(j, i, f"{corr_matrix.iloc[i, j]:.2f}",
                ha="center", va="center", color="black", fontsize=8)

ax.set_xticks(range(len(corr_matrix.columns)))
ax.set_yticks(range(len(corr_matrix.columns)))
ax.set_xticklabels(corr_matrix.columns, rotation=90)
ax.set_yticklabels(corr_matrix.columns)

plt.title("Heatmap of Correlations")
plt.tight_layout()
plt.show()


## Step 7 – Visualization Tasks

In [None]:
# Barplot: Top 10 artists with most tracks.
artist_list=spotify_df.drop_duplicates().groupby('artist_name')['track_name'].count().sort_values().tail(10)
plt.bar(artist_list.index,artist_list.values,color='green',edgecolor='brown',alpha=0.6)
plt.xticks(rotation=90)
plt.xlabel("Artist Name")
plt.ylabel("No of tracks")
plt.title('Top 10 artists with most tracks')
plt.show()

In [None]:
# Pie chart: genre share of songs.
genre_list=spotify_df.drop_duplicates().groupby('genre')['track_name'].count().sort_values().tail(10)
plt.pie(genre_list,labels=genre_list.index,autopct='%1.1f%%',startangle=140,counterclock=False,wedgeprops={'edgecolor':'black'})
plt.title("Genre Share of Songs (Top 10)", fontsize=14)
plt.show()


In [None]:
# Radar plot: average audio features by genre
features=["danceability", "energy", "valence", "tempo", "acousticness"]
genre_list=round(spotify_df.drop_duplicates().groupby('genre')[features].mean().sort_index().head(3),2)
angles = np.linspace(0, 2*np.pi, len(features), endpoint=False).tolist()
angles+=angles[:1]
fig, ax = plt.subplots(figsize=(6,6), subplot_kw=dict(polar=True))
for genre in genre_list.index:
    value=genre_list.loc[genre].tolist()
    value+=value[:1]
    ax.plot(angles, value, label=genre, linewidth=2)
    ax.fill(angles, value, alpha=0.25)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(features)

plt.title("Average Audio Features by Genre", fontsize=14, pad=20)
plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1))
plt.show()

## Step 8 – Knowledge Engineering (Feature Insights)

In [None]:
# Which features drive popularity?
artist_popularity=round(spotify_df.drop_duplicates().groupby('artist_name')['popularity'].mean().sort_values(ascending=False).head(10))
plt.bar(artist_popularity.index,artist_popularity.values)
plt.xticks(rotation=60,ha='right')
plt.xlabel("Artist Name")
plt.ylabel("Average Popularity")
plt.title("Top 10 Artists by Average Popularity")
plt.show()

genre_popularity=round(spotify_df.drop_duplicates().groupby('genre')['popularity'].mean().sort_values(ascending=False).head(10))
plt.bar(genre_popularity.index,genre_popularity.values)
plt.xticks(rotation=60,ha='right')
plt.xlabel("Genre Name")
plt.ylabel("Average Popularity")
plt.title("Top 10 Genre by Average Popularity")
plt.show()

artist_genre_popularity=round(spotify_df.drop_duplicates().groupby(['artist_name','genre'])['popularity'].mean().sort_values(ascending=False).head(10))
print(artist_genre_popularity,"\n")


In [None]:
# Relationship between valence (happiness) and mode (major/minor).
mode_valence=round(spotify_df.drop_duplicates().groupby('mode')['valence'].mean(),2)
print(f"Major Valence (happiness) = {mode_valence.iloc[0]}\nMinor Valence (happiness) = {mode_valence.iloc[1]}")

spotify_df.boxplot(column="valence", by="mode", grid=False)
plt.xticks([1,2], ["Major (1)" , "Minor (0)"])
plt.ylabel("Valence (Happiness)")
plt.title("Valence by Musical Mode")
plt.suptitle("")  # remove automatic title
plt.show()

In [None]:
# Build a Mood Score = (0.5 × valence + 0.3 × danceability + 0.2 × energy).
print("Added New Column [ood_score] ")
mood_score=(0.5*spotify_df['valence']+0.3*spotify_df['danceability']+0.2*spotify_df['energy'])
spotify_df['mood_score']=mood_score
plt.hist(spotify_df['mood_score'],bins=30,color='pink',edgecolor='black')
plt.xlabel("Mood Score")
plt.ylabel("Number of Tracks")
plt.title("Distribution of Mood Scores")
plt.show()

In [None]:
# Which genre scores highest on mood?
print('Top 10 genre scores highest on mood')
top_genre_mood=spotify_df.drop_duplicates().groupby('genre')['mood_score'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(10,5))
plt.bar(top_genre_mood.index, top_genre_mood.values, color="orange", edgecolor="black", alpha=0.7)
plt.xticks(rotation=45, ha="right")
plt.ylabel("Average Mood Score")
plt.title("Top 10 Genres by Mood Score")
plt.tight_layout()
plt.show()

## Step 9 – Insights & Reporting

In [None]:
#  Which genres dominate overall? Which are niche?
genre_counts=spotify_df.drop_duplicates()['genre'].value_counts()
dominating_genre=genre_counts.head(10)
plt.bar(dominating_genre.index,dominating_genre.values,color="purple",edgecolor="yellow")
plt.xticks(rotation=60,ha="right")
plt.xlabel("genre")
plt.ylabel("Count")
plt.title("Top 10 genres that dominate overall")
plt.show()

In [None]:
niche_genre=genre_counts.tail(10)
plt.bar(niche_genre.index,niche_genre.values,color="purple",edgecolor="yellow")
plt.xticks(rotation=60,ha="right")
plt.xlabel("genre")
plt.ylabel("Count")
plt.title("Top 10 genres that are niche")
plt.show()

In [None]:
# - Do artists stick to one style or vary widely across features?
features=['danceability', 'energy', 'valence', 'tempo', 'acousticness']
fig,ax=plt.subplots(2,3,figsize=(20,12))
ax=ax.flatten()
for f in enumerate(features):
    mean_points=spotify_df.groupby(['artist_name'])[f[1]].mean().head()
    std_points=spotify_df.groupby(['artist_name'])[f[1]].std().head()
    ax[f[0]].plot(mean_points.index,mean_points.values,label="Mean")
    ax[f[0]].plot(std_points.index,std_points.values,label="STD")
    ax[f[0]].legend()
    ax[f[0]].set_xlabel("Artist Name")
    ax[f[0]].set_ylabel("Value")
    ax[f[0]].set_title(f[1].upper()+' Graph')
plt.suptitle(f"{'artists vary widely across features'.upper()}")
plt.tight_layout()
plt.show()

In [None]:
print("Saving Cleaned and Modified Spotify Dataset")
spotify_df.to_csv("cleaned_spotify_dataset.csv")

### EDA on Spotify dataset completed successfully 