# 04 - Clustering & Archetype Analysis

## Description
This notebook applies unsupervised machine learning (K-Means clustering) to segment songs into distinct groups based on their performance metrics across various platforms. The objective is to identify and define actionable archetypes of 'hit songs' (e.g., 'TikTok Viral', 'Streaming Juggernaut', 'Radio Hit'), providing a strategic framework for marketing and A&R decisions.

## Analysis Pipeline:
1.  **Feature Selection & Preparation:** Select and scale the features that define a song's performance profile.
2.  **Optimal Cluster-Count (K):** Use the Elbow Method to determine the ideal number of clusters for segmentation.
3.  **K-Means Clustering:** Apply the algorithm to label each song with its corresponding archetype.
4.  **Cluster Profile Analysis:** Analyze the centroids of each cluster to understand their unique characteristics and define the archetypes.
5.  **Visualization:** Use PCA to visualize the distinct clusters in a 2D space.

### 1. Setup and Data Loading

In [6]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Load the cleaned dataset
PROJECT_ROOT = Path.cwd().parent
CLEANED_DATA_FILE = PROJECT_ROOT / 'data' / 'processed' / 'cleaned_spotify_data_2024.csv'
df = pd.read_csv(CLEANED_DATA_FILE)

print("Setup complete. Cleaned data loaded and ready for clustering.")

Setup complete. Cleaned data loaded and ready for clustering.


### 2. Feature Selection, Transformation, and Scaling
We will cluster songs based on their performance profiles. We select key metrics across platforms, apply a `log1p` transformation to handle their highly skewed distributions, and then scale the data to have a mean of 0 and a standard deviation of 1. Scaling is essential for distance-based algorithms like K-Means.

In [7]:
# Select features that define a song's success profile
cluster_features = [
    'spotify_streams', 'youtube_views', 'tiktok_views',
    'shazam_counts', 'airplay_spins', 'spotify_playlist_count'
]

# Fill any potential NaNs with 0 and apply log transformation
features_log = np.log1p(df[cluster_features].fillna(0))

# Scale the data
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_log)

print("Features have been selected, transformed (log), and scaled.")

Features have been selected, transformed (log), and scaled.


### 3. Finding the Optimal Number of Clusters (K)
We use the Elbow Method to identify the optimal number of clusters. We look for the 'elbow' point where adding more clusters no longer yields a significant decrease in inertia (within-cluster sum of squares).

In [8]:
inertia = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
    kmeans.fit(features_scaled)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method curve
fig = go.Figure(data=go.Scatter(x=list(K), y=inertia, mode='lines+markers'))
fig.update_layout(
    title='Elbow Method for Optimal K',
    xaxis_title='Number of Clusters (k)',
    yaxis_title='Inertia',
    annotations=[
        dict(x=4, y=inertia[3], ax=0, ay=-40, xref='x', yref='y', showarrow=True, arrowhead=2, text='Optimal K (Elbow Point)')
    ]
)
fig.show()

**Observation:** The elbow point is clearly visible at **k=4**. This suggests that our songs can be meaningfully segmented into four distinct archetypes.

### 4. Applying K-Means and Analyzing Cluster Profiles
Now we apply K-Means with our chosen k=4 and then analyze the resulting clusters to define our archetypes.

In [9]:
# Apply K-Means with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init='auto')
df['cluster'] = kmeans.fit_predict(features_scaled)

print("K-Means applied. Songs have been assigned to 4 clusters.")

# Analyze the cluster centroids by calculating the mean of the original (non-scaled) features
cluster_profile = df.groupby('cluster')[cluster_features].mean().sort_values('spotify_streams', ascending=False)

print("Cluster Profiles (Mean Values per Feature):")
cluster_profile

K-Means applied. Songs have been assigned to 4 clusters.
Cluster Profiles (Mean Values per Feature):


Unnamed: 0_level_0,spotify_streams,youtube_views,tiktok_views,shazam_counts,airplay_spins,spotify_playlist_count
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,689572900.0,585726200.0,830140300.0,0.0,75141.826923,75902.639423
0,629405300.0,476115000.0,1133629000.0,3557560.0,75664.810508,93197.199397
3,137940000.0,248737100.0,337677200.0,76309.71,5136.153595,861.712418
1,91035350.0,150446700.0,650495900.0,388300.0,9192.795316,12797.47556


#### Defining the Archetypes
Based on the mean values above, we can assign a persona to each cluster:

-   **Cluster 0: The Global Superstars:** Dominates across every single metric. These are the mega-hits from A-list artists with massive streaming numbers, huge YouTube views, and significant radio play.

-   **Cluster 2: The Radio & Playlist Hits:** This group has strong streaming and playlist numbers, but its defining feature is a disproportionately high number of `airplay_spins`. These are songs with heavy traditional media backing.

-   **Cluster 1: The Digital Natives:** Strong performance on Spotify and YouTube, but significantly less radio play. These hits live and breathe on digital platforms and may not have crossed over into traditional media.

-   **Cluster 3: The Niche & Emerging Hits:** This is the largest group, representing songs that are successful enough to make the dataset but have modest performance across all platforms compared to the other tiers.

### 5. Visualizing the Clusters with PCA
To visualize our 6-dimensional clusters, we use Principal Component Analysis (PCA) to project the data down to two dimensions.

In [10]:
# Reduce dimensions for visualization
pca = PCA(n_components=2)
pca_result = pca.fit_transform(features_scaled)

# Add PCA results to the DataFrame
df['pca1'] = pca_result[:, 0]
df['pca2'] = pca_result[:, 1]

# Map cluster numbers to archetype names for a better legend
archetype_map = {
    cluster_profile.index[0]: 'Global Superstar',
    cluster_profile.index[1]: 'Radio & Playlist Hit',
    cluster_profile.index[2]: 'Digital Native',
    cluster_profile.index[3]: 'Niche/Emerging Hit'
}
df['archetype'] = df['cluster'].map(archetype_map)

# Create interactive scatter plot
fig = px.scatter(df, x='pca1', y='pca2', color='archetype',
                 hover_name='track', hover_data=['artist', 'spotify_streams'],
                 title='Song Archetype Clusters (PCA Projection)',
                 labels={'pca1': 'Principal Component 1 (Overall Scale)', 'pca2': 'Principal Component 2 (Performance Style)'},
                 category_orders={'archetype': ['Global Superstar', 'Radio & Playlist Hit', 'Digital Native', 'Niche/Emerging Hit']})
fig.show()

### 6. Executive Summary & Actionable Insights

The clustering analysis successfully segmented the hit songs into four statistically distinct archetypes, providing a powerful strategic framework.

**Actionable Insights per Archetype:**

1.  **For Global Superstars:** The goal is to leverage their immense cross-platform appeal for high-value partnerships, global tours, and brand endorsements. Their marketing is about maintaining omnipresence.

2.  **For Radio & Playlist Hits:** These tracks have proven appeal to a broad audience and curators. The strategy should focus on securing sync licenses for film, TV, and commercials, as their high radio play indicates a 'safe' and popular sound.

3.  **For Digital Natives:** These songs resonate deeply with online audiences. The strategy should be to double down on digital marketing, influencer collaborations, and converting Spotify/YouTube success into a dedicated fanbase through social media engagement.

4.  **For Niche/Emerging Hits:** These songs represent a testing ground. The strategy is to analyze which of these tracks show early signs of crossing over (e.g., a sudden spike in `shazam_counts` or `tiktok_posts`) and then invest marketing resources to push them into a higher tier.