# Mapping the Musical Landscape: A Study of Clusters in Apple Music Data

Introduction

Music streaming platforms like Apple Music have transformed the way users experience music. With millions of tracks and a wide diversity of genres and styles, there's an increasing interest in understanding how songs can be grouped based on inherent characteristics rather than explicit genre labels. This approach opens the door to new possibilities in recommendation systems, playlist creation, and music discovery.

Unlike Spotify, Apple Music does not provide granular audio features such as danceability or energy through its public API. However, Apple Music does offer access to metadata like artist, genre, duration, and release date. For this project, we use these metadata features to perform clustering analysis and investigate whether songs in a user’s library can be grouped based on stylistic patterns.

This project not only explores clustering techniques but also demonstrates how basic music metadata can offer deep insights into listening behavior, preferred genres, and potential playlist structures. It highlights the power of unsupervised machine learning in the creative domain.

Core Questions

Can metadata-based clustering reveal distinctive groups of songs that represent certain listening moods or stylistic tendencies?

Do these clusters suggest non-traditional categorizations that go beyond standard genre labels?

How might these insights improve personalized listening experiences on platforms like Apple Music?

By addressing these questions, we aim to explore the potential of unsupervised clustering methods in the music domain, even with limited acoustic information.



Understanding Clustering

Clustering is an unsupervised learning technique used to group data points with similar characteristics without using predefined labels. It allows for exploratory data analysis and can uncover hidden structures within datasets. In this project, clustering helps group songs that share similar metadata characteristics.

We use K-Means clustering, a popular method due to its simplicity and speed. K-Means works by:

Selecting a number of clusters (k).

Randomly initializing centroids.

Assigning each point to the nearest centroid.

Recalculating the centroids based on assigned points.

Repeating until centroids stabilize.

We determine the optimal value of k using the Elbow Method. This method helps find the point at which adding more clusters doesn’t significantly reduce the inertia (within-cluster sum of squares), making it an effective tool to balance complexity and interpretability.

The Dataset

The dataset comes from a user’s Apple Music library, accessed via Apple’s MusicKit API. The data includes the following features:

track_name: Title of the song

artist: Name of the artist

album: Album title

genre: Primary genre classification

release_date: Date the track was released

duration_ms: Duration of the track in milliseconds

track_number: Track number in the album

This metadata, while limited compared to audio feature sets from services like Spotify, still provides valuable insights into a user's listening habits, genre preferences, and release trends over time.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Example DataFrame setup
df = pd.read_csv('your_apple_music_dataset.csv')
df['release_year'] = pd.to_datetime(df['release_date']).dt.year

num_features = ['duration_ms', 'track_number', 'release_year']

plt.figure(figsize=(15, 5))
for i, feature in enumerate(num_features):
    plt.subplot(1, 3, i + 1)
    sns.histplot(df[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(y='genre', data=df, order=df['genre'].value_counts().head(10).index)
plt.title('Top 10 Genres')
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Clean data
df_clean = df.dropna()
X_num = df_clean[['duration_ms', 'track_number', 'release_year']]
X_cat = pd.get_dummies(df_clean['genre'])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_num)

X = np.concatenate([X_scaled, X_cat], axis=1)

In [None]:
from sklearn.cluster import KMeans

inertia = []
for k in range(1, 11):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X)
    inertia.append(model.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
from sklearn.decomposition import PCA

kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

pca = PCA(n_components=2)
pca_result = pca.fit_transform(X)

df_clean['cluster'] = labels
df_clean['PC1'] = pca_result[:, 0]
df_clean['PC2'] = pca_result[:, 1]

plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='cluster', data=df_clean, palette='Set2')
plt.title('K-Means Clustering (PCA)')
plt.show()

In [None]:
cluster_means = df_clean.groupby('cluster')[['duration_ms', 'track_number', 'release_year']].mean()
print(cluster_means)

cluster_means.plot(kind='bar', figsize=(10, 6))
plt.title('Average Feature Values per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Average Value')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

This analysis reveals how each cluster differs based on numeric metadata. For example:

A cluster with higher average track_number could represent deep cuts or full album listens.

A cluster with recent release_years could highlight newer music preferences.

Storytelling (Clustering Analysis)

Cluster analysis unveils patterns in the user’s listening history. Some clusters group songs that were released in similar years, while others group by genre. Even though Apple Music doesn’t expose deep audio features, we observe coherent groupings that hint at user preferences.

This shows that even metadata clustering can uncover listening trends. For instance:

A cluster of high track numbers may indicate users prefer full album listens.

A genre-dominant cluster may suggest mood-based playlists or favorite genres.

With further refinement, such analysis can even inspire new kinds of music recommendations driven not just by genre, but by context (e.g., “late-night chill,” “throwback hits,” “energetic Monday”).

Impact

Positive Impacts

Enables playlist recommendations based on user behavior rather than just genre.

Helps users rediscover older songs they enjoy.

Promotes new music that matches listening patterns.

Supports music discovery for niche or emerging genres based on metadata similarities.

Potential Concerns

May reinforce echo chambers in recommendations.

Bias toward mainstream genres in available metadata.

Ignores lyrical or cultural context of music.

Lack of acoustic features may oversimplify music classification.

Conclusion

While Apple Music lacks detailed acoustic data, clustering using metadata still offers valuable insights into user preferences. With careful preprocessing and thoughtful interpretation, we can extract meaningful patterns that support playlist generation and personalized music discovery.

This project demonstrates the viability of applying unsupervised learning to personal music libraries using basic features. Future extensions could incorporate sentiment analysis of lyrics, integration with external platforms like Last.fm for richer tagging, or even building recommendation engines that dynamically adjust playlists based on user mood and historical trends.

