# Mapping the Musical Landscape: A Study of Clusters in Apple Music Data

## Introduction
Music streaming platforms like Apple Music have transformed how we experience music. This project explores clustering songs from a user's Apple Music library based on metadata to uncover listening patterns.

### Core Questions
1. Can metadata-based clustering reveal distinctive moods or styles?
2. Do clusters reflect non-traditional categorizations?
3. Can these insights improve recommendation systems?

## Understanding Clustering
K-Means clustering is used to group similar songs. It assigns songs to clusters based on metadata like genre, release year, and duration.

## The Dataset
The dataset includes:
- track_name
- artist
- album
- genre
- release_date
- duration_ms
- track_number

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Example DataFrame setup
df = pd.read_csv('your_apple_music_dataset.csv')
df['release_year'] = pd.to_datetime(df['release_date']).dt.year

num_features = ['duration_ms', 'track_number', 'release_year']

plt.figure(figsize=(15, 5))
for i, feature in enumerate(num_features):
    plt.subplot(1, 3, i + 1)
    sns.histplot(df[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(y='genre', data=df, order=df['genre'].value_counts().head(10).index)
plt.title('Top 10 Genres')
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Clean data
df_clean = df.dropna()
X_num = df_clean[['duration_ms', 'track_number', 'release_year']]
X_cat = pd.get_dummies(df_clean['genre'])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_num)

X = np.concatenate([X_scaled, X_cat], axis=1)

In [None]:
from sklearn.cluster import KMeans

inertia = []
for k in range(1, 11):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X)
    inertia.append(model.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
from sklearn.decomposition import PCA

kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

pca = PCA(n_components=2)
pca_result = pca.fit_transform(X)

df_clean['cluster'] = labels
df_clean['PC1'] = pca_result[:, 0]
df_clean['PC2'] = pca_result[:, 1]

plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='cluster', data=df_clean, palette='Set2')
plt.title('K-Means Clustering (PCA)')
plt.show()

In [None]:
cluster_means = df_clean.groupby('cluster')[['duration_ms', 'track_number', 'release_year']].mean()
print(cluster_means)

cluster_means.plot(kind='bar', figsize=(10, 6))
plt.title('Average Feature Values per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Average Value')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## Conclusion
Metadata-based clustering, even without audio features, reveals valuable patterns in music preference. Future work could integrate external APIs for deeper insights.