# Clustering and Dimensionality Reduction for Patient Wellness Profiles


In this notebook, we analyze a simulated dataset containing various health and wellness indicators 
to segment patients into meaningful groups using clustering techniques. We also apply dimensionality 
reduction using Principal Component Analysis (PCA) to evaluate its impact on clustering effectiveness.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import plotly.express as px

## Simulated Dataset Creation

In [None]:
# Simulate wellness dataset
np.random.seed(42)
n_samples = 300
data = {
    'exercise_minutes_per_day': np.random.normal(30, 10, n_samples),
    'healthy_meals_per_day': np.random.randint(1, 4, n_samples),
    'hours_sleep': np.random.normal(7, 1.5, n_samples),
    'stress_level': np.random.randint(1, 10, n_samples),
    'BMI': np.random.normal(25, 4, n_samples)
}
df = pd.DataFrame(data)
df.head()

## Exploratory Data Analysis

In [None]:
# Summary statistics and pairplot
display(df.describe())
sns.pairplot(df)
plt.suptitle("Pairplot of Wellness Features", y=1.02)
plt.show()

## Data Preprocessing

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

## K-Means Clustering (Before PCA)

In [None]:
sil_scores = []
k_range = range(2, 7)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(scaled_data)
    sil_scores.append(silhouette_score(scaled_data, labels))

# Plot silhouette scores
plt.plot(k_range, sil_scores, marker='o')
plt.title('Silhouette Scores for k-means (Before PCA)')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()

## Principal Component Analysis (PCA)

In [None]:
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(scaled_data)

# Plot PCA components
plt.figure(figsize=(8,6))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], alpha=0.6)
plt.title("PCA - 2 Principal Components")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.show()

## K-Means Clustering (After PCA)

In [None]:
sil_scores_pca = []
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels_pca = kmeans.fit_predict(reduced_data)
    sil_scores_pca.append(silhouette_score(reduced_data, labels_pca))

# Plot silhouette scores after PCA
plt.plot(k_range, sil_scores_pca, marker='o', color='green')
plt.title('Silhouette Scores for k-means (After PCA)')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()

## Final Observations and Summary


- PCA helped reduce dimensionality and offered clearer cluster separation in 2D.
- Clustering revealed distinct wellness profiles, though some overlap remained.
- Silhouette scores slightly improved post-PCA, suggesting more compact clusters.
- This method can support targeted wellness interventions based on patient similarity.
