# Clustering Wellness Data with PCA
This notebook demonstrates how to use clustering techniques and PCA to find patterns in a simulated wellness dataset.

## Dataset Creation
We create a synthetic dataset with 300 samples and the following features:
- Exercise Minutes per Day
- Healthy Meals per Day
- Hours of Sleep
- Stress Level
- BMI

In [None]:
# Create the synthetic dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
data = pd.DataFrame({
    'Exercise_Minutes': np.random.normal(30, 10, 300),
    'Healthy_Meals': np.random.randint(1, 4, 300),
    'Sleep_Hours': np.random.normal(7, 1.5, 300),
    'Stress_Level': np.random.randint(1, 10, 300),
    'BMI': np.random.normal(25, 4, 300)
})
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data.head()

## Exploratory Data Analysis
We examine feature relationships using a correlation heatmap.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.savefig('/mnt/data/heatmap.png')
plt.show()

## Hierarchical Clustering
We use Ward's method and plot a dendrogram to see how data groups naturally.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

linkage_matrix = linkage(data_scaled, method='ward')
plt.figure(figsize=(10, 6))
dendrogram(linkage_matrix)
plt.title('Dendrogram - Hierarchical Clustering')
plt.tight_layout()
plt.savefig('/mnt/data/dendrogram.png')
plt.show()

## Dimensionality Reduction (PCA)
PCA helps us reduce the data to 2 components that explain most of the variance.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_data = pca.fit_transform(data_scaled)
print('Explained variance by 2 components:', pca.explained_variance_ratio_.sum())

## K-Means Clustering
We test K values from 2 to 6 and evaluate using silhouette scores to find the best cluster count.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

k_range = range(2, 7)
scores = [silhouette_score(data_scaled, KMeans(n_clusters=k, n_init=10).fit_predict(data_scaled)) for k in k_range]

plt.plot(list(k_range), scores, marker='o')
plt.title('Silhouette Scores for K-Means Clustering')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.savefig('/mnt/data/silhouette_scores.png')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=5, n_init=10)
clusters = kmeans.fit_predict(pca_data)

plt.scatter(pca_data[:, 0], pca_data[:, 1], c=clusters, cmap='viridis')
plt.title('PCA Plot with K-Means Clusters')
plt.savefig('/mnt/data/pca_clusters.png')
plt.show()

## Conclusion
We found three wellness groups using clustering. PCA helped simplify the data and slightly improved clustering results.
These methods could help healthcare organizations find trends and tailor wellness plans.