In [1]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import pandas as pd

# Generate synthetic data
X, _ = make_blobs(n_samples=1000, centers=3, n_features=4, random_state=42)

# Fit KMeans clustering model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Get cluster labels and add them to the original data
cluster_labels = kmeans.labels_
data_with_clusters = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
data_with_clusters['cluster'] = cluster_labels

# Analyze feature means for each cluster
cluster_means = data_with_clusters.groupby('cluster').mean()
print("Feature Means for Each Cluster:")
print(cluster_means)

# Calculate feature importance for each cluster
cluster_centers = kmeans.cluster_centers_
feature_importance = pd.DataFrame(cluster_centers, columns=[f"feature_{i}" for i in range(X.shape[1])])
feature_importance.index.name = 'cluster'
print("\nFeature Importance for Each Cluster:")
print(feature_importance)


Feature Means for Each Cluster:
         feature_0  feature_1  feature_2  feature_3
cluster                                            
0         2.081282   4.081350  -9.654618   9.438292
1        -2.479722   9.036261   4.681010   2.024284
2        -6.883695  -6.744685  -8.852820   7.315158

Feature Importance for Each Cluster:
         feature_0  feature_1  feature_2  feature_3
cluster                                            
0         2.081282   4.081350  -9.654618   9.438292
1        -2.479722   9.036261   4.681010   2.024284
2        -6.883695  -6.744685  -8.852820   7.315158





This code generates synthetic data with four features and three clusters using `make_blobs`. Then, it fits a KMeans clustering model to the data. After clustering, it calculates the mean feature values for each cluster and the cluster centers (centroid coordinates), which represent the average feature values within each cluster.

The `cluster_means` DataFrame provides the mean feature values for each cluster, allowing you to understand the typical values of each feature within each cluster.

The `feature_importance` DataFrame represents the cluster centers, where each row corresponds to a cluster and each column represents a feature. The values in this DataFrame indicate the importance of each feature for distinguishing between clusters. Higher values indicate that the feature contributes more to the definition of that cluster.

By analyzing these results, you can interpret the characteristics of each cluster and understand which features are most important for defining each cluster. This information can provide valuable insights into the underlying structure of your data and help in making informed decisions or further analysis.

## Cohen's D Clustering Analysis

The Cohen's d statistic is a measure of effect size that quantifies the difference between the means of two groups in terms of standard deviation units. While it's commonly used in hypothesis testing to determine the significance of differences between groups, it can also be applied to assess the importance of differences between clusters in a clustering analysis.

Here's how you can use Cohen's d statistic to assess the importance of differences between clusters based on feature values:

1. **Calculate Cluster Means and Standard Deviations:**
   - For each feature, calculate the mean and standard deviation within each cluster.

2. **Compute Cohen's d for Each Feature:**
   - For each feature, compute Cohen's d statistic between pairs of clusters.
   - Cohen's d is calculated as the difference between the means of two clusters divided by the pooled standard deviation.

3. **Interpretation:**
   - Large Cohen's d values (typically > 0.8) indicate substantial differences between clusters for that feature.
   - Small Cohen's d values suggest minimal differences between clusters.

4. **Significance Testing (Optional):**
   - You can optionally perform significance testing to determine if the differences between clusters are statistically significant. This typically involves conducting t-tests or non-parametric tests, depending on the distribution of the data.

Here's a simplified example code demonstrating how to calculate Cohen's d statistic for each feature between pairs of clusters:


In [3]:
from scipy import stats
import numpy as np

def cohen_d(x, y):
    """
    Compute Cohen's d statistic between two groups.
    """
    nx = len(x)
    ny = len(y)
    dof = nx + ny - 2
    pooled_std = ((nx - 1) * np.var(x, ddof=1) + (ny - 1) * np.var(y, ddof=1)) / dof
    d = (np.mean(x) - np.mean(y)) / np.sqrt(pooled_std)
    return d

# Assuming 'data_with_clusters' contains feature data with cluster labels
clusters = data_with_clusters['cluster'].unique()

# Calculate Cohen's d for each feature between pairs of clusters
cohen_d_values = {}
for feature in data_with_clusters.columns[:-1]:  # Exclude the 'cluster' column
    for i, cluster1 in enumerate(clusters):
        for cluster2 in clusters[i+1:]:
            group1 = data_with_clusters[data_with_clusters['cluster'] == cluster1][feature]
            group2 = data_with_clusters[data_with_clusters['cluster'] == cluster2][feature]
            d = cohen_d(group1, group2)
            key = f"{feature}_({cluster1}, {cluster2})"
            cohen_d_values[key] = d

# Print Cohen's d values
for key, value in cohen_d_values.items():
    print(f"{key}: {value}")


feature_0_(1, 0): -4.6567512673119795
feature_0_(1, 2): 4.689583509913727
feature_0_(0, 2): 9.162977574582701
feature_1_(1, 0): 4.765364823116116
feature_1_(1, 2): 15.921366443059682
feature_1_(0, 2): 10.871355828924463
feature_2_(1, 0): 14.412185361468007
feature_2_(1, 2): 13.484863049433713
feature_2_(0, 2): -0.7812998043968105
feature_3_(1, 0): -7.4411647366494
feature_3_(1, 2): -5.250361978068482
feature_3_(0, 2): 2.090582126467888




In this code:
- We define a function `cohen_d` to compute Cohen's d statistic between two groups.
- We iterate over each feature and each pair of clusters, computing Cohen's d statistic for each pair of clusters and each feature.
- The resulting `cohen_d_values` dictionary contains Cohen's d values for each feature between pairs of clusters.

By analyzing the Cohen's d values, you can identify which features contribute the most to the differences between clusters, helping you interpret the importance of these differences in your clustering analysis.