# K-Means Clustering for Customer Segmentation

This notebook demonstrates the application of the K-Means clustering algorithm for customer segmentation. We will explore a synthetic customer dataset, preprocess the data, train the model, evaluate the results, and derive insights on customer behavior.

## Introduction

K-Means clustering is an unsupervised machine learning algorithm used to partition data into distinct clusters. Its primary objective is to minimize the within-cluster variance, thus grouping similar data points together. 

**Significance and Real-World Applications:**

- **Customer Segmentation:** Helps businesses tailor marketing strategies based on customer purchasing behavior and demographics.
- **Image Segmentation:** Used for partitioning images into meaningful segments for object detection or medical imaging.
- **Gene Expression Analysis:** Aids in identifying patterns in gene expression for biological research.

In this notebook, we focus on customer segmentation to understand different customer groups.

## Dataset Description & Exploratory Analysis

### Dataset Description

For demonstration purposes, we use a synthetic customer dataset with the following features:

- **CustomerID:** Unique identifier for each customer.
- **Age:** Age of the customer.
- **Annual Income (k$):** Annual income in thousands of dollars.
- **Spending Score (1-100):** A score assigned by the mall based on customer behavior and spending nature.

The dataset is simulated to resemble common customer segmentation problems seen in retail analytics.

### Exploratory Analysis

We will perform an initial exploratory data analysis (EDA) to understand the distribution and relationships of the features.

In [None]:
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Create a synthetic customer dataset
np.random.seed(42)
n_samples = 200
data = {
    "CustomerID": np.arange(1, n_samples + 1),
    "Age": np.random.randint(18, 70, size=n_samples),
    "Annual Income (k$)": np.random.randint(15, 140, size=n_samples),
    "Spending Score (1-100)": np.random.randint(1, 101, size=n_samples)
}

df = pd.DataFrame(data)
df.head()

In [None]:
# Basic exploratory analysis
print("Dataset Summary:\n", df.describe())

In [None]:
print("\nDataset Info:\n")
print(df.info())

In [None]:
# Pairplot to visualize relationships between features
sns.pairplot(df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']], diag_kind='kde')
plt.show()

## Data Preprocessing

### Steps Involved

1. **Handling Missing Values:** Check and impute or drop missing values. (Our synthetic data does not have missing values.)
2. **Feature Scaling:** Standardize features to bring them to the same scale. This is important for distance-based algorithms like K-Means.
3. **Dimensionality Reduction (if necessary):** In cases of high-dimensional data, techniques like PCA may be applied. In this project, we work with three features, so this step is not required.

Let's proceed with scaling the data.

In [None]:
# Check for missing values
print('Missing values in each column:\n', df.isnull().sum())

In [None]:
from sklearn.preprocessing import StandardScaler

# Drop CustomerID as it's not relevant for clustering
df_clean = df.drop('CustomerID', axis=1)

# Feature scaling using StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_clean)

# Convert the scaled data back to a DataFrame
df_scaled = pd.DataFrame(df_scaled, columns=df_clean.columns)
df_scaled.head()

## Mathematical Explanation

The goal of K-Means is to partition the dataset into **K** clusters such that the within-cluster sum of squares is minimized.

### Objective Function

The objective function is defined as:

$J = \sum_{i=1}^{K} \sum_{x \in C_i} \|x - \mu_i\|^2$

where:

- **$K$**: Number of clusters
- **$C$<sub>i</sub>**: Set of points in cluster *i*
- **$\mu_i$**: Centroid of cluster *i*

### Distance Metric

K-Means uses the Euclidean distance to measure the similarity between points:

$d(x, \mu) = \sqrt{\sum_{j=1}^{n} (x_j - \mu_j)^2}$

### Centroid Update

At each iteration, the centroid of each cluster is updated as the mean of all points assigned to that cluster:

$\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x$

### Convergence Criteria

The algorithm iterates between assigning points to clusters and updating centroids until the centroids do not change significantly or a maximum number of iterations is reached.

## Model Training & Evaluation

We will now implement the K-Means clustering algorithm using scikit-learn. To determine the optimal number of clusters, we'll utilize the Elbow Method and the Silhouette Score.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

wcss = []
silhouette_scores = []
K_range = range(2, 11)  # testing cluster numbers from 2 to 10

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df_scaled)
    wcss.append(kmeans.inertia_)
    cluster_labels = kmeans.labels_
    silhouette_avg = silhouette_score(df_scaled, cluster_labels)
    silhouette_scores.append(silhouette_avg)





In [None]:
# Plotting the Elbow Method
plt.figure(figsize=(12, 5))

plt.plot(K_range, wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

In [None]:
# Plotting the Silhouette Scores
plt.plot(K_range, silhouette_scores, marker='o', color='orange')
plt.title('Silhouette Scores')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')

plt.tight_layout()
plt.show()

In [None]:
# Based on the above plots, let's assume the optimal number of clusters is 4
optimal_k = 4
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42)
cluster_labels = kmeans_final.fit_predict(df_scaled)

# Append the cluster labels to the original dataframe
df['Cluster'] = cluster_labels
df.head()

## Model Analysis & Visualization

We now visualize the clustering results. Below are two key visualizations:

1. **Scatter Plot:** Plots 'Annual Income' vs. 'Spending Score' colored by cluster labels.
2. **Heatmap:** Shows the average feature values for each cluster (cluster profiles).

In [None]:
# Scatter Plot: Annual Income vs Spending Score
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', 
                palette='viridis', data=df, legend='full')
plt.title('Customer Segmentation (Scatter Plot)')
plt.show()

In [None]:
# Heatmap: Cluster Profiles
cluster_profile = df.groupby('Cluster').mean()
plt.figure(figsize=(8, 6))
sns.heatmap(cluster_profile, annot=True, cmap='coolwarm')
plt.title('Cluster Profile Heatmap')
plt.show()

## Discussion

The clustering results reveal distinct customer segments. For instance:

- Some clusters may correspond to high-income customers with a high spending score, indicating premium customers.
- Other clusters might represent lower-income customers with different spending behaviors.

Potential challenges include:

- **Scale Sensitivity:** K-Means is sensitive to the scale of features, making preprocessing crucial.
- **Choosing K:** The choice of the number of clusters can greatly affect results.

Future improvements could involve:

- Experimenting with different distance metrics or clustering algorithms.
- Incorporating additional customer features for richer segmentation.
- Using dimensionality reduction techniques if additional features are included.

## Conclusion

In this project, we applied the K-Means clustering algorithm to perform customer segmentation. Through exploratory analysis, data preprocessing, and model evaluation (using the Elbow Method and Silhouette Score), we identified distinct customer groups. This segmentation can help businesses tailor their marketing and service strategies based on customer profiles.

## References

1. MacQueen, J. (1967). _Some Methods for Classification and Analysis of Multivariate Observations_. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.
2. Jain, A. K. (2010). _Data clustering: 50 years beyond K-means_. Pattern Recognition Letters, 31(8), 651-666.
3. Pedregosa, F., et al. (2011). _Scikit-learn: Machine Learning in Python_. Journal of Machine Learning Research, 12, 2825-2830.

Additional resources and documentation from scikit-learn and relevant data science tutorials were also used.