# Retail Customer Segmentation using K-Means Clustering

This notebook demonstrates customer segmentation using K-Means clustering algorithm. We'll analyze customer data based on their annual income and spending score to identify distinct customer groups.

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette('husl')

## 2. Load and Explore the Dataset

In [None]:
# Load the dataset
df = pd.read_csv('Mall_Customers.csv')

# Display first few rows
print('First few rows of the dataset:
')
display(df.head())

# Dataset information
print('
Dataset Info:
')
print(df.info())

# Basic statistics
print('
Basic Statistics:
')
display(df.describe())

## 3. Data Preprocessing

In [None]:
# Select features for clustering
features = ['Annual Income (k$)', 'Spending Score (1-100)']
X = df[features]

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print('Scaled features shape:', X_scaled.shape)

## 4. Determine Optimal Number of Clusters using Elbow Method

In [None]:
# Calculate distortions for different k values
distortions = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    distortions.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

## 5. Perform K-Means Clustering

In [None]:
# Apply K-Means clustering
optimal_k = 5  # Based on elbow curve
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Calculate silhouette score
silhouette_avg = silhouette_score(X_scaled, clusters)
print(f'Silhouette Score: {silhouette_avg}')

## 6. Visualize the Clusters

In [None]:
# Create scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(df['Annual Income (k$)'], df['Spending Score (1-100)'], 
                     c=clusters, cmap='viridis')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segments')
plt.colorbar(scatter)
plt.show()

## 7. Analyze Clusters

In [None]:
# Add cluster labels to the original dataframe
df['Cluster'] = clusters

# Calculate cluster statistics
cluster_stats = df.groupby('Cluster').agg({
    'Annual Income (k$)': ['mean', 'min', 'max'],
    'Spending Score (1-100)': ['mean', 'min', 'max'],
    'CustomerID': 'count'
}).round(2)

print('Cluster Statistics:
')
display(cluster_stats)

# Save segmented customers
df.to_csv('segmented_customers.csv', index=False)

## 8. Conclusions

Based on the analysis, we can identify distinct customer segments:

1. **High Income, High Spending**: Target for luxury products
2. **High Income, Low Spending**: Potential for targeted marketing
3. **Average Income, Average Spending**: Standard customers
4. **Low Income, High Spending**: Careful credit assessment needed
5. **Low Income, Low Spending**: Budget-conscious customers

These insights can be used for:
- Targeted marketing campaigns
- Product recommendations
- Customer retention strategies
- Risk assessment