# Customer Segmentation Analysis for Marketing Strategy

This notebook implements customer segmentation using K-Means and DBSCAN clustering algorithms on the Mall Customer Segmentation dataset. The analysis identifies distinct customer segments based on spending behavior and demographic characteristics, which can be used to develop targeted marketing strategies.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from kneed import KneeLocator
import os
import warnings
warnings.filterwarnings("ignore")

# Set styling for plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("colorblind")

# Create output directory for plots
if not os.path.exists('plots'):
    os.makedirs('plots')
    print("Created 'plots' directory for saving visualizations")

print("Setup complete!")

## 1. Data Loading and Exploration

First, we'll load the Mall Customer Segmentation dataset and explore its structure.

In [None]:
# Load the Mall Customer Segmentation dataset
try:
    # Attempt to load the dataset from a local file
    df = pd.read_csv('Mall_Customers.csv')
    print("Dataset loaded successfully from local file.")
except FileNotFoundError:
    # If file not found, download from URL
    print("Local file not found. Downloading from URL...")
    url = "https://raw.githubusercontent.com/jeffrey125/Mall-Customer-Segmentation/master/Mall_Customers.csv"
    df = pd.read_csv(url)
    # Save locally for future use
    df.to_csv('Mall_Customers.csv', index=False)
    print("Dataset saved locally as 'Mall_Customers.csv'")

print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Display basic information about the dataset
print("Dataset Information:")
df.info()

In [None]:
# Summary statistics
print("Summary Statistics:")
df.describe()

In [None]:
# Check for missing values
print("Checking for missing values:")
df.isnull().sum()

## 2. Exploratory Data Analysis

Let's visualize the distributions and relationships in our dataset.

In [None]:
# Distribution of Age
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], kde=True)
plt.title('Distribution of Customer Age')
plt.savefig('plots/age_distribution.png')
plt.show()

In [None]:
# Distribution of Annual Income
plt.figure(figsize=(10, 6))
sns.histplot(df['Annual Income (k$)'], kde=True)
plt.title('Distribution of Annual Income')
plt.savefig('plots/income_distribution.png')
plt.show()

In [None]:
# Distribution of Spending Score
plt.figure(figsize=(10, 6))
sns.histplot(df['Spending Score (1-100)'], kde=True)
plt.title('Distribution of Spending Score')
plt.savefig('plots/spending_distribution.png')
plt.show()

In [None]:
# Gender distribution
plt.figure(figsize=(8, 6))
gender_counts = df['Gender'].value_counts()
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Gender Distribution')
plt.savefig('plots/gender_distribution.png')
plt.show()

In [None]:
# Annual Income vs Spending Score - Key relationship for segmentation
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=df, hue='Gender')
plt.title('Annual Income vs Spending Score')
plt.savefig('plots/income_vs_spending.png')
plt.show()

## 3. Data Preprocessing

Now we'll prepare our data for clustering by encoding categorical variables and scaling the features.

In [None]:
# Create a copy of the dataset for preprocessing
data = df.copy()

# Encode Gender (Male: 0, Female: 1)
data['Gender'] = data['Gender'].map({'Male': 0, 'Female': 1})

# Features for clustering
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)', 'Gender']
X = data[features]

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Data standardized.")

# Display the standardized features
standardized_df = pd.DataFrame(X_scaled, columns=features)
standardized_df.head()

## 4. Principal Component Analysis (PCA)

We'll apply PCA for dimensionality reduction and visualization.

In [None]:
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot the PCA results
plt.figure(figsize=(10, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Customer Data')
plt.savefig('plots/pca_visualization.png')
plt.show()

# Explained variance
print(f"Explained variance ratio by the first two components: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca.explained_variance_ratio_):.2f}")

## 5. K-Means Clustering

We'll apply K-Means clustering to identify customer segments.

In [None]:
# Finding the optimal number of clusters using the Elbow Method
wcss = []
silhouette_scores = []
range_n_clusters = range(2, 11)

for n_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_scaled)
    
    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, cluster_labels))

# Plot the Elbow Method
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(range_n_clusters, wcss, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')

# Use KneeLocator to find the elbow point
kl = KneeLocator(range_n_clusters, wcss, curve="convex", direction="decreasing")
optimal_k = kl.elbow
print(f"Optimal number of clusters based on Elbow Method: {optimal_k}")

# Plot Silhouette Scores
plt.subplot(1, 2, 2)
plt.plot(range_n_clusters, silhouette_scores, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.tight_layout()
plt.savefig('plots/kmeans_optimal_k.png')
plt.show()

In [None]:
# Apply K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to the original dataframe
data['KMeans_Cluster'] = kmeans_labels

# Visualize the K-Means clusters in PCA space
plt.figure(figsize=(12, 10))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='viridis', alpha=0.8)
plt.scatter(pca.transform(scaler.transform(kmeans.cluster_centers_))[:, 0],
            pca.transform(scaler.transform(kmeans.cluster_centers_))[:, 1],
            s=100, c='red', marker='X')
plt.title(f'K-Means Clustering with {optimal_k} Clusters (PCA Visualization)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster')
plt.savefig('plots/kmeans_clusters_pca.png')
plt.show()

### K-Means Cluster Analysis

In [None]:
# Analyze the K-Means clusters
print("K-Means Cluster Analysis:")
cluster_analysis = data.groupby('KMeans_Cluster').mean()
cluster_analysis

In [None]:
# Calculate cluster sizes
cluster_sizes = data['KMeans_Cluster'].value_counts().sort_index()
print("Cluster Sizes:")
for cluster, size in cluster_sizes.items():
    print(f"Cluster {cluster}: {size} customers")

In [None]:
# Visualize cluster characteristics
plt.figure(figsize=(15, 10))
for i, feature in enumerate(features):
    plt.subplot(2, 2, i+1)
    sns.boxplot(x='KMeans_Cluster', y=feature, data=data)
    plt.title(f'{feature} by Cluster')
plt.tight_layout()
plt.savefig('plots/kmeans_cluster_characteristics.png')
plt.show()

## 6. DBSCAN Clustering

Let's apply DBSCAN clustering for comparison.

In [None]:
# Apply DBSCAN with parameters based on prior analysis
dbscan = DBSCAN(eps=1.1, min_samples=3)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Add cluster labels to the dataframe
data['DBSCAN_Cluster'] = dbscan_labels

# Get number of clusters (excluding noise points with label -1)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
print(f"DBSCAN identified {n_clusters} clusters.")

# Calculate silhouette score if possible
if len(set(dbscan_labels)) > 1:  # Only calculate silhouette if we have more than one cluster
    try:
        db_silhouette = silhouette_score(X_scaled, dbscan_labels)
        print(f"Silhouette score: {db_silhouette:.3f}")
    except Exception as e:
        print(f"Cannot calculate silhouette score: {e}")

# Count points in each cluster
unique_labels, counts = np.unique(dbscan_labels, return_counts=True)
for label, count in zip(unique_labels, counts):
    if label == -1:
        print(f"Noise points: {count}")
    else:
        print(f"Cluster {label}: {count} points")

In [None]:
# Visualize DBSCAN clusters in PCA space
plt.figure(figsize=(12, 10))
# Create a colormap that handles noise points (-1) differently
n_clusters_for_color = max(1, len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0))
colors = plt.cm.viridis(np.linspace(0, 1, n_clusters_for_color))

for i, label in enumerate(sorted(set(dbscan_labels))):
    if label == -1:
        # Black for noise points
        mask = dbscan_labels == label
        plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c='black', marker='x', alpha=0.5, label='Noise')
    else:
        mask = dbscan_labels == label
        color_idx = i if -1 not in dbscan_labels else i-1
        plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c=[colors[color_idx % len(colors)]], alpha=0.8, label=f'Cluster {label}')

plt.title('DBSCAN Clustering Results (PCA Visualization)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.savefig('plots/dbscan_clusters_pca.png')
plt.show()

## 7. Marketing Strategy Recommendations

Based on our customer segments, we can develop targeted marketing strategies.

In [None]:
# Save the segmented data
segmented_data = df.copy()
segmented_data['KMeans_Cluster'] = data['KMeans_Cluster']
segmented_data['DBSCAN_Cluster'] = data['DBSCAN_Cluster']
segmented_data.to_csv('segmented_customers.csv', index=False)
print("Segmented data saved to 'segmented_customers.csv'")

### Customer Segment Descriptions and Marketing Recommendations

Based on our K-Means analysis, here are the identified customer segments and recommendations:

**Standard Female Customers (Cluster 0)**:
- *Profile*: Middle-aged female customers with moderate income and moderate-to-low spending scores
- *Recommendations*: Basic loyalty programs, practical promotions, family-oriented marketing

**Young Male Spenders (Cluster 1)**:
- *Profile*: Young male customers with moderate income but high spending scores
- *Recommendations*: Trending products, technology-focused marketing, social media campaigns

**Young Female Budget Shoppers (Cluster 2)**:
- *Profile*: Young female customers with lower income but moderate-high spending scores
- *Recommendations*: Value-oriented promotions, trending but affordable product lines, influencer marketing

**Older Male Conservatives (Cluster 3)**:
- *Profile*: Older male customers with moderate income and low spending scores
- *Recommendations*: Quality-focused messaging, durability emphasis, value-for-money propositions

**Affluent High-Spenders (Cluster 4)**:
- *Profile*: Predominantly female customers with high income and very high spending scores
- *Recommendations*: Premium loyalty programs, exclusive shopping events, personalized shopping experiences

**Affluent Savers (Cluster 5)**:
- *Profile*: Mixed gender customers with high income but very low spending scores
- *Recommendations*: High-quality, long-lasting product marketing, emphasis on investment value, prestige branding

DBSCAN analysis provided broader gender-based segments which could be useful for general gender-specific campaigns, but K-Means offered more actionable, detailed segments for targeted marketing strategies.

## 8. Conclusion

In this analysis, we successfully identified distinct customer segments using clustering algorithms, with K-Means providing more detailed and actionable segments compared to DBSCAN for this particular dataset.

**Key Insights:**

1. **Segmentation Power**: Customer segmentation reveals distinct groups with varying purchasing behaviors and demographic characteristics.
2. **Algorithm Comparison**: K-Means outperformed DBSCAN for this dataset, producing more interpretable and marketing-relevant clusters.
3. **Marketing Applications**: Each identified segment represents a unique opportunity for targeted marketing strategies.
4. **Future Directions**: The analysis could be enhanced by incorporating additional behavioral data, such as purchase history and browsing behavior.

The segmentation results demonstrate how data-driven approaches can inform more effective marketing strategies by tailoring approaches to specific customer groups rather than using a one-size-fits-all approach.