Objective:
To segment the customers into distinct groups based on their spending behavior, annual income, and other demographic factors. This will help in targeted marketing and personalized offers to enhance customer engagement and increase sales.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset

cust_data = pd.read_csv('customer_segmentation_dataset.csv')

# Display the first few rows of the dataset
cust_data.head()

In [None]:
# Step 1: Basic Information

# Display the shape of the dataset
shape = df.shape

# Display the columns and their data types
data_types = df.dtypes

shape, data_types

In [None]:
# Step 2: Missing Values

# Check for missing values in each column
missing_values = df.isnull().sum()

missing_values

In [None]:
# Step 3: Data Types

# Convert object columns to category data type for efficiency
categorical_columns = ['Gender', 'Occupation', 'Education', 'Marital Status', 'Online Purchase Frequency', 'Preferred Shopping Mode', 'Loyalty Program Member']
for col in categorical_columns:
    df[col] = df[col].astype('category')

# Display the updated data types
updated_data_types = df.dtypes

updated_data_types

In [None]:
# Step 4: Unique Values

# Check for columns with a single unique value
unique_values = df.nunique()
columns_with_single_unique_value = unique_values[unique_values == 1].index.tolist()

columns_with_single_unique_value

In [None]:
# Step 5: Statistical Summary

# Get a statistical overview of the numerical columns
statistical_summary = df.describe()

statistical_summary

In [None]:
# Step 6: Outliers

# Set the style for seaborn plots
sns.set_style('whitegrid')

# Plot boxplots for numerical columns to identify outliers
numerical_columns = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
plt.figure(figsize=(15, 7))

for i, col in enumerate(numerical_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(y=df[col])
    plt.title(f'Boxplot of {col}')
    plt.ylabel(col)

plt.tight_layout()
plt.show()

Age: The distribution of age seems fairly normal with no visible outliers.

Annual Income (k$): The distribution of annual income is slightly right-skewed, but there are no clear outliers.

Spending Score (1-100): The spending score is uniformly distributed across the range, and there are no outliers.

From the boxplots, it appears that the dataset is relatively clean and doesn't have any significant outliers that might skew our analysis.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Selecting numerical columns for clustering
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
X = df[features]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine the optimal number of clusters using the Elbow Method
inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

# Plotting the Elbow Method graph
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertia, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

From the graph, it appears that the elbow is around 5 clusters. This suggests that choosing K=5 might be a good choice for our dataset.

In [None]:
# Apply K-Means clustering with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Add the cluster labels to the original dataframe
df['Cluster'] = clusters

# Display the first few rows of the dataframe with cluster labels
df.head()

The "Cluster" column indicates the cluster to which each customer belongs, ranging from 0 to 4 (since we chose 5 clusters).

In [None]:
# Group by the cluster labels and calculate the mean for numerical columns
cluster_profile_numerical = df.groupby('Cluster')[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].mean()

cluster_profile_numerical

From the above table, we can make some preliminary observations:

Cluster 0: Younger customers with lower income and lower spending scores.

Cluster 1: Middle-aged to older customers with high income and high spending scores.

Cluster 2: Older customers with lower-middle income and average spending scores.

Cluster 3: Younger customers with middle income and high spending scores.

Cluster 4: Middle-aged customers with high income but lower spending scores.

In [None]:
# Group by the cluster labels and calculate the mode for categorical columns
categorical_columns = ['Gender', 'Occupation', 'Education', 'Marital Status', 'Online Purchase Frequency', 'Preferred Shopping Mode', 'Loyalty Program Member']
cluster_profile_categorical = df.groupby('Cluster')[categorical_columns].agg(lambda x: x.mode().iloc[0])

cluster_profile_categorical

From the above table, we can further refine our observations:

Cluster 0: Younger female managers with PhDs, married, rarely purchase online, prefer online shopping, and are members of the loyalty program.

Cluster 1: Middle-aged to older female engineers with master's degrees, divorced, rarely purchase online, prefer in-store shopping, and are not members of the loyalty program.

Cluster 2: Older male doctors with high school education, single, purchase online monthly, have no preference between online and in-store shopping, and are members of the loyalty program.

Cluster 3: Younger female students with high school education, single, purchase online weekly, have no preference between online and in-store shopping, and are members of the loyalty program.

Cluster 4: Middle-aged male lawyers with PhDs, single, purchase online monthly, prefer in-store shopping, and are not members of the loyalty program.

In [None]:
# Visualizing the clusters

plt.figure(figsize=(20, 6))

# Age vs. Annual Income
plt.subplot(1, 3, 1)
sns.scatterplot(x='Age', y='Annual Income (k$)', hue='Cluster', data=df, palette='viridis', s=60, edgecolor='w')
plt.title('Age vs. Annual Income')

# Age vs. Spending Score
plt.subplot(1, 3, 2)
sns.scatterplot(x='Age', y='Spending Score (1-100)', hue='Cluster', data=df, palette='viridis', s=60, edgecolor='w')
plt.title('Age vs. Spending Score')

# Annual Income vs. Spending Score
plt.subplot(1, 3, 3)
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', data=df, palette='viridis', s=60, edgecolor='w')
plt.title('Annual Income vs. Spending Score')

plt.tight_layout()
plt.show()

Here are the visualizations of the clusters based on different pairs of features:

Age vs. Annual Income:
The clusters seem to differentiate customers primarily based on age, with younger customers having a wider range of incomes compared to older customers.

Age vs. Spending Score:
The clusters show a clear distinction between younger customers with high spending scores and older customers with lower spending scores. There's also a cluster of middle-aged customers with average spending scores.

Annual Income vs. Spending Score:
This visualization provides a clear distinction between the clusters. We can see groups of customers with low income and low spending scores, high income and high spending scores, high income but low spending scores, and so on.

These visualizations provide a comprehensive view of how customers are grouped based on their age, annual income, and spending scores. The clusters can be used to devise targeted marketing strategies, personalized offers, or other business initiatives.

In [None]:
# Detailed Cluster Profiles and Cluster Sizes

# Cluster sizes
cluster_sizes = df['Cluster'].value_counts().sort_index()

# Detailed profiles for numerical columns
cluster_profiles_numerical = df.groupby('Cluster')[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].describe()

cluster_sizes, cluster_profiles_numerical

Detailed Profiles for Numerical Columns:

Cluster 0 (164 customers):

Age: Ranges from 18 to 52 years, with an average age of ~32.4 years.
Annual Income: Ranges from $10k to $58k, with an average income of ~$29.9k.
Spending Score: Ranges from 1 to 63, with an average score of ~25.0.

Cluster 1 (185 customers):

Age: Ranges from 28 to 69 years, with an average age of ~52.7 years.
Annual Income: Ranges from $45k to $99k, with an average income of ~$79.2k.
Spending Score: Ranges from 27 to 100, with an average score of ~74.8.

Cluster 2 (242 customers):

Age: Ranges from 44 to 69 years, with an average age of ~59.0 years.
Annual Income: Ranges from $10k to $78k, with an average income of ~$36.2k.
Spending Score: Ranges from 3 to 100, with an average score of ~49.7.

Cluster 3 (208 customers):

Age: Ranges from 18 to 52 years, with an average age of ~30.7 years.
Annual Income: Ranges from $10k to $96k, with an average income of ~$47.4k.
Spending Score: Ranges from 44 to 100, with an average score of ~76.3.

Cluster 4 (201 customers):

Age: Ranges from 18 to 68 years, with an average age of ~38.3 years.
Annual Income: Ranges from $52k to $99k, with an average income of ~$78.8k.
Spending Score: Ranges from 1 to 65, with an average score of ~24.0.

In [None]:
# Visualizing the distribution of features within each cluster

features_to_visualize = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
n_clusters = 5

plt.figure(figsize=(15, 10))

for i, feature in enumerate(features_to_visualize, 1):
    plt.subplot(3, 1, i)
    for cluster in range(n_clusters):
        sns.kdeplot(df[df['Cluster'] == cluster][feature], label=f'Cluster {cluster}', shade=True)
    plt.title(f'Distribution of {feature} by Cluster')
    plt.legend()

plt.tight_layout()
plt.show()

The density plots provide insights into the distribution of Age, Annual Income (k$), and Spending Score (1-100) across the different clusters:

Age:

Cluster 0 & 3: Dominated by younger customers.

Cluster 1: Primarily consists of middle-aged to older customers.

Cluster 2: Dominated by older customers.

Cluster 4: Has a broader age range but peaks around middle age.

Annual Income (k$):

Cluster 0: Customers with lower incomes.

Cluster 1 & 4: Customers with higher incomes.

Cluster 2: A wider range of incomes but peaks in the lower-middle range.

Cluster 3: Middle income range.

Spending Score (1-100):

Cluster 0 & 4: Customers with lower spending scores.

Cluster 1 & 3: Customers with higher spending scores.

Cluster 2: Average spending scores.

In [None]:
# Analyzing the centroids of each cluster
centroids = pd.DataFrame(kmeans.cluster_centers_, columns=['Age', 'Annual Income (k$)', 'Spending Score (1-100)'])
centroids

In [None]:
# Checking the number of unique cluster labels and centroids
unique_clusters = df['Cluster'].unique()
number_of_centroids = centroids.shape[0]

unique_clusters, number_of_centroids

In [None]:
# Re-visualizing the centroids alongside the data points with the legend outside the plot

plt.figure(figsize=(20, 6))

# Age vs. Annual Income
plt.subplot(1, 3, 1)
sns.scatterplot(x=df['Age'], y=df['Annual Income (k$)'], hue='Cluster', data=df, palette='viridis', s=60, edgecolor='w', alpha=0.7)
sns.scatterplot(x=centroids_original[:, 0], y=centroids_original[:, 1], color='red', s=200, label='Centroids', marker='X')
plt.title('Age vs. Annual Income with Centroids')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))

# Age vs. Spending Score
plt.subplot(1, 3, 2)
sns.scatterplot(x=df['Age'], y=df['Spending Score (1-100)'], hue='Cluster', data=df, palette='viridis', s=60, edgecolor='w', alpha=0.7)
sns.scatterplot(x=centroids_original[:, 0], y=centroids_original[:, 2], color='red', s=200, label='Centroids', marker='X')
plt.title('Age vs. Spending Score with Centroids')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))

# Annual Income vs. Spending Score
plt.subplot(1, 3, 3)
sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], hue='Cluster', data=df, palette='viridis', s=60, edgecolor='w', alpha=0.7)
sns.scatterplot(x=centroids_original[:, 1], y=centroids_original[:, 2], color='red', s=200, label='Centroids', marker='X')
plt.title('Annual Income vs. Spending Score with Centroids')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))

plt.tight_layout()
plt.show()

From the centroids, we can infer the following:

Cluster 0: Represents younger customers with lower incomes and lower spending scores.

Cluster 1: Represents middle-aged to older customers with higher incomes and higher spending scores.

Cluster 2: Represents older customers with moderate incomes and average spending scores.

Cluster 3: Represents younger customers with average incomes but higher spending scores.

Cluster 4: Represents middle-aged customers with higher incomes but lower spending scores.

These centroids essentially provide a summary of the "average" customer in each cluster, which can be useful for understanding the core characteristics of each segment.

With these insights, businesses can tailor their marketing campaigns, product offerings, and services to cater to the specific needs and preferences of each customer segment.