In [1]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_samples = 1000

data = {
    'CustomerID': np.arange(1, n_samples + 1),
    'Age': np.random.randint(18, 70, size=n_samples),
    'Gender': np.random.choice(['Male', 'Female'], size=n_samples),
    'AnnualIncome': np.random.normal(50000, 15000, size=n_samples).round(2),
    'SpendingScore': np.random.randint(1, 101, size=n_samples),
    'MembershipYears': np.random.randint(1, 11, size=n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)

df.head()

Unnamed: 0,CustomerID,Age,Gender,AnnualIncome,SpendingScore,MembershipYears
0,1,56,Male,49753.66,11,9
1,2,69,Male,67825.9,49,1
2,3,46,Male,87903.99,38,1
3,4,32,Female,42036.97,64,7
4,5,60,Male,42658.41,69,3


In [4]:
import plotly.express as px
import plotly.graph_objects as go

# Summary statistics
summary_stats = df.describe()
summary_stats

Unnamed: 0,CustomerID,Age,AnnualIncome,SpendingScore,MembershipYears
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,43.819,51204.48724,50.847,5.532
std,288.819436,14.99103,14898.527905,28.676505,2.898168
min,1.0,18.0,6179.74,1.0,1.0
25%,250.75,31.0,40976.3225,26.0,3.0
50%,500.5,44.0,51135.79,51.0,6.0
75%,750.25,56.0,61131.55,75.0,8.0
max,1000.0,69.0,97896.61,100.0,10.0


- The average age of customers is approximately 44 years, with a standard deviation of about 15 years, indicating a wide age range.
- The average annual income is around USD 51,204, with a standard deviation of approximately USD 14,899, suggesting significant income variability.
- The average spending score is about 51, with a standard deviation of around 29, indicating diverse spending behaviors.
- The average membership duration is roughly 5.5 years, with a standard deviation of about 2.9 years, showing varied customer loyalty.

In [5]:
import plotly.express as px
import plotly.graph_objects as go

# Summary statistics
summary_stats = df.describe()
summary_stats

Unnamed: 0,CustomerID,Age,AnnualIncome,SpendingScore,MembershipYears
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,43.819,51204.48724,50.847,5.532
std,288.819436,14.99103,14898.527905,28.676505,2.898168
min,1.0,18.0,6179.74,1.0,1.0
25%,250.75,31.0,40976.3225,26.0,3.0
50%,500.5,44.0,51135.79,51.0,6.0
75%,750.25,56.0,61131.55,75.0,8.0
max,1000.0,69.0,97896.61,100.0,10.0


In [6]:
# Distribution of Age
fig_age = px.histogram(df, x='Age', nbins=20, title='Age Distribution')
fig_age.show()

# Distribution of Annual Income
fig_income = px.histogram(df, x='AnnualIncome', nbins=20, title='Annual Income Distribution')
fig_income.show()

# Distribution of Spending Score
fig_spending = px.histogram(df, x='SpendingScore', nbins=20, title='Spending Score Distribution')
fig_spending.show()

# Distribution of Membership Years
fig_membership = px.histogram(df, x='MembershipYears', nbins=10, title='Membership Years Distribution')
fig_membership.show()

- Age Distribution: The age distribution is relatively uniform, with a slight concentration around the 40-50 age range.
- Annual Income Distribution: The annual income distribution is approximately normal, centered around $50,000, with a few outliers on both ends.
- Spending Score Distribution: The spending score is uniformly distributed, indicating a wide range of spending behaviors among customers.
- Membership Years Distribution: The membership years show a slight right skew, with most customers having been members for fewer than 6 years.

In [7]:
# Scatter plot of Age vs. Annual Income
fig_age_income = px.scatter(df, x='Age', y='AnnualIncome', title='Age vs. Annual Income')
fig_age_income.show()

# Scatter plot of Age vs. Spending Score
fig_age_spending = px.scatter(df, x='Age', y='SpendingScore', title='Age vs. Spending Score')
fig_age_spending.show()

# Scatter plot of Annual Income vs. Spending Score
fig_income_spending = px.scatter(df, x='AnnualIncome', y='SpendingScore', title='Annual Income vs. Spending Score')
fig_income_spending.show()

- Age vs. Annual Income: There is no clear correlation between age and annual income, indicating that income levels are distributed across all age groups.
- Age vs. Spending Score: There is no distinct pattern between age and spending score, suggesting that spending behavior is independent of age.
- Annual Income vs. Spending Score: There is no strong correlation between annual income and spending score, indicating that higher income does not necessarily translate to higher spending scores.

In [8]:
# Box plot of Annual Income by Gender
fig_income_gender = px.box(df, x='Gender', y='AnnualIncome', title='Annual Income by Gender')
fig_income_gender.show()

# Box plot of Spending Score by Gender
fig_spending_gender = px.box(df, x='Gender', y='SpendingScore', title='Spending Score by Gender')
fig_spending_gender.show()

- Age, Income, and Spending Distributions: Age and spending scores are uniformly distributed, while annual income follows a normal distribution centered around $50,000.
- Scatter Plots: There are no clear correlations between age and annual income, age and spending score, or annual income and spending score.
- Box Plots by Gender: Both annual income and spending scores show similar distributions across genders, indicating no significant gender-based differences in these metrics.

In [13]:
# Convert categorical variables to numerical
from sklearn.preprocessing import LabelEncoder

# Encode 'Gender' column
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])

# Drop 'CustomerID' column
df_no_id = df.drop(columns=['CustomerID'])

# Correlation matrix
correlation_matrix = df_no_id.corr()
correlation_matrix

Unnamed: 0,Age,Gender,AnnualIncome,SpendingScore,MembershipYears
Age,1.0,0.010002,0.012247,0.088405,0.014867
Gender,0.010002,1.0,0.029061,0.017756,-0.020577
AnnualIncome,0.012247,0.029061,1.0,-0.034258,-0.004107
SpendingScore,0.088405,0.017756,-0.034258,1.0,0.03786
MembershipYears,0.014867,-0.020577,-0.004107,0.03786,1.0


In [14]:
# Correlation matrix
correlation_matrix = df.corr()

# Heatmap of the correlation matrix
fig_heatmap = px.imshow(correlation_matrix, text_auto=True, title='Correlation Matrix Heatmap')
fig_heatmap.show()

Correlation Matrix: The heatmap confirms the lack of strong correlations between the numerical variables, reinforcing the observations from the scatter plots.

In [15]:
from sklearn.cluster import KMeans

# Selecting features for clustering
features = df[['Age', 'AnnualIncome', 'SpendingScore', 'MembershipYears']]

# Applying K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
df['Cluster'] = kmeans.fit_predict(features)

# Visualizing the clusters
fig_clusters = px.scatter_3d(df, x='Age', y='AnnualIncome', z='SpendingScore', color='Cluster', title='Customer Segments')
fig_clusters.show()

**Customer Segmentation**
Objective: The goal of customer segmentation is to identify distinct groups of customers based on their characteristics. This helps in tailoring marketing strategies and improving customer satisfaction.

We used K-Means clustering, a popular unsupervised machine learning algorithm, to segment customers. The algorithm partitions the data into ( k ) clusters, where each customer belongs to the cluster with the nearest mean.

- Features Used:
    - Age
    - Annual Income
    - Spending Score

- Steps:
    - Feature Selection: We selected the numerical features for clustering.
    - K-Means Clustering: We applied K-Means with ( k = 4 ) clusters.
    - Visualization: We visualized the clusters in a 3D scatter plot based on age, annual income, and spending score.
  
- Results:
    - The 3D scatter plot shows four distinct customer segments.
    - Each segment represents customers with similar characteristics, which can be used for targeted marketing and personalized services.
    - Would you like to proceed with any further analysis or visualizations?

In [16]:
# Calculating mean values of features for each cluster
cluster_means = df.groupby('Cluster').mean()

# Displaying the mean values
cluster_means

Unnamed: 0_level_0,CustomerID,Age,Gender,AnnualIncome,SpendingScore,MembershipYears
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,509.345679,43.438272,0.5,28427.328457,51.716049,5.41358
1,491.904615,43.956923,0.498462,57241.732123,52.381538,5.470769
2,515.151976,43.568389,0.537994,44418.934255,50.094225,5.662614
3,481.695652,44.358696,0.565217,72727.541957,48.717391,5.51087


The customer segmentation analysis using K-Means clustering identified four distinct customer segments. Each segment has similar characteristics in terms of age, annual income, spending score, and membership years. The mean values for these features across the clusters are relatively close, indicating that the segments are not drastically different from each other. This suggests a relatively homogeneous customer base with slight variations in spending behavior and membership duration.

The mean values for these features across the clusters are relatively close, indicating a homogeneous customer base with slight variations in spending behavior and membership duration. Visualizing these mean values confirms the lack of significant differences between the segments, suggesting that targeted marketing strategies may need to focus on more nuanced customer attributes.

Possible nuanced customer attributes that could be considered for targeted marketing strategies include:

- **Customer Lifetime Value (CLV)**: This metric estimates the total revenue a business can expect from a customer over the entire duration of their relationship. It helps in identifying high-value customers who may warrant special attention.

- **Purchase Frequency**: Understanding how often customers make purchases can help in creating personalized marketing campaigns aimed at increasing purchase frequency.

- **Product Preferences**: Analyzing the types of products or services that different customer segments prefer can help in tailoring marketing messages and promotions.

- **Customer Feedback and Reviews**: Sentiment analysis of customer feedback and reviews can provide insights into customer satisfaction and areas for improvement.

- **Engagement with Marketing Channels**: Tracking how customers interact with various marketing channels (e.g., email, social media, in-store) can help in optimizing marketing efforts.

- **Geographic Location**: Understanding the geographic distribution of customers can help in creating location-specific marketing campaigns.

- **Behavioral Data**: Analyzing customer behavior on the website or app, such as pages visited, time spent, and actions taken, can provide insights into customer interests and intent.

- **Demographic Information**: While age and gender are already considered, other demographic factors like education level, occupation, and family size could provide additional insights.

- **Social Media Activity**: Monitoring customers' social media activity and engagement can help in understanding their interests and preferences.

- **Referral Sources**: Identifying how customers found out about the business (e.g., word of mouth, online ads, search engines) can help in optimizing marketing spend.