# Unlocking Customer Insights: A Comprehensive Segmentation Approach 

## Introduction

In today's highly competitive market, understanding customer behaviour is crucial for businesses to tailor their marketing strategies effectively. This project aims to leverage data-driven techniques to segment customer based on a various attributes.

By utilizing the [Customer segmentation dataset from Kaggle](https://www.kaggle.com/datasets/vetrirah/customer/data), we will employ clustering algorithms to identify distinct customer groups, analyze their characteristics and provide actionable insights. This segmentations will help businesses optimize their marketing efforts, enhance customer satisfactions, and ultimately drive growth.

<center>
    <figure>
    <img src = "https://media.licdn.com/dms/image/C4D12AQEQSgGXtud3dA/article-cover_image-shrink_600_2000/0/1588440796258?e=2147483647&v=beta&t=0SORV_gIMlAEmaTdqPJG_UgY3PyN0TTMBOXnfmTKgJI" alt="Customer segmentation" style="width:500px;height:350px;">
    <figcaption> How to targeting the right customers? </figcaption>
    </figure>
</center>

## Dataset

An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they've deduced that the behavior of new market is similar to their existing market.

In their existing market, the sales team has classified all customers into 4 segments (A, B, C and D). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers.

We are required to help the manager to predict the right group of the new customers.

## Data exploration and pre-processing

In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

file_path = 'C:/Users/Francisco Valerio/Desktop/customer-segmentation/data/Train.csv'

data_train = pd.read_csv(file_path)

data_train.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [2]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


In [3]:
data_train.describe()

Unnamed: 0,ID,Age,Work_Experience,Family_Size
count,8068.0,8068.0,7239.0,7733.0
mean,463479.214551,43.466906,2.641663,2.850123
std,2595.381232,16.711696,3.406763,1.531413
min,458982.0,18.0,0.0,1.0
25%,461240.75,30.0,0.0,2.0
50%,463472.5,40.0,1.0,3.0
75%,465744.25,53.0,4.0,4.0
max,467974.0,89.0,14.0,9.0


In [4]:
data_train.shape

(8068, 11)

In [5]:
for column in data_train.columns:

    if column != 'ID':

        unique_values = data_train[column].unique()

        print(f"{column}: {unique_values}")

Gender: ['Male' 'Female']
Ever_Married: ['No' 'Yes' nan]
Age: [22 38 67 40 56 32 33 61 55 26 19 70 58 41 31 79 49 18 36 35 45 42 83 27
 28 47 29 57 76 25 72 48 74 59 39 51 30 63 52 60 68 86 50 43 80 37 46 69
 78 71 82 23 20 85 21 53 62 75 65 89 66 73 77 87 84 81 88]
Graduated: ['No' 'Yes' nan]
Profession: ['Healthcare' 'Engineer' 'Lawyer' 'Entertainment' 'Artist' 'Executive'
 'Doctor' 'Homemaker' 'Marketing' nan]
Work_Experience: [ 1. nan  0.  4.  9. 12.  3. 13.  5.  8. 14.  7.  2.  6. 10. 11.]
Spending_Score: ['Low' 'Average' 'High']
Family_Size: [ 4.  3.  1.  2.  6. nan  5.  8.  7.  9.]
Var_1: ['Cat_4' 'Cat_6' 'Cat_7' 'Cat_3' 'Cat_1' 'Cat_2' nan 'Cat_5']
Segmentation: ['D' 'A' 'B' 'C']


In [6]:
print(data_train.isnull().sum())

ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64


In [7]:
categorical_columns = ['Ever_Married', 'Graduated', 'Profession', 'Var_1']

for column in categorical_columns:

    mode_val = data_train[column].mode()[0]

    data_train[column] = data_train[column].fillna(mode_val)

numerical_columns = ['Age', 'Work_Experience', 'Family_Size']

for column in numerical_columns:

    mean_val = data_train[column].mean()

    data_train[column]= data_train[column].fillna(mean_val)


In [8]:
print(data_train.isnull().sum())

ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
Segmentation       0
dtype: int64


In [9]:
data_train.drop_duplicates(inplace=True)

In [10]:
data_train['Work_Experience'] = data_train['Work_Experience'].astype('int')
data_train['Family_Size'] = data_train['Family_Size'].astype('int')

In [11]:
data_train = data_train.drop(columns=['Var_1'])

In [12]:
fig_hist_age = px.histogram(data_train, x = 'Age', nbins = 10, color_discrete_sequence=['#72B7B2'], marginal = "box")

fig_hist_age.update_layout(
    title = {
        "text": "Customers' age histogram",
        'font': {'size': 16}
    },
    xaxis_title = "Age",
    yaxis_title = 'Count',
    xaxis = dict(
        title_font_size = 16
    ),
    yaxis = dict(
        title_font_size = 16
    ),
    legend_title_text = "Age",
    legend = dict(
        font_size = 14,
        title_font_size = 14
    ),
    template = 'simple_white'
)

fig_hist_age.show()

In [13]:
fig_hist_family_size = px.histogram(data_train, x= 'Family_Size', nbins = 5, color_discrete_sequence=['#00CC96'], marginal = "box")

fig_hist_family_size.update_layout(
    title={
        "text": "Customers' family size histogram",
        'font': {'size': 16}
    },
    xaxis_title = "Family Size",
    yaxis_title = 'Count',
    xaxis=dict(
        title_font_size=16
    ),
    yaxis=dict(
        title_font_size=16
    ),
    legend_title_text = 'Number of Family members',
    legend=dict(
        font_size=14,
        title_font_size=14
    ),
    template='simple_white'
)



fig_hist_family_size.show()

In [14]:
fig_hist_work_exp = px.histogram(data_train, x= 'Work_Experience', nbins = 5, color_discrete_sequence=['#3366CC'], marginal = "box")

fig_hist_work_exp.update_layout(
    title={
        "text": "Customers' work experience histogram",
        'font': {'size': 16}
    },
    xaxis_title = "Work Experience (Years)",
    yaxis_title = 'Count',
    xaxis=dict(
        title_font_size=16
    ),
    yaxis=dict(
        title_font_size=16
    ),
    legend_title_text = 'Years of work experience',
    legend=dict(
        font_size=14,
        title_font_size=14
    ),
    template='simple_white'
)



fig_hist_work_exp.show()

In [15]:
data_train['Profession'] = data_train['Profession'].astype(str)

group_profession = data_train['Profession'].value_counts().reset_index(name = 'count')

fig_professions = px.bar(group_profession, x = 'Profession', y = 'count',
                         title = "Customers' Professions",
                         labels = {'count': 'Count', 'Profession': 'Profession'},
                         color = 'Profession',
                         text = 'count',
                         text_auto=True,
                         color_discrete_sequence=px.colors.qualitative.Pastel,
                         template='simple_white')

fig_professions.show()





In [16]:
group_gender = data_train['Gender'].value_counts().reset_index(name = 'count')

fig_gender = px.bar(group_gender, x = 'Gender', y = 'count',
                         title = "Customers' Gender",
                         labels = {'count': 'Count', 'Gender': 'Gender'},
                         color = 'Gender',
                         text = 'count',
                         text_auto=True,
                         color_discrete_sequence=px.colors.qualitative.Dark2,
                         template='simple_white')

fig_gender.show()





In [17]:
group_married = data_train['Ever_Married'].value_counts().reset_index(name = 'count')

fig_married = px.bar(group_married, x = 'Ever_Married', y = 'count',
                         title = "Customers' Marital Status",
                         labels = {'count': 'Count', 'Ever_Married': 'Have ever married?'},
                         color = 'Ever_Married',
                         text = 'count',
                         text_auto=True,
                         color_discrete_sequence=px.colors.qualitative.Set1,
                         template = 'simple_white')

fig_married.show()





In [18]:
group_spending = data_train['Spending_Score'].value_counts().reset_index(name = 'count')

fig_spending = px.bar(group_spending, x = 'Spending_Score', y = 'count',
                         title = "Customers' Spending Score",
                         labels = {'count': 'Count', 'Spending_Score': 'Spending Score'},
                         color = 'Spending_Score',
                         text = 'count',
                         text_auto=True,
                         color_discrete_sequence=px.colors.qualitative.Antique,
                         template = 'simple_white')

fig_spending.show()





In [19]:
correlations = data_train[['Age', 'Work_Experience', 'Family_Size']].corr()

fig_corr = go.Figure(data = go.Heatmap(z = correlations.values,
                                       x = correlations.columns,
                                       y = correlations.columns,
                                       colorscale='sunset'))

fig_corr.update_layout(title = 'Correlation matrix of numerical variables', xaxis_nticks = 36, template = 'simple_white')
fig_corr.show()

## Test data pre-processing

In [20]:
categorical_variables = ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Segmentation']

encoder = LabelEncoder()

for col in categorical_variables:

    if col in data_train.columns:

        data_train[col] = encoder.fit_transform(data_train[col])

    else:

        print(f"Column '{col}' not found in dataframe.")

data_train.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Segmentation
0,462809,1,0,22,0,5,1,2,4,3
1,462643,0,1,38,1,2,2,0,3,0
2,466315,0,1,67,1,2,1,2,1,1
3,461735,1,1,67,1,7,0,1,2,1
4,462669,0,1,40,1,3,2,1,6,0


In [21]:
standard_scaler = StandardScaler()

data_train[numerical_columns] = standard_scaler.fit_transform(data_train[numerical_columns])

data_train.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Segmentation
0,462809,1,0,-1.284623,0,5,-0.487443,2,0.785536,3
1,462643,0,1,-0.327151,1,2,-0.178099,0,0.122735,0
2,466315,0,1,1.408268,1,2,-0.487443,2,-1.202868,1
3,461735,1,1,1.408268,1,7,-0.796787,1,-0.540066,1
4,462669,0,1,-0.207467,1,3,-0.178099,1,2.111139,0


## Segmentation techniques

In [22]:
sse = []

for k in range(1,20):

    kmeans = KMeans(n_clusters=k, random_state=42)

    kmeans.fit(data_train[numerical_columns])

    sse.append(kmeans.inertia_)

In [23]:
fig_sse = go.Figure()

fig_sse.add_trace(go.Scatter(x = list(range(1,11)), y = sse, mode = 'lines+markers'))

fig_sse.update_layout(
    title = "Elbow Method For Optimal k",
    xaxis_title = 'Number of Clusters',
    yaxis_title = 'SSE',
    xaxis = dict(tickmode = 'linear'),
    template = 'simple_white'
)

fig_sse.show()

In [24]:
optimal_k = 4

kmeans = KMeans(n_clusters=optimal_k, random_state = 42)

kmeans.fit(data_train[numerical_columns])

data_train['Cluster'] = kmeans.labels_

In [25]:
data_train.head()

data_train['Cluster'].value_counts()

Cluster
2    2775
1    1906
3    1818
0    1569
Name: count, dtype: int64

In [26]:
data_train.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Segmentation,Cluster
0,462809,1,0,-1.284623,0,5,-0.487443,2,0.785536,3,1
1,462643,0,1,-0.327151,1,2,-0.178099,0,0.122735,0,2
2,466315,0,1,1.408268,1,2,-0.487443,2,-1.202868,1,3
3,461735,1,1,1.408268,1,7,-0.796787,1,-0.540066,1,3
4,462669,0,1,-0.207467,1,3,-0.178099,1,2.111139,0,1


## Evaluation of segmentation

In [27]:
from sklearn.metrics import silhouette_score

silhouette_avg = silhouette_score(data_train[numerical_columns], data_train['Cluster'])

print(f"Silhouette Score for {optimal_k} clusters: {silhouette_avg}")

Silhouette Score for 4 clusters: 0.35918045725184217


In [28]:
fig_scatter_matrix = px.scatter_matrix(
    data_train,
    dimensions=['Age', 'Gender', 'Family_Size'],
    color = 'Cluster',
    title = "Scatter Matrix of Features by Cluster"
)

fig_scatter_matrix.show()

In [30]:
feature_cols = ['Gender', 'Ever_Married', 'Age', 'Graduated', 'Profession', 'Work_Experience', 'Spending_Score', 'Family_Size', 'Segmentation', 'Cluster']

data_features = data_train[feature_cols]

cluster_summary = data_features.groupby('Cluster').mean().reset_index()

fig_bar_summary = px.bar(
    cluster_summary.melt(id_vars = 'Cluster', var_name = 'Feature', value_name='Value'),
    x = 'Feature',
    y = 'Value',
    color = 'Cluster',
    barmode='group',
    title = "Cluster Feature Summary"
)

fig_bar_summary.show()

In [39]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.2, min_samples=5)

data_train['Cluster_DBSCAN'] = dbscan.fit_predict(data_train[numerical_columns])


In [33]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=optimal_k, random_state=42)

data_train['Cluster_GMM'] = gmm.fit_predict(data_train[numerical_columns])

In [36]:
fig_kmeans = px.scatter(data_train, x = 'Age', y ='Work_Experience', color = 'Cluster', title = 'K-Means Clusters')
fig_kmeans.show()

In [41]:
fig_dbscan = px.scatter(data_train, x = 'Age', y ='Work_Experience', color = 'Cluster_DBSCAN', title = 'DBSCAN Clusters')
fig_dbscan.show()

In [38]:
fig_gmm = px.scatter(data_train, x='Age', y='Work_Experience', color='Cluster_GMM', title='GMM Clusters')
fig_gmm.show()

## Customer profiling

## Implementation of segmentation strategy

## Conclusions