# Cluster Analysis

Customer segmentation is a crucial step in understanding the unique characteristics and behaviors of different groups within a customer base. In this project, clustering was performed using **Principal Component Analysis (PCA)** for dimensionality reduction and **KMeans** for grouping customers into clusters. The resulting segmentation aims to uncover patterns and insights that can inform targeted strategies and decision-making.

This phase focuses on analyzing the clusters to understand their distinct features and distributions. By exploring the characteristics that define each group and differentiate them from others, we aim to:

- Identify key attributes that influence the formation of clusters.
- Interpret the patterns and preferences within each segment.
- Highlight differences across clusters to provide actionable insights.

The findings from this analysis will serve as a foundation for strategic business decisions, such as personalized marketing, product recommendations, and resource allocation.


In [42]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sweetviz as sv

%run ../customer_personality_analysis/utils/pandas_explorer.py

In [43]:
path = '../customer_personality_analysis/data/clustered.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,...,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,YearJoining,QuarterJoining,Kmeans_labels
0,1957,Graduation,Single,58138.0,0,0,58,635,88,546,...,0,0,0,0,0,0,1,2012,3,3
1,1954,Graduation,Single,46344.0,1,1,38,11,1,6,...,0,0,0,0,0,0,0,2014,1,3
2,1965,Graduation,Together,71613.0,0,0,26,426,49,127,...,0,0,0,0,0,0,0,2013,3,0
3,1984,Graduation,Together,26646.0,1,0,26,11,4,20,...,0,0,0,0,0,0,0,2014,1,0
4,1981,PhD,Married,58293.0,1,0,94,173,43,118,...,0,0,0,0,0,0,0,2014,1,7


## Selecting features used for clustering

In [47]:
features = [
    'Year_Birth',
    'Education',
    'Marital_Status',
    'Income',
    'Kidhome',	
    'Teenhome',
    'YearJoining',
    'Kmeans_labels'
]

df[features].head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,YearJoining,Kmeans_labels
0,1957,Graduation,Single,58138.0,0,0,2012,3
1,1954,Graduation,Single,46344.0,1,1,2014,3
2,1965,Graduation,Together,71613.0,0,0,2013,0
3,1984,Graduation,Together,26646.0,1,0,2014,0
4,1981,PhD,Married,58293.0,1,0,2014,7


## Separating clusters into diferent datasets

In [49]:
# clusters = df['Kmeans_labels'].sort_values().unique()
# for cluster in clusters:
#     plt.figure(figsize=(24,6))
#     for index,feature in enumerate(df[features].drop(columns=['Kmeans_labels']).columns):
#         plt.subplot(1,7,index + 1)
#         df[df['Kmeans_labels'] == cluster].explorer.categorical_dist(feature)
#         plt.title(f"{feature}'s distribution - cluster:{cluster}")
#         plt.tight_layout() 
#         plt.yticks(range(0,501,100))
#     plt.show()