# 🧠 Customer Segmentation Analysis with Pandas

## 1. Load and Preview Dataset
Read the dataset using `pd.read_excel()` and check its structure.

In [84]:
import pandas as pd 
import numpy as np 

In [85]:
df = pd.read_excel(r"C:\Users\mirza\Desktop\customer_segmentation\data\Segmentation.xlsx")

## 2. Understand the Structure of the Dataset
Explore the basic structure, null values, and statistical overview.


In [86]:
df.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,A


In [87]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 630.4+ KB


In [88]:
df.describe()

Unnamed: 0,ID,Age,Work_Experience,Family_Size
count,8068.0,8068.0,7239.0,7733.0
mean,463479.214551,43.466906,2.641663,2.850123
std,2595.381232,16.711696,3.406763,1.531413
min,458982.0,18.0,0.0,1.0
25%,461240.75,30.0,0.0,2.0
50%,463472.5,40.0,1.0,3.0
75%,465744.25,53.0,4.0,4.0
max,467974.0,89.0,14.0,9.0


In [89]:
df.shape

(8068, 10)

In [90]:
df.sample(5)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Segmentation
6799,460257,Female,No,28,Yes,Healthcare,1.0,Low,6.0,D
3532,462879,Male,Yes,74,Yes,Lawyer,1.0,High,2.0,A
1294,461103,Female,No,30,No,Engineer,1.0,Low,5.0,D
17,461644,Male,No,31,No,Healthcare,1.0,Low,6.0,B
6308,467482,Male,Yes,38,Yes,Executive,0.0,High,3.0,C


## 3. Handling Missing Values
Some imputations could be improved by considering correlated features (e.g. filling `Work_Experience` using `Age`, `Profession`).


In [91]:
df.isnull().sum()

ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Segmentation         0
dtype: int64

## 4. Correlation Matrix
Understanding relationships between numerical features (e.g. Age vs Work Experience).


In [131]:
df['Work_Experience'] = df.groupby(['Graduated', 'Profession'])['Work_Experience']\
                          .transform(lambda x: x.fillna(x.median()))

In [132]:
df['Family_Size'] = df.groupby('Ever_Married')['Family_Size']\
                      .transform(lambda x: x.fillna(x.median()))


## 5. Group Analysis on Spending Score
Compare average Spending Scores across:
- Gender
- Marital Status
- Profession


In [134]:
df['Ever_Married'].fillna(df['Ever_Married'].mode()[0], inplace=True)
df['Graduated'].fillna(df['Graduated'].mode()[0], inplace=True)
df['Profession'].fillna(df['Profession'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Ever_Married'].fillna(df['Ever_Married'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Graduated'].fillna(df['Graduated'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermedia

## Sütunların unikal dəyərlərini yoxlamaq

In [135]:
df['Graduated'].value_counts()

Graduated
Yes    5046
No     3022
Name: count, dtype: int64

In [136]:
df['Profession'].unique()

array(['Healthcare', 'Engineer', 'Lawyer', 'Entertainment', 'Artist',
       'Executive', 'Doctor', 'Homemaker', 'Marketing'], dtype=object)

## Ən çox rast gəlinən peşəni tap (mode)

In [137]:
df['Profession'].mode()[0]

'Artist'

In [138]:
df['Segmentation'].value_counts() #Hər seqmentdə neçə nəfər var?

Segmentation
D    2268
A    1972
C    1970
B    1858
Name: count, dtype: int64

In [139]:
df.groupby('Profession')['Segmentation'].value_counts().unstack().fillna(0)
#peşə sahibləri hansı seqmentə düşür?

Segmentation,A,B,C,D
Profession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Artist,591,778,1083,188
Doctor,199,143,140,206
Engineer,259,189,75,176
Entertainment,365,221,148,215
Executive,125,183,175,116
Healthcare,106,101,146,979
Homemaker,73,55,28,90
Lawyer,197,158,140,128
Marketing,57,30,35,170


In [140]:
df.groupby('Gender')['Segmentation'].value_counts().unstack().fillna(0)
#Kişilər və qadınlar hansı seqmentlərə düşürlər?

Segmentation,A,B,C,D
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,909,861,922,959
Male,1063,997,1048,1309


In [141]:
df.groupby('Segmentation')['Age'].mean()
#hər seqment üçün ortalama yaş nə qədərdir?

Segmentation
A    44.924949
B    48.200215
C    49.144162
D    33.390212
Name: Age, dtype: float64

In [142]:
df.groupby('Segmentation')[['Family_Size', 'Work_Experience']].mean()
#Hər seqmentdə ortalama ailə üzvü sayı

Unnamed: 0_level_0,Family_Size,Work_Experience
Segmentation,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.466531,2.851665
B,2.703983,2.405382
C,2.975127,2.272314
D,3.216931,2.973456


In [143]:
df[df['Spending_Score'] == 'High'].groupby('Segmentation')['Gender'].value_counts()


Series([], Name: count, dtype: int64)

In [144]:
df.dtypes


ID                      int64
Gender                 object
Ever_Married           object
Age                     int64
Graduated              object
Profession             object
Work_Experience       float64
Spending_Score        float64
Family_Size           float64
Segmentation           object
Spending_Score_Num      int64
dtype: object

In [145]:
df['Segmentation'].value_counts(normalize=True) * 100
# Segmentlərin sayını və faizlərini tapaq


Segmentation
D    28.111056
A    24.442241
C    24.417452
B    23.029251
Name: proportion, dtype: float64

In [146]:
df2 = pd.read_excel(r"C:\Users\mirza\Downloads\Segmentation.xlsx")
df['Spending_Score'] = df2['Spending_Score'].values


In [147]:
print(df2['Spending_Score'].unique())


['Low' 'Average' 'High']


In [148]:
mapping = {
    'Low': 1,
    'Average': 2,
    'High': 3
}

df['Spending_Score_Num'] = df2['Spending_Score'].map(mapping)

df.groupby('Profession')['Spending_Score_Num'].mean().sort_values(ascending=False)


Profession
Executive        2.454090
Lawyer           2.069021
Artist           1.587500
Engineer         1.487840
Homemaker        1.455285
Entertainment    1.433087
Doctor           1.347384
Marketing        1.284247
Healthcare       1.099099
Name: Spending_Score_Num, dtype: float64

In [149]:
# Peşəyə görə neçə nəfər hansı səviyyədədir
df2.groupby(['Profession', 'Spending_Score']).size().unstack().fillna(0)


Spending_Score,Average,High,Low
Profession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Artist,1011,242,1263
Doctor,177,31,480
Engineer,221,60,418
Entertainment,319,46,584
Executive,75,398,126
Healthcare,42,45,1245
Homemaker,60,26,160
Lawyer,18,324,281
Marketing,17,33,242


In [150]:
# Ümumi paylanma
df2['Spending_Score'].value_counts()


Spending_Score
Low        4878
Average    1974
High       1216
Name: count, dtype: int64