
# Customer Segmentation (Business & Teaching Version)

This notebook demonstrates **end-to-end customer segmentation** using clustering.

Key corrections:
- Identifier columns dropped
- Categorical features encoded
- Numerical features scaled



## 1. Problem Definition

Segment customers into meaningful groups using **unsupervised learning**.



## 2. Import Libraries


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.compose import ColumnTransformer
from sklearn.metrics import silhouette_score



## 3. Load Dataset


In [None]:

df = pd.read_csv("customer_segmentation.csv")
df.head()



## 4. Drop Identifier Columns


In [None]:

id_columns = [c for c in df.columns if 'id' in c.lower()]
df_model = df.drop(columns=id_columns)

id_columns



## 5. Separate Feature Types


In [None]:

num_features = df_model.select_dtypes(include=['int64', 'float64']).columns
cat_features = df_model.select_dtypes(include=['object']).columns

num_features, cat_features



## 6. Preprocessing (Scaling + Encoding)


In [None]:

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_features),
    ('cat', OneHotEncoder(drop='first'), cat_features)
])

X = preprocessor.fit_transform(df_model)



## 7. Choose Number of Clusters


In [None]:

wcss = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    wcss.append(km.inertia_)

plt.plot(range(2,11), wcss, marker='o')
plt.title('Elbow Method')
plt.show()


In [None]:

for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X)
    print(k, silhouette_score(X, labels))



## 8. Train Final Model


In [None]:

kmeans = KMeans(n_clusters=3, random_state=42)
df['Segment'] = kmeans.fit_predict(X)

df.head()



## 9. Business Interpretation


In [None]:

df.groupby('Segment')[num_features].mean()



## 10. Teaching Summary

- IDs removed to avoid distance bias
- Categorical data encoded correctly
- Pipeline ensures clean ML workflow
