# **Project-4**

***Project Title:*** Customer Segmentation

***Project Description:*** In this project, you will build a machine learning model to segment customers based on their demographics and spending behavior. The data set includes information on customers' age, gender, annual income, and spending score (a metric assigned by the mall based on how much customers spend and how often they visit).

***Dataset Details:*** The data set contains 200 records of customers.

***Datasets Location:*** Canvas -> Modules -> Week 12 -> Datasets -> **"customers.csv"**.

***Tasks:***

1) *Data Exploration and Preprocessing:* You will explore the data set, handle missing values, perform feature engineering, and preprocess the data to get it ready for model building.

2) *Model Building:* You will train and evaluate several unsupervised clustering models on the preprocessed data set, including k-means clustering and DBSCAN.

3) *Model Evaluation:* You will evaluate the clustering results using silhouette and inertia scores. You will also analyze the resulting customer segments and interpret their characteristics.

4) *Deployment:* Once you have identified the customer segments, you can use them to personalize marketing campaigns, improve customer retention, and optimize product recommendations.

This project will give you hands-on experience with unsupervised clustering, data preprocessing, and model evaluation. It also has real-world applications in marketing and e-commerce, where customer segmentation can help businesses tailor their offerings to different customer groups.

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

customers = pd.read_csv("customers.csv")

missingValues = customers.isna()
missingFlag = True

for col in missingValues.columns:
    for index, valueInIndex in missingValues[col].items():
        if valueInIndex:
            print(f"Missing value at index {index} in column {col}")
            missingFlag = False
        

if missingFlag:
    print("There are no missing values in the DataFrame")



There are no missing values in the DataFrame


In [21]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold

#sss = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
#skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

X = customers.drop(columns=['Spending Score (1-100)'])
y = customers['Spending Score (1-100)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

"""
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index] 

"""
num_feat = ['CustomerID', 'Age', 'Annual Income (k$)']
cat_feat = ['Gender']

In [22]:
print(X_train)

     CustomerID  Gender  Age  Annual Income (k$)
79           80  Female   49                  54
197         198    Male   32                 126
38           39  Female   36                  37
24           25  Female   54                  28
122         123  Female   40                  69
..          ...     ...  ...                 ...
106         107  Female   66                  63
14           15    Male   37                  20
92           93    Male   48                  60
179         180    Male   35                  93
102         103    Male   67                  62

[160 rows x 4 columns]


In [23]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Create a ColumnTransformer
preprocess = ColumnTransformer([
    ('num_feat', 'passthrough', num_feat),
    ('cat_feat', OneHotEncoder(), cat_feat)
])

# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocess),
    ('model', KMeans(n_clusters=5, random_state=42))
])

# Fit the pipeline and get cluster labels
y_pred = pipeline.fit_predict(X_train)

# Get the inertia (within-cluster sum of squares)
#inertia = pipeline.named_steps['model'].inertia_
#print(f"Inertia: {inertia}")

# Calculate silhouette score
#sil_score = silhouette_score(X, y_pred)
#print(f"Silhouette Score: {sil_score}")


  super()._check_params_vs_input(X, default_n_init=10)


In [31]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

preprocess = ColumnTransformer([
   ('num_feat', 'passthrough' ,num_feat),
   ('cat_feat', OneHotEncoder(), cat_feat) 
])

pipeline = Pipeline([
    ('preprocessor', preprocess),
    ('model', KMeans(n_clusters=5, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

inertia = pipeline.named_steps['model'].inertia_

sil_score = silhouette_score(preprocess.transform(X_test), y_pred)

print("Inertia: " + str(inertia) + "\n" + "Silhouette Score: " + str(sil_score))

Inertia: 58092.85693679771
Silhouette Score: 0.33334214313347543


  super()._check_params_vs_input(X, default_n_init=10)
