### Data Preprocessing

In [1]:
import numpy as np
import pandas as pd

load Dataset

In [2]:

data = pd.read_csv(r"C:\Users\Alex Marco\Downloads\Projects\Customer Segmentation\Data\Mall_Customers.csv")

print(data.head())
print(data.info())
print(data.describe())
print(data.shape)


   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB
None
       CustomerID         Age  Annual Income (k$)  

Check for missing Value

In [3]:

null = data.isnull().sum()
print(null)

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64


Feature Engineering <br>
Create new features that better represent customer behavior

We add Features that represent customer spending behavior 
#### Spending Efficiency
This feature measures how much a customer spends relative to their income.

$$ \text{Spending Efficiency} = \frac{\text{Spending Score}}{\text{Annual Income (k\$)}} $$

#### Interpretation
*   **High value **($\approx$1 or more)****: Spends heavily despite income (impulsive spender).
*   **Low value (<0.2)**: Conservative or low spender relative to income.

In [4]:
data['Spending_Efficiency'] = data['Spending Score (1-100)'] / data['Annual Income (k$)']



### Income-Spend Interaction
Sometimes the interaction between income and spending better captures wealthy heavy spenders.

$$ \text{Income-Spend Interaction} = \text{Annual Income (k\$)} \times \text{Spending Score (1-100)} $$

#### Interpretation
*   **High value (High income × high spending score)**: High-value premium customers.
*   **Low value (Low income × low spending score)**: Budget buyers.


In [5]:
data['Income_Spend_Interaction'] = data['Annual Income (k$)'] * data['Spending Score (1-100)']


In [6]:
## Features to Keep
feature = [
    'Gender',
    'Age',
    'Annual Income (k$)',
    'Spending Score (1-100)',
    'Spending_Efficiency',
    'Income_Spend_Interaction'
]

df = data[feature]

Encode Categorical Features <br>
Convert text features like “Gender” into numerical:

In [14]:
from sklearn.preprocessing import LabelEncoder
import joblib
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])  # Male=1, Female=0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Gender'] = le.fit_transform(df['Gender'])  # Male=1, Female=0


Feature Scaling <br>
K-Means and most ML algorithms are distance-based, so scale your data.

In [16]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)


## save scaler
joblib.dump(scaler, '../Model/scaler.pkl')

['../Model/scaler.pkl']

Save features for later

In [11]:
df.to_csv(r'C:\Users\Alex Marco\Downloads\Projects\Customer Segmentation\Data\Processed\processed_customers.csv', index=False)