# Data Preprocessing and Feature Creation

#### 1. Introduction
#### 2. Importing Required Libraries
#### 3. Loading the Dataset
#### 4. Feature Selection
#### 5. Encoding Categorical Variables
#### 6. RFM Feature Preparation
#### 7. Feature Scaling
#### 8. Saving Processed Data
#### 9. Summary

## Introduction
Data preprocessing is a crucial step in machine learning pipelines.
This notebook prepares the dataset for clustering by performing
encoding, feature selection, and feature scaling.

## Importing Required Libraries

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,LabelEncoder
import pickle


## Loading the Dataset

In [6]:
df = pd.read_csv("Mall_Customers.csv")

## Encoding Categorical Variables

In [16]:
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,1,19,15,39
1,2,1,21,15,81
2,3,0,20,16,6
3,4,0,23,16,77
4,5,0,31,17,40


## Feature Selection
Only relevant features are selected for clustering.<br>
CustomerID is excluded as it does not carry behavioral information.

In [17]:
df_selected = df[['Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']]
df_selected.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,19,15,39
1,1,21,15,81
2,0,20,16,6
3,0,23,16,77
4,0,31,17,40


## RFM Feature Preparation

In [18]:
# Create proxy RFM features
df_selected['Recency'] = 100 - df_selected['Spending Score (1-100)']
df_selected['Frequency'] = df_selected['Annual Income (k$)']
df_selected['Monetary'] = df_selected['Spending Score (1-100)']

df_selected.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100),Recency,Frequency,Monetary
0,1,19,15,39,61,15,39
1,1,21,15,81,19,15,81
2,0,20,16,6,94,16,6
3,0,23,16,77,23,16,77
4,0,31,17,40,60,17,40


## Feature Scaling

In [19]:
scaler = StandardScaler()

scaled_features = scaler.fit_transform(df_selected)

df_scaled = pd.DataFrame(
    scaled_features,
    columns=df_selected.columns
)

df_scaled.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100),Recency,Frequency,Monetary
0,1.128152,-1.424569,-1.738999,-0.434801,0.434801,-1.738999,-0.434801
1,1.128152,-1.281035,-1.738999,1.195704,-1.195704,-1.738999,1.195704
2,-0.886405,-1.352802,-1.70083,-1.715913,1.715913,-1.70083,-1.715913
3,-0.886405,-1.137502,-1.70083,1.040418,-1.040418,-1.70083,1.040418
4,-0.886405,-0.563369,-1.66266,-0.39598,0.39598,-1.66266,-0.39598


Clustering algorithms are distance-based.
Scaling ensures that all features contribute equally to distance calculations.

## Saving Processed Data

In [20]:
df_scaled.to_csv(
    "../data/processed/mall_customers_scaled.csv",
    index=False
)

In [21]:
with open("../data/processed/mall_customers_scaled.pkl", "wb") as f:
    pickle.dump(df_scaled, f)

In [26]:
with open("../model/scaler.pkl", "wb") as f:
    pickle.dump(scaler, f)

## Summary

- Categorical variables were encoded.
- Proxy RFM features were added.
- Features were scaled using StandardScaler.
- The processed dataset is now ready for clustering.