### Introduction

This notebook focuses on the preprocessing steps required to prepare the dataset for machine learning models. The dataset contains both categorical and numerical variables, which require specific transformations to ensure compatibility with algorithms and to improve model performance. 

The key preprocessing steps covered include:

- **Handling of categorical variables** using One-Hot Encoding (OHE) to convert them into a numerical format.
- **Scaling of numerical variables** using MinMaxScaler to ensure they are within the same range.

These transformations are essential for ensuring that the dataset is ready for further analysis and modeling.



In [1]:
import pandas as pd
import numpy as np



%run ../customer_personality_analysis/utils/pandas_missing_handler.py
%run ../customer_personality_analysis/utils/pandas_explorer.py

## Data Load and First Visualization:

In [2]:
path = '../customer_personality_analysis/data/cleaned_data.csv'
df = pd.read_csv(path).drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,YearJoining,QuarterJoining
0,1957,Graduation,Single,58138.0,0,0,58,635,88,546,...,7,0,0,0,0,0,0,1,2012,3
1,1954,Graduation,Single,46344.0,1,1,38,11,1,6,...,5,0,0,0,0,0,0,0,2014,1
2,1965,Graduation,Together,71613.0,0,0,26,426,49,127,...,4,0,0,0,0,0,0,0,2013,3
3,1984,Graduation,Together,26646.0,1,0,26,11,4,20,...,6,0,0,0,0,0,0,0,2014,1
4,1981,PhD,Married,58293.0,1,0,94,173,43,118,...,5,0,0,0,0,0,0,0,2014,1


## Standardizing Numerical Features:

In [3]:
from sklearn.preprocessing import MinMaxScaler

# Creating a copy of the dataset:
encoded_df = df.copy()

# Selecting numerical features:
numerical_features = encoded_df.select_dtypes(include=[np.number]).columns

# Initializing MinMaxScaler:
scaler = MinMaxScaler()

# Applying Standardization:
encoded_df[numerical_features] = pd.DataFrame(scaler.fit_transform(encoded_df[numerical_features]), columns=numerical_features)
encoded_df.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,YearJoining,QuarterJoining
0,0.303571,Graduation,Single,0.516867,0.0,0.0,0.585859,0.425318,0.442211,0.554878,...,0.7,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.666667
1,0.25,Graduation,Single,0.396485,0.5,0.5,0.383838,0.007368,0.005025,0.006098,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.446429,Graduation,Together,0.654408,0.0,0.0,0.262626,0.285332,0.246231,0.129065,...,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.666667
3,0.785714,Graduation,Together,0.195425,0.5,0.0,0.262626,0.007368,0.020101,0.020325,...,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.732143,PhD,Married,0.518449,0.5,0.0,0.949495,0.115874,0.21608,0.119919,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## Mapping Categorical Features to Numerical:

In [4]:
# Selecting categorical features:
categorical_features = encoded_df.select_dtypes(exclude=[np.number])

# Getting dummies variables:
encoded_df = pd.get_dummies(encoded_df, columns=['Education','Marital_Status']).astype(int)
encoded_df.head()

Unnamed: 0,Year_Birth,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,...,Education_2n Cycle,Education_Basic,Education_Graduation,Education_Master,Education_PhD,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single,Marital_Status_Together,Marital_Status_Widow
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0


## Saving the Preprocessed Dataset in a csv file:

In [5]:
encoded_df.to_csv('../customer_personality_analysis/data/preprocessed.csv')

## Preprocessing Summary

In this preprocessing stage, the following actions were taken to prepare the dataset for machine learning models:

1. **One-Hot Encoding for categorical variables:**  
   Categorical variables were transformed using One-Hot Encoding (OHE), ensuring they were converted into binary columns, preserving their original information for model training.

2. **MinMaxScaler for numerical variables:**  
   Numerical variables were standardized using the MinMaxScaler method, which scales values to a [0, 1] range, ensuring all variables have a comparable scale.

These transformations ensure that the dataset is adequately prepared for further modeling steps.
