### Introduction

This notebook focuses on the preprocessing steps required to prepare the dataset for machine learning models. The dataset contains both categorical and numerical variables, which require specific transformations to ensure compatibility with algorithms and to improve model performance. 

The key preprocessing steps covered include:

- **Handling of categorical variables** using One-Hot Encoding (OHE) to convert them into a numerical format.
- **Scaling of numerical variables** using MinMaxScaler to ensure they are within the same range.

These transformations are essential for ensuring that the dataset is ready for further analysis and modeling.



In [84]:
import pandas as pd
import numpy as np



%run ../customer_personality_analysis/utils/pandas_missing_handler.py
%run ../customer_personality_analysis/utils/pandas_explorer.py

## Data Load and First Visualization:

In [85]:
path = '../customer_personality_analysis/data/cleaned_data.csv'
df = pd.read_csv(path).drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,YearJoining,QuarterJoining
0,1957,Graduation,Single,58138.0,0,0,58,635,88,546,...,7,0,0,0,0,0,0,1,2012,3
1,1954,Graduation,Single,46344.0,1,1,38,11,1,6,...,5,0,0,0,0,0,0,0,2014,1
2,1965,Graduation,Together,71613.0,0,0,26,426,49,127,...,4,0,0,0,0,0,0,0,2013,3
3,1984,Graduation,Together,26646.0,1,0,26,11,4,20,...,6,0,0,0,0,0,0,0,2014,1
4,1981,PhD,Married,58293.0,1,0,94,173,43,118,...,5,0,0,0,0,0,0,0,2014,1


## Features Selection

In [86]:
selected_features = ['Year_Birth','Education','Marital_Status','Income','Kidhome','Teenhome','YearJoining']

- Feature selection for demographic segmentation was performed, considering only demographic features.

## Standardizing Numerical Features:

In [87]:
df[selected_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2214 entries, 0 to 2213
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Year_Birth      2214 non-null   int64  
 1   Education       2214 non-null   object 
 2   Marital_Status  2214 non-null   object 
 3   Income          2214 non-null   float64
 4   Kidhome         2214 non-null   int64  
 5   Teenhome        2214 non-null   int64  
 6   YearJoining     2214 non-null   int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 121.2+ KB


In [88]:
from sklearn.preprocessing import MinMaxScaler

# Creating a copy of the dataset:
encoded_df = df[selected_features].copy()

# Selecting numerical features:
numerical_features = encoded_df.select_dtypes(include=[np.number]).columns

# Initializing MinMaxScaler:
scaler = MinMaxScaler()

# Applying Standardization:
encoded_df[numerical_features] = pd.DataFrame(scaler.fit_transform(encoded_df[numerical_features]), columns=numerical_features)
encoded_df.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,YearJoining
0,0.303571,Graduation,Single,0.516867,0.0,0.0,0.0
1,0.25,Graduation,Single,0.396485,0.5,0.5,1.0
2,0.446429,Graduation,Together,0.654408,0.0,0.0,0.5
3,0.785714,Graduation,Together,0.195425,0.5,0.0,1.0
4,0.732143,PhD,Married,0.518449,0.5,0.0,1.0


## Mapping Categorical Features to Numerical:

In [None]:
# Selecting categorical features:
categorical_features = encoded_df.select_dtypes(include='object').columns

# Getting dummies variables:
encoded_df = pd.get_dummies(encoded_df, columns=['Education', 'Marital_Status'])

# Changing dummies data types to int
dummie_columns = encoded_df.select_dtypes(include='boolean').columns
encoded_df[dummie_columns] = encoded_df[dummie_columns].astype(int)
encoded_df.head()

Unnamed: 0,Year_Birth,Income,Kidhome,Teenhome,YearJoining,Education_2n Cycle,Education_Basic,Education_Graduation,Education_Master,Education_PhD,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single,Marital_Status_Together,Marital_Status_Widow
0,0.303571,0.516867,0.0,0.0,0.0,0,0,1,0,0,0,0,1,0,0
1,0.25,0.396485,0.5,0.5,1.0,0,0,1,0,0,0,0,1,0,0
2,0.446429,0.654408,0.0,0.0,0.5,0,0,1,0,0,0,0,0,1,0
3,0.785714,0.195425,0.5,0.0,1.0,0,0,1,0,0,0,0,0,1,0
4,0.732143,0.518449,0.5,0.0,1.0,0,0,0,0,1,0,1,0,0,0


## Saving the Preprocessed Dataset in a csv file:

In [92]:
encoded_df.to_csv('../customer_personality_analysis/data/preprocessed.csv')

## Preprocessing Summary

In this preprocessing stage, the following actions were taken to prepare the dataset for machine learning models:

1. **One-Hot Encoding for categorical variables:**  
   Categorical variables were transformed using One-Hot Encoding (OHE), ensuring they were converted into binary columns, preserving their original information for model training.

2. **MinMaxScaler for numerical variables:**  
   Numerical variables were standardized using the MinMaxScaler method, which scales values to a [0, 1] range, ensuring all variables have a comparable scale.

These transformations ensure that the dataset is adequately prepared for further modeling steps.
