### Introduction

This notebook focuses on the preprocessing steps required to prepare the dataset for machine learning models. The dataset contains both categorical and numerical variables, which require specific transformations to ensure compatibility with algorithms and to improve model performance. 

The key preprocessing steps covered include:

- **Handling of categorical variables** using One-Hot Encoding (OHE) to convert them into a numerical format.
- **Scaling of numerical variables** using MinMaxScaler to ensure they are within the same range.

These transformations are essential for ensuring that the dataset is ready for further analysis and modeling.



In [1]:
import pandas as pd
import numpy as np



%run ../customer_personality_analysis/utils/pandas_missing_handler.py
%run ../customer_personality_analysis/utils/pandas_explorer.py

## Data Load and First Visualization:

In [2]:
path = '../customer_personality_analysis/data/cleaned_data.csv'
df = pd.read_csv(path).drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,...,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Tenure_Months,Age,Total_bought,Total_purchases,Cmp_Accepted
0,1957,Graduation,Uncommitted,58138.0,0,0,58,635,88,546,...,0,0,0,0,1,21,57,1617,25,0
1,1954,Graduation,Uncommitted,46344.0,1,1,38,11,1,6,...,0,0,0,0,0,3,60,27,6,0
2,1965,Graduation,Committed,71613.0,0,0,26,426,49,127,...,0,0,0,0,0,10,49,776,21,0
3,1984,Graduation,Committed,26646.0,1,0,26,11,4,20,...,0,0,0,0,0,4,30,53,8,0
4,1981,PhD,Committed,58293.0,1,0,94,173,43,118,...,0,0,0,0,0,5,33,422,19,0


## Creating new features

In [3]:
# Creating a function to labelice features
def categorice_feature(data, feature, labels):
    if len(labels) == 3:
        Q1,Q3 = data[feature].quantile([0.25, 0.75])
        # Applying labels
        categories = data[feature].apply(lambda x: labels[0] if x <= Q1 else (labels[1] if x <= Q3 else labels[2]))
    elif len(labels) == 4:
        Q1, Q2, Q3 = data[feature].quantile([0.25, 0.50, 0.75])
        categories = data[feature].apply(lambda x : labels[0] if x <= Q1 else (labels[1] if x <= Q2 else labels[2] if x <= Q3 else labels[3]))
    return categories

In [4]:
# Creating 'Age_level' feature
labels = [1,2,3]

df['Age_level'] = categorice_feature(
    data = df,
    feature = 'Age',
    labels=labels,
)

# Creating 'Income_level' feature based on 'Income'
Income_level = [1,2,3,4]
df['Income_level'] = categorice_feature(
    data=df,
    feature='Income',
    labels=Income_level
)

# Creating 'Tenure_level' feature
tenure_levels = [1,2,3,4]
df['Tenure_level'] = categorice_feature(
    data=df,
    feature='Tenure_Months',
    labels=tenure_levels
)

## Features Selection

In [5]:
selected_features = ['Education','Marital_Status','Income_level','Kidhome','Teenhome','Tenure_level','Age_level']

- Feature selection for demographic segmentation was performed, considering only demographic features.

## Mapping Categorical Features to Numerical:

In [6]:
from sklearn.preprocessing import OrdinalEncoder

# Creating a copy of the dataset:
encoded_df = df[selected_features].copy()

# Initialize the OrdinalEncoder with categories defined manually
education_levels = ['Basic', '2n Cycle', 'Graduation', 'Master', 'PhD']
encoder = OrdinalEncoder(categories=[education_levels])

# Fit and transform the 'Education' column
encoded_df['Education'] = encoder.fit_transform(df[['Education']])

# Applying Ohe to Marital_Status feature
encoded_df = pd.get_dummies(columns=['Marital_Status'], data=encoded_df).drop(columns=["Marital_Status_Uncommitted"])
encoded_df['Marital_Status_Committed'] = encoded_df['Marital_Status_Committed'].astype(int)

# Display the first few rows
encoded_df.head()

Unnamed: 0,Education,Income_level,Kidhome,Teenhome,Tenure_level,Age_level,Marital_Status_Committed
0,2.0,3,0,0,4,3,0
1,2.0,2,1,1,1,3,0
2,2.0,4,0,0,2,2,1
3,2.0,1,1,0,1,1,1
4,4.0,3,1,0,1,1,1


## Standardizing Numerical Features:

In [7]:
from sklearn.preprocessing import MinMaxScaler

# Selecting numerical features:
numerical_features = encoded_df.select_dtypes(include=[np.number]).columns

# Initializing MinMaxScaler:
scaler = MinMaxScaler()

# Applying Standardization:
encoded_df[numerical_features] = pd.DataFrame(scaler.fit_transform(encoded_df[numerical_features]), columns=numerical_features)
encoded_df.head()

Unnamed: 0,Education,Income_level,Kidhome,Teenhome,Tenure_level,Age_level,Marital_Status_Committed
0,0.5,0.666667,0.0,0.0,1.0,1.0,0.0
1,0.5,0.333333,0.5,0.5,0.0,1.0,0.0
2,0.5,1.0,0.0,0.0,0.333333,0.5,1.0
3,0.5,0.0,0.5,0.0,0.0,0.0,1.0
4,1.0,0.666667,0.5,0.0,0.0,0.0,1.0


## Saving the Preprocessed Dataset in a csv file:

In [8]:
encoded_df.to_csv('../customer_personality_analysis/data/preprocessed.csv')

## Preprocessing Summary

In this preprocessing stage, the following actions were taken to prepare the dataset for machine learning models:

1. **One-Hot Encoding for categorical variables:**  
   Categorical variables were transformed using One-Hot Encoding (OHE), ensuring they were converted into binary columns, preserving their original information for model training.

2. **MinMaxScaler for numerical variables:**  
   Numerical variables were standardized using the MinMaxScaler method, which scales values to a [0, 1] range, ensuring all variables have a comparable scale.

These transformations ensure that the dataset is adequately prepared for further modeling steps.
