# Dataset Preprocessing

Here we will mainly focus on building off of `03_imputation_clipped_mean.csv` by doing the following:

1. reviewing columns for any logical clipping of obviously wrong values
2. skewed distribution handling
3. scaling/normalization of data

In [32]:
import numpy as np
from scipy.stats import skew

In [33]:
# review feature distributions (using data wrangler)

train_dataset = pd.read_csv("../data/03_imputation_clipped_mean.csv")
y_col = "Premium Amount"

train_dataset.head(1)

Unnamed: 0,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount,Policy Duration Mins,Gender_Male,Marital Status_Married,Marital Status_Single,Education Level_High School,Education Level_Master's,Education Level_PhD,Occupation_Self-Employed,Occupation_Unemployed,Location_Suburban,Location_Urban,Policy Type_Comprehensive,Policy Type_Premium,Customer Feedback_Good,Customer Feedback_Poor,Smoking Status_Yes,Exercise Frequency_Monthly,Exercise Frequency_Rarely,Exercise Frequency_Weekly,Property Type_Condo,Property Type_House
0,19.0,10049.0,1.0,22.598761,2.0,17.0,372.0,5.0,2869.0,518866.882265,False,True,False,False,False,False,True,False,False,True,False,True,False,True,False,False,False,True,False,True


# 1. preprocessing for linear models

transform y

apply standardization

In [34]:
linear_train_dataset = train_dataset.copy()

## 1.1 Handle Skewed Distributions


### expand for more info

Why: Many machine learning models (especially linear ones) assume features are normally distributed.

How:

Log Transformation: For highly positive-skewed data (e.g., income).

Box-Cox Transformation: For reducing skewness in a broader range of distributions (requires data > 0).

Winsorization: Replace extreme values with the nearest threshold instead of removing them.
Example:

Column: Salary
Data: [25k, 30k, 28k, 1.2M, 29k, 31k]
Action: Apply log transformation.
Result: [10.12, 10.31, 10.24, 13.00, 10.27, 10.34]

### Apply Transforms including `y_col`

In [35]:
#most likely columns needed to be transformed
# log_transform_cols = ['Annual Income','Health Score', 'Previous Claims', y_col]

#for now since we can check skew value, just check all non-binary columns for need to transform
non_binary_cols = [col for col in linear_train_dataset.columns if linear_train_dataset[col].dtype != 'bool']


In [36]:
# Apply transformations iteratively until skew is within [-0.5, 0.5]
for col in non_binary_cols:
    max_iterations = 3  # Prevent excessive loops
    iteration = 0

    while iteration < max_iterations:
        skew_value = skew(linear_train_dataset[col])
        
        if -0.5 <= skew_value <= 0.5:
            break  # Stop if skew is already in range
        
        if skew_value > 0.5:
            linear_train_dataset[col] = np.log1p(linear_train_dataset[col])  # Log transform for positive skew
        elif skew_value < -0.5:
            linear_train_dataset[col] = linear_train_dataset[col]**2  # Square for negative skew
            
        iteration += 1
    
    final_skew = skew(linear_train_dataset[col])
    print(f"Transformed {col} {iteration} time(s). Final skew: {final_skew:.2f}")


Transformed Age 0 time(s). Final skew: -0.01
Transformed Annual Income 3 time(s). Final skew: 0.01
Transformed Number of Dependents 0 time(s). Final skew: -0.01
Transformed Health Score 0 time(s). Final skew: 0.29
Transformed Previous Claims 1 time(s). Final skew: -0.17
Transformed Vehicle Age 0 time(s). Final skew: -0.02
Transformed Credit Score 0 time(s). Final skew: -0.12
Transformed Insurance Duration 0 time(s). Final skew: -0.01
Transformed Premium Amount 3 time(s). Final skew: 0.13
Transformed Policy Duration Mins 0 time(s). Final skew: -0.01


## 1.2 Standardizing features 

### (exapand for info on Scaling/Normalization/Standardization)

Why: Helps models converge faster and prevents one feature from dominating due to its scale.

Methods:

Normalization: Scale to a fixed range [0, 1].

Standardization: Center data around 0 with unit variance.

Robust Scaling: Scale using the median and IQR (useful for data with outliers).

Example:

Column: Height

Data: [150, 160, 170, 180]

Action: Normalize.

Result: [0, 0.33, 0.67, 1.0]

### Applying z-score standardization


In [38]:
# Perform Z-score standardization for non-binary columns
for col in non_binary_cols:
    mean = linear_train_dataset[col].mean()
    std = linear_train_dataset[col].std()
    linear_train_dataset[col] = (linear_train_dataset[col] - mean) / std

linear_train_dataset.head(1)

Unnamed: 0,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount,Policy Duration Mins,Gender_Male,Marital Status_Married,Marital Status_Single,Education Level_High School,Education Level_Master's,Education Level_PhD,Occupation_Self-Employed,Occupation_Unemployed,Location_Suburban,Location_Urban,Policy Type_Comprehensive,Policy Type_Premium,Customer Feedback_Good,Customer Feedback_Poor,Smoking Status_Yes,Exercise Frequency_Monthly,Exercise Frequency_Rarely,Exercise Frequency_Weekly,Property Type_Condo,Property Type_House
0,-1.64847,-0.65692,-0.747535,-0.255551,1.190605,1.286338,-1.567243,-0.007023,1.730884,-1.301872,False,True,False,False,False,False,True,False,False,True,False,True,False,True,False,False,False,True,False,True


### 1.3 saving dataset

In [39]:
linear_train_dataset.to_csv('../data/04_standardized_preprocessed_dataset.csv')

# 2. Preprocessing for tree-based models

don't transform y

apply nomalization

In [41]:
tree_train_dataset = train_dataset.copy()

### 2.1 Transform skewed distributions

In [43]:
non_binary_cols = [col for col in tree_train_dataset.columns if tree_train_dataset[col].dtype != 'bool' and col != y_col]

# Apply transformations iteratively until skew is within [-0.5, 0.5]
for col in non_binary_cols:
    max_iterations = 3  # Prevent excessive loops
    iteration = 0

    while iteration < max_iterations:
        skew_value = skew(tree_train_dataset[col])
        
        if -0.5 <= skew_value <= 0.5:
            break  # Stop if skew is already in range
        
        if skew_value > 0.5:
            tree_train_dataset[col] = np.log1p(tree_train_dataset[col])  # Log transform for positive skew
        elif skew_value < -0.5:
            tree_train_dataset[col] = tree_train_dataset[col]**2  # Square for negative skew
            
        iteration += 1
    
    final_skew = skew(tree_train_dataset[col])
    print(f"Transformed {col} {iteration} time(s). Final skew: {final_skew:.2f}")


Transformed Age 0 time(s). Final skew: -0.01
Transformed Annual Income 3 time(s). Final skew: 0.01
Transformed Number of Dependents 0 time(s). Final skew: -0.01
Transformed Health Score 0 time(s). Final skew: 0.29
Transformed Previous Claims 1 time(s). Final skew: -0.17
Transformed Vehicle Age 0 time(s). Final skew: -0.02
Transformed Credit Score 0 time(s). Final skew: -0.12
Transformed Insurance Duration 0 time(s). Final skew: -0.01
Transformed Policy Duration Mins 0 time(s). Final skew: -0.01


### 2.2 Normalizing features (scaling between [0,1] )

In [44]:
# Perform Min-Max Normalization for non-binary columns
for col in non_binary_cols:
    min_val = tree_train_dataset[col].min()
    max_val = tree_train_dataset[col].max()
    tree_train_dataset[col] = (tree_train_dataset[col] - min_val) / (max_val - min_val)

tree_train_dataset.head(1)


Unnamed: 0,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount,Policy Duration Mins,Gender_Male,Marital Status_Married,Marital Status_Single,Education Level_High School,Education Level_Master's,Education Level_PhD,Occupation_Self-Employed,Occupation_Unemployed,Location_Suburban,Location_Urban,Policy Type_Comprehensive,Policy Type_Premium,Customer Feedback_Good,Customer Feedback_Poor,Smoking Status_Yes,Exercise Frequency_Monthly,Exercise Frequency_Rarely,Exercise Frequency_Weekly,Property Type_Condo,Property Type_House
0,0.021739,0.315364,0.25,0.386282,0.682606,0.894737,0.117757,0.5,2869.0,0.120873,False,True,False,False,False,False,True,False,False,True,False,True,False,True,False,False,False,True,False,True


### 2.3 Saving Dataset

In [45]:
tree_train_dataset.to_csv('../data/04_normalized_preprocessed_dataset.csv')