# 1. IMPORTS
---

## 1.1. Libraries

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, RobustScaler

# 2. DATA PREPARATION
---

In [10]:
# Copy
dfp = pd.read_csv('../data/interim/hi_cs_processed.csv')

## 2.1. Creating features

In [11]:
# Policy sales channel 2
dfp['policy_sales_channel2'] = dfp['policy_sales_channel'].copy().astype('int64').astype(str)
dfp.loc[~dfp['policy_sales_channel'].isin([152, 26, 124]), 'policy_sales_channel2'] = 'others'
dfp = pd.get_dummies(dfp, columns=['policy_sales_channel2'], dtype='int64')

# Drop policy_sales_channel and redundant policy_sales_channel2_others
dfp.drop(['policy_sales_channel', 'policy_sales_channel2_others'], axis=1, inplace=True)

## 2.2. Encoder

In [12]:
# Vehicle damage
dfp['vehicle_damage'] = dfp['vehicle_damage'].map({'Yes': 1, 'No': 0}).astype('int64')

# Gender (1 = male, 0 = female)
dfp['gender'] = dfp['gender'].map({'Male': 0, 'Female': 1}).astype('int64')

# Vehicle age
dfp['vehicle_age'] = dfp['vehicle_age'].map({'< 1 Year': 1, '1-2 Year': 2, '> 2 Years': 3}).astype('int64')

I'll test `vehicle_age` and `vehicle_age2` in feature importance methods. So, I'll include but drop one of them after.

## 2.3. Dropping some redundant features

In [13]:
# Order of variables and drop: region_code and previously_insured
wanted_vars = [
       'id', 'age', 'vehicle_damage', 'annual_premium', 'vintage',
       'famous_region', 'vehicle_age', 'vehicle_age2', 
       'hi_customer_profitability', 'famous_policy_sales_channel', 
       'policy_sales_channel2_124', 'policy_sales_channel2_152', 
       'policy_sales_channel2_26', 'gender', 'response'
]

dfp = dfp[wanted_vars]

There are 4 continuous and **10 categorical features**. Maybe **catboost** will be a good propose for algorithm!

In [14]:
# Saving the data before training division
dfp.to_csv('../data/interim/hi_cs_pre_training.csv', index=False)

## 2.4. Train and test datasets

In [15]:
# X and y
X = dfp.drop('response', axis=1)
y = dfp['response']

In [16]:
# Train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42, stratify=y)

In [17]:
# Saving X_train and y_train
X_train.to_csv('../data/processed/X_train.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)

# Saving X_test and y_test
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

## 2.5. Rescaling

I'll use some methods to rescale the data: **Min Max Scaler** and **Robust Scaler**. None of the continuous variables has a Normal distribution, aparently. If there are a lot of outliers, then Robust Scaler will be better. On the other hand, I'll use Min Max.

**SCALERS DEFINED:**

- **Min Max Scaler:**
    - `age`
    - `vintage`
- **Robust Scaler:**
    - `annual_premium`
    - `hi_customer_profitability`

In [18]:
# Define the scalers and the columns they should be applied to
scalers = [
    (MinMaxScaler(), ['age', 'vintage']),
    (RobustScaler(), ['annual_premium', 'hi_customer_profitability'])
]

# Define the column transformer with specified scalers
transformers = [(scaler.__class__.__name__.lower(), scaler, cols) for scaler, cols in scalers]
preprocessor = ColumnTransformer(transformers=transformers, remainder='passthrough')

# Apply rescaling to columns
X_train_rescaled = preprocessor.fit_transform(X_train)
X_test_rescaled = preprocessor.transform(X_test)

# Get feature names
X_train_cols = list(preprocessor.get_feature_names_out())
X_train_cols = [x.split('__')[1] for x in X_train_cols]

In [19]:
# Saving X_train_rescaled and X_test_rescaled
pd.DataFrame(X_train_rescaled, columns=X_train_cols)\
    .to_csv('../data/processed/X_train_rescaled.csv', index=False)
pd.DataFrame(X_test_rescaled, columns=X_train_cols)\
    .to_csv('../data/processed/X_test_rescaled.csv', index=False)