<div style= "color:SkyBlue; font-size:16px;padding:10px;">
    
**Regression:**

<span style="color:PaleGoldenRod; text-transform: uppercase;">Predict</span> the <span style="color:PaleGoldenRod; text-transform: uppercase;">Insurance Premium</span> from available demographic information.</p>
**Scoring Criteria:**
Submissions are evaluated using <span style="color:PaleGoldenRod; text-transform: uppercase;">Root Mean Squared Logarithmic Error</span>. (RMSLE).
</div>

---
# üíæ Initialize and Load Data

In [None]:
# Import libraries
import warnings
warnings.filterwarnings("ignore")

# Update libraries
!pip install --upgrade scikit-learn
!pip install --upgrade plotly  ## 5.24.1 -> 6.3.1
!pip install --upgrade seaborn  ##  0.12.2 ->  0.12.3

# data manipulation
import numpy as np
import pandas as pd

# import common libraries and toolkits
from multiprocessing import Pool, cpu_count
#import sys
#import os

# machine learning libraries
import sklearn as skl
import lightgbm as lgb
import xgboost as xgb
import catboost as catb
#import umap

# visualization libraries
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#import plotly as go

# time management
#import optuna
from time import time
from tqdm import tqdm

# other useful libraries
#import math 
#from scipy import stats
#import itertools
#import random

#import pkg_resources
#print("hdbscan:", pkg_resources.get_distribution("hdbscan").version)
#print("sklearn:", skl.__version__)
#print("plotly:", go.__version__)


# reuse my kaggle tabular data functions
import urllib.request

url = "https://raw.githubusercontent.com/2awesome-rob/iron_fungi/main/my_kaggle_functions.py"
urllib.request.urlretrieve(url, "my_kaggle_functions.py")
import my_kaggle_functions as mkf

In [None]:
#Specify PATH
PATH = "/kaggle/input/playground-series-s4e12/"
print(f"Home path is: {PATH}")

#Set globals and load data
DEVICE, CORES  = mkf.set_globals(verbose = True)
XY, features, targets, target = mkf.load_tabular_data(PATH)

---
# üß≠ Exploratory Data Analysis
## üîé Target

In [None]:
# show target stats and plots
mkf.summarize_data(XY, target)
mkf.plot_target_eda(XY, target, title = f'{target} distribution')

#### üëÄ target observations and notes

Target: Premium Amount (float). Cost of premium in notional dollars. </p>

Target distribution has a strong left skew (median $865, mean $1103) with a minimum value of 2 (near 0) and outliers at high values

In [None]:
# add labels for plotting, 
#if target is numeric, specify categorical cuts.
XY, targets = mkf.get_target_labels(XY, target, targets, cuts=8)

---
## üîç Features

In [None]:
mkf.summarize_data(XY, features)

In [None]:
#mkf.plot_features_eda(XY, features[:14], target, 'label', 
#                      high_label="pricey", low_label="cheap")

#### üëÄ Feature Observations and Notes
**The target signal is very weak in any given feature**

19 predictive features. 

- 5 numeric / magnitude / count
    - annual_income, number_of_dependents, health_score, previous_claims, credit_score
- 3 numeric / timedelta
    - age, vehicle_age, insurance_duration
- 1 object/ datetime
    - policy_start_date
- 4 object / categorical string 
    - marital_status, occupation, location, property_type
- 4 object / ordinal string 
    - education_level, policy_type, customer_feedback, exercise_frequency
- 2 object / bool string
    - gender, smoking_status 

---
## üî∞ Out of the Box Performance

In [None]:
mkf.plot_feature_corr(XY, features, target)

In [None]:
training_features = [f for f in features if 
                     XY[f].dtype!='object']

X_train, y_train, X_val, y_val, X_test, y_test = mkf.split_training_data(
    XY, training_features, target, validation_size = 0.2, drop_na=True
)

In [None]:
### plot mutual information - may require cleaning prior to evaluation
mi_scores = mkf.get_feature_mutual_info(X_train, y_train)

In [None]:
# check feature_importance
model = lgb.LGBMRegressor(verbose=-1, n_jobs=CORES)

feature_importance = mkf.get_feature_importance(
    X_train, X_val, y_train, y_val, task="regression"
)

In [None]:
single_feature_model, _ = mkf.train_and_score_model(
    X_train[[feature_importance.index[0]]], X_val[[feature_importance.index[0]]], y_train, y_val, 
    model, task="regression_rmsle"
)

In [None]:
oob_model, _ = mkf.train_and_score_model(
    X_train, X_val, y_train, y_val, 
    model, task="regression_rmsle"
)

#### üëÄ Initial Model Observations and Notes
- As expected based on above EDA, the feature information scores are **very** low!
- Single Feature Model is NOT predictive
- OOB Model is also NOT predictive
- Pulling signal from noise will be challenging!!

---
# üìè Target Engineering

In [None]:
mkf.plot_feature_transforms(XY, target)

In [None]:
XY, targets, TargetTransformer = mkf.get_target_transformer(
    XY, target, targets, name='pwr', TargetTransformer=skl.preprocessing.PowerTransformer()
)
ttarget=targets[-1]

---
# üìê Feature Engineering
## üßπ Clean

In [None]:
XY = mkf.check_duplicates(XY, features, ttarget, drop=False)

In [None]:
# missing values
missing_data_features = mkf.plot_null_data(XY, features, verbose=False)

In [None]:
#correct datatype
XY['policy_start_date'] = pd.to_datetime(XY['policy_start_date'])

#binary map
XY['gender'].replace({'Female': False, 'Male': True}, inplace=True)
XY['smoking_status'].replace({'Yes': True, 'No': False}, inplace=True)

In [None]:
## cleans categoricals and imputes new 'unk' category for missing data
training_features = [f for f in features if XY[f].dtype=='object']
XY = mkf.clean_categoricals(XY, training_features, string_length=3, fillna=True)

In [None]:
#imputer strategies
#fill NaN as noise 
XY['number_of_dependents'].fillna(-1, inplace=True) 
XY['previous_claims'].fillna(-1, inplace=True) 

#tag incomplete data rows before filling
XY['missing_credit'] = XY['credit_score'].isna()
XY['missing_health'] = XY['health_score'].isna()

#impute to mean or median
XY['vehicle_age'].fillna(10, inplace=True) 
XY['insurance_duration'].fillna(5, inplace=True) 
#XY['annual_income'] = skl.impute.SimpleImputer(strategy='median').fit_transform(XY['annual_income'].values.reshape(-1,1))

training_features = ['age', 'health_score', 'credit_score']
XY = mkf.impute_using(XY, 'insurance_duration', ['annual_income'])
XY = mkf.impute_using(XY, 'annual_income', training_features)

#use KNN to impute - this will be slow on large data sets
#XY[training_features] = skl.impute.KNNImputer(n_neighbors=3).fit_transform(XY[training_features])

In [None]:
# missing values
missing_data_features = mkf.plot_null_data(XY, features, verbose=False)

In [None]:
training_features = [f for f in features if 
                     (XY[f].dtype=='object' or XY[f].dtype=='category')]

mkf.check_categoricals(XY, training_features, pct_diff=0.5)

In [None]:
#XY = mkf.denoise_categoricals(XY, training_features, target=ttarget, threshold=0.1)

In [None]:
training_features = [f for f in features if 
                     (XY[f].dtype=='float' or XY[f].dtype=='int')]
#XY = mkf.tag_outliers_by_neighbors(XY, training_features, n_neighbors=5)

#### üëÄ Initial Model Observations and Notes
- occupation and previous_claims missing 30% of data!!
- number_of_dependents and credit_score missing 10% of data!!
- Seven other features also missing data
- Categorical data "filled" by creating new category "unk"
- ---
## üìÖ Datetime Feature Extraction

In [None]:
#transform date time to training features
XY = mkf.get_cycles_from_datetime(XY, 'policy_start_date', drop=True)

## üî† Categorical Feature Extraction

In [None]:
#Converst categoricals to numeric training info
training_features = [f for f in XY.columns if f not in targets and 
                     XY[f].dtype=='category']
for f in training_features[:3]:
    print(f"{f}: {list(XY[f].unique())}")

# ordinal features to integers
XY['education_level'].replace({'hig': 0, 'bac': 1, 'mas':2, 'phd':3}, inplace=True)
XY['customer_feedback'].replace({'unk': -1, 'poo': 1, 'ave':2, 'goo':3}, inplace=True)
XY['exercise_frequency'].replace({'dai': 0, 'wee': 1, 'mon':2, 'rar':3}, inplace=True)

#consider using ordinal encoding if appropriate
XY['policy_type_enc'] = skl.preprocessing.OrdinalEncoder(dtype=np.int8).fit_transform(XY['policy_type'].values.reshape(-1,1))
XY['policy_type_enc'] = XY['policy_type_enc'].astype('category')

#one hot encode non-ordinal categoricals
training_features = ['marital_status', 'occupation', 'location', 'property_type', 'policy_type']
XY = pd.get_dummies(XY,columns=training_features)

## üßÆ Expert Features

In [None]:
training_features = [f for f in XY.columns if f not in targets and
                     (XY[f].dtype=='float' or XY[f].dtype=='int') and
                    '_sin' not in f and '_cos' not in f]

mkf.print_pca_loadings(XY, training_features)

In [None]:
def get_domain_expert_features(df):
    #PCA 10 health to wealth ratio
    df['pc10'] = (df['annual_income'] * df['credit_score']) / df['health_score'] * (2+df['previous_claims']) 
    return df
    
XY = get_domain_expert_features(XY)

In [None]:
# add simple feature interactions
training_features = [f for f in XY.columns if f not in targets and
                     f in features and
                     (XY[f].dtype=='float' or XY[f].dtype=='int')]

XY = mkf.get_feature_interactions(XY, training_features)

In [None]:

XY = mkf.get_feature_by_grouping_on_cat(XY, ['age', 'credit_score', 'income'], 'education_level')

XY = mkf.get_feature_cat_interactions(XY, ['previous_claims', 'exercise_frequency', 'education_level'], 'number_of_dependents')

## ‚öñÔ∏è Scale/Transform Features

In [None]:
training_features = [f for f in XY.columns if f not in targets and
                     (XY[f].dtype=='float' or XY[f].dtype=='int') and
                     '_sin' not in f and
                     '_cos' not in f and
                     '*' not in f]

for feat in training_features[-5:]:
    mkf.plot_feature_transforms(XY, feat)

In [None]:
mkf.check_all_features_scaled(XY, targets)

In [None]:
#scale / transform numeric features
# standad transform features
training_features = ['health_score', 'credit_score']
XY = mkf.get_transformed_features(
    XY, training_features, skl.preprocessing.StandardScaler()
)

# power transform features
training_features = ['annual_income', 'pc10']
XY = mkf.get_transformed_features(
    XY, training_features, skl.preprocessing.PowerTransformer()
)

# minmax transform features
training_features = ['age', 'number_of_dependents', 'previous_claims', 'vehicle_age', 'insurance_duration',
                    'policy_start_date_dummy', 'policy_start_date_doy']
XY = mkf.get_transformed_features(
    XY, training_features, skl.preprocessing.MinMaxScaler()
)

mkf.check_all_features_scaled(XY, targets)

---
## ‚ûñ Dimension Reduction
#### Embeddings

In [None]:
training_features = [f for f in XY.columns if f in features and
                     (XY[f].dtype=='float' or XY[f].dtype=='int')]

XY = mkf.get_embeddings(XY, training_features, 
    skl.decomposition.PCA(n_components=4), "pca_orig_",
    target=target, verbose=True
)

In [None]:
training_features = [f for f in XY.columns if f not in targets and
                     XY[f].dtype!='bool']

XY = mkf.get_embeddings(XY, training_features, 
    skl.decomposition.PCA(n_components=12), "pca_all_",
    target=target, verbose=True
)

In [None]:
# rbf features were low info/low importance
#XY = mkf.get_embeddings(XY, training_features, 
#    skl.kernel_approximation.RBFSampler(n_components=16), "rbf_",
#    verbose=False
#)

#umap embedding can be slow
#XY = mkf.get_embeddings(XY, training_features, 
#    umap.UMAP(n_components=16), "umap_all_", sample_size=0.1,
#    target=target, verbose=True
#)

#### Clustering

In [None]:
training_features = [f for f in XY.columns if f not in targets and
                     "pca_" in f]

XY = mkf.get_clusters(XY, training_features,
    skl.cluster.KMeans(init="k-means++", n_clusters=6, random_state=69),  "k_means_pca")

In [None]:
training_features = [f for f in XY.columns if f not in targets]

XY = mkf.get_clusters(XY, training_features,
    skl.cluster.KMeans(init="k-means++", n_clusters=12, random_state=69),  "k_means_all")

In [None]:
#DBSCAN memory usage can be excessive 
#XY = mkf.get_clusters(XY, training_features,
#    skl.cluster.DBSCAN(eps=2, min_samples=333, metric='euclidean', leaf_size=30, n_jobs=CORES), "dbscan_pca", 
#    target=ttarget)

---
## üìã Evaluate Performance

In [None]:
training_features = [f for f in XY.columns if f not in targets and
                     XY[f].dtype!='object']

X_train, y_train, X_val,  y_val, X_test, y_test = mkf.split_training_data(
    XY, training_features, ttarget, validation_size = 0.2
)

In [None]:
#plot feature correlation
mkf.plot_feature_corr(XY, training_features, ttarget)

In [None]:
### updated mutual information
mi_scores = mkf.get_feature_mutual_info(X_train, y_train)

In [None]:
### updated feature information
feature_importance = mkf.get_feature_importance(
    X_train, X_val, y_train, y_val, task="regression"
)

In [None]:
updated_model, _ = mkf.train_and_score_model(
    X_train, X_val, y_train, y_val, 
    model, task="regression_rmsle", 
    TargetTransformer = TargetTransformer
)

---
## üÜñ Outliers

In [None]:
XY['model_residual'] = XY[ttarget] - updated_model.predict(XY[training_features])
targets.append('model_residual')

In [None]:
XY, _ = mkf.get_outliers(XY, 'model_residual', deviations = 5, remove=False)

# üèÉ‚Äç‚ôÇÔ∏è Training and Evaluation
## üëØ‚Äç‚ôÄÔ∏è Model Selection

- base_models -> model classes
- models -> base models with hyperparameters
- training_features -> list of training features for each model class

In [None]:
important_features = [f for f in feature_importance.index.tolist() if 
                      (feature_importance[f] > 10 or
                       mi_scores[f] > 0)]

print(f"Not using {[f for f in feature_importance.index.tolist() if f not in important_features]}")

In [None]:
base_models = {
    'linear' : skl.linear_model.LinearRegression,
    'lgb' : lgb.LGBMRegressor,
    'lgb2' : lgb.LGBMRegressor,
    'catb' : catb.CatBoostRegressor,
    'hgb' : skl.ensemble.HistGradientBoostingRegressor,
    'xgb' : xgb.XGBClassifier
}

params = {
    'linear' : {}
}

models, training_features = mkf.get_ready_models(XY, important_features, ttarget, 
    base_models, task='regression', direction='minimize', hyper_params=params,
    n_features=0, n_trials=3, CORES=CORES, DEVICE=DEVICE, verbose=False,
)

## üèãÔ∏è‚Äç‚ôÇÔ∏è Model Training

In [None]:
trained_models, stacking_model = mkf.cv_train_models(XY, training_features, ttarget,
    models, task='regression_rmsle', TargetTransformer=TargetTransformer, 
    folds=5
    )

---
# üîÆ Predict & Submit

In [None]:
predictions = mkf.submit_cv_predict(X_test, y_test, training_features, target, 
                      trained_models, task='regression_rmsle',
                      meta_model=stacking_model,
                      TargetTransformer=TargetTransformer,
                      path=PATH, verbose=True)