# Statistical Modeling

Tasks:
- Data Preparation
- Model Building
- Model Evaluation
- Feature Importance Analysis
- Use SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret the model's predictions and understand how individual features influence the outcomes.
- Report comparison between each model performance.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import os
import sys
sys.path.append(os.path.abspath(os.path.join('..','src')))
from eda import EDA

from scipy.stats import chi2_contingency, ttest_ind, fisher_exact

import warnings
warnings.filterwarnings('ignore')

In [2]:
# get the CSV file
df_insurance = pd.read_csv("df_insurance.csv")

# instantiate the class
eda = EDA(df_insurance)

# change the datatype to appropriate type
eda.change_dtype()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990010 entries, 0 to 990009
Data columns (total 16 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   Gender                    990010 non-null  category      
 1   Province                  990010 non-null  category      
 2   PostalCode                990010 non-null  category      
 3   TransactionMonth          990010 non-null  datetime64[ns]
 4   VehicleType               990010 non-null  category      
 5   RegistrationYear          990010 non-null  category      
 6   SumInsured                990010 non-null  float64       
 7   TermFrequency             990010 non-null  category      
 8   TotalPremium              990010 non-null  float64       
 9   Product                   990010 non-null  category      
 10  CoverType                 990010 non-null  category      
 11  TotalClaims               990010 non-null  float64       
 12  St

## Data Preparation

### Feature Engineering

In [3]:
# new features
df_insurance = eda.get_dataframe()

df_insurance['Profit'] = df_insurance['TotalPremium'] - df_insurance['TotalClaims']                            # profit
df_insurance['Premium_per_claims'] = df_insurance['TotalPremium'] / (df_insurance['TotalClaims'] + 1e-6)     # Adding small value to avoid division by zero
df_insurance['RiskScore'] = df_insurance['TotalClaims']  * df_insurance['SumInsured']                        # weighted risk scored
df_insurance['ClaimSeverity'] = df_insurance['TotalClaims'] / df_insurance['SumInsured']                     # The amount claimed relative to the insured value


In [4]:
# choosing feature to work on 
'''
Gender,
VehicleType,
SumInsured,
TermFrequency,
TotalPremium, 
Product,  
CoverType,
TotalClaims,    
StatutoryRiskType,   
CalculatedPremiumPerTerm,
Profit,     
Premium_per_claims,
RiskScore,
ClaimSeverity
'''
df_insurance = df_insurance[['Gender','VehicleType','SumInsured','TermFrequency','TotalPremium','Product','CoverType','TotalClaims','StatutoryRiskType','CalculatedPremiumPerTerm','Profit','Premium_per_claims','RiskScore','ClaimSeverity']]

eda = EDA(df_insurance)

A higher SumInsured combined with a higher number of TotalClaims suggests a high-risk policyholder.
A low TotalClaims with a high SumInsured would indicate a low-risk policyholder.
The risk score becomes a way to combine the severity (amount insured) with the frequency of claims.

### Encoding Categorical Data

In [5]:
# one-hot encoding
new_df = eda.one_hot()

print(f'Shape of the new dataframe with encoded categories: {new_df.shape}')

Shape of the new dataframe with encoded categories: (990010, 39)


In [6]:
# check for correlation
correlation = new_df.corr()
# correlation['RiskScore'].sort_values(ascending=False)
correlation

Unnamed: 0,SumInsured,TotalPremium,TotalClaims,CalculatedPremiumPerTerm,Profit,Premium_per_claims,RiskScore,ClaimSeverity,Gender_Male,Gender_Not specified,...,CoverType_Keys and Alarms,CoverType_Own Damage,CoverType_Passenger Liability,CoverType_Roadside Assistance,CoverType_Signage and Vehicle Wraps,CoverType_Standalone passenger liability,CoverType_Third Party,CoverType_Third Party Only,CoverType_Trailer,CoverType_Windscreen
SumInsured,1.0,-0.080085,-0.006895,-0.12001,0.004587,-0.079582,-0.004284,-0.008287,-0.005401,0.005304,...,-0.134079,-0.089055,0.988368,-0.024277,-0.13447,0.068668,-0.022672,0.004778,-0.003448,-0.135544
TotalPremium,-0.080085,1.0,0.008409,0.470701,0.020404,0.996549,0.00145,-0.003039,-0.029113,0.031512,...,-0.13111,0.069974,-0.131555,-0.016985,-0.136302,0.004238,0.492159,0.009204,-0.001756,-0.079921
TotalClaims,-0.006895,0.008409,1.0,0.026253,-0.999585,-0.011318,0.934181,0.025781,-0.002851,0.00302,...,-0.009181,0.072017,-0.009219,-0.001626,-0.009191,-0.000637,-0.007004,-0.000217,-0.000238,-0.00742
CalculatedPremiumPerTerm,-0.12001,0.470701,0.026253,1.0,-0.012687,0.468928,0.013761,-0.007796,-0.003271,0.004538,...,-0.176377,0.469457,-0.178633,-0.022012,-0.181482,-0.004011,0.442096,0.003789,-0.002609,-0.126958
Profit,0.004587,0.020404,-0.999585,-0.012687,1.0,0.040028,-0.933977,-0.025864,0.002011,-0.002111,...,0.005402,-0.069988,0.005427,0.001137,0.005263,0.000759,0.021183,0.000482,0.000187,0.005116
Premium_per_claims,-0.079582,0.996549,-0.011318,0.468928,0.040028,1.0,-0.008382,-0.008647,-0.0291,0.031543,...,-0.130729,0.066811,-0.131174,-0.016905,-0.135935,0.004299,0.493691,0.009244,-0.001742,-0.080188
RiskScore,-0.004284,0.00145,0.934181,0.013761,-0.933977,-0.008382,1.0,-0.000416,-0.002404,0.002687,...,-0.006807,0.055258,-0.006827,-0.001222,-0.006807,-0.000471,-0.003084,-0.00016,-0.000176,-0.006821
ClaimSeverity,-0.008287,-0.003039,0.025781,-0.007796,-0.025864,-0.008647,-0.000416,1.0,0.001095,-0.001884,...,-0.007022,-0.007044,-0.007043,-0.000368,-0.007022,-0.000486,-0.007038,-0.000165,-0.000182,0.060957
Gender_Male,-0.005401,-0.029113,-0.002851,-0.003271,0.002011,-0.0291,-0.002404,0.001095,1.0,-0.926053,...,0.000295,-7.6e-05,-9.2e-05,0.067953,0.000295,0.003278,8e-06,-0.001696,-0.001863,2.3e-05
Gender_Not specified,0.005304,0.031512,0.00302,0.004538,-0.002111,0.031543,0.002687,-0.001884,-0.926053,1.0,...,-0.001166,-0.000758,-0.000745,-0.077774,-0.001166,-0.003115,-0.000853,0.001832,0.002012,-0.00087


 **Predicted feature: RiskScore**
 - Risk score has positive correlation with TotalClaims but negative correlation with Profit, it doesn't show significant relation with other features.

 - Profit has negative correlation with TotalClaims but surprisingly it doesn't have relation with TotalPremium.

### Model Building

In [7]:
# Linear model
print('Linear Model')
eda.linear_model()

Linear Model
Mean: 5.0416483077470125e+17
R2: 0.499795575416775


In [8]:
# Random Forest
print('RandomForest')
eda.ensemble_model()

RandomForest
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Mean: 7.598345059458892e+17
R2: -0.0019911019279070796


In [9]:
# Xgboost model
print('xgboost')
eda.xgboost_model()

xgboost
Fitting 3 folds for each of 81 candidates, totalling 243 fits
Mean: 9.030364935490618e+17
R2: -0.42043736027503753
