# STATISTICAL MODELLING

### Assume that the claims count random variable N has a Poisson distribution with given years at risk v > 0 and expected frequency λ > 0. We aim at modeling the expected frequency λ > 0 such that it allows us to incorporate structural differences (heterogeneity) or systematic effects, between different insurance policies and risks.
### v measures the volume of the aggregated portfolio. Aggregation property says that the aggregated portfolio has a compound Poisson distribution (with volume weighted expected frequency)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from patsy import dmatrices
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.tools.eval_measures import rmse

In [2]:
claimsdf = pd.read_csv('/home/julian/Cursos/Ironhack/Proyectos/ProyectoFinal/Claims-Frequency-Predictions/Notebooks/claimsdf_1.csv')

In [3]:
claimsdf.head()

Unnamed: 0,ClaimNb,Exposure,Area,BonusMalus,VehBrand,Region,empirical_frequencies,VehGas_Regular,VehPower_,VehAge_,DrivAge_,log_density
0,1,0.1,4,50,9,1,10.0,1,5,1,6,7.104144
1,1,0.77,4,50,9,1,1.298701,1,5,1,6,7.104144
2,1,0.75,2,50,9,5,1.333333,0,6,2,6,3.988984
3,1,0.09,2,50,9,7,11.111111,0,7,1,5,4.330733
4,1,0.84,2,50,9,7,1.190476,0,7,1,5,4.330733


In [4]:
claimsdf.drop(columns=['empirical_frequencies'], inplace=True)

## Poisson-GENERALIZED LINEAR MODEL

- The feature components interact in a multiplicative way in our Poisson GLM. One of the main tasks is to analyze whether this multiplicative interaction is appropriate. For GLM modeling approach, as the frequencies are non-linearly related to Vehicle Age and Driver Age as we've seen in the EDA, we should partition them and then treat them as categorical variables.
- As we consider 3 continuous feature components (Area, BonusMalus, log-Density), 1 binary feature component (VehGas) and 5 categorical feature components (VehPower, VehAge, DrivAge, VehBrand, Region) that we dummy-encoded, we get a feature space dimension q = 3 + 1 + 8 + 2 + 7 + 10 + 11 = 42.

In [5]:
claimsdf = pd.get_dummies(claimsdf, columns=['VehPower_', 'VehAge_', 'DrivAge_', 'VehBrand', 'Region'], drop_first=True)

In [6]:
claimsdf.head(2)

Unnamed: 0,ClaimNb,Exposure,Area,BonusMalus,VehGas_Regular,log_density,VehPower__5,VehPower__6,VehPower__7,VehPower__8,...,Region_4,Region_5,Region_6,Region_7,Region_8,Region_9,Region_10,Region_11,Region_12,Region_13
0,1,0.1,4,50,1,7.104144,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0.77,4,50,1,7.104144,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- REFERENCE LEVEL (Variables for wich the b0 parameter accounts for): `VehPower__4`, `VehAge__1`, `DrivAge__1`, `VehBrand_1`, `Region_1.0`

### MODEL TRAINING

- Maximizing the log-likelihood for parameter β is equivalent to minimizing the deviance loss for β. In this spirit, the deviance loss plays the role of the canonical objective function that should be minimized. b is the scaled Poisson deviance loss. 

- We randomly (uniformly) select 80% of data for training and leave 20% for testing:

In [7]:
sample = np.random.rand(len(claimsdf)) < 0.8

In [8]:
claimsdf_train = claimsdf[sample]

In [9]:
claimsdf_train.shape

(542397, 45)

In [10]:
claimsdf_test = claimsdf[~sample]

In [11]:
claimsdf_test.shape

(135616, 45)

In [12]:
expr = """ Q('ClaimNb') ~ Q('Area') + Q('BonusMalus') + Q('log_density') + Q('VehGas_Regular') + 
                          Q('VehPower__5') + Q('VehPower__6') + Q('VehPower__7') + Q('VehPower__8') +
                          Q('VehPower__9') + Q('VehPower__10') + Q('VehPower__11') + Q('VehPower__12') + 
                          Q('VehAge__2') + Q('VehAge__3') + Q('DrivAge__2') + Q('DrivAge__3') + 
                          Q('DrivAge__4') + Q('DrivAge__5') + Q('DrivAge__6') + Q('DrivAge__7') + Q('DrivAge__8') +
                          Q('VehBrand_2') + Q('VehBrand_3') + Q('VehBrand_4') + Q('VehBrand_5') + Q('VehBrand_6') +
                          Q('VehBrand_7') + Q('VehBrand_8') + Q('VehBrand_9') + Q('VehBrand_10') + Q('VehBrand_11') +
                          Q('Region_2') + Q('Region_3') + Q('Region_4') + Q('Region_5') + Q('Region_6') + 
                          Q('Region_7') + Q('Region_8') + Q('Region_9') + Q('Region_10') + Q('Region_11') +
                          Q('Region_12') """

- Build the matrices for the specified model:

In [13]:
y_train, X_train = dmatrices(expr, claimsdf_train, return_type='dataframe')

- Fit a Poisson-GLM to the TRAIN SET

In [14]:
poisson_model1 = sm.GLM(y_train, X_train, exposure=claimsdf_train.Exposure, family=sm.families.Poisson()).fit()

In [15]:
print(poisson_model1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:           Q('ClaimNb')   No. Observations:               542397
Model:                            GLM   Df Residuals:                   542354
Model Family:                 Poisson   Df Model:                           42
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -1.1384e+05
Date:                Mon, 04 Oct 2021   Deviance:                   1.7232e+05
Time:                        18:17:52   Pearson chi2:                 1.32e+06
No. Iterations:                     7                                         
Covariance Type:            nonrobust                                         
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              -3.4260    

- From the results of the `poisson_model1` we can see that the variables: `Area`, `VehPower__12`, `VehBrand_2`, `VehBrand_3`, `VehBrand_4`, `VehBrand_6`, `VehBrand_7`, `VehBrand_8`, `VehBrand_10`, `VehBrand_11`, `Region_10.0`, `Region_12.0`: have a p-value bigger than 0.05 (for an assumed alpha=0.05), so we should consider its inclusion in the model, because they doesn't seem to be significant for frequency modelling.
- As we saw in the correlation analysis, `Area` is highly correlated with `Density`, wich might explain why the first is not significant.
- From the 11 `VehBrand` classes, 8 resulted non-siginificant.

#### DISPERSION:

- Deviance statistics that accounts for potential over- or under-dispersion (φ != 1). In the Poisson model, by definition variance equal to mean (φ = 1). We can determine this parameter empirically by Pearson’s (distribution-free) dispersion estimate and and the deviance dispersion.
- From the results of the `poisson_model1` we can see that the `scale` = 1, wich accounts for the Pearson's dispertion (Pearson's residuals/Residuals degrees of freedom), wich means that we can assume that the mean and variance are equal, the model is not overdispersed. 

- DEVIANCE DISPERTION:

In [16]:
sum((poisson_model1.resid_deviance) ** 2) / 542395

0.31770810751737816

In [17]:
(sum((poisson_model1.resid_pearson) ** 2)) / 542395

2.4275883196464054

#### TRAINING DEVIANCE-LOSS:

In [18]:
loss_train = sum((poisson_model1.resid_deviance) ** 2) / X_train.shape[0]
loss_train

0.31770693602082667

#### AKAIKE INFORMATION CRITERION

In [19]:
poisson_model1.aic

227773.02917851752

#### X2-STATISTIC: NULL DEVIANCE - RESIDUAL DEVIANCE:

In [20]:
x2_statistic = (poisson_model1.null_deviance) - sum(poisson_model1.resid_deviance)
x2_statistic

274700.6579827171

#### ROOT MEAN SQUARE ERROR:

In [59]:
"""y_test, X_test = dmatrices(expr, claimsdf_test, return_type='dataframe')
poisson_model1_pred = poisson_model1.predict(X_test)
pred_test1=poisson_model1_pred[:13562]
y_test1=y_test[:13562]"""

## Zero Inflated Poisson-GENERALIZED LINEAR MODEL

In [78]:
zip_poisson_model = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, exposure=claimsdf_train.Exposure, inflation='logit').fit()



         Current function value: 0.209716
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41




In [79]:
print(zip_poisson_model.summary())

                     ZeroInflatedPoisson Regression Results                    
Dep. Variable:            Q('ClaimNb')   No. Observations:               542397
Model:             ZeroInflatedPoisson   Df Residuals:                   542354
Method:                            MLE   Df Model:                           42
Date:                 Mon, 04 Oct 2021   Pseudo R-squ.:                 0.02549
Time:                         18:50:51   Log-Likelihood:            -1.1375e+05
converged:                       False   LL-Null:                   -1.1672e+05
Covariance Type:             nonrobust   LLR p-value:                     0.000
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
inflate_Intercept               0.3933      0.270      1.457      0.145      -0.136       0.922
inflate_Q('Area')              -0.0371      0.046     -0.808      0.419 

'Region_11'
'Region_10'
'Region_9'
'Region_6'
'Region_3'
'Region_2'
'VehBrand_11'
'VehBrand_10'
'VehBrand_8'
'VehBrand_7'
'VehBrand_5'
'VehBrand_4'
'VehBrand_3'
'DrivAge__8'
'DrivAge__7'
'DrivAge__6'
'DrivAge__5'
'VehPower__12'
'VehPower__11'
'VehPower__7'

inflate_Intercept               0.3933      0.270      1.457      0.145      -0.136       0.922
inflate_Q('Area')
inflate_Q('log_density')  
inflate_Q('VehPower__5')        0.0736      0.055      1.328      0.184      -0.035       0.182
inflate_Q('VehPower__6')        0.0469      0.055      0.847      0.397      -0.062       0.156
inflate_Q('VehPower__7')  
inflate_Q('VehPower__9')        0.0524      0.079      0.661      0.508      -0.103       0.208
inflate_Q('VehPower__10')       0.0615      0.081      0.754      0.451      -0.098       0.221
inflate_Q('VehPower__11')       0.0946      0.102      0.927      0.354      -0.105       0.295
inflate_Q('VehPower__12') 
inflate_Q('DrivAge__2')         0.1541      0.234      0.658      0.510      -0.305       0.613
inflate_Q('DrivAge__3')         0.2614      0.212      1.235      0.217      -0.153       0.676
inflate_Q('DrivAge__4')         0.2419      0.210      1.153      0.249      -0.169       0.653
inflate_Q('DrivAge__5')         0.0641      0.212      0.303      0.762      -0.350       0.479
inflate_Q('DrivAge__6')         0.1122      0.212      0.529      0.597      -0.303       0.528
inflate_Q('DrivAge__7')         0.1224      0.214      0.571      0.568      -0.298       0.542
inflate_Q('DrivAge__8') 
inflate_Q('VehBrand_3')         0.1055      0.063      1.681      0.093      -0.018       0.229
inflate_Q('VehBrand_4')         0.1191      0.085      1.408      0.159      -0.047       0.285
inflate_Q('VehBrand_5')         0.0800      0.070      1.138      0.255      -0.058       0.218
inflate_Q('VehBrand_6')         0.1311      0.080      1.636      0.102      -0.026       0.288
inflate_Q('VehBrand_7')         0.0947      0.098      0.966      0.334      -0.097       0.287
inflate_Q('VehBrand_8')         0.0838      0.109      0.768      0.442      -0.130       0.297
inflate_Q('VehBrand_9')         0.0583      0.058      0.999      0.318      -0.056       0.173
inflate_Q('VehBrand_10')        0.0969      0.111      0.870      0.385      -0.121       0.315
inflate_Q('VehBrand_11')        0.1157      0.207      0.559      0.576      -0.290       0.521
inflate_Q('Region_2')           0.1213      0.130      0.931      0.352      -0.134       0.377
inflate_Q('Region_3')           0.0643      0.048      1.341      0.180      -0.030       0.158
inflate_Q('Region_4')           0.1576      0.096      1.642      0.101      -0.031       0.346
inflate_Q('Region_5')           0.1265      0.081      1.562      0.118      -0.032       0.285
inflate_Q('Region_6')    

inflate_Q('Region_9')           0.0747      0.063      1.194      0.232      -0.048       0.197
inflate_Q('Region_10')          0.0951      0.205      0.464      0.642      -0.306       0.496
inflate_Q('Region_11')


- From the results of the `zip_poisson_model` we can see that the variables: `Area`, `VehPower__12`, `VehBrand_2`, `VehBrand_3`, `VehBrand_4`, `VehBrand_6`, `VehBrand_7`, `VehBrand_8`, `VehBrand_10`, `VehBrand_11`, `Region_10.0`, `Region_12.0`: have a p-value bigger than 0.05 (for an assumed alpha=0.05), so we should consider its inclusion in the model, because they doesn't seem to be significant for frequency modelling.
- As we saw in the correlation analysis, `Area` is highly correlated with `Density`, wich might explain why the first is not significant.
- From the 11 `VehBrand` classes, 8 resulted non-siginificant.

In [81]:
zip_poisson_model.aic

227584.9486195171

### MODEL SELECTION

- The likelihood ratio test based on Posisson deviance can be applied recursively to a sequence of nested models. This leads to a step-wise reduction of model complexity, this is similar in spirit to the analysis of variance (ANOVA) in Listing 2.7, and it is often referred to as backward model selection