# STATISTICAL MODELLING

### Assume that the claims count random variable N has a Poisson distribution with given years at risk v > 0 and expected frequency λ > 0. We aim at modeling the expected frequency λ > 0 such that it allows us to incorporate structural differences (heterogeneity) or systematic effects, between different insurance policies and risks.
### v measures the volume of the aggregated portfolio. Aggregation property says that the aggregated portfolio has a compound Poisson distribution (with volume weighted expected frequency)

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from patsy import dmatrices
import seaborn as sns
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.tools.eval_measures import rmse

In [4]:
claimsdf = pd.read_csv('/home/julian/Cursos/Ironhack/Proyectos/ProyectoFinal/Claims-Frequency-Predictions/Notebooks/claimsdf_1.csv')

In [5]:
claimsdf.head()

Unnamed: 0,ClaimNb,Exposure,Area,BonusMalus,VehBrand,Region,empirical_frequencies,VehGas_Regular,VehPower_,VehAge_,DrivAge_,log_density
0,1,0.1,4,50,9,1,10.0,1,5,1,6,7.104144
1,1,0.77,4,50,9,1,1.298701,1,5,1,6,7.104144
2,1,0.75,2,50,9,5,1.333333,0,6,2,6,3.988984
3,1,0.09,2,50,9,7,11.111111,0,7,1,5,4.330733
4,1,0.84,2,50,9,7,1.190476,0,7,1,5,4.330733


In [6]:
claimsdf.drop(columns=['empirical_frequencies'], inplace=True)

## Poisson-GENERALIZED LINEAR MODEL

- The feature components interact in a multiplicative way in our Poisson GLM. One of the main tasks is to analyze whether this multiplicative interaction is appropriate. For GLM modeling approach, as the frequencies are non-linearly related to Vehicle Age and Driver Age as we've seen in the EDA, we should partition them and then treat them as categorical variables.
- We consider 3 continuous feature components (Area, BonusMalus, log-Density), 1 binary feature component (VehGas) and 5 categorical feature components (VehPower, VehAge, DrivAge, VehBrand, Region)
- We'll dummy-encode the categorical features, in order to get a unique MLE for β
- In total, we'll get a 42 variable model.

In [18]:
claimsdf = pd.get_dummies(claimsdf, columns=['VehPower_', 'VehAge_', 'DrivAge_', 'VehBrand', 'Region'], drop_first=True)

In [19]:
claimsdf.head(2)

Unnamed: 0,ClaimNb,Exposure,Area,BonusMalus,VehGas_Regular,log_density,VehPower__5,VehPower__6,VehPower__7,VehPower__8,...,Region_4,Region_5,Region_6,Region_7,Region_8,Region_9,Region_10,Region_11,Region_12,Region_13
0,1,0.1,4,50,1,7.104144,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0.77,4,50,1,7.104144,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- REFERENCE LEVEL (Variables for wich the b0 parameter accounts for): `VehPower__4`, `VehAge__1`, `DrivAge__1`, `VehBrand_1`, `Region_1.0`

### MODEL TRAINING

- Maximizing the log-likelihood for parameter β is equivalent to minimizing the deviance loss for β. In this spirit, the deviance loss plays the role of the canonical objective function that should be minimized. b is the scaled Poisson deviance loss. 

- We randomly (uniformly) select 80% of data for training and leave 20% for testing:

In [20]:
sample = np.random.rand(len(claimsdf)) < 0.8

In [21]:
claimsdf_train = claimsdf[sample]

In [22]:
claimsdf_train.shape

(541937, 45)

In [23]:
claimsdf_test = claimsdf[~sample]

In [24]:
claimsdf_test.shape

(136076, 45)

In [25]:
expr = """ Q('ClaimNb') ~ Q('Area') + Q('BonusMalus') + Q('log_density') + Q('VehGas_Regular') + 
                          Q('VehPower__5') + Q('VehPower__6') + Q('VehPower__7') + Q('VehPower__8') +
                          Q('VehPower__9') + Q('VehPower__10') + Q('VehPower__11') + Q('VehPower__12') + 
                          Q('VehAge__2') + Q('VehAge__3') + Q('DrivAge__2') + Q('DrivAge__3') + 
                          Q('DrivAge__4') + Q('DrivAge__5') + Q('DrivAge__6') + Q('DrivAge__7') + Q('DrivAge__8') +
                          Q('VehBrand_2') + Q('VehBrand_3') + Q('VehBrand_4') + Q('VehBrand_5') + Q('VehBrand_6') +
                          Q('VehBrand_7') + Q('VehBrand_8') + Q('VehBrand_9') + Q('VehBrand_10') + Q('VehBrand_11') +
                          Q('Region_2') + Q('Region_3') + Q('Region_4') + Q('Region_5') + Q('Region_6') + 
                          Q('Region_7') + Q('Region_8') + Q('Region_9') + Q('Region_10') + Q('Region_11') +
                          Q('Region_12') """

- Build the matrices for the specified model:

In [27]:
y_train, X_train = dmatrices(expr, claimsdf_train, return_type='dataframe')

- Fit a Poisson-GLM to the TRAIN SET

In [14]:
poisson_model1 = sm.GLM(y_train, X_train, exposure=claimsdf_train.Exposure, family=sm.families.Poisson()).fit()

In [15]:
print(poisson_model1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:           Q('ClaimNb')   No. Observations:               542397
Model:                            GLM   Df Residuals:                   542354
Model Family:                 Poisson   Df Model:                           42
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -1.1384e+05
Date:                Mon, 04 Oct 2021   Deviance:                   1.7232e+05
Time:                        18:17:52   Pearson chi2:                 1.32e+06
No. Iterations:                     7                                         
Covariance Type:            nonrobust                                         
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              -3.4260    

- From the results of the `poisson_model1` we can see that the variables: `Area`, `VehPower__12`, `VehBrand_2`, `VehBrand_3`, `VehBrand_4`, `VehBrand_6`, `VehBrand_7`, `VehBrand_8`, `VehBrand_10`, `VehBrand_11`, `Region_12`: have a p-value bigger than 0.05 (for an assumed alpha=0.05), so we should consider its inclusion in the model, because they doesn't seem to be significant for frequency modelling.
- As we saw in the correlation analysis, `Area` is highly correlated with `Density`, wich might explain why the first is not significant.
- From the 11 `VehBrand` classes, 8 resulted non-siginificant.

#### DISPERSION:

- Deviance statistics that accounts for potential over- or under-dispersion (φ != 1). In the Poisson model, by definition variance equal to mean (φ = 1). We can determine this parameter empirically by Pearson’s (distribution-free) dispersion estimate and and the deviance dispersion.
- From the results of the `poisson_model1` we can see that the `scale` = 1, wich accounts for the Pearson's dispertion (Pearson's residuals/Residuals degrees of freedom), wich means that we can assume that the mean and variance are equal, the model is not overdispersed. 

- DEVIANCE DISPERTION:

In [16]:
sum((poisson_model1.resid_deviance) ** 2) / 542395

0.31770810751737816

In [17]:
(sum((poisson_model1.resid_pearson) ** 2)) / 542395

2.4275883196464054

#### TRAINING DEVIANCE-LOSS:

In [18]:
loss_train = sum((poisson_model1.resid_deviance) ** 2) / X_train.shape[0]
loss_train

0.31770693602082667

#### AKAIKE INFORMATION CRITERION
- Akaike’s information criterion (AIC), which introduces a penalty term for over-fitting (to mimic an out-of-sample loss)

In [19]:
poisson_model1.aic

227773.02917851752

#### X2-STATISTIC: NULL DEVIANCE - RESIDUAL DEVIANCE:

In [20]:
x2_statistic = (poisson_model1.null_deviance) - sum(poisson_model1.resid_deviance)
x2_statistic

274700.6579827171

#### ROOT MEAN SQUARE ERROR:

In [59]:
"""y_test, X_test = dmatrices(expr, claimsdf_test, return_type='dataframe')
poisson_model1_pred = poisson_model1.predict(X_test)
pred_test1=poisson_model1_pred[:13562]
y_test1=y_test[:13562]"""

## Zero Inflated Poisson-GENERALIZED LINEAR MODEL

In [None]:
zip_poisson_model = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, exposure=claimsdf_train.Exposure, inflation='logit').fit()

In [33]:
print(zip_poisson_model.summary())

                     ZeroInflatedPoisson Regression Results                    
Dep. Variable:            Q('ClaimNb')   No. Observations:               541937
Model:             ZeroInflatedPoisson   Df Residuals:                   541894
Method:                            MLE   Df Model:                           42
Date:                 Tue, 05 Oct 2021   Pseudo R-squ.:                 0.02579
Time:                         00:19:09   Log-Likelihood:            -1.1380e+05
converged:                       False   LL-Null:                   -1.1681e+05
Covariance Type:             nonrobust   LLR p-value:                     0.000
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
inflate_Intercept               0.3936      0.239      1.650      0.099      -0.074       0.861
inflate_Q('Area')              -0.0195      0.041     -0.474      0.635 

- From the results of the `zip_poisson_model` we can see that the next variables have a p-value bigger than 0.05 (for an assumed alpha=0.05), so we should consider its inclusion in the model:

- ZERO-INFLATION VARIABLES: The ones that the Logistic regression part of the ZIP model did't find useful for estimating the ϕ parameter:

  `Intercept`, `Area`, `log_density` 
  
  `VehPower_5`, `VehPower_6`, `VehPower__7`, `VehPower_9`, `VehPower_10`, `VehPower_11`,  `VehPower_12`
  
  `DrivAge_2`, `DrivAge_3`, `DrivAge_4`, `DrivAge_5`, `DrivAge_6`, `DrivAge_7`, `DrivAge_8`
  
  `VehBrand_3`, `VehBrand_4`, `VehBrand_5`, `VehBrand_6`, `VehBrand_7`, `VehBrand_8`, `VehBrand_9`, `VehBrand_10`, 
  `VehBrand_11` 
  
  `Region_2`, `Region_3`, `Region_4`, `Region_5`, `Region_6`, `Region_9`, `Region_10`, `Region_11`
  

- VARIABLES used by the Poisson part of the ZIP model to estimate ClaimNb:

   `VehBrand_3`, `VehBrand_4`, `VehBrand_5`, `VehBrand_7`, `VehBrand_8`, `VehBrand_9`, `VehBrand_10`,       
   `VehBrand_11`
    
   `Region_2`, `Region_3`, `Region_6`, `Region_9`, `Region_10`, `Region_11`
 
   `DrivAge_5`, `DrivAge_6`, `DrivAge_7`, `DrivAge_8`
 
   `VehPower__7`, `VehPower_11`,  `VehPower_12`
 

- For the logistic part, we'll cut: 
  `Intercept`, `Area`, `log_density`, `DriveAge` (rejected 7 out 8 classes), `Vehicle Brand` (rejected 9 of 11 classes), `Region` (rejected 8 out of 13), and `Veh_Power` (7 out of 9 classes)

- For the Poisson part, we'll cut only `Vehicle Brand` (rejected 8 of 11 classes), wich is coincident with the results of the full Poisson model. We'll keep `Drivers Age`, although the rejected classes are the ones that account for the less observed frequencies. With respect to `Region`, we'll keep it because some of the rejected classes account for higer observe frequencies values, maybe the rejection come from the fact that there's a slightly correlation bewtween `Region` and `log_density` and `Area`.

In [34]:
zip_poisson_model.aic

227679.03056086283

In [68]:
expr = """ Q('ClaimNb') ~ Q('Area') + Q('BonusMalus') + Q('log_density') + Q('VehGas_Regular') + 
                          Q('VehPower__5') + Q('VehPower__6') + Q('VehPower__7') + Q('VehPower__8') +
                          Q('VehPower__9') + Q('VehPower__10') + Q('VehPower__11') + Q('VehPower__12') + 
                          Q('VehAge__2') + Q('VehAge__3') + Q('DrivAge__2') + Q('DrivAge__3') + 
                          Q('DrivAge__4') + Q('DrivAge__5') + Q('DrivAge__6') + Q('DrivAge__7') + Q('DrivAge__8') +
                          Q('Region_2') + Q('Region_3') + Q('Region_4') + Q('Region_5') + Q('Region_6') + 
                          Q('Region_7') + Q('Region_8') + Q('Region_9') + Q('Region_10') + Q('Region_11') +
                          Q('Region_12') """

In [69]:
y_train, X_train = dmatrices(expr, claimsdf_train, return_type='dataframe')

In [70]:
y_train2, X_train2 = dmatrices(expr, claimsdf_train, return_type='dataframe')

In [71]:
X_train2.drop(columns=["Q('Area')", "Q('log_density')", "Q('DrivAge__4')", "Q('DrivAge__5')", "Q('DrivAge__6')", "Q('DrivAge__7')", "Q('DrivAge__8')", "Q('Region_2')", "Q('Region_3')", "Q('Region_4')", "Q('Region_5')",
       "Q('Region_6')", "Q('Region_7')", "Q('Region_8')", "Q('Region_9')",
       "Q('Region_10')", "Q('Region_11')", "Q('Region_12')", "Q('VehPower__5')", "Q('VehPower__6')", "Q('VehPower__7')", "Q('VehPower__8')", "Q('VehPower__9')", "Q('VehPower__10')", "Q('VehPower__11')", "Q('VehPower__12')"], inplace=True)

In [74]:
zip_poisson_model = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train2, exposure=claimsdf_train.Exposure, inflation='logit').fit()



         Current function value: 0.210128
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42




In [75]:
print(zip_poisson_model.summary())

                     ZeroInflatedPoisson Regression Results                    
Dep. Variable:            Q('ClaimNb')   No. Observations:               541937
Model:             ZeroInflatedPoisson   Df Residuals:                   541904
Method:                            MLE   Df Model:                           32
Date:                 Tue, 05 Oct 2021   Pseudo R-squ.:                 0.02511
Time:                         01:25:29   Log-Likelihood:            -1.1388e+05
converged:                       False   LL-Null:                   -1.1681e+05
Covariance Type:             nonrobust   LLR p-value:                     0.000
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
inflate_Intercept               0.3884      0.063      6.117      0.000       0.264       0.513
inflate_Q('BonusMalus')        -0.0149      0.001    -15.284      0.000 

In [76]:
zip_poisson_model.aic

227818.01696923067

### MODEL SELECTION

- The likelihood ratio test based on Posisson deviance can be applied recursively to a sequence of nested models. This leads to a step-wise reduction of model complexity, this is similar in spirit to the analysis of variance (ANOVA) in Listing 2.7, and it is often referred to as backward model selection