# STATISTICAL MODELLING

### Assume that the claims count random variable N has a Poisson distribution with given years at risk v > 0 and expected frequency λ > 0. We aim at modeling the expected frequency λ > 0 such that it allows us to incorporate structural differences (heterogeneity) or systematic effects, between different insurance policies and risks.
### v measures the volume of the aggregated portfolio. Aggregation property says that the aggregated portfolio has a compound Poisson distribution (with volume weighted expected frequency)

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from patsy import dmatrices
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [8]:
claimsdf = pd.read_csv('/home/julian/Cursos/Ironhack/Proyectos/ProyectoFinal/Claims-Frequency-Predictions/Notebooks/claimsdf_1.csv')

In [9]:
claimsdf.head()

Unnamed: 0,ClaimNb,Exposure,Area,BonusMalus,VehBrand,Region,empirical_frequencies,VehGas_Regular,VehPower_,VehAge_,DrivAge_,log_density
0,1,0.1,4,50,9,1.0,10.0,1,5,1,6,7.104144
1,1,0.77,4,50,9,1.0,1.298701,1,5,1,6,7.104144
2,1,0.75,2,50,9,5.0,1.333333,0,6,2,6,3.988984
3,1,0.09,2,50,9,7.0,11.111111,0,7,1,5,4.330733
4,1,0.84,2,50,9,7.0,1.190476,0,7,1,5,4.330733


In [10]:
claimsdf.drop(columns=['empirical_frequencies', 'Exposure'], inplace=True)

## GENERALIZED LINEAR MODELS (Poisson case)

- The feature components interact in a multiplicative way in our Poisson GLM. One of the main tasks is to analyze whether this multiplicative interaction is appropriate. For GLM modeling approach, as the frequencies are non-linearly related to Vehicle Age and Driver Age as we've seen in the EDA, we should partition them and then treat them as categorical variables.
- As we consider 3 continuous feature components (Area, BonusMalus, log-Density), 1 binary feature component (VehGas) and 5 categorical feature components (VehPower, VehAge, DrivAge, VehBrand, Region) that we dummy-encoded, we get a feature space dimension q = 3 + 1 + 8 + 2 + 7 + 10 + 11 = 42.

In [11]:
claimsdf = pd.get_dummies(claimsdf, columns=['VehPower_', 'VehAge_', 'DrivAge_', 'VehBrand', 'Region'], drop_first=True)

In [12]:
claimsdf.head(2)

Unnamed: 0,ClaimNb,Area,BonusMalus,VehGas_Regular,log_density,VehPower__5,VehPower__6,VehPower__7,VehPower__8,VehPower__9,...,Region_3.0,Region_4.0,Region_5.0,Region_6.0,Region_7.0,Region_8.0,Region_9.0,Region_10.0,Region_11.0,Region_12.0
0,1,4,50,1,7.104144,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,4,50,1,7.104144,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### MODEL TRAINING

- Maximizing the log-likelihood for parameter β is equivalent to minimizing the deviance loss for β. In this spirit, the deviance loss plays the role of the canonical objective function that should be minimized. b is the scaled Poisson deviance loss. 

- We randomly (uniformly) select 80% of data for training and leave 20% for testing:

In [13]:
sample = np.random.rand(len(claimsdf)) < 0.8

In [14]:
claimsdf_train = claimsdf[sample]

In [17]:
claimsdf_train.shape

(543149, 43)

In [24]:
claimsdf_train.head()

Unnamed: 0,ClaimNb,Area,BonusMalus,VehGas_Regular,log_density,VehPower__5,VehPower__6,VehPower__7,VehPower__8,VehPower__9,...,Region_3.0,Region_4.0,Region_5.0,Region_6.0,Region_7.0,Region_8.0,Region_9.0,Region_10.0,Region_11.0,Region_12.0
1,1,4,50,1,7.104144,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,2,50,0,3.988984,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,1,2,50,0,4.330733,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
5,1,5,50,1,8.007367,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6,1,5,50,1,8.007367,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [18]:
claimsdf_test = claimsdf[~sample]

In [20]:
claimsdf_test.shape

(134864, 43)

In [29]:
expr = """ Q('ClaimNb') ~ Q('Area') + Q('BonusMalus') + Q('log_density') + Q('VehGas_Regular') + 
                          Q('VehPower__5') + Q('VehPower__6') + Q('VehPower__7') + Q('VehPower__8') +
                          Q('VehPower__9') + Q('VehPower__10') + Q('VehPower__11') + Q('VehPower__12') + 
                          Q('VehAge__2') + Q('VehAge__3') + Q('DrivAge__2') + Q('DrivAge__3') + 
                          Q('DrivAge__4') + Q('DrivAge__5') + Q('DrivAge__6') + Q('DrivAge__7') + Q('DrivAge__8') +
                          Q('VehBrand_2') + Q('VehBrand_3') + Q('VehBrand_4') + Q('VehBrand_5') + Q('VehBrand_6') +
                          Q('VehBrand_7') + Q('VehBrand_8') + Q('VehBrand_9') + Q('VehBrand_10') + Q('VehBrand_11') +
                          Q('Region_2.0') + Q('Region_3.0') + Q('Region_4.0') + Q('Region_5.0') + Q('Region_6.0') + 
                          Q('Region_7.0') + Q('Region_8.0') + Q('Region_9.0') + Q('Region_10.0') + Q('Region_11.0') +
                          Q('Region_12.0') """

- Build the matrices for the specified model:

In [30]:
y_train, X_train = dmatrices(expr, claimsdf_train, return_type='dataframe')

In [31]:
y_test, X_test = dmatrices(expr, claimsdf_test, return_type='dataframe')

In [32]:
poisson_model1 = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()

In [33]:
print(poisson_model1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:           Q('ClaimNb')   No. Observations:               543149
Model:                            GLM   Df Residuals:                   543106
Model Family:                 Poisson   Df Model:                           42
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -1.1298e+05
Date:                Sun, 03 Oct 2021   Deviance:                   1.7051e+05
Time:                        11:18:15   Pearson chi2:                 5.87e+05
No. Iterations:                     7                                         
Covariance Type:            nonrobust                                         
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              -4.1696    

- From the results of the `poisson_model1` we can see that the variables: `Area`, `VehPower__12`, `VehBrand_2`, `VehBrand_3`, `VehBrand_4`, `VehBrand_6`, `VehBrand_7`, `VehBrand_8`, `VehBrand_10`, `VehBrand_11`, `Region_10.0`, `Region_12.0`: have a p-value bigger than 0.05 (for an assumed alpha=0.05), so we should consider its inclusion in the model, because they doesn't seem to be significant for frequency modelling.
- As we saw in the correlation analysis, `Area` is highly correlated with `Density`, wich might explain why the first is not significant.
- From the 11 `VehBrand` classes, 8 resulted non-siginificant.
- But before exclude them, we should check for overdispertion.

In [34]:
poisson_model1_pred = poisson_model1.get_prediction(X_test)

In [36]:
poisson_model1_predictions = poisson_model1_pred.summary_frame()
print(poisson_model1_predictions)

            mean   mean_se  mean_ci_lower  mean_ci_upper
0       0.069756  0.001773       0.066365       0.073320
4       0.046379  0.001638       0.043277       0.049704
15      0.117812  0.004275       0.109723       0.126497
18      0.050005  0.003794       0.043095       0.058022
20      0.050005  0.003794       0.043095       0.058022
...          ...       ...            ...            ...
677999  0.057363  0.001606       0.054300       0.060598
678007  0.055607  0.001473       0.052793       0.058571
678010  0.049910  0.001246       0.047527       0.052412
678011  0.047462  0.002507       0.042795       0.052639
678012  0.022898  0.000922       0.021160       0.024778

[134864 rows x 4 columns]


### OVER DISPERSION CHECK

- Deviance statistics that accounts for potential over- or under-dispersion φ != 1. In the Poisson model this does not apply because by definition φ = 1 (variance equal to mean). We can determine this dispersion parameter empirically on our data by Pearson’s (distribution-free) dispersion estimate and and the deviance dispersion

### MODEL SELECTION

- The likelihood ratio test based on Posisson deviance can be applied recursively to a sequence of nested models. This leads to a step-wise reduction of model complexity, this is similar in spirit to the analysis of variance (ANOVA) in Listing 2.7, and it is often referred to as backward model selection