# STATISTICAL MODELLING

### Assume that the claims count random variable N has a Poisson distribution with given years at risk v > 0 and expected frequency λ > 0. We aim at modeling the expected frequency λ > 0 such that it allows us to incorporate structural differences (heterogeneity) or systematic effects, between different insurance policies and risks.
### v measures the volume of the aggregated portfolio. Aggregation property says that the aggregated portfolio has a compound Poisson distribution (with volume weighted expected frequency)

In [63]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from patsy import dmatrices
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [64]:
claimsdf = pd.read_csv('/home/julian/Cursos/Ironhack/Proyectos/ProyectoFinal/Claims-Frequency-Predictions/Notebooks/claimsdf_1.csv')

In [65]:
claimsdf.head()

Unnamed: 0,ClaimNb,Exposure,Area,BonusMalus,VehBrand,Density,Region,empirical_frequencies,VehGas_Regular,VehPower_,VehAge_,DrivAge_
0,1,0.1,4,50,9,1217,1.0,10.0,1,5,1,6
1,1,0.77,4,50,9,1217,1.0,1.298701,1,5,1,6
2,1,0.75,2,50,9,54,5.0,1.333333,0,6,2,6
3,1,0.09,2,50,9,76,7.0,11.111111,0,7,1,5
4,1,0.84,2,50,9,76,7.0,1.190476,0,7,1,5


In [66]:
claimsdf.drop(columns=['empirical_frequencies', 'Exposure'], inplace=True)

## GENERALIZED LINEAR MODELS (Poisson case)

- The feature components interact in a multiplicative way in our Poisson GLM. One of the main tasks is to analyze whether this multiplicative interaction is appropriate. For GLM modeling approach, as the frequencies are non-linearly related to Vehicle Age and Driver Age as we've seen in the EDA, we should partition them and then treat them as categorical variables.
- As we consider 3 continuous feature components (Area, BonusMalus, log-Density), 1 binary feature component (VehGas) and 5 categorical feature components (VehPower, VehAge, DrivAge, VehBrand, Region) that we dummy-encoded, we get a feature space dimension q = 3 + 1 + 8 + 2 + 7 + 10 + 11 = 42.

In [67]:
claimsdf = pd.get_dummies(claimsdf, columns=['VehPower_', 'VehAge_', 'DrivAge_', 'VehBrand', 'Region'], drop_first=True)

In [None]:
claimsdf.head(2)

### MODEL TRAINING

- We'll split our portfolio in a TRAIN SET and a TEST SET

- Maximizing the log-likelihood for parameter β is equivalent to minimizing the deviance loss for β. In this spirit, the deviance loss plays the role of the canonical objective function that should be minimized. b is the scaled Poisson deviance loss. 

- We randomly select 80% of data for training and leave 20% for testing

In [69]:
sample = np.random.rand(len(claimsdf)) < 0.8

In [70]:
claimsdf_train = claimsdf[sample]

In [75]:
claimsdf_train

Unnamed: 0,ClaimNb,Area,BonusMalus,Density,VehGas_Regular,VehPower__5,VehPower__6,VehPower__7,VehPower__8,VehPower__9,...,Region_3.0,Region_4.0,Region_5.0,Region_6.0,Region_7.0,Region_8.0,Region_9.0,Region_10.0,Region_11.0,Region_12.0
0,1,4,50,1217,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,4,50,1217,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,2,50,54,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,1,2,50,76,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
4,1,2,50,76,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678008,0,5,50,3317,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
678009,0,5,95,9850,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
678010,0,4,50,1323,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
678011,0,2,50,95,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [72]:
claimsdf_test = claimsdf[~sample]

In [76]:
expr = """ Q('ClaimNb') ~ Q('Area') + Q('BonusMalus') + Q('Density') + Q('VehGas_Regular') + Q('VehPower__5') + Q('VehPower__6') +
           Q('VehPower__7') + Q('VehPower__8') + Q('VehPower__9') + Q('VehPower__10') + Q('VehPower__11') + Q('VehPower__12') + Q('VehAge__2') + Q('VehAge__3') + 
           Q('DrivAge__2') + Q('DrivAge__3') + Q('DrivAge__4') + Q('DrivAge__5') + Q('DrivAge__6') + Q('DrivAge__7') + Q('DrivAge__8') + Q('VehBrand_2') + 
           Q('VehBrand_3') + Q('VehBrand_4') + Q('VehBrand_5') + Q('VehBrand_6') + Q('VehBrand_7') + Q('VehBrand_8') + Q('VehBrand_9') + Q('VehBrand_10') + 
           Q('VehBrand_11') + Q('Region_2.0') + Q('Region_3.0') + Q('Region_4.0') + Q('Region_5.0') + Q('Region_6.0') + Q('Region_7.0') + Q('Region_8.0') + 
           Q('Region_9.0') + Q('Region_10.0') + Q('Region_11.0') + Q('Region_12.0') """

In [77]:
y_train, X_train = dmatrices(expr, claimsdf_train, return_type='dataframe')

In [78]:
y_test, X_test = dmatrices(expr, claimsdf_test, return_type='dataframe')

### OVER DISPERSION CHECK

- Deviance statistics that accounts for potential over- or under-dispersion φ != 1. In the Poisson model this does not apply because by definition φ = 1 (variance equal to mean). We can determine this dispersion parameter empirically on our data by Pearson’s (distribution-free) dispersion estimate and and the deviance dispersion

### MODEL SELECTION

- The likelihood ratio test based on Posisson deviance can be applied recursively to a sequence of nested models. This leads to a step-wise reduction of model complexity, this is similar in spirit to the analysis of variance (ANOVA) in Listing 2.7, and it is often referred to as backward model selection