### Log normal regression does not do a great job at predicting actual ClaimAmount. In this notebook, I set up a use a gradient boost poisson regression model to estimate the expected ClaimNb. Code has been developed for the same data set [elsewhere](https://scikit-learn.org/stable/auto_examples/linear_model/plot_poisson_regression_non_normal_loss.html). Here I adapt it to my work for the sake of method comparison.

In [1]:
import pandas as pd
import arff
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

In [2]:
data_freq = arff.load('freMTPL2freq.arff')
df_freq = pd.DataFrame(data_freq, columns=["IDpol", "ClaimNb", "Exposure", "Area", "VehPower",
"VehAge","DrivAge", "BonusMalus", "VehBrand", "VehGas", "Density", "Region"])

df_freq['Frequency']=df_freq.ClaimNb/df_freq.Exposure

df_train, df_test = train_test_split(df_freq, test_size=0.2, random_state=1337)

In [None]:
tree_preprocessor = ColumnTransformer(
    [
        (
            "categorical",
            OrdinalEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
        ("numeric", "passthrough", ["VehAge", "DrivAge", "BonusMalus", "Density"]),
    ],
    remainder="drop",
)
poisson_gbrt = Pipeline(
    [
        ("preprocessor", tree_preprocessor),
        (
            "regressor",
            HistGradientBoostingRegressor(loss="poisson", max_leaf_nodes=128),
        ),
    ]
)
poisson_gbrt.fit(
    df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"]
)


In [4]:
from sklearn.metrics import root_mean_squared_error
y_pred = poisson_gbrt.predict(df_test)
print("Ratio ExpectedClaimRate/ClaimRate: ",y_pred.sum()/df_test.Frequency.sum())

Ratio ExpectedClaimRate/ClaimRate:  0.45712068402106587


### Gradient boost model still underestimates the claim rate but to a somewhat lesser extent