# Formula Interface Tutorial: Revisiting French Motor Third-Party Liability Claims


**Intro**

This tutorial showcases the formula interface of `glum`. It allows for the specification of the design matrix and the response variable using so-called [Wilkinson-formulas](https://www.jstor.org/stable/2346786) instead of constructing it by hand. This kind of model specification should be familiar to R users or those who have used the `statsmodels`/`lienarmodels` Python packages before. This tutorial aims to introduce the basics of working with formulas to other users, as well as highlighting some important differences between `glum`s and other packages' formula implementations.

For a more in-depth look at how formulas work, please take a look at the [documentation of `formulaic`](https://matthewwardrop.github.io/formulaic/), the package on which `glum`'s formula interface is based.


**Background**

This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the [GLM tutorial](tutorials/glm_french_motor_tutorial/glm_french_motor.html). If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first. 



## Table of Contents
* [1. Load and Prepare Datasets from Openml.org](#1.-Load-and-Prepare-Datasets-from-Openml.org)
* [2. Frequency GLM - Poisson Distribution](#2.-Frequency-GLM---Poisson-Distribution)
* [3. Severity GLM - Gamma Distribution](#3.-Severity-GLM---Gamma-distribution)
* [4. Combined GLM - Tweedie Distribution](#4.-Combined-GLM---Tweedie-Distribution)

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as optimize
import scipy.stats
from dask_ml.preprocessing import Categorizer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import ShuffleSplit
from glum import GeneralizedLinearRegressor
from glum import TweedieDistribution

from load_transform import load_transform

## 1. Load and prepare datasets from Openml<a class="anchor"></a>
[back to table of contents](#Table-of-Contents)

First, we load in our [dataset from openML]("https://www.openml.org/d/41214") and apply several transformations. In the interest of simplicity, we do not include the data loading and preparation code in this notebook.

In [3]:
df = load_transform()
with pd.option_context('display.max_rows', 10):
    display(df)

Unnamed: 0_level_0,ClaimNb,Exposure,Area,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region,ClaimAmount,ClaimAmountCut
IDpol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,0.10000,D,5,0,5,50,B12,Regular,1217,R82,0.0,0.0
3,0,0.77000,D,5,0,5,50,B12,Regular,1217,R82,0.0,0.0
5,0,0.75000,B,6,1,5,50,B12,Diesel,54,R22,0.0,0.0
10,0,0.09000,B,7,0,4,50,B12,Diesel,76,R72,0.0,0.0
11,0,0.84000,B,7,0,4,50,B12,Diesel,76,R72,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6114326,0,0.00274,E,4,0,5,50,B12,Regular,3317,R93,0.0,0.0
6114327,0,0.00274,E,4,0,4,95,B12,Regular,9850,R11,0.0,0.0
6114328,0,0.00274,D,6,1,4,50,B12,Diesel,1323,R82,0.0,0.0
6114329,0,0.00274,B,4,0,5,50,B12,Regular,95,R26,0.0,0.0


## 2. Reproducing the model from Tutorial 1

Now, let us start by fitting a very simple model. As usual, let's divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate `y` and `X` data frames for our label and features – the formula will take care of that for us.

We still have some preprocessing to do:
 - Many of the ordinal or nominal variables are encoded as integers, instead as categoricals. We will need to convert these so that `glum` will know to estimate a separate coefficient for each of their levels.
 - The outcome variable is a transformation of other columns. We need to create it first.

As we will see later on, these steps can actually be incorporated into the formula itself, but let's not overcomplicate things at first.

In [11]:
ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
train, test = next(ss.split(df))

df = df.assign(PurePremium = lambda x: x["ClaimAmountCut"] / x["Exposure"])

glm_categorizer = Categorizer(columns=["VehBrand", "VehGas", "Region", "Area", "DrivAge", "VehAge", "VehPower"])
df_train = glm_categorizer.fit_transform(df.iloc[train])
df_test = glm_categorizer.transform(df.iloc[test])


formula = "PurePremium ~ VehBrand + VehGas + Region + Area + DrivAge + VehAge + VehPower + BonusMalus + Density"

This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different prefictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand.

In [12]:
TweedieDist = TweedieDistribution(1.5)
t_glm1 = GeneralizedLinearRegressor(family=TweedieDist, alpha_search=True, l1_ratio=1, fit_intercept=True, formula=formula)
t_glm1.fit(df_train, sample_weight=df['Exposure'].values[train])

pd.DataFrame({'coefficient': np.concatenate(([t_glm1.intercept_], t_glm1.coef_))},
             index=['intercept'] + t_glm1.feature_names_).T

Unnamed: 0,intercept,VehBrand[T.B1],VehBrand[T.B10],VehBrand[T.B11],VehBrand[T.B12],VehBrand[T.B13],VehBrand[T.B14],VehBrand[T.B2],VehBrand[T.B3],VehBrand[T.B4],...,VehAge[T.1],VehAge[T.2],VehPower[T.4],VehPower[T.5],VehPower[T.6],VehPower[T.7],VehPower[T.8],VehPower[T.9],BonusMalus,Density
coefficient,2.88667,-0.064157,0.0,0.231868,-0.211061,0.054979,-0.270346,-0.071453,0.00291,0.059324,...,0.008117,-0.229906,-0.111796,-0.123388,0.060757,0.005179,-0.021832,0.208158,0.032508,2e-06


## Fun with Functions

The previous example is only scratching the surface of what formulas are capable of. For example, they are capable of evaluating arbitrary Python expressions, which act as if they saw the columns of the input data frame as local variables (`pandas.Series`). The way to tell `glum` that a part of the formula should be evaluated as a Python expression before applying the formula grammar to it is to enclose it in curly braces. As an example, we can easily do the following within the formula itself:

 - Create the outcome variable on the fly instead of doing it beforehand.
 - Include the logarithm of a certain variable in the model.<sup>1</sup>
 - Include a basis spline interpolation of a variable to capture non-linearities in its effect.<sup>2</sup>

Let's try it out!

<sup>1</sup>: This works because formulas can include variables from the local scope, such as the imported `numpy` namespace. (Even more precisely, certain often-used `numpy` functions are special-cased, so the curly braces are not even strictly necessary here.)

<sup>2</sup>: `bs` is one of the several built-in `formulaic` functions that aim to simplify preprocessing steps. You can learn more about them [in `formulaic`'s docs](https://matthewwardrop.github.io/formulaic/guides/transforms/).

In [15]:
formula_fun = "{ClaimAmountCut / Exposure} ~ VehBrand + VehGas + Region + Area + DrivAge + VehAge + VehPower + bs(BonusMalus, 3) + {np.log(Density)}"
t_glm2 = GeneralizedLinearRegressor(family=TweedieDist, alpha_search=True, l1_ratio=1, fit_intercept=True, formula=formula_fun)
t_glm2.fit(df_train, sample_weight=df['Exposure'].values[train])

pd.DataFrame({'coefficient': np.concatenate(([t_glm2.intercept_], t_glm2.coef_))},
             index=['intercept'] + t_glm2.feature_names_).T

Unnamed: 0,intercept,VehBrand[T.B1],VehBrand[T.B10],VehBrand[T.B11],VehBrand[T.B12],VehBrand[T.B13],VehBrand[T.B14],VehBrand[T.B2],VehBrand[T.B3],VehBrand[T.B4],...,VehPower[T.4],VehPower[T.5],VehPower[T.6],VehPower[T.7],VehPower[T.8],VehPower[T.9],"bs(BonusMalus, 3)[1]","bs(BonusMalus, 3)[2]","bs(BonusMalus, 3)[3]",np.log(Density)
coefficient,3.808829,-0.060201,0.0,0.242194,-0.202517,0.063471,-0.345415,-0.072546,0.00777,0.079391,...,-0.113038,-0.127255,0.060209,0.005577,-0.032114,0.207355,3.178178,0.361951,8.231846,0.121944


## Categorical Variables



`formulaic` also provides extensive support for encoding categorical variables. Unfortunately, not all of that is available in `glum`, as `tabmat` can only handle categorical variables as if they were one-hot-encoded. (This is the price one pays for performance. If you really need another encoding scheme, you can always do that separately in `pandas`/`formulaic` and use them as normal numeric variables.)

The main function one needs to be aware of in the context of categorixals is simply called `C()`. A variable placed within it is always converted to a categorical, regardless of it's type. Let's try it out on our dataset!

In [18]:
df_train_noncat = df.iloc[train]
df_test_noncat = df.iloc[test]

df_train_noncat.dtypes

ClaimNb             int64
Exposure          float64
Area               object
VehPower            int64
VehAge              int64
DrivAge             int64
BonusMalus          int64
VehBrand           object
VehGas             object
Density             int64
Region             object
ClaimAmount       float64
ClaimAmountCut    float64
ClaimFreq         float64
PurePremium       float64
dtype: object

Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a caetgorical variable, it does not have any effect outside of the feature name.

In [17]:
formula_cat = "PurePremium ~ C(VehBrand) + C(VehGas) + C(Region) + C(Area) + C(DrivAge) + C(VehAge) + C(VehPower) + BonusMalus + Density"

t_glm3 = GeneralizedLinearRegressor(family=TweedieDist, alpha_search=True, l1_ratio=1, fit_intercept=True, formula=formula_fun)
t_glm3.fit(df_train, sample_weight=df['Exposure'].values[train])

pd.DataFrame({'coefficient': np.concatenate(([t_glm3.intercept_], t_glm3.coef_))},
             index=['intercept'] + t_glm3.feature_names_).T

Unnamed: 0,intercept,VehBrand[T.B1],VehBrand[T.B10],VehBrand[T.B11],VehBrand[T.B12],VehBrand[T.B13],VehBrand[T.B14],VehBrand[T.B2],VehBrand[T.B3],VehBrand[T.B4],...,VehPower[T.4],VehPower[T.5],VehPower[T.6],VehPower[T.7],VehPower[T.8],VehPower[T.9],"bs(BonusMalus, 3)[1]","bs(BonusMalus, 3)[2]","bs(BonusMalus, 3)[3]",np.log(Density)
coefficient,3.808829,-0.060201,0.0,0.242194,-0.202517,0.063471,-0.345415,-0.072546,0.00777,0.079391,...,-0.113038,-0.127255,0.060209,0.005577,-0.032114,0.207355,3.178178,0.361951,8.231846,0.121944


Finally, prediction works as expected with categorical variables. `glum` keeps track of the levels present in the training dataset, and makes sure that categorical variables in unseen datasets are also properly aligned, even if they have missing or unknown levels.<sup>3</sup> Therefore, one can simply use predict, and `glum` does The Right Thing™ by default.

<sup>3</sup>: This is made possible due to `glum` saving a [`ModelSpec` object](https://matthewwardrop.github.io/formulaic/guides/model_specs/), which contains any information necessary for reapplying the transitions that were done during the formula materialization process. It is especially relevant in the case of [stateful transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/), such as creating categorical variables.

In [19]:
t_glm3.predict(df_test_noncat)

array([71.14008987, 17.22303128, 62.87949515, ..., 23.01366586,
       16.36815769, 77.96147907])

## Interactions and Structural Full-rankness

The attentive reader might have noticed that the first level of each categorical variable is omitted from the model. This is a manifestation of the more general concept of [ensuring structural full-rankedness](https://matthewwardrop.github.io/formulaic/guides/contrasts/#guaranteeing-structural-full-rankness)<sup>4</sup>. By default, `glum` and `formulaic` will try to make sure that one does not fall into the [Dummy Variable Trap](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)). Moreover, it even does it in the case of (possibly multi-way) interactions involving categorical variables. It will always drop the necessary number of levels, and no more. If you want to opt out of this behavior (for example because you would like to penalize all levels equally), simply set the `drop_first` parameter during model initialization to `False`.

<sup>4</sup>: Note, that it does not guarantee that the design matrix is actually full rank. For example, two identical numerical variables will still lead to a rank-deficient design matrix.

# Miscellaneous features