# Regularized GLM for Claim Frequency

This notebook demonstrates **regularization** (Ridge, Lasso, Elastic Net) for GLM models using the same insurance dataset as the frequency example.

Regularization is useful when:
- You have many predictors and want to prevent overfitting
- You want automatic variable selection (Lasso)
- You have correlated predictors (Ridge, Elastic Net)

In [1]:
import polars as pl
import numpy as np
import rustystats as rs

# Load the insurance data
data = pl.read_parquet("https://raw.githubusercontent.com/PricingFrontier/pricing-data-example/917c853e256df8d5814721ab56f72889a908bb08/data/processed/frequency_set.parquet")
print(f"Dataset: {data.shape[0]:,} rows, {data.shape[1]} columns")
data.head()

Dataset: 678,012 rows, 13 columns


IDpol,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Area,Density,Region,Group,Exposure,ClaimCount
i64,f64,f64,f64,f64,str,str,str,f64,str,str,f64,i32
2124053,5.0,1.0,31.0,60.0,"""B2""","""Diesel""","""C""",393.0,"""Centre""","""5""",0.53,0
1049168,4.0,2.0,73.0,50.0,"""B12""","""Regular""","""D""",983.0,"""Pays-de-la-Loire""","""2""",0.1,0
134313,4.0,11.0,60.0,62.0,"""B1""","""Regular""","""E""",3744.0,"""Provence-Alpes-Cotes-D'Azur""","""1""",1.0,0
1145209,7.0,9.0,37.0,50.0,"""B12""","""Regular""","""C""",204.0,"""Pays-de-la-Loire""","""2""",0.06,0
2281532,5.0,4.0,43.0,54.0,"""B1""","""Diesel""","""E""",3317.0,"""Provence-Alpes-Cotes-D'Azur""","""3""",0.5,0


## 1. Standard GLM (No Regularization)

First, let's fit a standard Poisson GLM for comparison.

In [2]:
# Standard GLM (no regularization)
model_std = rs.glm(
    formula="ClaimCount ~ VehPower + VehAge + C(Area) + C(Region)",
    data=data,
    family="poisson"
).fit()

print(f"Standard GLM: {len(model_std.params)} parameters")
print(f"Deviance: {model_std.deviance:.2f}")
print(f"Converged: {model_std.converged}")

Standard GLM: 28 parameters
Deviance: 212748.25
Converged: True


## 2. Ridge Regularization (L2)

Ridge regression adds an L2 penalty that shrinks coefficients toward zero but keeps all variables in the model. Use `l1_ratio=0.0` for pure Ridge.

In [3]:
# Ridge regularization (L2 penalty)
model_ridge = rs.glm(
    formula="ClaimCount ~ VehPower + VehAge + C(Area) + C(Region)",
    data=data,
    family="poisson"
).fit(alpha=0.1, l1_ratio=0.0)  # l1_ratio=0 means pure Ridge

print(f"Ridge GLM (α=0.1):")
print(f"  Penalty type: {model_ridge.penalty_type}")
print(f"  Deviance: {model_ridge.deviance:.2f}")
print(f"  Non-zero coefficients: {model_ridge.n_nonzero()} (Ridge keeps all)")

Ridge GLM (α=0.1):
  Penalty type: ridge
  Deviance: 212748.25
  Non-zero coefficients: 27 (Ridge keeps all)


## 3. Lasso Regularization (L1)

Lasso adds an L1 penalty that can shrink coefficients exactly to zero, performing automatic **variable selection**. Use `l1_ratio=1.0` for pure Lasso.

In [4]:
# Lasso regularization (L1 penalty)
model_lasso = rs.glm(
    formula="ClaimCount ~ VehPower + VehAge + C(Area) + C(Region)",
    data=data,
    family="poisson"
).fit(alpha=0.01, l1_ratio=1.0)  # l1_ratio=1 means pure Lasso

print(f"Lasso GLM (α=0.01):")
print(f"  Penalty type: {model_lasso.penalty_type}")
print(f"  Deviance: {model_lasso.deviance:.2f}")
print(f"  Non-zero coefficients: {model_lasso.n_nonzero()} out of {len(model_lasso.params)-1}")
print(f"\n  Selected features: {model_lasso.selected_features()}")

Lasso GLM (α=0.01):
  Penalty type: lasso
  Deviance: 212748.25
  Non-zero coefficients: 27 out of 27

  Selected features: ['VehPower', 'VehAge', 'Area[T.B]', 'Area[T.C]', 'Area[T.D]', 'Area[T.E]', 'Area[T.F]', 'Region[T.Aquitaine]', 'Region[T.Auvergne]', 'Region[T.Basse-Normandie]', 'Region[T.Bourgogne]', 'Region[T.Bretagne]', 'Region[T.Centre]', 'Region[T.Champagne-Ardenne]', 'Region[T.Corse]', 'Region[T.Franche-Comte]', 'Region[T.Haute-Normandie]', 'Region[T.Ile-de-France]', 'Region[T.Languedoc-Roussillon]', 'Region[T.Limousin]', 'Region[T.Midi-Pyrenees]', 'Region[T.Nord-Pas-de-Calais]', 'Region[T.Pays-de-la-Loire]', 'Region[T.Picardie]', 'Region[T.Poitou-Charentes]', "Region[T.Provence-Alpes-Cotes-D'Azur]", 'Region[T.Rhone-Alpes]']


## 4. Elastic Net (L1 + L2)

Elastic Net combines L1 and L2 penalties. It can select variables like Lasso while handling correlated predictors better. Use `0 < l1_ratio < 1` for Elastic Net.

In [5]:
# Elastic Net (mix of L1 and L2)
model_enet = rs.glm(
    formula="ClaimCount ~ VehPower + VehAge + C(Area) + C(Region)",
    data=data,
    family="poisson"
).fit(alpha=0.01, l1_ratio=0.5)  # 50% L1, 50% L2

print(f"Elastic Net GLM (α=0.01, l1_ratio=0.5):")
print(f"  Penalty type: {model_enet.penalty_type}")
print(f"  Deviance: {model_enet.deviance:.2f}")
print(f"  Non-zero coefficients: {model_enet.n_nonzero()}")

Elastic Net GLM (α=0.01, l1_ratio=0.5):
  Penalty type: elasticnet
  Deviance: 212748.25
  Non-zero coefficients: 27


In [6]:
# Get diagnostics object
diag = model_enet.diagnostics(
    data=data,
    categorical_factors=["Region", "Area"],
    continuous_factors=["VehPower", "VehAge"]
)

# Export as JSON for LLM consumption
json_output = diag.to_json()

In [7]:
print(json_output)



## 5. Compare Coefficients

Let's compare how regularization affects the coefficient estimates.

In [8]:
import polars as pl

# Get feature names
feature_names = model_std.feature_names

# Create comparison table
comparison = pl.DataFrame({
    'Feature': feature_names,
    'Standard': model_std.params,
    'Ridge': model_ridge.params,
    'Lasso': model_lasso.params,
    'ElasticNet': model_enet.params,
})

# Show first 10 features
comparison.head(10)

Feature,Standard,Ridge,Lasso,ElasticNet
str,f64,f64,f64,f64
"""Intercept""",-2.796935,-2.799459,-2.797706,-2.797461
"""VehPower""",-0.008496,-0.008496,-0.008496,-0.008496
"""VehAge""",-0.021198,-0.021198,-0.021198,-0.021198
"""Area[T.B]""",0.048648,0.048633,0.048636,0.048641
"""Area[T.C]""",0.08998,0.089969,0.089971,0.089975
"""Area[T.D]""",0.200838,0.200824,0.200828,0.200832
"""Area[T.E]""",0.243506,0.243497,0.243498,0.243502
"""Area[T.F]""",0.343668,0.343608,0.343642,0.343652
"""Region[T.Aquitaine]""",-0.308201,-0.305647,-0.307416,-0.307666
"""Region[T.Auvergne]""",-0.494127,-0.491335,-0.493298,-0.493558


## 6. Cross-Validation for Optimal Alpha

Use cross-validation to find the optimal regularization strength. This helps balance model fit vs. complexity.

In [9]:
# Build design matrix for CV (formula API builds this internally)
from rustystats.formula import build_design_matrix

y, X, names = build_design_matrix(
    "ClaimCount ~ VehPower + VehAge + C(Area) + C(Region)",
    data
)
print(f"Design matrix: {X.shape[0]:,} rows × {X.shape[1]} columns")

Design matrix: 678,012 rows × 28 columns


## 7. Final Model with Optimal Alpha

Fit the final model using the CV-selected alpha on the full dataset.

In [9]:
# Fit final model with optimal alpha
model_final = rs.glm(
    formula="ClaimCount ~ VehPower + VehAge + C(Area) + C(Region)",
    data=data,
    family="poisson"
).fit(alpha=cv_result.alpha_best, l1_ratio=1.0)

print(model_final.summary())

                                 GLM Results                                  

Family:              Poisson         No. Observations:        678012
Link Function:       (default)       Df Residuals:            677984
Method:              IRLS + Lasso    Df Model:                    27
Scale:               1.0000          Alpha (λ):               0.0000
L1 Ratio:            1.00            Iterations:                   6
Non-zero coefs:      27             

Log-Likelihood:         -140874.1999 Deviance:                212748.2537
AIC:                     281804.3999 Null Deviance:           214041.4441
BIC:                     282124.3537 Pearson chi2:              716467.64
Converged:           True           

------------------------------------------------------------------------------
Variable                             Coef    Std.Err        z    P>|z|                 95% CI     
------------------------------------------------------------------------------
Intercept           

In [10]:
# Show which variables were selected
selected_names = model_final.selected_features()

print(f"\nSelected {len(selected_names)} variables:")
for name in selected_names:
    print(f"  - {name}")


Selected 27 variables:
  - VehPower
  - VehAge
  - Area[T.B]
  - Area[T.C]
  - Area[T.D]
  - Area[T.E]
  - Area[T.F]
  - Region[T.Aquitaine]
  - Region[T.Auvergne]
  - Region[T.Basse-Normandie]
  - Region[T.Bourgogne]
  - Region[T.Bretagne]
  - Region[T.Centre]
  - Region[T.Champagne-Ardenne]
  - Region[T.Corse]
  - Region[T.Franche-Comte]
  - Region[T.Haute-Normandie]
  - Region[T.Ile-de-France]
  - Region[T.Languedoc-Roussillon]
  - Region[T.Limousin]
  - Region[T.Midi-Pyrenees]
  - Region[T.Nord-Pas-de-Calais]
  - Region[T.Pays-de-la-Loire]
  - Region[T.Picardie]
  - Region[T.Poitou-Charentes]
  - Region[T.Provence-Alpes-Cotes-D'Azur]
  - Region[T.Rhone-Alpes]


## Summary

| Method | Alpha | L1 Ratio | Non-zero | Deviance |
|--------|-------|----------|----------|----------|
| Standard GLM | 0 | - | All | Baseline |
| Ridge | 0.1 | 0.0 | All | Slightly higher |
| Lasso | 0.01 | 1.0 | Sparse | Slightly higher |
| Elastic Net | 0.01 | 0.5 | Sparse | Slightly higher |
| CV-Optimal | auto | 1.0 | Optimal | Best CV score |

**Key Takeaways:**
- **Ridge** is good for multicollinearity but keeps all variables
- **Lasso** performs variable selection by zeroing out weak predictors
- **Elastic Net** combines both benefits
- **Cross-validation** finds the optimal balance between fit and complexity