
# GeoEmbeddings for Poisson GLM – Full Educational Workflow

## Conceptual Introduction

Traditional actuarial Poisson GLMs use handcrafted features and exposure offsets to model claim frequency.  
However, spatial risk is often complex, nonlinear, and multi-dimensional.

Instead of feeding raw coordinates directly into a GLM, we first learn a **vector representation (embedding)** of geography using an autoencoder.

### Why is this novel?

- Raw geographic variables do not encode spatial similarity.
- High-dimensional spatial features (hazard, census, terrain) interact nonlinearly.
- Representation learning compresses spatial structure into dense vectors.
- GLM remains interpretable while benefiting from richer spatial structure.

This idea is inspired by:
Blier-Wong et al. (2021) – Geographic Ratemaking with Spatial Embeddings.


## 1. Imports

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error

np.random.seed(42)
torch.manual_seed(42)


## 2. Simulate Spatial and Traditional Data

In [None]:

N = 60000

longitude = np.random.uniform(-100,100,N)
latitude  = np.random.uniform(-100,100,N)

pop_density = np.exp(-(longitude**2+latitude**2)/20000) * np.random.uniform(10,300,N)
median_income = 35000 + 10000*np.sin(longitude/20) + 8000*np.cos(latitude/25)

flood_hazard = np.clip(np.sin(longitude/15)+np.cos(latitude/15),0,3)
storm_exposure = flood_hazard * np.random.rand(N)

elevation = 0.005*(longitude**2+latitude**2)+5*np.random.randn(N)
terrain_roughness = np.abs(np.gradient(elevation))

insured_value = np.random.uniform(50000,400000,N)
building_age  = np.random.randint(0,80,N)
policy_count  = np.random.randint(1,4,N)
exposure      = np.random.randint(1,5,N)

linear_traditional = 0.000002*insured_value + 0.05*building_age + 0.3*policy_count
linear_spatial = 0.3*pop_density + 0.5*flood_hazard + 0.2*storm_exposure + 0.1*elevation

true_linear = linear_traditional + linear_spatial
rate = np.exp(true_linear)
claim_count = np.random.poisson(rate*exposure)

df = pd.DataFrame({
    "longitude":longitude,"latitude":latitude,
    "pop_density":pop_density,"median_income":median_income,
    "flood_hazard":flood_hazard,"storm_exposure":storm_exposure,
    "elevation":elevation,"terrain_roughness":terrain_roughness,
    "insured_value":insured_value,"building_age":building_age,
    "policy_count":policy_count,"exposure":exposure,
    "claim_count":claim_count
})

df.head()


## 3. Train/Test Split

In [None]:

train = df.sample(frac=0.7, random_state=42)
test  = df.drop(train.index)


## 4. Train Spatial Autoencoder

In [None]:

geo_vars = ["longitude","latitude","pop_density","median_income",
            "flood_hazard","storm_exposure","elevation","terrain_roughness"]

geo_train = torch.FloatTensor(train[geo_vars].values)
loader = DataLoader(TensorDataset(geo_train), batch_size=1024, shuffle=True)

class GeoAutoencoder(nn.Module):
    def __init__(self,input_dim,latent_dim=5):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim,64),
            nn.ReLU(),
            nn.Linear(64,latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim,64),
            nn.ReLU(),
            nn.Linear(64,input_dim)
        )
    def forward(self,x):
        z = self.encoder(x)
        recon = self.decoder(z)
        return z,recon

model = GeoAutoencoder(len(geo_vars),5)
optimizer = torch.optim.Adam(model.parameters(),lr=1e-3)
criterion = nn.MSELoss()

for epoch in range(20):
    total_loss = 0
    for (xb,) in loader:
        z,recon = model(xb)
        loss = criterion(recon,xb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss={total_loss/len(loader):.4f}")


## 5. Extract Embeddings

In [None]:

with torch.no_grad():
    train_emb = model.encoder(torch.FloatTensor(train[geo_vars].values)).numpy()
    test_emb  = model.encoder(torch.FloatTensor(test[geo_vars].values)).numpy()

for i in range(train_emb.shape[1]):
    train[f"emb_{i}"] = train_emb[:,i]
    test[f"emb_{i}"]  = test_emb[:,i]


## 6. Baseline Poisson GLM

In [None]:

formula_base = "claim_count ~ insured_value + building_age + policy_count + offset(np.log(exposure))"

baseline_glm = smf.glm(
    formula=formula_base,
    data=train,
    family=smf.families.Poisson()
).fit()

print(baseline_glm.summary())


## 7. Enhanced Poisson GLM with Embeddings

In [None]:

emb_cols = [f"emb_{i}" for i in range(5)]
formula_enh = "claim_count ~ insured_value + building_age + policy_count + "               + " + ".join(emb_cols) + " + offset(np.log(exposure))"

enhanced_glm = smf.glm(
    formula=formula_enh,
    data=train,
    family=smf.families.Poisson()
).fit()

print(enhanced_glm.summary())


## 8. Evaluation

In [None]:

pred_base = baseline_glm.predict(test)
pred_enh  = enhanced_glm.predict(test)

print("Baseline MSE:", mean_squared_error(test["claim_count"],pred_base))
print("Enhanced MSE:", mean_squared_error(test["claim_count"],pred_enh))

plt.scatter(test["claim_count"],pred_base,alpha=0.3,label="Baseline")
plt.scatter(test["claim_count"],pred_enh,alpha=0.3,label="Enhanced")
plt.legend()
plt.title("Calibration Plot")
plt.show()
