# Using Geographic data in GLMs

One interesting use case for GLMs with regularization is to use geographic data in forecasting tasks. Say you want to predict risk for car insurance. Accounting for geography is important, because where we people live impacts their frequency and severity of accidents.

Examples:
 - higher frequency of accidents in urban areas, lower severity on average
 - more weather related accidents near the coasts or mountains
 - higher severity of accidents in wealthier areas, because repair costs are higher

While one could try and find more direct predictors for the examples above it is very convenient to simply use zip code as a predictor. The problem with using zip code is that this is a very high dimensional fixed effect and some zip codes may have very little exposure (i.e. very few observations). So estimates for some zip codes may be very noisy (or not defined) and fairly precise for others. Another problem with zip codes is that they tell us where drivers live, not necessarily where the drive.

# 1. The traditional approach

In practice, actuaries use spatial smoothing to reign in the zip code fixed effects. The (simplified) process works as follows:
- Train a GLM such as $E[y|x] = \exp(X' \beta)$
- Retrieve zip code level residuals from this model (as in the zip code fixed effects that are obtained using $X'\hat \beta$ as offset)
- Spatially smooth the zip code level residuals using some kernel, e.g. Gaussian. We take into account the exposure of zip codes and the distance between zip codes.
- Take the smooth zip code level residuals and cluster them into 30 zones using, e.g., Ward clustering.
- Take the zones, one-hot encode them, call them $Z$, and fit the GLM with them: $E[y|x] = \exp(X' \beta + Z' \gamma)$

## How does the smoothing work?

- $r_j$ refers to the residual of region $i$
- $d_{jj'}$ refers to the distance between regions $j$ and $j'$
- $e_j$ refers to exposure of region $j$
- $f(\cdot)$ is a Kernel that we use for weighting
 
The smoothed residual is given by:
 
\begin{align}
\tilde {r_j} = w_j r_j + (1 - w_j) \frac{\sum_{j \neq i} f(d_{jj'}) e_j r_j}{\sum_{j \neq i} f(d_{jj'}) e_j}
\end{align}
 
where $w_j$ refers to the weight placed on one's own average:
 
\begin{align}
w_j = \left(\frac{e_j}{\sum_{j \neq i} f(d_{jj'}) e_j}\right) ^ \rho
\end{align}
 
$f(\cdot)$ can be any kind of Kernel; here we use a Gaussian Kernel:
 
\begin{align}
f(d) = \exp\left( -\frac{1}{2} \left(\frac{d}{h}\right)^2 \right),
\end{align}
where $h$ refers to the bandwidth.
 
In this form, the smoother has two hyperparameters $h$ and $\rho$.

## Where do we get the residuals from?

What's the correct residual for smoothing?
 
Index observations by $i$ and regions by $j$. All individuals in region $j$
are given by $I(j)$. The region of observation $i$ is given by $j(i)$.
 
Using a log link function, the conditional expectation of $y_i/e_i$ is modeled as
$$
\mathbb E[y_i / e_i | x_i] = \exp(x_i' \beta),
$$
where $e_i$ denotes exposure.
 
We run this model, which gives us $\hat{\beta}$.
 
So what's the correct residual to use? Note that we want the residual on the
zip code level. So we _define_ the residual as the $\alpha_j$ that solve
$$
\mathbb E[y_i / e_i | x_i] = \exp(x_i' \hat{\beta} + \alpha_j).
$$
 
How to estimate $\{\alpha_j\}_j$?
 
Using $e_i$ to denote exposure, the interesting bit of the log-likelihood of a
Tweedie GLM with power $p \in (1, 2)$ is given by
$$\sum_i e_i \left(\frac{y_i}{e_i} \frac{\exp(x_i' \hat{\beta} + \alpha_j) ^ {1-p}}{1-p} - \frac{\exp(x_i' \hat{\beta} + \alpha_j) ^ {2 - p}}{2-p}\right),$$
where we use $e_i$ as weights. We skip the expression for
$p=1$ and $p=2$, but the first order conditions are the same.
 
The first order condition with respect to $\alpha_j$ is given by, for all j:
 
$$\sum_{i \in I(j)} e_i \left( \frac{y_i}{e_i} \exp(x_i' \hat{\beta} + \alpha_j)^{1-p} - \exp(x_i' \hat{\beta} + \alpha_j)^{2-p} \right) = 0.$$
 
Re-arranging, we get
 
$$\sum_{i \in I(j)} y_i \exp(x_i' \hat{\beta} + \alpha_j)^{1-p} = \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta} + \alpha_j)^{2-p}$$
 
and then
 
$$\exp(\alpha_j)^{1-p} \sum_{i \in I(j)} y_i \exp(x_i' \hat{\beta})^{1-p} = \exp(\alpha_j)^{2-p} \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta})^{2-p},$$
 
which means we can solve for $\alpha_j$ directly:
 
$$\alpha_j = \log \left( \sum_{i \in I(j)} y_i \exp(x_i' \hat{\beta})^{1-p} \right) - \log \left( \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta})^{2-p} \right).$$
 
For Poisson ($p=1$) this simplfies to:
 
$$\alpha_j = \log \left( \sum_{i \in I(j)} y_i \right) - \log \left( \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta}) \right).$$
 
For Gamma ($p=2$) this simplfies to:
 
$$\alpha_j = \log \left( \sum_{i \in I(j)} y_i \exp(-x_i' \hat{\beta}) \right) - \log \left( \sum_{i \in I(j)} e_i \right).$$
 
Note that $\alpha_j$ is a zip-code level fixed effect. However, unlike a
typical fixed effect, it is estimated sequentially, i.e. we're not jointly
estimating the main effects $\beta$. To the extent that regionally effects
can be explained by $x_i$, the model will load on $\beta$ not $\alpha_j$.
 
Note that for Poisson, we have a bit of an issue for all zip codes $j$ for
which $y_i = 0$ for all $i \in I(j)$, because the fixed effects would be
$-\infty$. As a first pass, we may just want to hard code a lower bound.
It would be cleaner to use a Bayesian approach (or something similar to a
Bayesian approach, such as a credibility method).

## Extensions

- Typical extensions of this approach involve using some geographic features (e.g., does the person live in a city? average income of the zip code?) in the first stage model and only smoothing the residual that is not easily explained by such variables. This also has the advantage of not treating areas that are very close but very different as (almost) equal.
- Other extensions involve different distance metrics:
    - as the crow flies
    - actual driving times from centroid to centroid
    - neighborhood based distance measure: how many borders to I have to cross to travel from $j$ to $j'$


# 2. Using Regularized GLMs

Instead of the multi-step procedure outlined above, we can also do all of this in one go. See Section 6 in [this tutorial](https://github.com/lorentzenchr/Tutorial_freMTPL2/blob/b99b688f4be3c50d9a3356cc95bc4504742040d0/glm_freMTPL2_example.ipynb).

\begin{align*}
\mathbb E[y_{ij} / e_{ij} | x_{ij}] = \exp(x_{ij}' \beta + \alpha_j)
\end{align*}

Again, we use $i$ to index observations and $j$ to index regions.

When we estimate the GLM, we use L2 regularization on the vector of zip code effects $\alpha$. 

\begin{align}
\min_{\alpha, \beta} -\mathcal L + \lambda \alpha P \alpha'
\end{align}

where $\mathcal L$ is the log-likelihood and $\alpha$ is a $1 \times J$ vector and $P$ is $J \times J$ matrix. $\lambda$ is a scalar (for convenience, can also be absorbed by $P$). This is known as [Tikhonov Regularization](https://en.wikipedia.org/wiki/Tikhonov_regularization)

Creating $P$ is straightforward but a little but of work. We want the differences between $\alpha_j$ and $\alpha_j'$ to be regularized. $\alpha_j$ and $\alpha_j'$ may be further apart from each other when regions $j$ and $j'$ are far apart from each other or when $j$ and $j'$ have lots of exposure. The weight matrix $P$ can encode the (exposure-weighted) distance between zip code areas (same as above).

## What makes this tricky?
- We have a very many zip codes ($\approx10,000$)

## First steps
- Start with [this tutorial](https://github.com/lorentzenchr/Tutorial_freMTPL2/blob/b99b688f4be3c50d9a3356cc95bc4504742040d0/glm_freMTPL2_example.ipynb), familiarize yourself with the problem and adopt it to work with our code base
- Extend the neighborhood based distance metric used there to a distance based metric
- Compare the results with the traditional step-by-step wise approach

## Next steps
- Find (or generate) a data set for a prediction task that has high dimensional regional fixed effect
- Figure out how to handle the large number of fixed effects (e.g. do not construct the full hessian)
- Construct the weight matrix $P$ (consider preserving sparsity for regions that are far apart)
- Figure out how to effectively tune the various hyperparameters that we have ($\alpha$ and the parameters that went into the construction of $P$)

## Extensions
- There's no need to restrict this to L2 regularization

# Literature

- tbc