# Using geographic data in GLMs

One interesting use case for GLMs with regularization is to incorporate geographic data in forecasting tasks. Say you want to predict risk for car insurance. It is important to take geography into account, because where people live impacts the frequency with which they have accidents as well as the severity of those accidents.

Examples:
 - higher frequency of accidents in urban areas, lower severity on average;
 - more weather-related accidents near the coasts or in the mountains;
 - higher severity of accidents in wealthier areas, because repair costs are higher.

While one could try and find more direct predictors for the examples above, it is very convenient to simply include zip codes as predictors. However, zip codes  imply a very high-dimensional fixed effect and some may moreover contain very little exposure (i.e. very few observations). As a consequence, estimates may be very noisy for certain zip codes (or not defined) and fairly precise for others. Another problem is that zip codes inform us about where drivers live, but not necessarily about where they drive.

# 1. The traditional approach

In practice, actuaries use spatial smoothing to rein in the zip code effects. The (simplified) process works as follows:
- Train a first GLM, such as $E[y|x] = \exp(X' \beta)$, to obtain a linear predictor $X'\hat \beta$.
- Retrieve residuals for each zip code from this model: i.e. the zip code fixed effects from a second GLM with $X' \hat \beta$ as an offset.
- Smooth the residuals with some kernel (e.g., Gaussian), taking into account both the exposure in each zip code and the distance between them.
- Take the smoothed residuals and cluster them into a number of zones using, e.g., Ward clustering.
- Take the zones, encode them into indicators $Z$ and fit a third GLM with them: $E[y|x] = \exp(X' \beta + Z' \gamma)$.

## How does the smoothing work?

Let

- $r_j$ be the residual of region $i$;
- $d_{jj'}$ be the distance between regions $j$ and $j'$;
- $e_j$ be exposure of region $j$;
- $f(\cdot)$ be a kernel for use in weighting.
 
The smoothed residual is given by:
 
\begin{equation*}
\tilde {r_j} = w_j r_j + (1 - w_j) \frac{\sum_{j \neq i} f(d_{jj'}) e_j r_j}{\sum_{j \neq i} f(d_{jj'}) e_j},
\end{equation*}
 
where $w_j$ refers to the weight placed on one's own average:
 
\begin{equation*}
w_j = \left[\frac{e_j}{\sum_{j \neq i} f(d_{jj'}) e_j}\right] ^ \rho.
\end{equation*}
 
$f(\cdot)$ can be any kernel; here, we use a Gaussian Kernel:
 
\begin{equation*}
f(d) = \exp\left[ -\frac{1}{2} \left(\frac{d}{h}\right)^2 \right],
\end{equation*}
where $h$ is the bandwidth.
 
In this form, the smoother has two hyperparameters: the bandwidth $h$ and a curvature parameter $\rho$.

## Where do we get the residuals from?

What is the correct residual for smoothing?
 
Index observations by $i$ and regions by $j$. All individuals in region $j$ are given by $I(j)$. The region of observation $i$ is given by $j(i)$.
 
Using a log link function, the conditional expectation of $y_i/e_i$ is modeled as

\begin{equation*}
\mathbb E[y_i / e_i | x_i] = \exp(x_i' \beta),
\end{equation*}

where $e_i$ denotes exposure.
 
We run this model, which gives us $\hat{\beta}$.
 
So what is the correct residual to use? Note that we want the residual on the zip code level. So we _define_ the residual as the $\{\alpha_j\}_j$ that solve

\begin{equation*}
\mathbb E[y_i / e_i | x_i] = \exp(x_i' \hat{\beta} + \alpha_j).
\end{equation*}
 
How to estimate these $\{\alpha_j\}_j$?
 
For concreteness, consider a Tweedie GLM. Using $e_i$ to denote exposure, the interesting bit of the log-likelihood of a Tweedie GLM with power $p \in (1, 2)$ is given by

\begin{equation*}
\sum_i e_i \left(\frac{y_i}{e_i} \frac{\exp(x_i' \hat{\beta} + \alpha_j) ^ {1-p}}{1-p} - \frac{\exp(x_i' \hat{\beta} + \alpha_j) ^ {2 - p}}{2-p}\right),
\end{equation*}

where we use $e_i$ as weights. We skip the expression for $p=1$ and $p=2$, but the first order conditions are the same.
 
The first order condition with respect to $\alpha_j$ is given by, for all j:
 
\begin{equation*}
\sum_{i \in I(j)} e_i \left( \frac{y_i}{e_i} \exp(x_i' \hat{\beta} + \alpha_j)^{1-p} - \exp(x_i' \hat{\beta} + \alpha_j)^{2-p} \right) = 0.
\end{equation*}
 
Re-arranging, we get
 
\begin{equation*}
\sum_{i \in I(j)} y_i \exp(x_i' \hat{\beta} + \alpha_j)^{1-p} = \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta} + \alpha_j)^{2-p}
\end{equation*}
 
and then
 
\begin{equation*}\exp(\alpha_j)^{1-p} \sum_{i \in I(j)} y_i \exp(x_i' \hat{\beta})^{1-p} = \exp(\alpha_j)^{2-p} \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta})^{2-p},
\end{equation*}
 
which means that we can solve for $\alpha_j$ directly:
 
\begin{equation*}
\alpha_j = \log \left( \sum_{i \in I(j)} y_i \exp(x_i' \hat{\beta})^{1-p} \right) - \log \left( \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta})^{2-p} \right).
\end{equation*}
 
For Poisson ($p=1$), this simplfies to:
 
\begin{equation*}
\alpha_j = \log \left( \sum_{i \in I(j)} y_i \right) - \log \left( \sum_{i \in I(j)} e_i \exp(x_i' \hat{\beta}) \right).
\end{equation*}
 
For Gamma ($p=2$), this simplfies to:
 
\begin{equation*}
\alpha_j = \log \left( \sum_{i \in I(j)} y_i \exp(-x_i' \hat{\beta}) \right) - \log \left( \sum_{i \in I(j)} e_i \right).
\end{equation*}
 
Note that $\alpha_j$ is a zip-code level fixed effect. However, unlike a typical fixed effect, it is estimated sequentially, i.e. we're not jointly estimating the main effects $\beta$. To the extent that regional effects can be explained by $x_i$, the model will load on $\beta$ not $\alpha_j$.
 
Note that for Poisson, we have a bit of an issue for all zip codes $j$ for which $y_i = 0$ for all $i \in I(j)$, because the fixed effects would be $-\infty$. As a first pass, we may just want to hard code a lower bound. It would be cleaner to use a Bayesian approach (or something similar to a Bayesian approach, such as a credibility method).

## Extensions

- Typical extensions of this approach involve using some geographic features (e.g., does the person live in a city? average income of the zip code?) in the first-stage model and only smoothing variation that is not easily explained by such variables. This procedure has the advantage of not treating areas that are very close in space but very different in their (observed) characteristics as equal or nearly so.
- Other extensions involve different distance metrics:
    - as the crow flies;
    - actual driving times from centroid to centroid;
    - neighborhood-based distance: how many borders do I have to cross to travel from $j$ to $j'$.

# 2. Using Regularized GLMs

Instead of the multi-step procedure outlined above, we can also do all of this in one go. See Section 6 in [this tutorial](https://github.com/lorentzenchr/Tutorial_freMTPL2/blob/b99b688f4be3c50d9a3356cc95bc4504742040d0/glm_freMTPL2_example.ipynb).

\begin{align*}
\mathbb E[y_{ij} / e_{ij} | x_{ij}] = \exp(x_{ij}' \beta + \alpha_j)
\end{align*}

Again, we use $i$ to index observations and $j$ to index regions.

When we estimate the GLM, we use L2 regularization on the vector of zip code effects $\alpha$. 

\begin{equation*}
\min_{\alpha, \beta} -\mathcal L + \lambda \alpha' P \alpha
\end{equation*}

where $\mathcal L$ is the log-likelihood, $\alpha$ is a $J \times 1$ vector, $P$ is $J \times J$ matrix and $\lambda$ is a scalar (which, for convenience, can also be incorporated into $P$). This is known as [Tikhonov regularization](https://en.wikipedia.org/wiki/Tikhonov_regularization).

It is straightforward to create $P$, but a little but of work. We want to regularize the differences between $\alpha_j$ and $\alpha_j'$. In doing so, the farther apart the regions $j$ and $j'$ and the greater the exposure in each of them, the more we let $\alpha_j$ and $\alpha_j'$ differ. The weight matrix $P$ conveys the (exposure-weighted) distance between zip codes, much as the kernel in the sequential procedure above.

## What makes this tricky?

- We have very many zip codes ($\approx 10,000$)

## First steps
- Starting with [this tutorial](https://github.com/lorentzenchr/Tutorial_freMTPL2/blob/b99b688f4be3c50d9a3356cc95bc4504742040d0/glm_freMTPL2_example.ipynb), familiarize yourself with the problem and adapt it to work with our code base.
- Extend the neighborhood-based distance metric used there to an actual distance based metric.
- Compare the results with the traditional step-by-step approach.

## Next steps

- Find (or generate) a data set for a prediction task that has a high dimensional regional fixed effect.
- Figure out how to handle the large number of fixed effects (e.g., do not construct the full Hessian).
- Construct the weight matrix $P$ (consider preserving sparsity for regions that are far apart).
- Figure out how to tune effectively the various hyperparameters that we have ($\alpha$ and the parameters that went into the construction of $P$).

## Extensions

- There's no need to restrict this to L2 regularization.

# Literature

- tbc