In [2]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

In [3]:
outlier_czs = [34105, 34113, 34112, 34106]
df = (
    pd.read_csv('data/mobility.csv')
    # filter out rows with NaN AUM values
    .query('not aum.isnull()', engine='python')
    # take out outlier CZs
    .query('cz not in @outlier_czs')
)

(sec:linear_multi)=
# Multiple Linear Model

So far, we've used a single predictor variable $ x $ to predict the outcome
$ y $.
Now, we'll introduce the *multiple linear model*, a linear model that uses
multiple predictors to predict $ y $.
This is useful because having multiple predictors can
improve our model's fit to the data and improve accuracy.
After defining the multiple linear model, we'll use it to predict AUM using
a combination of variables.

If we have multiple predictors, we say that $ x $ is a $ p $-dimensional
column vector
$ x = [ x_1, x_2, \ldots, x_p ] $. Then, for a given $ x $ the
outcome $ y $ depends on a linear combination of $ x_i $:

$$
\begin{aligned}
y = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon
\end{aligned}
$$

Similar to the simple linear model, our
multiple linear model $ f_{\theta}(x) $ predicts $ y $ for a given $ x $:

$$
\begin{aligned}
f_{\theta}(x) = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p
\end{aligned}
$$

We can simplify this notation if we add an intercept term to $ x $ so that
$ x = [ 1, x_1, x_2, \ldots, x_p ] $.
Since we also write our model parameters as a column vector
$ \theta = [ \theta_0, \theta_1, \ldots, \theta_p ] $, we can
use the definition of the dot product to
write our model as:

$$
\begin{aligned}
f_{\theta}(x) &= \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p \\
&= \theta \cdot x \\
\end{aligned}
$$

As a final simplification, we'll use matrix notation to show how our
models works on our entire dataset.
Before, we said that a single observation is $ (x, y) $, where $ x $ is a
vector of predictor variables and $ y $ is the scalar outcome. 
Now, we'll say that $ X $ is a matrix. Each row of $ X $ has the predictors
for a single observation.
We'll also say that $ y $ is a vector (instead of scalar) with the outcomes
for each observation:

$$
\begin{aligned}
X = 
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1p} \\
1 & x_{21} & x_{22} & \cdots & x_{2p} \\
  &        & \vdots &        &        \\
1 & x_{n1} & x_{n2} & \cdots & x_{np} \\
\end{bmatrix}
& &
y =  
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix}
\end{aligned}
$$


We call $ X $ the *design matrix*.
It's a $ n \times (p + 1) $ matrix (remember that we added an extra dimension
for the intercept term).
Now, we can write the predictions for the entire dataset using
matrix multiplication:

$$
\begin{aligned}
f_{\theta}(x) &= X \theta
\end{aligned}
$$

$ X $ is an $ n \times (p + 1) $ matrix and $ \theta $ is a $ (p + 1) $-dimensional column vector.
This means that $ X \theta $ is
an $ n $-dimensional column vector. Each item in the vector
is the model's predictions for one observation.
It's easier to understand the design matrix through an example, so let's
return to the Opportunity data.

Now that we have our data prepared for modeling, in the next section we'll 
fit our model by finding the $ \hat{\theta} $ that minimizes our loss.

(sec:linear_multi_fit)=
## Fitting the Multiple Linear Model

For a $ n \times (p + 1) $ design matrix $ X $, a $ n $-dimensional
column vector of outcomes $ y $, and a $ (p + 1) $-dimensional column 
vector of model parameters $ \theta $, we assume that:

$$
\begin{aligned}
y = X \theta + \epsilon
\end{aligned}
$$

Here, $ \epsilon $ is a $ n $-dimensional column vector that represents the
sampling error.
We define the multiple linear model as:

$$
\begin{aligned}
f_{\theta}(X) = X \theta
\end{aligned}
$$

Similar to the simple linear model, we'll fit $ f_{\theta}(X) $ using
the squared loss function.
We want to find the model parameters $ \hat{\theta} $ that minimize the
mean squared loss:

$$
\begin{aligned}
L(\theta, X, y)
 &= \frac{1}{n} \left | y - f_{\theta}(X) \right|^2
\end{aligned}
$$

Here, we're using the notation $ |v|^2 $ for a vector $ v $ as a
shorthand for the sum of each vector element squared [^l2]:
$ |v|^2 = \sum_i v_i^2 $ .

[^l2]: $ |v| $ is also called the $ \ell_2 $ norm of a
vector $ v $.

In this section, we'll fit our model by figuring out what the
minimizing $ \hat{\theta} $ is.
One idea is to use calculus as we did for the simple linear model.
However, this approach needs knowledge of vector calculus that we won't
cover in this book.
Instead, we'll use a geometric argument.

## A Geometric Problem

Our goal is the find the $ \hat{\theta} $ that minimizes our loss
function---we want to make $ L(\theta, X, y) $ as small as possible
for a given $ X $ and $ y $.
The key insight is that we can restate this goal in a geometric way.
Remember: the model predictions $ f_{\theta}(X) $ and the true outcomes
$ y $ are both vectors.
We can treat vectors as points---for example, we can plot
the vector $ [ 2, 3 ] $ at $ x = 2, y = 3 $ in 2D space.
Then, minimizing $ L(\theta, X, y) $ is equivalent to finding
$ \hat{\theta} $ that makes $ f_{\theta}(X) $ as close as possible to
$ y $ when we plot them as points.
As depicted in {numref}`Figure %s <fig:geom-2d>`, different values of
$ \theta $ give different predictions $ f_{\theta}(X) $ (hollow points).
Then, $ \hat{\theta} $ is the vector of parameters that put
$ f_{\theta}(X) $ as close to $ y $ (filled point) as possible.

```{figure} figures/geom-2d.svg
---
name: fig:geom-2d
width: 250px
---

A plot showing different values of $ f_{\theta}(X) $ (hollow points) and
the outcome vector $ y $ (filled point).
```

Next, we'll look at the possible values of $ f_{\theta}(X) $.
In {numref}`Figure %s <fig:geom-2d>`, we showed a few possible 
$ f_{\theta}(X) $.
Instead of just plotting a few possible points, we can
plot *all* possible values of $ f_{\theta}(X) $ by varying $ \theta $.
This results in a subspace of possible $ f_{\theta}(X) $ values, as shown in
{numref}`Figure %s <fig:geom-span>`.

```{figure} figures/geom-span.svg
---
name: fig:geom-span
width: 250px
---

A plot showing all possible values of $ f_{\theta}(X) $ as a line.
```

In the above {numref}`Figure %s <fig:geom-span>`, we drew the set of
possible $ f_{\theta}(X) $ values as a line.
Since our model is $ f_{\theta}(X) = X \theta $, from a property of
matrix-vector multiplication we know that $ f_{\theta}(X) $ is a linear
combination of the columns of $ X $, which we also call $ \text{span}(X) $.
Now, we need to figure out which point within $ \text{span}(X) $ lies the
closest to $ y $.

As {numref}`Figure %s <fig:geom-span>` suggests, the closest point to $ y $ 
is the point where the error $ \epsilon = y - f_{\theta}(X) $ is perpendicular
to $ \text{span}(X) $. We'll leave the complete proof as an exercise.
With this final fact, we can solve for $ \hat{\theta} $:

$$
\begin{aligned}
f_\hat{\theta}(X) + \epsilon &= y \\
X \hat{\theta} + \epsilon &= y \\
X^\top X \hat{\theta} + X^\top \epsilon &= X^\top y
    & (\text{left-multiply both sides by } X^\top) \\
X^\top X \hat{\theta} &= X^\top y
    & (X^\top \epsilon = 0 \text{ since } \epsilon \perp \text{span}(X)) \\
\hat{\theta} &= (X^\top X)^{-1} X^\top y
\end{aligned}
$$


With this derivation done, we can now write a short function to
fit the multiple linear model using $ X $ and $ y $.

In [3]:
def fit(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

Notice that deriving $ \hat{\theta} $ for the multiple linear model also gives
us $ \hat{\theta} $ for the simple linear model too. If we set
$ X $ to contain the intercept column and one column of features, using the
formula for $ \hat{\theta} $ gives us the intercept and slope of the best-fit
line.