## Part 2: A Failure Mode for Linear Regression

In [1]:
import numpy as np
import pandas as pd

from glm.glm import GLM
from glm.families import Gaussian

%matplotlib inline
import matplotlib.pyplot as plt

1. Create a feature matrix X with two columns and 100 rows. The first column should be an intercept column of all 1.0's, and the second should be randomly sampled from any distribution (a uniform is fine).

In [2]:
X = np.zeros(shape=(100, 2))

X[:, 0] = 1.0
X[:, 1] = np.random.uniform(size=100)

2. Create a target vector from a linear data generating process

In [3]:
# We'll use a smaller scale so the parameter estimates have
# smaller variance.
y = 1.0 + 2.0 * X[:, 1] + np.random.normal(scale=0.25, size=100)

3. Fit a linear regression to (X, y) data. Look at the fit coefficients (i.e. the parameter estimates in statistical language). Are they what you expect them to be? If you had fit the model to 1,000,000 data points, what would change about them?

In [4]:
model = GLM(family=Gaussian())
model.fit(X, y)
model.summary()

Gaussian GLM Model Summary.
Name         Parameter Estimate  Standard Error
-----------------------------------------------
Intercept                  1.06            0.05
X1                         1.96            0.08


If we use more data points then we expect the parameter esitmates to more precisely estimate the true values.  This relies on the data points being independently sampled from one another, and the statistical property that guarentees the better approximation is called **consistency**.

4. Create a new feature matrix X with three columns and 100 rows. Make the first two columns the same as your previous X, but make the third column a copy of the second column, i.e., X should have the same data in the second and third column.

In [5]:
X = np.zeros(shape=(100, 3))

X[:, 0] = 1.0
X[:, 1] = np.random.uniform(size=100)
X[:, 2] = X[:, 1]

y = 1.0 + 2.0 * X[:, 1] + np.random.normal(scale=0.25, size=100)

5. Fit a linear regression to the new (X, y) data (y should be the same as it was in the previous example). What happened?

In [6]:
model = GLM(family=Gaussian())
model.fit(X, y)

LinAlgError: Singular matrix

6. Hopefully you got a LinearAlgError, so there's something unfortunate going on here. Think about what you think the correct answer should be, what coefficients should the model return?

It's not possible to say what the correct answer is in this situation.  We created the `y` vector using the equation:

$$ y = 1 + 2 x_1 + \epsilon $$

But in our current setup with two copies of the same predictor, this is exactly the same equation as:

$$ y = 1 + 2 x_2 + \epsilon $$

In fact, there is an infinite number of equivalent expressions:

$$ y = 1 + x_1 + x_2 + \epsilon $$
$$ y = 1 + 1.5 x_1 + 0.5 x_2 + \epsilon $$

and so forth.  There's not really a way to say that any one of these is *better* than any other ones, they all give the same answer.

In terms of algebra, we are looking for solutions the the following equation:

$$ X^t X \beta = X^t y $$

But the *rank* of the matrix $X^t X$ is two, since the columns of $X$ are linearly dependent.  This mean that this system of linear equations has an infinite number of solutions.  This is the source of the `LinAlgError`, a *singluar* matrix is one without full rank.

7. Create a new feature matrix where one column is a multiple of another, and fit a linear regression again, what happened this time? How can you explain it?

In [7]:
X = np.zeros(shape=(100, 3))

X[:, 0] = 1.0
X[:, 1] = np.random.uniform(size=100)
X[:, 2] = 2 * X[:, 1]

y = 1.0 + 2.0 * X[:, 1] + np.random.normal(scale=0.25, size=100)

In [8]:
model = GLM(family=Gaussian())
model.fit(X, y)

LinAlgError: Singular matrix

We've got more or less the same issue as before.  This time the following equations are all equivalent:

$$ y = 1 + 2 x_1 + \epsilon $$
$$ y = 1 + x_2 + \epsilon $$
$$ y = 1 + x_1 + 0.5 x_2 + \epsilon $$

So we again have an issue with the rank of the matrix $X^t X$.

8. Create one last feature matrix where one column is a linear combination of two or more other columns. Fit a linear regression using it. What happened this time? Can you explain it?

In [9]:
X = np.zeros(shape=(100, 4))

X[:, 0] = 1.0
X[:, 1] = np.random.uniform(size=100)
X[:, 2] = np.random.uniform(size=100)
X[:, 3] = 2 * X[:, 1] - 3 * X[:, 2]

y = 1.0 + 2.0 * X[:, 1] - X[:, 2] + np.random.normal(scale=0.25, size=100)

In [10]:
model = GLM(family=Gaussian())
model.fit(X, y)

<glm.glm.GLM at 0x1a17d85908>

This time the model fit...

In [11]:
model.summary()

Gaussian GLM Model Summary.
Name         Parameter Estimate  Standard Error
-----------------------------------------------
Intercept                  1.02            0.07
X1                         0.97      1362631.85
X2                         0.52      2043947.78
X3                         0.52       681315.93


Oh wow, look at those standard errors (you'll find they are either **massive** or `nan`).

This is a sign that we have some problems in our regression.  While before, our regression just failed because we had an exact linear dependency, here floating point error saved us from having an exact problem, but the linear dependency in our matrix has led to very badly estimated parameters.

Notice that the parameter estimates returned are very bad estimates of the truth.  This is a common situation when our matrix has either an exact linear dependency, or an approximate one.  The results from these regression should not be trusted!  It's indicating to us that we could remove one of our predictors without suffering:

In [12]:
X_small = X[:, [0, 1, 2]]

model = GLM(family=Gaussian())
model.fit(X_small, y)

<glm.glm.GLM at 0x1a17d85630>

In [13]:
model.summary()

Gaussian GLM Model Summary.
Name         Parameter Estimate  Standard Error
-----------------------------------------------
Intercept                  1.02            0.07
X1                         2.01            0.08
X2                        -1.04            0.09


That's much better.

9. Hopefully you've seen a few linear regressions fail at this point. Why did they fail? What is the failure mode for linear regression?

Linear regressions fail when the columns of the predictor matrix $X$ are linearly independent, which leads to having multiple (an infinite number of) equally good solutions.  This was explicit in our first few examples.

In our final example, we again had a linear dependency in $X$, but the computer did not catch it exactly.  Instead, it was indicated by the massive standard errors of our parameter estimates.  This tends to happen when our $X$ matrix either contains an exact or approximate linear independency.