# 7 James-Stein Estimation and Ridge Regression

James-Stein estimator sacrifices some biasedness to trade for overall performance in variance. Oftentimes it has  smaller variance than MLE.

## James-Stein Estimation 

### Posterior Mean 

First suppose $\mu\sim N(M,A)$ and $(x|\mu)\sim N(\mu,1)$ and $M,A,x\in\mathbb R$ are known and now $\mu$ is fixed but unknown. Then we can estimate $\mu$ by solving that 
$(\mu|x)\sim N(M+B(x-M), B)$ where $B = A(A+1)^{-1}<1$. We thus define the posterior mean of $\mu$, or Bayes estimator of $\mu$, by 
$$\hat\mu^{\rm Bayes} = M+B(x-M).$$

Then 
$$\mathbb E\left\{(\hat\mu^{\rm Bayes} - \mu)^2\right\} = B.$$

However, if given $x$ we use the MLE estimator to estimate $\mu$, which is $\hat\mu^{\rm MLE} = x$, then 
$$\mathbb E\left\{(\hat\mu^{\rm MLE})^2\right\} = 1>B.$$

Hence here the Bayes estimator is better in variance, as we have utilized the priori information $\mu\sim N(M,A)$.

### Unknown Posterior Parameters

Suppose $\mu\sim N(M,A)$ and $(x|\mu)\sim N(\mu,1)$ and only $x\in\mathbb R$ is known while $M,A$ are fixed but unknown. Still we have 
$(\mu|x,M,A)\sim N(M+B(x-M),B)$ where $B = A(A+1)^{-1}$. It suffices to estimate the parameters $M,A$.

When we have $n>3$ samples $x_1,\dotsc,x_n$, then we can use the following to provide unbiased estimators:
$$\hat M =\bar x\quad\quad \hat B =1 - \frac{n-3}{\sqrt{\sum_{i=1}^n (x_i - \bar x)^2}}.$$


**Proof** It can be shown that $x\sim N(M,A+1)$ so $\mathbb E(\hat M) = \mathbb E(\bar x) = M$. For $\hat B$, we note that $\hat B =  1 - \frac{n-3}{(A+1)\chi_{n-1}^2}$ and
$$\mathbb E(\hat B ) = 1 - \frac{n-3}{A+1}\int_{0}^\infty \frac{1}{x}\frac{1}{2^{\frac{n-1}{2}}\Gamma(\frac{n-1}{2})}x^{\frac{n-1}{2}-1}e^{-\frac{x}{2}}dx= 1 - \frac{n-3}{A+1}\frac{2^{\frac{n-3}{2}}\Gamma(\frac{n-3}{2})}{2^{\frac{n-1}{2}}\Gamma(\frac{n-1}{2})}=\frac{A}{A+1}.
$$


## James-Stein Theorem

Suppose $(x_i|\mu_i)\sim N(\mu_i,1)$ independently for $i=1,2,\dotsc,n$ and $n>3$. Then 
$$\mathbb E\left\{\Vert \hat\mu^{\rm JS} - \mu\Vert^2\right\} < \mathbb E\left\{\Vert\hat\mu^{\rm MLE} - \mu\Vert^2\right\}$$
where $\mu = [\mu_1,\dotsc,\mu_n]^T$. Each $\mu_i$ can be a random variable.

## Ridge Regression

In linear regression ${\argmin}_{\hat\beta} \Vert y - X\hat\beta\Vert^2$, we can add a regularization term to form a new problem, called ridge regression term:

$${\argmin}_{\hat\beta}\left\{\Vert y - X\hat\beta\Vert^2 + \lambda\Vert\hat\beta\Vert^2\right\}$$

### Bayesian Rationale

We can explain the idea of ridge regression by Bayes. Assume the true parameter follows a Gaussian prior,
$$\beta\sim N (0,\Sigma).$$

Together with the assumption $\epsilon\sim N(0,\sigma^2I_n)$ and $y = X\beta+\epsilon$ we learn that 
$$\left[\begin{matrix}y\\ \beta\end{matrix}\right]
\sim N\left(0,\left[\begin{matrix} X\Sigma X^T+\sigma^2I_n & X\Sigma  \\ \Sigma  X^T & \Sigma\end{matrix}\right]\right).
$$

Then the conditional Gaussian distribution is given by 
$$(\beta|y)\sim N\big(\Sigma X^T(X\Sigma X^T+\sigma^2I_n)^{-1}y,\ \dotsc\big)$$

When $\Sigma = \frac{2}{\lambda}\sigma^2I_p$, we can estimate $\hat\beta$ with the conditional expectance,
$$\hat\beta = \Sigma X^T(X\Sigma X^T+\sigma^2I_n)^{-1}y=X^T(X X^T+\frac{\lambda}{2}I_n)^{-1}y.$$

Note that we have the equation
$$(X^TX+\frac{\lambda}{2}I_p)X^T = X^T(XX^T+\frac{\lambda}{2} I_n)\quad\Rightarrow\quad  X^T(XX^T+\frac{\lambda}{2} I_n)^{-1} = (X^TX+\frac{\lambda}{2}I_p)^{-1}X^T.$$

Thus, $\hat\beta = (X^TX+\frac{\lambda}{2}I_p)^{-1}X^Ty$ is exactly the minimizer of the ridge regression ${\rm argmin}\Vert y - X\beta\Vert^2+\frac \lambda 2\Vert \beta\Vert^2$.

In [7]:
import numpy as np
np.random.seed(0)
X = np.random.randn(8,4)
y = np.random.randn(X.shape[0])
b1 = X.T @ np.linalg.inv(X @ X.T + .5 * np.diag(np.ones(X.shape[0]))) @ y
b2 = np.linalg.inv(X.T @ X + .5 * np.diag(np.ones(X.shape[1]))) @ X.T @ y
b1 - b2

array([-8.32667268e-17, -1.11022302e-15, -2.08166817e-16, -1.11022302e-16])

## Shrinkage

On the other hand, when there is outlier, these estimations will badly impact the estimation of the outlier. One had better remove the outliers from the training data first.