## Honor 4 High Dimensional Data Analysis

## Background


### Rise of Dimensionality

High dimensional data have a wide range of application, for example, microarray or proteomics data in disease classification, spatial-temporal data and high-resolution images...

Sometimes we want to construct a method that can effectively predict the future observations, then a black-box model like neural networks works well. However, if we want to gain insight into the underlying relation of features like gene relations, it would be quite challenging.

### Impact of Dimensionality

High dimensional data are very challenging.

* Expensive Computation (large time complexity, slow convergence)
* Expensive Storage
* Hard Optimiztation (trapped in local minima, overfitting)
* Numerical Instability (accumulated noises, ...)


### Spurious Correlation

Spurious correlation refers to the "wrong" relation due to overfitting in high dimensional data.

#### Example

Consider $n$ samples $(x,y)$ where $x\in\mathbb R^p$ and $y\in\mathbb R$. Here each entry of $x$ is sampled from independent Gaussian distribution $N(0,1)$ and also $y$ is independently sampled from $N(0,1)$.

Theoretically, $y$ is independent of $x$. However, when we face undersampling in high dimension: $n\ll p$, even a linear regression will fit well and report strong correlation. This is called the spurious correlation.

## Multiple Linear Regression

Now we review the concept of multiple linear regression. 



Let $X\in\mathbb R^{n\times p}$ be the covariate matrix and $y\in\mathbb R^n$ be the responses. The true model is $y = X\beta + \epsilon$ while $\epsilon$ is the noise and $\beta$ is unknown coefficients. We want to estimate
$$\hat\beta = {\rm argmin}_{\hat\beta} \mathbb E\Vert y - X\beta\Vert^2$$

### Heteroskedasticity

When $\mathbb E(\epsilon_i) = 0$ but ${\rm Var}(\epsilon_i) = \sigma^2w_i$ where $w_i$ is a known positive constant and $\sigma^2$ is unknown (i.e. the scale of noises are different), the problem is said to be heteroskedastic.

We can reweight the problem, $\tilde y_i = w_i^{-\frac 12}y_i$ and $\tilde x_i = w_i^{-\frac 12}x_i$ and $\tilde \epsilon_i = w_i^{-\frac 12}\epsilon_i\sim (0,\sigma^2)$. Now the noise $\tilde\epsilon$ is homoskedastic and we can use OLS (ordinary least squares):

$$\hat \beta = (\tilde X^T\tilde X)^{-1}\tilde X^T\tilde y=(X^TW_0^{-1}X)^{-1}X^TW_0^{-1}y.$$

Here $W_0 = {\rm diag}[w_1,\dotsc,w_n]={\rm Cov}(\epsilon)\in\mathbb R^{n\times n}$. This is called the weighted least squares.

#### Generalized Least Squares

In the heteroskedastic problem, we note that the scaling of the variance, the true $W_0$ is not known in advance. However, whatever **wrong** $W$ we choose, using $\hat\beta = (X^TW^{-1}X)^{-1}X^TW^{-1}y$ is unbiased, since
$$\mathbb E\hat\beta = \mathbb E\left\{(X^TW^{-1}X)^{-1}X^TW^{-1}y\right\} = \mathbb E\left\{(X^TW^{-1}X)^{-1}X^TW^{-1}X\beta\right\}=\beta.$$


So it suffices to look into the variance of $\hat\beta$. It is not surprising that it converges as $n\rightarrow \infty$ if assuming some mild conditions.
$${\rm Var}(\hat\beta) =(X^TW^{-1}X)^{-1}(X^TW^{-1}W_0W^{-1}X)(X^TW^{-1}X)^{-1}\approx O(n^{-1}) $$

### Ridge Regression

Ridge regression can handle the case when $X$ is singular or has small singular values by $\hat\beta = (X^TX+\lambda I)^{-1}X^Ty$. Also, it trades some unbiasedness for smaller MSE than OLS.

#### Brige Regression

A generalization of ridge regression is the bridge regression:
$$\min_\beta \left\{\Vert y -X\beta\Vert^2 + \lambda \sum_j |\beta_j|^q\right\}.$$

Here $q>1$ guarantees the convexity. When $q =2$ it is the ridge regression. When $q = 1$ it degenerates to LASSO.

## Subset Selection

Subset selection might outperform ridge selection in sparse models.


Assume in true model $y = f(x) +\epsilon$ where $\epsilon \sim (0,\sigma^2)$ are uncorrelated and homoskedastic. Let $p$ be the dimension of fitted model $\hat f$. And we have $n$ samples in total.


Common criterion for subset selections include $C_p$, AIC, BIC...


### $C_p$

Criterion $C_p$ is given by $C_p = \frac{\rm SSE_P}{\sigma^2}- (n - 2p)$. 

In multiple linear regression $f(x) = x^T\beta$,  let $P$ be the chosen subset of features and $X_P$ be the chosen covariates matrix, then we have the following theorem:

**Theorem** Let $\hat y_P = (X_P^TX_P)^{-1}X_P^Ty$ be the least squares estimator with the feature subset $P$. But in actual $y = X\beta +\epsilon$ and $\epsilon\sim (0,\sigma^2)$, then
$$\mathbb E C_p =\mathbb E \left\{ \frac{1}{\sigma^2} \Vert \hat y_P - \mathbb E(y)\Vert^2\right\}$$

**Proof** On the one hand, let $H_P = X_P(X_P^TX_P)^{-1}X_P^T$ be the hat matrix, then $H_P$ is idempotent and $\hat y = H_P y$,
$$\begin{aligned}\mathbb E {\rm SSE}_P &=\mathbb E\Vert y - H_Py\Vert^2 =
\mathbb E\Vert (I - H_P)(X\beta + \epsilon)\Vert^2
\\ & = \beta^T X^T(I - H_P)^2X\beta +\sigma^2 {\rm tr}\left\{(I - H_P)^2\right\}
\\ & = \beta^T X^T(I - H_P)X\beta +\sigma^2 (n-p).
\end{aligned}
$$

This implies $\mathbb E C_p = \sigma^{-2}  \beta^T X^T(I - H_P)X\beta+p$.

On the other hand,
$$\begin{aligned}\mathbb E \left\{ \frac{1}{\sigma^2} \Vert \hat y_P - \mathbb E(y)\Vert^2\right\}
&=\frac{1}{\sigma^2}\mathbb E\Vert H_P (X\beta  + \epsilon) - X\beta\Vert^2
\\ &= \frac{1}{\sigma^2}\left(\beta^TX^T(I - H_P)^2X\beta + \sigma^2{\rm tr}(H_P^2)\right)\\ &= \sigma^{-2}  \beta^T X^T(I - H_P)X\beta+p.
\end{aligned}
$$



### AIC

Here the "IC" in "AIC" is abbreviation for "information criterion". Let $\mathcal L$ be the likelihood of the observations evaluated on the fitted model, then

$${\rm AIC} = -2\log \mathcal L + 2p$$

In multiple linear regression, when using the same $\sigma^2$ and assuming the noise has independent Gaussian distribution $N(0,\sigma^2)$, then minimizing AIC and minimizing $C_p$ criterion are equivalent.

### BIC

BIC refers to Bayesian information criterion.

$${\rm AIC} = -2\log \mathcal L + p\log n$$

### $L_0$ Penalized Least Squares

In multiple linear regression, consider the general form of $L_0$ penalty,
$$\min \left\{ \Vert y - X\beta\Vert^2 + \lambda \Vert \beta\Vert_0\right\}.$$

Here the $0$-norm $\Vert \beta\Vert_0$ is the count of nonzero entries of $\beta$. Using the notations above, $\Vert \beta\Vert_0 = p$.

<br>

We can easily see that $C_p$ criterion is equivalent to using $\lambda = 2\sigma^2$. As mentioned earlier, AIC and $C_p$ are equivalent if the noise are Gaussian and using the same $\sigma^2$, so AIC is also equivalent to $\lambda = 2\sigma^2$.

Anologously, BIC is equivalent to $\lambda = \sigma^2\log n$ in the penalty when assuming the same condition.
