# Feature selection methods

Basically there're **feature selection** can be divided into 3 categories:

## Filter methods
Filter methods pick up the intrinsic properties of the features (i.e., the “relevance” of the features). They are measured via statistics analysis instead of cross-validation performance. 

1. information gain
2. chi-square test
    > A chi-square test is used to test the independence of two events. In feature selection,we can test the dependency between the predictor and response. Higher the Chi-Square value, the feature is more dependent on the response and it can be selected for model training
3. fisher score
4. correlation coefficient
    > Test correlation between all predictors. Features with high correlation are more linearly dependent on each other and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features. 
5. variance threshold
   > Variance Thresholds: We assume that features with **a higher variance may contain more useful information**, so we simply compute the variance of each feature, and select the subset of features based on a specified threshold. E.g., “keep all features that have a variance greater or equal to x” or “keep the top k features with the largest variance”.

## Wrapper methods
Wrapper methods measure the “usefulness” of features based on the classifier performance. They are essentially solving the “real” problem, which is optimizing the classifier performance, but they are also computationally more expensive compared to filter methods due to the repeated learning steps and cross-validation.

1. recursive feature elimination
2. sequential feature selection algorithms
3. genetic algorithms

## Embedded methods
Embedded methods, are quite similar to wrapper methods since they are also used to optimize the objective function or performance of a learning algorithm or model. The difference is that an intrinsic model building metric is used during learning. 

1. L1 (LASSO) regularization
    > L1 (or LASSO) regression for generalized linear models can be understood as adding a penalty against complexity to reduce the degree of overfitting or variance of a model by adding more bias.
2. decision tree

# Subset Selection / Wrapper methods

**Subset Selection:** Also called **wrapper methods** in feature selection. This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

## Best Subset Selection

To perform best subset selection, we fit a separate regression model for each possible combination of the $p$ predictors, which includes the $2^p$ possibilities.

1. Let **$M_0$ denote the null model**, which contains no predictors. This model simply predicts the sample mean for each observation.
 
2. For k=1, we **fit all** p models that contain exactly one predictor. **Pick the best** among these models, and call it $M_1$. Here best is defined as having the smallest RSS, or equivalently largest $R^2$.

3. For k=2, fit all $\left(\begin{array}{c}p\\ 2\end{array}\right)= p(p−1)/2$ models that contain exactly two predictors, and pick the best as $M_2$.

4. For all k <= p, fit all $\left(\begin{array}{c}p\\ k\end{array}\right)$ models and pick the best among these models as $M_k$.

5. Select **a single best model** from among $M_0,..., M_p$. We can use **cross-validated prediction MSE, $C_p (AIC)$, $BIC$, or adjusted $R^2$** as criterion.

> As we know, among all these p+1 options, as **the number of features included in the models increases**, RSS and $R^2$ of these p+1 models **decreases monotonically**. Therefore, if we use these statistics to select the best model, then we will always end up with a model involving **all of the variables**. A low RSS or a high $R^2$ indicates a model with a **low training error**, which by no means guarantees a **low test error**. Therefore, we use cross-validated prediction MSE, $C_p (AIC)$, $BIC$, or adjusted $R^2$.

**Drawbacks:**

1. The number of models that this procedure fits multiplies quickly. Therefore, best subset selection becomes **computationally infeasible** for values of p greater than around 40, even with extremely fast modern computers.
2. The larger the **search space** (p is large), the higher the chance of finding **overfitting models**, which look good on the training data, but might not have any predictive power on future data.

## Stepwise Selection

### Forward Stepwise Selection

Forward Stepwise Selection begins with **the null model** containing no predictors, and then iteratively adds **the most useful predictor, one-at-a-time**. 

Forward stepwise selection is a **computationally efficient**, which considers a much smaller set of models than best subset selection. Unlike best subset selection, which involved fitting $2^p$ models, forward stepwise selection involves fitting one null model, along with p − k models in the kth iteration, for k = 0,...,p−1. This **amounts to a total of 1 + p−1 (p − k) = 1 + p(p + 1)/2 models.**

1. Let $M_0$ denote the null model, which contains no predictors.
2. For k=0, consider all p models that contain exactly one predictor. Choose the best among these p models, and call it $M_1$. We can define the best model as having smallest RSS or highest $R^2$ .
3. For k=1, consider all p-1 feature from the pool of all features that are not selected in previous rounds. Select the feature that – when added – results in the best classifier performance. The model with the additional feature will be $M_2$.
4. Repeat the process to produce models $M_0,..., M_p$. Select **a single best model** among them. We can use **cross-validated prediction MSE, $C_p (AIC)$, $BIC$, or adjusted $R^2$** as criterion.


**Difference between best subset and forward stepwise selection**

<img src="./images/65.png" width=900>

Though forward stepwise tends to do well in practice, it is **not guaranteed to find the best possible model** out of all $2^p$ models containing subsets of the p predictors. 
> For instance, suppose that in a given data set with p = 3 predictors, the best possible one-variable model contains X1, and the best possible two-variable model instead contains X2 and X3. Then forward stepwise selection will fail to select the best possible two-variable model, because the best model for only 1 predictor will contain X1, so the final best model must also contain X1 together with one additional variable.

### Backward Stepwise Selection (Recursive Feature Elimination)

Backward Stepwise Selection begins with **the full model** containing all p predictors, and then iteratively removes **the least useful predictor, one-at-a-time**.

1. Let $M_p$ denote the full model, which contains all p predictors.
2. For k=p, consider all p models which remove exactly one predictor from $M_p$. Choose the best among these p models, and call it $M_{p-1}$. We can define the best model as having smallest RSS or highest $R^2$ .
3. For k=p-1, consider all p-1 feature from the pool of all features that are not removed in previous round. Select the feature that – when removed – results in the best classifier performance. The model after removing the feature will be $M_2$.
4. Repeat the process to produce models $M_0,..., M_p$. Select **a single best model** among them. We can use **cross-validated prediction MSE, $C_p (AIC)$, $BIC$, or adjusted $R^2$** as criterion.

Like forward stepwise selection, the backward selection approach searches through only **1+p(p+1)/2** models, and so can be applied in settings where p is too large to apply best subset selection

Also like forward stepwise selection, backward stepwise selection is **not guaranteed to yield the best model** containing a subset of the p predictors.

### Hybrid Approaches (Bidirectional elimination)

1. Variables are added to the model sequentially, in analogy to forward selection.
2. However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit.

Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection.

# Evaluation metric for choosing the optimal regression model

The training error can be a poor estimate of the test error. Therefore, RSS, MSE(RSS/n) and $R^2$ are not suitable for selecting the best model among a collection of models with different numbers of predictors.

**2 Methods**:

1. **Indirectly** estimate test error by making an **adjustment to the training error** to account for the bias due to overfitting.

2. **Directly** estimate the test error, using either a validation set approach or a cross-validation approach

## Statistics evaluation metric, which adjust training error for the bias: $C_p$, $AIC$, $BIC$, Adjusted $R^2$
which indirectly estimate test error by making an adjustment to the training error

**Why adjusting training error?**
- The training set error is generally an underestimate of the test error. When we achieve a model with minimum training error, it doesn't guarantee that the test error will also be the smallest.
- Especially the training error will decrease as more variables are included in the model, but the test error may not. 
- Therefore, training set RSS and training set $R^2$ cannot be used for model selection.


### Mallows' $C_p$

For a fitted least squares model containing d predictors, $C_p$ estimate of test MSE:

\begin{align}
C_p= \frac{RSS}{\hat{\sigma}^2}+2d−n = \frac{1}{n}(RSS+2d\hat{\sigma}^2)
\end{align}

where $\hat{\sigma}^2$ is an estimate of the variance of the error $\epsilon$

**Note**:
- The $C_p$ statistic adds a **penalty** of $2d\hat{\sigma}^2$ to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error.
- The **penalty increases as the number of predictors in the model increases**; this is intended to adjust
for the corresponding decrease in training RSS.
- If $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$, then $C_p$ is an unbiased estimate of test **MSE**. Typically , we estimate $σ^2$ using $MSE_{all}$, the mean squared error obtained from fitting the model containing **all of the candidate predictors**.

**How to determine which set of models is best with $C_p$ statistic?**
1. Choose the model with **the lowest $C_p$ value**.
2. Identify the model for which the $C_p$ value is **near d**.
> When the $C_p$ value is near d, the bias is small (next to none). When it's much greater than d, the bias is substantial. When it's below d, it is due to sampling error; interpret as no bias
3. The full model always yields $C_p$ = d, so don't select the full model based on $C_p$.
4. If **all models**, except the full model, yield a large $C_p$ not near d, it suggests some **important predictor(s) are missing**. In this case, we are well-advised to identify the predictors that are missing.
5. When more than one model has a small value of $C_p$ value near d, in general, choose **the simpler model(( or the model that meets your research needs.

### AIC
The AIC criterion is defined for a large class of models fit by least square. 
In this case AIC is given by

\begin{align}
AIC=\frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2)
\end{align}

For least squares models, Cp and AIC are proportional to each other.

Given a set of candidate models for the data, the preferred model is the one with the **minimum AIC value**. AIC examines **goodness of fit**, but it also includes **a penalty** that is an increasing function of the number of estimated parameters, which discourages overfitting.

### BIC
For the least squares model with d predictors, the BIC is, up to irrelevant constants, given by

\begin{align}
BIC=\frac{1}{n}(RSS+\log(n)d\hat{\sigma}^2)
\end{align}

BIC will tend to take on a small value for a model with **a low test error**.
Since log(n) > 2 for any n > 7, the BIC statistic generally places a **heavier penalty** on models with many variables, and hence results in the selection of **smaller models** than $C_p$.

### Adjusted $R^2$ 

Recall:
\begin{align}
R^2=1 − \frac{RSS}{TSS} = 1-\frac{RSS}{\sum(y_i-\bar{y})^2}
\end{align}

**TSS**: total sum of squares for the response


**Why not $R^2$?**
Since RSS always decreases as more variables are added to the model, the $R^2$ always increases as more variables are added. 


For a least squares model with d variables, **the adjusted $R^2$** statistic is calculated as
\begin{align}
Adjusted  \, R^2=1 − \frac{RSS/(n-d-1)}{TSS/(n-1)}
\end{align}


**How to determine which set of models is best with the adjusted $R^2$**:
- **A large value of adjusted $R^2$** indicates a model with a small test error. In theory, the model with the largest adjusted $R^2$ will have **only correct variables and no noise variables**. Maximizing the adjusted $R^2$ is equivalent to minimizing $\frac{RSS}{n−d−1}$, which may increase or decrease due to the presence of d in the
denominator.
> Why: Once **all of the correct variables** have been included in the model, adding additional *noise* variables will lead to only a **very small decrease in RSS**, such variables will lead to an increase in $\frac{RSS}{n−d−1}$, and hence the adjusted $R^2$. Therefiore, unlike the $R^2$ statistic, the adjusted $R^2$ statistic **pays a price for the inclusion of unnecessary variables** in the model.

## Directly estimate the test error using validation or cross-validation 

We can also directly estimate the test error using the validation set and cross-validation method. We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest. 

**Advantage over statistics evaluation metrics: $C_p, AIC, BIC$, and Adjusted $R^2$**: 
- Direct estimate of the test error, and makes fewer assumptions about the true underlying model. 
- Used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom or hard to estimate the error variance σ2.

**One-standard-error rule**: We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
 - **Rationale**: if a set of models appear to **be more or less equally good**, then we might as well choose the **simplest model**—that is, the model with the smallest number of predictors. 

# Regularization / Shrinkage / Embedded methods

Shrinkage Methods fit a model containing all p predictors using a technique that **constrains or regularizes the coefficient estimates**, or equivalently, that shrinks the coefficient estimates towards zero.
The two best-known techniques for shrinking the regression coefficients towards zero are **ridge regression and the lasso**.

## Ridge Regression

Recall that the least squares fitting procedure estimates β0, β1, ..., βp using the values that minimize:

\begin{align}
RSS=\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2
\end{align}


**Ridge regression** coefficient estimates $\hat{\beta}^R$ are the values that minimize

\begin{align}
\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2+\lambda\sum_{j=1}^p\beta_j^2=RSS+\lambda\sum_{j=1}^p\beta_j^2
    \end{align}
    
where λ ≥ 0 is a tuning parameter, to be determined separately. **When λ = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates.** However, as λ → ∞, the impact of the shrinkage penalty grows, and the coefficient estimates will approach zero. 


**Trade-off between RSS and the shrinkage penalty:**
1. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. 
2. **Shrinkage penalty:**  $\lambda\sum_{j=1}^p\beta_j^2$ is small when β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero. 


**Note:**
 - The intercept will not be shrunk. 
 - The penalty will not set any of the coefficients exactly to zero (unless λ = ∞). Therefore, the ridge regression will include **all p predictors** in the final model. This can create a challenge in **model interpretation** in settings in which the number of variables p is quite large.
 - This penalty term gives ridge regression an advantage over least squares, which is rooted in **the bias-variance trade-off**. It works best in situations where the least squares estimates have high variance. As λ increases, the flexibility decreases, leading to decreased variance but a slight increased bias.


**Standardize before applying ridge**:
-  The **standard least squares** coefficient estimates are **scale invariant**: multiplyin $X_j$ by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1/c.

- In contrast, the **ridge regression coefficient** estimates can **change substantially** when multiplying a given predictor by a constant. 

> $X_{j,\lambda}^\beta$ will depend not only on the value of λ, but also on the scaling of the jth predictor, and the scaling of the other predictors. **It is best to apply ridge regression after standardizing the predictors**, using the formula

> \begin{align}
\tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}
\end{align}
The denominator is the
estimated standard deviation of the jth predictor

## The Lasso

The lasso coefficients, $\hat{\beta}_\lambda^L$, minimize the quantity
\begin{align}
\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2+\lambda\sum_{j=1}^p|\beta_j|=RSS+\lambda\sum_{j=1}^p|\beta_j|
\end{align}

When λ = 0, then the lasso simply gives **the least squares fit**, and when λ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero.


The lasso regression only fits a single model, and the procedure can be performed quite quickly. It has **computational advantages** over some other **feature selection technique**, e.g. best subset selection

## How to selection the tuning paramter:

**GridSearchCV**. This will allow us to automatically perform multiple folds cross-validation with a range of different regularization parameters in order to find the optimal value of the tuning paramter alpha.

## Lasso v.s. Ridge

- **Ridge**: **adding l2 penalty** $β_j^2$, which is the **square** of the magnitude of coefficients to the loss function. 
 - The l2 penalty will not set any of the coefficients exactly to zero (unless λ = ∞).
 - This can create a challenge in model interpretation in settings in which the number of variables p is quite large.
 
- **Lasso**: **adding l1 penalty** $|β_j|$, which is the **absolute value** of the magnitude of coefficients to the loss function. 
 - The l1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. 
 - As a result, lasso can performs **variable selection**, which yields **sparse models** — that is, models that involve only a subset of the variables. Hence, results in models that are easier to interpret.
 
- **Cross-validation** can be used in order to determine which approach is better on a particular data set.


### Explanation

**Another Formulation for Ridge Regression, the Lasso and best subset selection:**

The lasso and ridge regression coefficient estimates solve the problems

\begin{align}
minimize_\beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^p|\beta_j|\leq s \\
minimize_\beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^p\beta_j^2\leq s
\end{align}   

When we perform the lasso we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budget s for how large $\sum_{j=1}^p|\beta_j|$ or $\sum_{j=1}^p\beta_j^2$ can be. When s is extremely large, then this budget is not very restrictive, and so the coefficient estimates can be large



**Graph illustration of the situation of ridge and the lasso**

<img src="./images/66.png" width=600>

The lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. 

**Ridge regression**: has **a circular constraint** with no sharp points, this intersection will **not generally occur on an axis**, and so the ridge regression coefficient estimates will be exclusively non-zero.


**The lasso**: constraint has **corners** at each of the axes, and so the ellipse will often intersect the constraint region at an axis. 


**A close connection between the lasso, ridge regression, and best subset selection**:

best subset selection is equivelant to :
\begin{align}
minimize_\beta \left\{\sum_{i=1}^n\left( y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j \right)^2\right\}\,\, subject\, to \, \sum_{j=1}^pI(\beta_j\neq 0)\leq s
\end{align}  

, which amounts to finding a set of coefficient estimates such that RSS is as small as possible, subject to the constraint that no more than s coefficients can be nonzero.

Therefore, we can interpret **ridge regression** and **the lasso** as computationally feasible alternatives to **best subset selection**.