Regularization Regression And Cross Validation

1) Subset Selection of Predictors
* Ways to reduce complexity of the model:
* $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \epsilon$
    1. **Subset Selection** - choose subset of $p$ predictors
    2. **Regularization** - keep $p$ predictors, shrink coefficient estimates towards 0 (some variable selection for lasso)
    3. **Dimensionality Reduction** - project $p$ predictors into $M$-dimensional space where $M<p$
* Subset selection techniques:
    1. **Best subset** - try every model with every possible combination of $p$ predictors
        * (-) computationally intensive (huge search space), especially for large amount of $p$
        * (+) higher chance of finding models that look good on training data (low bias)
        * (-) have little predictive power on future data (high variance) 
    2. **Stepwise selection** - in each step, a variable is considered for addition to or subtraction from $p$ to model $M$ with some criterion
        * e.g. $M_0 \rightarrow M_1 \rightarrow M_2 \rightarrow \dots \rightarrow M_p$
            * where $M_1$ adds one predictor with smallest $RSS$ or largest $R^2$
            * where $M_2$ adds one predictor with smallest $RSS$ or largest $R^2$
        * now that there are $p$ candidate models, is using $RSS$ and $R^2$ metrics good way to determine best $p$ candidate?
* Best ways to determine best model between multiple models:
    1. **Cross-validation** - assesses how well a model will generalize without losing significant modelling or testing capabilites (not 70%/30% split technique). It combines averages measure of fit (prediction error) to derive a more accurate estimate of model prediction performance (k-fold cross-validation).
    2. **Mallow's $C_p$** - takes into consideration residual sum of squares, number of predictors, and estimate of variance associated with each response in linear model
        * equation: $C_p=\frac{1}{n}(RSS+2\underline{p}\hat{\sigma}^2)$ (minimize)
        * $p$ = total # of parameters
        * $\hat{\sigma}^2$ = an estimate of variance of error $\epsilon$
        * small value of $C_p$ means that model is relatively precise
        * (-) $C_p$ approximation is only valid for large sample size
        * (-) $C_p$ cannot handle complex collections of models as in the variable selection or feature selection problem
    3. **Akaike Information Criterion (AIC)** - an estimate of the relative information lost when a given model is used to represent the process that generated the data. It deals with trade-off between goodness of fit and simplicity of model
        * equation: $AIC = -2logL+2\underline{p}$ (minimize)
        * $L$ = maximized value of the likelihood function for model estimated
        * the preferred model is the one with the minimum AIC value
        * penalizes large number of estimated parameters (discourages overfitting)
        * can show that AIC and Mallow's $C_p$ are equivalent for linear case
    4. **Bayesian Information Criterion (BIC)** - similar to AIC, it considers like likelihood and # of parameters. The difference is that BIC penalization is larger than AIC.
        * equation: $BIC = \frac{1}{n}(RSS+log(n)\underline{p}\hat{\sigma}^2)$ (minimize)
        * similar to AIC, except $2$ is replaced by $log(n)$
        * Generally, BIC exacts heavier penalty for more variables
            * $log(n) > 2$ for $n>7$
        * BIC assumes that data distribution is an exponential family
        * (-) BIC is only valid for sample size $n$ much larger than the number of $k$ of parameters in the model
        * (-) BIC cannot handle complex collections of models as in the variable selection or feature selection problem in high-dimension
    5. **Adjusted $R^2$** - attempts to take account of $R^2$ and penalizes for extra explanatory variables added to the model
        * equation: Adjusted $R^2= 1-\frac{\frac{RSS}{n-p-1}}{\frac{TSS}{n-1}}$
        * similar to $R^2$, but penalizes more for more variables

2) Cross Validation
* **Normal Cross-Validation** - randomly divide data into training set and validation set (e.g. 50/50, 60/40, 70/30, 80/20, etc.)
    * Cross-Validation steps:
        1. fit model on training set
        2. use fitted model to predict responses for validation set
        3. compute validation-set error
            * quantitative response: typically MSE/RMSE
            * qualitative: typically misclassification rate (e.g. accuracy, precision, recall, f1)
    * example: fitting MPG, $Y$, from horsepower, $X$ using different polynomial fits
    * (-) validation error can be highly variable depending on random split
* **K-Fold Cross-Validation** - randomly divide data into $k$ folds and treat each $k$ as the validation set to measure predicted responses
![kfold_cv](kfold_validation.png)
    * K-Fold Cross-Validation steps: (run these steps $k$ times)
        1. fit model on training set using $k-1$ folds
        2. use fitted model to predict responses for validation set
        3. compute validation-set error
            * quantitative response: typically MSE/RMSE
            * qualitative: typically misclassification rate (e.g. accuracy, precision, recall, f1)
    * equation: $CV_{k}=\frac{1}{k}\sum_{i=1}^k MSE_i$
    * example: fitting MPG, $Y$, from horsepower, $X$ using different polynomial fits
    * (-) validation error can be highly variable depending on random split
* **Iterated K-Fold Validation With Shuffling** - consists of applying k-fold validation multiple times, shuffling the data every time before splitting it $K$ ways
    * **randomly shuffling** prevents you from excluding specific classes of labels when splitting data into training and test sets
        * e.g. image classification of digits: first 80% of array for training set (class 0-7) and remaining 20% as test set (class 8-9)
    * final score is the average of the scores obtained at each run of k-fold validation
    * training and evaluating $P \times K$ models (where $P$ is the number of iterations you use), which can be very expensive
    * extremely helpful with relatively little data available and you need to evaluate the model as precisely as possible
    * also, extremely helpful in Kaggle competitions
    * example code:
        ```python
        k=4
        num_validation_samples = len(data) // k
        
        np.random.shuffle(data)
        
        validation_scores = [] 
        for fold in range(k):
            # Selects the validation - data partition
            validation_data = data[num_validation_samples * fold: num_validation_samples * (fold + 1)]
            # Uses the remainder of the data as training data. Note that the + operator is list concatenation, not summation.
            training_data = data[:num_validation_samples * fold] + data[num_validation_samples * (fold + 1):]
            
            # Creates a brand-new instance of the model (untrained)
            model = get_model()
            model.train(training_data)
            validation_score = model.evaluate(validation_data)
            validation_scores.append(validation_score)
        
        # Validation score: average of the validation scores of the k folds
        validation_score = np.average(validation_scores)
        
        # Trains the final model on all non-test data available
        model = get_model()
        model.train(data)
        test_score = model.evaluate(test_data)
        ```
* **Redundancy in the data** - if some data points in your data appear twice (fairly common with real-world data), then shuffling the data and splitting it into a training set and a validation set will result in redundancy between the training and validation sets
    * In effect, you'll be testing on the part of your training data
    * Make sure your training set and validation set are disjoint (having no elements in common)

3) **Bias-Variance Tradeoff** - the problem of simultaneously minimizing two sources of error, bias and variance, that prevent supervised learning algorithms from generalizing beyond their training set
![bias_variance](bias_variance_tradeoff.png)
* situation: fit a model, $\hat{f}(x)$, to some training data and let $(x_0,y_0)$ be a test observation from the population. If the true model is $Y=f(X)+\epsilon$ where $f(x)=E(Y|X=x)$
* equation: $E(y_0-\hat{f}(x_0))^2=Var(\hat{f}(x_0))+[Bias(\hat{f}(x_0))]^2+Var(\epsilon)$
    * applies to modeling in general (not just linear regression)
    * minimize the expected test MSE/RMSE by **reducing Variance**, $Var(\hat{f}(x_0))$, **reducing Bias**, $Bias(\hat{f}(x_0))$, but can't do much about Irreducible Error, $Var(\epsilon)$
    * generally speaking, the *more flexible* the model, the *greater the variance*
    * $Bias(\hat{f}(x_0)) = E[\hat{f}(x_0)]-f(x_0)$ - difference between expected prediction of our model and correct value we are trying to predict
    * $Var(\hat{f}(x_0))$ - amount by which $\hat{f}$ would change if estimated it using a different training dataset
    * $Var(\epsilon)$ - simply because $Y=f(X)+\epsilon$ has irreducible error
* reducing the complexity of the model so it stops overfitting and thus reducing the variance (how well it generalizes to different training data)

4) **Regularization** - solves overfitting problem in statistical models. This method reduces the values of the coefficients (aka **shrinkage**)
* in linear regression, we find the estimates for all coefficients that minimize RSS
    * Linear Regression: $RSS = \sum_{i=1}^n\Big(y_i-\beta_0-\sum_{j=1}^p \beta_j x_{ij}\Big)^2$
* **Ridge Regression** - great technique for analyzing multiple regression data that suffers from multicollinearity
    * minimize: $RSS + \lambda\sum_{i=1}^p \beta_j^2$
    * $\lambda$ is tuning parameter to be determined
    * uses L2 penalty (also known as least squares)
    * $j$ is not zero
* **Lasso Regression** - lasso (least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization
    * minimize: $RSS + \lambda\sum_{i=1}^p \big|\beta_j\big|$
    * uses an L1 penalty (also known as least absolute deviations)
* Ridge vs Lasso:
![ridge_lasso](ridge_vs_lasso.png)
    * when $\lambda=0$, we simply have linear models
    * as $\lambda$ increases, both models become less flexible, reducing variance, but increasing bias
    * lasso has the advantage of variable selection as well (especially nice when $p$ is large)
    * neither universally dominate, but in general one might expect Lasso to do better when response is function of relatively *few* predictors (however, it's not always true, so make sure to cross-validate)
    * when two predictors are highly correlated L1 (Lasso) penalties tend to pick one of the two while L2 (Ridge) will take both and shrink the coefficients
    * in general L1 (Lasso) penalties are better at recovering sparse signals
    * L2 (Ridge) penalties are better at minimizing prediction error
* Choosing $\lambda$:
![lambda](choosing_lambda.png)
    * increment $\lambda$ per model, and then choose the *$\lambda$ which minimizes cross-validated error*
    * in least squares linear regression. the $\beta$ coefficient estimates are *scale equivariant* (scale remains the same)
        * in other words, multiplying $X_j$ by constact $c$ leads to scaling of least squares coefficient estimates by $\frac{1}{c}$, so that $X_j\hat{\beta}_j$ remains the same
    * in ridge regression, the $\beta$ coefficient estimates can change *substantially* due to the penalty part of the ridge cost function
        * make sure to **standardize** the predictors using:
            * $\tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}$