# Statistical Learning

The following statistical learning concepts are based off of the concepts mentioned in this [textbook](https://www.statlearning.com/)

- What: 
    - Statistical learning is the practice of estimating the unknown function `f` in the equation:  `Y = f(X) + epsilon`. This is a very general form of descibing how the estimator `f`, with the features `X1,...,Xn` and some added noise `epsilon`,  represent the systematic information that X provides about Y. 

- Why:
    - Prediction: In case X is known but Y isn't, you'll want to predict what Y is given the estimator `f_hat` that's used to help make that prediction. So, the equation is `Y_hat = f_hat(X)`. The accuracy of `Y_hat` depends on the `reducible` and `irreducible` error. Your goals is to minimize the `reducible` error as much as possible. `irreducible` error is the `epsilon` that includes variation/noise that you cannot control. 
    - Inference: In case you want to know the relationship between X and Y, the exact form of the estimator `f` is known, you'll want to know:
        - Which (top few) features are associated with the outcome of Y?
        - What's the relationship between Y and each X?
        - Can the relationship between Y and each X be adequately summarized using linear equation, or is the relationship more complicated? 


## How to find estimate `f`?
1. Parametric methods: Reduce the problem of estimating `f` down to just estimating a set of parameters. The disadvantage of parametric form is that the model you choose will usually not match the true unknown form of `f`. To attempt to resolve this, you can choose flexible models that can fit many different functional forms for `f`. But this usually means estimating a bunch of parameters. More complex models can lead to `overfitting`, where the parameters follow the noise too closely. 
2. Non-Parametric Methods: This approach doesn't make assumptionf about the functional form of `f`. It just tries to find an estimate of `f` that is as close to the data points as possible without being too wiggly or rough. The advantage of this approach is avoiding assumptiong means they have to a better chance of fitting a wider range of possible shapes for `f`. The disadvantage is that since they don't reduce the problem of estimating `f` to a small number of parameters, a large number of observations is needed to obtain an accurate estimate of `f`.

### The Tradeoff Between Prediction Accuracy and Model Interpretability
- Reasons to use a more restrictive model: 
    - Just doing inference
    - some models: least squares linear regression
- Resons to use a less restrictive model:
    - Want to do prediction
    - some models: bagging, boosting, SVMs w/ non-linear kernels, neural networks


### Supervised V. Unsupervised

- Supervised learning is often associated with predictor measurements (features), and repsonses (labels). The goal of supervised learning is to fit a model that relates the response to the predictors. Doing so gives us the hope of predicting future responses, and understanding the relationship better between the resposne and the predictors.
- Unsupervised learning is often not associated with the responses (labels), only with the predictors (features). It's not possible to fit linear regression models since there is no response variable to predict. The goal of unsupervised learning is to understand the relationship between the variables or between the observations. 

### Regression V. Classification
- Qualitative Variables (Cateogrical variables): these are variables belonging to a class. (e.g. married or not, gay or not)
    - Problems with qualitative labels are classification problems.
- Quantitative Variables: These are numerical values (e.g. age, height, income, price, etc.) 
    - Problems with quantitative labels are regression problems. 
    
**Note**: 
- Logistic Regression is a classification method, but since it estimates class probabilities, it can also be used for regression methods too. 
- We usually select the statistical learning method based on whether the label we're trying to predict is quantitative or qualitative. 

## Assessing Model Accuracy
### Measure Quality of Fit
Usually your quality of fit is measuring how close the predicted value is from the true value. The most common measure of quality of fit is the `mean squared error (MSE)`. We care about the accuracy of the predictions on data the model hasn't been trained on. 

The smaller the MSE is, the closer the predicted value is to the true value. You'll want to find a model that minimizes MSE as much as possible.

### Bias-Variance Trade-Off
MSE is composed of 3 fundamental quantities:
- variance of the predicted MSE
- the squared bias of the predicted MS

## Linear Regression

### Simple Linear Regression

### Multiple Linear Regression

### Linear Regression Models V. K-Nearest Neighbors

## Classification

### Logistic Regression

### Generative Models for Classification

### Compare Classification Models

## Resampling Methods

### Cross-Validation

### Bootstrap

## Linear Model Selection + Regularization

### Subset Selection

### Shrinkage Methods

### Dimension Reduction Methods

## Polynomial Regression

### Step Functions

### Basis Functions

### Regression Splines

### Smoothing Splines

### Local Regression

### Generalized Adidtive Models

### Non-Linear Modeling

## Tree-Based Methods

### Bagging, Random Forests, Boosting, Bayesian Additive Regression Trees

## Support Vector Macines

### Maximal Margin Classifier

### Support Vector Classifiers

### SVMs with 2+ Classes

### Relationship to Logistic Regression

## Deep Learning

### Single Layer Neural Networks

### Multilayer Neural Networks

### Convolutional Neural Networks

### Document Classification

### Recurrent Neural Networks

### Fitting a Neural Network

### Interpolation and Double Descent

## Survivial Analysis + Censored Data

### Kaplan-Meier Survival Curve

### Log-Rank Test

### Regression Models w/ a Survival Response

### Shrinkage for the Cox Model

## Unsupervised Learning

### Principal Component Analysis

### Missing Values and Matrix Completion

### Clustering Methods

## Multiple Testing

### Hypothesis Testing

### Multiple Testing

### Family-Wise Error Rate

### Re-sampling Approach to P-Values, and False Discovery Rates