# Resampling

In statistics, resampling is any of a variety of methods for doing one of the following:

- Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (**jackknifing**) or drawing randomly with replacement from a set of data points (**bootstrapping**)
- Exchanging labels on data points when performing significance tests (**permutation tests**, also called exact tests, randomization tests, or re-randomization tests)
- Validating models by using random subsets (**bootstrapping**, **cross validation**)

------------

# Cross Validation

Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset).

The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.

## Exercise 1 - Explore the mtcars data set from the ggplot package

## Exercise 2 - Build a Cross Validation class 

It should do the following:
- Hold-Out, LOOCV and k-Fold
- Take in specific parameters for each
- Return train and test sets
- Start out with 1 dimension / feature
- *Optional* - Build for more dimensions / features

## Execise 3 - Try it out on the mtcars data set

## Exercise 4 - Check via Scikit-Learn

------------

# Bootstrapping

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient.

## Exercise 1 - Build a Bootstrap Class 

It should do the following:
- Take in the appropriate parameters
- Calculate various common statistics of interest
    - Example. For the mean or median, calculate the standard deviation to estimate the standard error, the 2.5th and 97.5th percentiles as a confiendence interval, and draw a histogram of the distribution

## Exercise 2 - Try it out on the mtcars data set

## Exercise 3 - Check via scikits.bootstrap or NumPy

*Scikit-learn has deprecated and removed the bootstrap class for no logical reason.*