# Machine learning - Assignment 5 - The bootstrap method
____
**Author**: Kemal Cikota

**Course**: Machine learning
____

## Introduction

In this assignment, i will explore the use of **bootstrap methods** for estimating the accuracy of regression model parameters. The difference between bootstrapping methods and regular statistical methods for estimating performance metrics is that bootstrap does not rely on strict assumptions on the distribution of the data. This report will include a conceptual/theoreticall part, which includes a discussion about k-fold cross validation and one practical part i apply bootstrapping methods to estimate standard errors for linear and quadratic regression models.

I am a complete nerd when it comes to cars so this dataset was quite intuitive and easy for me to understand but for those that dont know, [this source](https://islp.readthedocs.io/en/latest/datasets/Auto.html) can be used as a reference for what the features mean. Theese short descriptions can be of great help.

## Conceptual Questions

**1. explain how k-fold cross-validation is implemented**
K-fold cross-validation is a resampling technique used to evaluate the performance of a model while making efficient use of avaliable data, here is how it may be implemented.

- <ins>Divide data</ins>: The data should first be divided in to k equal parts. Theese k amount of parts are also called the "folds", hence the name, "K-fold".
- <ins>Train and validate (iteratively)</ins>: For each iteration, one of the folds is used as the validation set, while the other remaining folds are combined to form the training set. The model will be trained with theese combined training folds and evaluated on the validation fold. This process will repeat K-amounts of times, with each fold servind as the validation set just once.
- <ins>Compute performance metrics</ins>: After theese K-amounts of iterations, the performance metrics (such as $R^2$ or MSE) from each iteration are averaged to provide an overall estimate.

The amounts of K we have in practice can varry a lot depending on the data and circumstances we are dealing with but it is often common to use something like K=5 or K=10 but higher K's like 15 can also be used. There is also a special case of K-fold called "Leave One Out Cross Validation" (LOOCV) where k=N where N is the total number of data points. This is often used when the dataset is relatively small because LOOCV makes sure that every possible training sample is used, which maximizes the amount of data for training the model.

In general though, K-fold is useful because its ability to "average out" the performance metrics so it helps reduce overfitting compared to just single train/test split metric measuerment. 

**2. What are advantages and disadvantages of k-fold crossvalidation relative to the validation set approach?**
So i have already discusssed what K-fold is and what it roughly does in the previous conceptual task, the validation set approach is just a simple train/test split where we have a random seed that randomly splits the data in to one training and one validation set with just one iteration.

The advantages of using K-fold over the standard validation set approaches are:
- Instead of using a fixed portion of the data for validation, each datapoint gets a chance to be in the validation set if we use K-folds, this ensures that the model is trained on more data and in different orders which leads to better generalization if we were to have some arbitrary new data.
- Because K-fold has an ability to "average out" or "flatten" the performance measurements, the end value we get for some metric (for example $R^2$) is more stable and reliable because it is the final average value of multiple iteration which reduces variances compared to validation setting. This is important because the validation set approach can sometimes make a model look much better then it actually is because the split is "lucky" where the validation set contains easy examples. This can also work the other way around for "unlucky" splits so thats why its good to have an average.
- Training the model of different subsets of data reduces the likelihood that a model will be overfitted to a particular training set because it is trained on more variety of data.

The disadvantages of using K-fold over the standard validation set approaches are:
- doing K-fold requires K times more computation than a single validation set approach. So if we have K=10 or even higher, then that means that K-fold would use 10 times or more data to train the model than if we were just to have a constant split and train once. So for complex datasets that includes a lot of features that are complex, this can be very impractical.
- Sometimes, it can be hard to find the best K to use for some dataset. And it is in general harder to implement  K-fold than a normal validation set approach as it requires us to partition the data well and loop over it in multiple iterations.

**3. What are advantages and disadvantages of k-fold crossvalidation relative LOOCV?**

The advantages of using K-fold over LOOCV are:
- Because, in LOOCV we use K=N where N is the amount of datapoints in our data, it can be very computationally heavy for larger datasets as for regular K-fold we use something like K=5 or K=10 usually. This makes LOOCV less reasonable to use for larger datasets because it can require several times more data than even K=10.
- Because LOOCV uses all of the datapoints as folds for the iteration, it can sometimes lead to overfitting since the training set in each iteration is almost the entire dataset while K-fold has more diversity in the training subsets for each iteration.
- K-fold gives a much smoother estimate by averaging over multiple folds with more varied training sets as compared to LOOCV that produces performance values that have high variance because each validation set only consists of one single data point.

The disadvantages of using K-fold over LOOCV are:
- Even though K-fold has less variance between the iterations as compared to LOOCV, K-fold will have a higher bias in the performance estimates because the LOOCV uses almost all the data for training which makes the model very similar to the one trained on the full dataset.
- because in K-fold we choose K ourselves, it meaans that we have some room for error in the evaluation since the outcome is dependent on K which is our choice. This makes the LOOCV method somewhat deterministic since the uses every single data point excactly once. 

## Practical

For the practical part of the assignment, i have to work with the auto.csv dataset which consists of some information about cars like the name and year but we also have some stats about each car like weight, horsepower, miles per galon (MPG) and engine displacement. With this data i will use bootstraping methods to estimate the standard errors of parameters of a linear regression and quadratic regression models predictive ability.

### Load the data and get an overview of the data


In [4]:
import pandas as pd # Never coded in R before but this seems to be the equivalent of library(pandas) in R
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
import numpy as np

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA # pip install scikit-learn
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import confusion_matrix, accuracy_score


# load Smarket.csv
auto = pd.read_csv('Auto.csv')

# Set pandas option to display all columns
pd.set_option('display.max_columns', None)

Once the dataset is loaded, we can display the number of predictors (variables/columns) and their names.

In [3]:
numFeatures = auto.shape[1]
print(numFeatures)

featureNames = auto.columns.tolist()
print(featureNames, end="\n\n")

10
['Unnamed: 0', 'mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']

