<a href="https://colab.research.google.com/github/AzlinRusnan/Machine-Learning/blob/main/Resampling_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Resampling Methods**

- **Resampling** - repeatedly drawing samples from a training set and refitting a modelon each sample - to obtain additional info.

- Common resampling methods: **bootstraping** and **cross validation**.

- **Cross validation** can be used to estimate the test error to evaluate model perfomance (model assessment) or to select appropriate level of flexibility (model selection).

- **Boostraping** provide a measure of accuracy of a parameter estimates or statistical learning method.

### **Cross Validation**

**Cross Validation** is a technique used in machine learning to evaluate the performance of a model. The main goal is to make sure that the model works well on unseen data (data it hasn't been trained on).

Here's how it works:

1. **Divide the Data:** Split the entire dataset into several smaller parts, called "folds". A common approach is to split it into 5 or 10 parts (folds).

2. **Train and Test:** For each fold:
- Use some of the folds to train the model.
- Use the remaining fold to test the model.

3. **Repeat and Average:** Repeat this process for each fold, so every fold is used as a test set once. Then, average the results to get a final performance estimate.

This process helps in understanding how the model performs on different subsets of the data and gives a better idea of its true performance.

### **Bootstrapping**

**Bootstrapping** is another technique used to estimate the performance of a model, especially when the dataset is small.

Here's how it works:

1. **Generate Samples:** Randomly select samples from the original dataset to create many new datasets (called "bootstrap samples"). Each sample is created by randomly picking data points with replacement (meaning the same data point can be picked more than once).

2. **Train and Test:** For each bootstrap sample:
- Train the model on the sample.
- Test the model on the data points that were not included in the sample (called the "out-of-bag" data).

3. Aggregate Results: Calculate the performance metric for each sample and then average these results to get a final estimate.

Bootstrapping helps in getting an idea of the variability and reliability of the model's performance.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
file_path = '/content/gdrive/MyDrive/STQD 6024 Machine Learning/auto.csv'
df1 = pd.read_csv(file_path)
df1 .head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    int64  
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB


We begin by using the  𝚜𝚊𝚖𝚙𝚕𝚎()
  function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations. We refer to these observations as the training set.

We'll use the  𝚛𝚊𝚗𝚍𝚘𝚖⎯𝚜𝚝𝚊𝚝𝚎
  parameter in order to set a seed for  𝚙𝚢𝚝𝚑𝚘𝚗
 ’s random number generator, so that you'll obtain precisely the same results each time. It is generally a good idea to set a random seed when performing an analysis such as cross-validation that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.

In [8]:
train_df = df1.sample(196, random_state = 1) #Contains 196 randomly selected rows from df1
test_df = df1[~df1.isin(train_df)].dropna(how = 'all') #1) test_df: Contains the rows from df1 that are not in train_df. #2) how='all': Specifies that a row should only be dropped if all of its elements are NaN.

X_train = train_df['horsepower'].values.reshape(-1,1)
y_train = train_df['mpg']
X_test = test_df['horsepower'].values.reshape(-1,1)
y_test = test_df['mpg']

#.values: Converts the selected column into a NumPy array.
#.reshape(-1, 1): Reshapes the array to have one column and as many rows as needed (-1 infers the number of rows automatically). This is necessary because many machine learning algorithms expect the input to be in a specific shape (2D array).

#X_train and y_train: Used to train the model, enabling it to learn the relationship between input features and output targets.
#X_test and y_test: Used to evaluate the model's performance on new, unseen data, ensuring it generalizes well and isn't just memorizing the training data.

We then use ${\tt LinearRegression()}$ to fit a linear regression to predict ${\tt mpg}$ from ${\tt horsepower}$ using only
the observations corresponding to the training set.

In [9]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression() #Think of it as creating a new blank model that we will train with data.

model = lm.fit(X_train, y_train)

#X_train: The input features for training (in this case, 'horsepower' values).
#y_train: The target values for training (in this case, 'mpg' values).

#model: Stores the trained model. After training, this model has learned the relationship between 'horsepower' and 'mpg'.

We now use the ${\tt predict()}$ function to estimate the response for the test
observations, and we use ${\tt sklearn}$ to caclulate the MSE.

In [10]:
pred = model.predict(X_test)

from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y_test, pred)

print(MSE)

23.361902892587224


#### **pred = model.predict(X_test)**

- **What happens here:** Now, the trained model uses the patterns and parameters it learned during training to predict the target values for the new input features provided in X_test.

- **Input: X_test,** which contains the input features (e.g., 'horsepower' values) from the test dataset.

- **Output:** pred, which contains the predicted target values (e.g., 'mpg' values) based on X_test.

**Analogy**

Think of the training process as teaching a student how to solve math problems using a set of practice problems (X_train and y_train). Once the student has learned how to solve these problems, you give them a new set of problems (X_test) and ask them to solve them. The student's solutions to these new problems are the predictions (pred).

So, the line **model.predict(X_test)** means that the model, which has already been trained on the training data **(X_train and y_train)**, is now being used to predict the target values for the test data **(X_test)**.


#### **MSE = mean_squared_error(y_test, pred)**

- This line calculates the mean squared error between the actual 'mpg' values (from y_test) and the predicted 'mpg' values (from pred).
  -  y_test: The actual target values for the test data.
  -  pred: The predicted target values.

- MSE: Stores the calculated mean squared error. The mean squared error is a measure of how well the model's predictions match the actual values. A lower value indicates better performance.


In [11]:
MSE = mean_squared_error(y_train, model.predict(X_train))
print(MSE)

24.62301015144335


Therefore, the estimated test MSE for the linear regression fit is 23.36. We
can use the ${\tt PolynomialFeatures()}$ function to estimate the test error for the polynomial
and cubic regressions.

**Purpose:** By comparing the test MSEs of linear, polynomial, and cubic regressions, we can determine which model best captures the underlying patterns in the data and generalizes well to new data.