#Model Validation

Within Machine Learning and Deep Learning, Model validation is a critical process that ensures the reliability and accuracy of the model!

Model Validation involves assessing how well a model performs on unseen data, which is essential for determining its predictive capabilities and applicability in real-world scenarios.

Different types of models—such as predictive, descriptive, and prescriptive—each have specific validation needs depending on their intended use. Poorly validated models can lead to significant consequences, including incorrect predictions, financial losses, and ethical dilemmas in decision-making.


## Train_Test_Split

The train-test split is a fundamental step in machine learning and statistics to evaluate how well a model can generalize to new data. Here's how it works:

The idea is to split your dataset into two parts: the training set and the test set. The training set is used to "train" the model (meaning it uses this data to learn patterns), and the test set is used to evaluate its performance on unseen data.

Common ratios are 70-30, 80-20, or even 90-10 (training-test). A popular choice is 80-20 (80% of data for training, 20% for testing), though it can vary based on the dataset size.

Without a train-test split, you risk the model overfitting, making it perform poorly on new, unseen data. Testing with unseen data helps ensure the model performs well in general.

In [None]:
from sklearn.model_selection import train_test_split # Sklearn provides an effective and simple way to split up data in an effective and efficient way
from sklearn.datasets import load_iris

Data_x, Data_y = load_iris(return_X_y = True, as_frame = True)

X_train, X_test, Y_train, Y_test = train_test_split(Data_x, Data_y, test_size=0.2, shuffle = True)  #Here we are splitting the data up so we will be able to actually test the accuracy of our model


##K-folding Cross-validation

K-Fold Cross-Validation is a technique used in machine learning to assess a model’s performance more reliably than a simple train-test split. It helps prevent overfitting and provides a better estimate of a model's ability to generalize to unseen data.


K-fold cross-validation is a method used in machine learning to evaluate a model's performance more reliably by splitting the dataset into k equally sized folds or subsets. The model is trained on k−1 of these folds and validated on the remaining fold. This process is repeated k times, with each fold serving as the validation set once. After all iterations, the average of the evaluation metrics (like accuracy or precision) is calculated across all folds, providing a more robust estimate of the model’s ability to generalize to new data. This technique makes efficient use of all data, reduces the likelihood of overfitting, and is especially beneficial when working with small datasets.

In [None]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Sample data (features and labels)
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9], [9, 10], [10, 11]])
y = np.array([0, 0, 0, 1, 1, 1, 0, 0, 1, 1])  # Example labels (binary classification)

# Define the model (Logistic Regression in this case)
model = LogisticRegression()

# Set up K-Fold Cross-Validation with k=5
kf = KFold(n_splits=5, shuffle= True)

# List to store accuracy for each fold
accuracies = []

# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
    # Split data into training and test sets for this fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Calculate accuracy for this fold
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

# Calculate and print the average accuracy across all folds
average_accuracy = np.mean(accuracies)
print("Accuracies for each fold:", accuracies)
print("Average Accuracy:", average_accuracy)


Accuracies for each fold: [1.0, 0.5, 0.5, 0.5, 0.0]
Average Accuracy: 0.5


##Leave-One-Out Cross-Validation

a specific type of cross-validation technique used to evaluate the performance of a machine learning model. It is particularly useful when the dataset is small, as it maximizes the amount of training data available for each model iteration.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable

# Initialize the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Lists to store predictions and true labels
y_true = []
y_pred = []

# Perform Leave-One-Out Cross-Validation
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    prediction = model.predict(X_test)

    # Store true and predicted values
    y_true.append(y_test[0])
    y_pred.append(prediction[0])

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f'Leave-One-Out Cross-Validation Accuracy: {accuracy:.2f}')


Leave-One-Out Cross-Validation Accuracy: 0.95


## Bootstrapping

Bootstrapping is a powerful statistical resampling technique that repeatedly samples the original dataset when training a model. This method is especially useful when the underlying distribution of the data is unknown or when the sample size is small.

Bootstrapping involves drawing multiple samples from the dataset, with each sample being the same size as the original, and the same data point can be selected multiple times within a single sample.
Typically, thousands of bootstrap samples are generated, and for each sample, a statistic (e.g., mean, median, variance) is calculated, creating a distribution of that statistic.

One of the key advantages of bootstrapping is that it is non-parametric, meaning it does not assume a specific distribution for the data, making it versatile for various datasets. However, bootstrapping does have its disadvantages; it can be computationally intensive due to the need to generate many samples, and its effectiveness may be compromised if the original dataset is too small, leading to unreliable or unstable results.



When you think about bootstrapping, imagine it as a variation of the random search hyperparameter tuning algorithm. However, instead of testing random parameter combinations, you’re training multiple models on different combinations of data, storing each and every one. In the end, predictions are made by combining the outputs from all these models.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


Data_x, Data_y = load_iris(return_X_y = True, as_frame = True)
X_train, X_test, Y_train, Y_test = train_test_split(Data_x, Data_y, test_size=0.8, shuffle = True)

class RF_bootstrapped():

  def __init__(self, iterations = 100, best_parameters = [], Best_Model_accuracy = 0, best_model = None, models = [] ):

    self.iterations = iterations
    self.best_parameters = best_parameters
    self.Best_Model_accuracy = Best_Model_accuracy
    self.best_model = best_model
    self.models = models
  def fit(self, X_train, Y_train, X_test, y_test):
    try:
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      n_samples = X_train.shape[0] # So we can select random data values and its corresponding outputs, we need to know the shape of the data we are working with.

      for iterations in range(self.iterations):


        Bootstrap_indeces = np.random.choice(np.arange(n_samples), size = n_samples, replace= True) # This simply creates an array from 1 to n_samples and then picks n_sample values of which can repeat

        Bootstrap_x = X_train.iloc[Bootstrap_indeces] # This takes those array values and gets the corresponding data from them.
        Bootstrap_y = Y_train.iloc[Bootstrap_indeces] # This takes those array values and gets the corresponding data from them.

        Random_forest = RandomForestClassifier(n_estimators = 6 , max_leaf_nodes = 4, max_depth = 13)

        Random_forest.fit(Bootstrap_x, Bootstrap_y) # trains the model

        report_considered = classification_report(
            y_test, Random_forest.predict(X_test), output_dict=True
        )

        if report_considered['accuracy'] > self.Best_Model_accuracy: # This goes with storing the models and outputting them
            self.Best_Model_accuracy = report_considered['accuracy']

            self.Best_model = Random_forest

            self.models.append(self.Best_model)

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

        print("******************************************************************************************************************************************************")
        print(f"The best Accuracy from Bootstrapping is {self.Best_Model_accuracy}")
        print("******************************************************************************************************************************************************")
      return self.Best_model # You dont really need this, as if you want to make a prediction, you will be able to simply use the models aldready storred, but, you may want to simply use the most accurate, yet that may not be the best idea.

    except KeyboardInterrupt:

      print("********************************************************************************************************************************************************")
      print(f"your best parameters are {self.best_parameters} ")
      print(f"its corresponding Accuracy is {self.Best_Model_accuracy}%")
      print("********************************************************************************************************************************************************")
      return self.Best_model


#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  def Aggregate_predict(self, X): # This uses all the model used within bootstrapping to make a reliable prediction based on majority voting.

      # Generate an array of predictions from each model in the ensemble, where each model predicts labels for all instances in X.
      Predictions = np.array([model.predict(X) for model in self.models])

      # Apply majority voting to determine the final prediction for each instance:
      final_predictions = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=Predictions)

      return final_predictions

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Test = RF_bootstrapped()

Model_bootstrapped = Test.fit(X_train, Y_train, X_test, Y_test)





******************************************************************************************************************************************************
The best Accuracy from Bootstrapping is 0.95
******************************************************************************************************************************************************
******************************************************************************************************************************************************
The best Accuracy from Bootstrapping is 0.95
******************************************************************************************************************************************************
******************************************************************************************************************************************************
The best Accuracy from Bootstrapping is 0.95
**************************************************************************************************************

### Fitting

Clearer code for the Fitting

In [None]:
def fit(self, X_train, Y_train, X_test, y_test):
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      n_samples = X_train.shape[0] # So we can select random data values and its corresponding outputs, we need to know the shape of the data we are working with.

      for iterations in range(self.iterations):


        Bootstrap_indeces = np.random.choice(np.arange(n_samples), size = n_samples, replace= True) # This simply creates an array from 1 to n_samples and then picks n_sample values of which can repeat

        Bootstrap_x = X_train.iloc[Bootstrap_indeces] # This takes those array values and gets the corresponding data from them.
        Bootstrap_y = Y_train.iloc[Bootstrap_indeces] # This takes those array values and gets the corresponding data from them.

        Random_forest = RandomForestClassifier(n_estimators = 6 , max_leaf_nodes = 4, max_depth = 13)

        Random_forest.fit(Bootstrap_x, Bootstrap_y) # trains the model

        report_considered = classification_report(
            y_test, Random_forest.predict(X_test), output_dict=True
        )

        if report_considered['accuracy'] > self.Best_Model_accuracy: # This goes with storing the models and outputting them
            self.Best_Model_accuracy = report_considered['accuracy']

            self.Best_model = Random_forest

            self.models.append(self.Best_model)

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

        print("******************************************************************************************************************************************************")
        print(f"The best Accuracy from Bootstrapping is {self.Best_Model_accuracy}")
        print("******************************************************************************************************************************************************")
      return self.Best_model
      # You dont really need this, as if you want to make a prediction.
      # You can use the models already stored, but instead of selecting only the most accurate one,
      # consider using all models together for an ensemble prediction. This approach can often yield
      # more robust results than relying on a single "most accurate" model.

###Predictions

Clearer code for the Predictions

In [None]:
def Aggregate_predict(self, X): # This uses all the model sused within bootstrapping to make a reliable prediction based on majority voting.

    # Generate an array of predictions from each model in the ensemble, where each model predicts labels for all instances in X.
    Predictions = np.array([model.predict(X) for model in self.models])

    # Apply majority voting to determine the final prediction for each instance:
    final_predictions = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=Predictions)

    return final_predictions