# Parameters vs Hyperparameters

**Parameters** are the internal values that a model learns from the training data. These values adjust automatically during the learning process to help the model make accurate predictions. For example, in linear regression, parameters are the coefficients that define the slope of the line, and in neural networks, they are the weights and biases that determine how data is passed through layers. Parameters are the model's way of adapting to the data and are updated during training.

**Hyperparameters** are settings that are defined before the model begins training, and they control the learning process itself. These are not learned from data but are manually set by the user to influence how the model works. Examples of hyperparameters include the number of trees in a random forest or the learning rate in a neural network. They remain fixed during training unless manually tuned for better performance.

# Metrics

## Classification

**Dependent on the Probability Threshold:**
Metrics like accuracy, precision, recall, F1 score, false positive rate (FPR), and false negative rate (FNR) rely on a specific probability threshold. This threshold decides whether a prediction is classified as positive or negative. Changing the threshold (e.g., 0.5, 0.7) affects how many predictions are classified as each class, which in turn alters these metrics.

* Accuracy: Measures the percentage of correct predictions at a chosen threshold.
* Precision: Measures the proportion of positive predictions that are correct.
* Recall: Measures how many actual positives were correctly identified.
* F1 Score: Balances precision and recall.
* FPR/FNR: Measure how often the model incorrectly classifies negatives or positives, respectively.


**Independent of the Probability Threshold:**
ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) is independent of any specific threshold. It evaluates the model's ability to distinguish between classes across all possible thresholds, providing a more holistic view of how well the model separates positives and negatives.

## Regression

* Mean Squared Error (MSE)
Measures how far the predicted values are from the actual values, with a focus on larger errors. Lower MSE is better.
* Root Mean Squared Error (RMSE)
Similar to MSE but in the same units as the original data, making it easier to interpret. Lower RMSE indicates better performance.
* Mean Absolute Error (MAE)
Averages the absolute differences between predicted and actual values. It treats all errors equally and is more resistant to outliers.
* R-squared (R²)
Tells how well the model explains the variation in the data. A value closer to 1 means the model fits the data well.

**Side notes**
* **Square:** Highlights bigger errors.
* **Root:** Makes the numbers easier to understand.
* **Absolute:** Treats all errors fairly, focusing on how far off we are in general.

# Cross validation

**K-Fold cross-validation** involves splitting the dataset into k equal parts (called folds). The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold used once as the test set. It’s commonly used for general model performance evaluation when you have enough data. The assumption here is that data is randomly and evenly distributed across the folds. When k is small, the model may not get enough training data, which can result in unreliable estimates.

**Leave-One-Out (LOO)** cross-validation is a special case of K-Fold where k equals the number of samples. Each iteration trains the model on all but one data point and tests on that single point. It’s used when working with very small datasets, where every individual data point is crucial. The assumption is that all data points are independent. Since this method uses almost all data for training in each iteration, it can be computationally intensive but provides an unbiased estimate of performance.

**Leave-P-Out (LPO)** cross-validation is similar to LOO, but instead of leaving out just one data point, p points are left out for testing in each iteration. This is useful for small datasets but can be computationally expensive as the number of combinations increases rapidly when p grows. The assumption remains that the data is evenly distributed, and the technique provides detailed model evaluation when p is small, though it can be inefficient for large datasets.
* it ensures that each unique combination of 2 data points is used exactly once for testing, and the remaining 98 are used for training.
* Every data point will eventually be part of the test set across different iterations, but the same pair of data points won’t be repeated in a test set.

**Repeated K-Fold** cross-validation takes the K-Fold approach and repeats it multiple times with different random splits of the data. This helps average out randomness and provides a more robust estimate of model performance. The assumption is similar to K-Fold: the data is randomly distributed, but by repeating the process, any variance due to random splits is smoothed out. This method is particularly helpful when k is small, as multiple repetitions can reduce the bias of smaller folds.

**Stratified K-Fold** cross-validation ensures that each fold has the same class distribution as the original dataset, which is especially important for classification problems with imbalanced classes. It’s used when dealing with classification tasks, where you want to maintain the proportion of each class across training and test sets. The assumption is that the dataset must have enough samples in each class to preserve balance. When k is small, class imbalance might distort the representation within each fold, so this method helps ensure balance.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from sklearn.model_selection import (
    KFold,
    RepeatedKFold,
    LeaveOneOut,
    LeavePOut,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)

### K-Fold Cross-Validation


In [None]:
# Step 1: Create a Logistic Regression model
# Logistic Regression with L2 regularization, C=10, using 'liblinear' solver, fixed random state, and max iterations set to 10000
logit = LogisticRegression(
    penalty='l2', C=10, solver='liblinear', random_state=4, max_iter=10000)

# Step 2: Set up K-Fold Cross-Validation
# Creating a K-Fold object with 5 splits, shuffling the data, and setting a random state for reproducibility
kf = KFold(n_splits=5, shuffle=True, random_state=4)

# Step 3: Estimate generalization error
# Using cross-validation to evaluate the model's performance on training and test data
clf = cross_validate(
    logit,
    X_train,  # Features for training
    y_train,  # Target variable for training
    scoring='accuracy',  # Metric for evaluation
    return_train_score=True,  # Include training scores in the output
    cv=kf,  # Use K-Fold cross-validation
)

# Step 4: Retrieve test scores
# Accessing the test scores from the cross-validation results
clf['test_score']

# Step 5: Retrieve training scores
# Accessing the training scores from the cross-validation results
clf['train_score']

# Step 6: Print mean train set accuracy
# Calculating and printing the mean accuracy and standard deviation for the training set
print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))

# Step 7: Print mean test set accuracy
# Calculating and printing the mean accuracy and standard deviation for the test set
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

### Repeated K-Fold


In [None]:
# Step 1: Create a Logistic Regression model
# Logistic Regression with L2 regularization, C=1, using 'liblinear' solver, fixed random state, and max iterations set to 10000
logit = LogisticRegression(
    penalty='l2', C=1, solver='liblinear', random_state=4, max_iter=10000)

# Step 2: Set up Repeated K-Fold Cross-Validation
# Creating a Repeated K-Fold object with 5 splits, repeating 10 times, and a random state for reproducibility
rkf = RepeatedKFold(
    n_splits=5,  # Number of folds
    n_repeats=10,  # Number of repetitions
    random_state=4,  # Seed for reproducibility
)

# Step 3: Print expected number of performance metrics
# Calculating and displaying the expected number of performance metrics (5 folds * 10 repeats)
print('We expect K * n performance metrics: ', 5 * 10)

# Step 4: Estimate generalization error
# Using cross-validation to evaluate the model's performance on training and test data
clf = cross_validate(
    logit,
    X_train,  # Features for training
    y_train,  # Target variable for training
    scoring='accuracy',  # Metric for evaluation
    return_train_score=True,  # Include training scores in the output
    cv=rkf,  # Use Repeated K-Fold cross-validation
)

# Step 5: Print the number of metrics obtained
# Displaying the total number of test scores obtained from cross-validation
print('Number of metrics obtained: ', len(clf['test_score']))

# Step 6: Access test scores
# Accessing the test scores from the cross-validation results
clf['test_score']

# Step 7: Print mean train set accuracy
# Calculating and printing the mean accuracy and standard deviation for the training set
print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))

# Step 8: Print mean test set accuracy
# Calculating and printing the mean accuracy and standard deviation for the test set
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

### Leave One Out


In [None]:
# Step 1: Create a Logistic Regression model
# Logistic Regression with L2 regularization, C=1, using 'liblinear' solver, fixed random state, and max iterations set to 10000
logit = LogisticRegression(
    penalty='l2', C=1, solver='liblinear', random_state=4, max_iter=10000)

# Step 2: Set up Leave One Out Cross-Validation
# Creating a Leave One Out cross-validation object for evaluating the model
loo = LeaveOneOut()

# Step 3: Print expected number of metrics
# Displaying the expected number of metrics, which is equal to the number of samples in the training set
print('We expect as many metrics as data in the train set: ', len(X_train))

# Step 4: Estimate generalization error
# Using cross-validation to evaluate the model's performance on training and test data
clf = cross_validate(
    logit,
    X_train,  # Features for training
    y_train,  # Target variable for training
    scoring='accuracy',  # Metric for evaluation
    return_train_score=True,  # Include training scores in the output
    cv=loo,  # Use Leave One Out cross-validation
)

# Step 5: Print the number of metrics obtained
# Displaying the total number of test scores obtained from cross-validation
print('Number of metrics obtained: ', len(clf['test_score']))

# Step 6: Access test scores
# Accessing the test scores from the cross-validation results
len(clf['test_score'])

# Step 7: Print mean train set accuracy
# Calculating and printing the mean accuracy and standard deviation for the training set
print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))

# Step 8: Print mean test set accuracy
# Calculating and printing the mean accuracy and standard deviation for the test set
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

### Leave P Out

In [None]:
# Step 1: Create a Logistic Regression model
# Logistic Regression with L2 regularization, C=1, using 'liblinear' solver, fixed random state, and max iterations set to 10000
logit = LogisticRegression(
    penalty='l2', C=1, solver='liblinear', random_state=4, max_iter=10000)

# Step 2: Set up Leave P Out Cross-Validation
# Creating a Leave P Out cross-validation object where p=2, meaning 2 samples will be left out for testing
lpo = LeavePOut(p=2)

# Step 3: Take a smaller sample of the data
# Selecting a smaller sample of 100 data points to avoid memory issues during computation
X_train_small = X_train.head(100)
y_train_small = y_train.head(100)

# Step 4: Calculate expected number of metrics
# Calculating and printing the expected number of metrics based on combinations of 100 data points taken 2 at a time
print('We expect : ', comb(100, 2), ' metrics')

# Step 5: Estimate generalization error
# Using cross-validation to evaluate the model's performance on the small sample of training data
clf = cross_validate(
    logit,
    X_train_small,  # Features for training
    y_train_small,  # Target variable for training
    scoring='accuracy',  # Metric for evaluation
    return_train_score=True,  # Include training scores in the output
    cv=lpo,  # Use Leave P Out cross-validation
)

# Step 6: Print the number of metrics obtained
# Displaying the total number of test scores obtained from cross-validation
print('Number of metrics obtained: ', len(clf['test_score']))

# Step 7: Print mean train set accuracy
# Calculating and printing the mean accuracy and standard deviation for the training set
print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))

# Step 8: Print mean test set accuracy
# Calculating and printing the mean accuracy and standard deviation for the test set
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

### Stratified K-Fold Cross-Validation


In [None]:
# Step 1: Create a Logistic Regression model
# Logistic Regression with L2 regularization, C=1, using 'liblinear' solver, fixed random state, and max iterations set to 10000
logit = LogisticRegression(
    penalty='l2', C=1, solver='liblinear', random_state=4, max_iter=10000)

# Step 2: Set up Stratified K-Fold Cross-Validation
# Creating a Stratified K-Fold cross-validation object with 5 splits, shuffling data, and a fixed random state
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=4)

# Step 3: Estimate generalization error
# Using cross-validation to evaluate the model's performance on the training data
clf = cross_validate(
    logit,
    X_train,  # Features for training
    y_train,  # Target variable for training
    scoring='accuracy',  # Metric for evaluation
    return_train_score=True,  # Include training scores in the output
    cv=skf,  # Use Stratified K-Fold cross-validation
)

# Step 4: Print the number of metrics obtained
# Displaying the total number of test scores obtained from cross-validation
len(clf['test_score'])

# Step 5: Print mean train set accuracy
# Calculating and printing the mean accuracy and standard deviation for the training set
print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))

# Step 6: Print mean test set accuracy
# Calculating and printing the mean accuracy and standard deviation for the test set
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

# Cross validation for hyperparameters

## K-Fold Cross-Validation

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from sklearn.model_selection import (
    KFold,
    RepeatedKFold,
    LeaveOneOut,
    LeavePOut,
    StratifiedKFold,
    GridSearchCV,
    train_test_split,
)

In [None]:
# Step 1: Logistic Regression model definition
# Initializes a logistic regression model with L2 regularization and specified settings
logit = LogisticRegression(
    penalty ='l2',        # Regularization type: L2
    C=1,                  # Inverse of regularization strength
    solver='liblinear',   # Optimization algorithm used
    random_state=4,       # Seed for reproducibility
    max_iter=10000        # Maximum number of iterations for convergence
)

# Step 2: Define hyperparameter space for tuning
# Creates a dictionary of hyperparameters to search through during Grid Search
param_grid = dict(
    penalty=['l1', 'l2'],  # Regularization options to explore
    C=[0.1, 1, 10],        # Values for inverse regularization strength to explore
)

# Step 3: Set up K-Fold Cross-Validation
# Sets up 5-fold cross-validation with shuffling to create training/validation splits
kf = KFold(n_splits=5, shuffle=True, random_state=4)  # 5-fold CV with shuffling

# Step 4: Set up Grid Search for hyperparameter tuning
# Configures Grid Search to tune hyperparameters using accuracy and cross-validation
clf =  GridSearchCV(
    logit,                  # Model to tune
    param_grid,            # Hyperparameter space to explore
    scoring='accuracy',     # Metric to optimize
    cv=kf,                  # Cross-validation strategy
    refit=True,            # Refits best model to the entire dataset
)

# Step 5: Fit Grid Search to training data
# Trains the model using Grid Search with the training data
search = clf.fit(X_train, y_train)

# Step 6: Retrieve best hyperparameters found
# Returns the optimal hyperparameters found during Grid Search
search.best_params_

# Step 7: Create a DataFrame with the results of the grid search
# Converts cross-validation results into a DataFrame for analysis
results = pd.DataFrame(search.cv_results_)[['params', 'mean_test_score', 'std_test_score']]
print(results.shape)  # Display shape of results DataFrame
results  # Show results DataFrame

# Step 8: Sort results by mean test score in descending order
# Sorts the DataFrame by mean test score to find the best performing hyperparameters
results.sort_values(by='mean_test_score', ascending=False, inplace=True)

# Step 9: Reset index of the results DataFrame
# Resets the index of the DataFrame after sorting for better readability
results.reset_index(drop=True, inplace=True)

# Step 10: Plot mean test scores with error bars
# Visualizes the mean test scores with error bars showing standard deviation
results['mean_test_score'].plot(
    yerr=[results['std_test_score'], results['std_test_score']],  # Error bars for standard deviation
    subplots=True            # Create subplots for better visual clarity
)

# Step 11: Label the y-axis
# Adds a label to the y-axis of the plot indicating it represents accuracy
plt.ylabel('Mean Accuracy')

# Step 12: Label the x-axis
# Adds a label to the x-axis of the plot indicating it represents hyperparameter space
plt.xlabel('Hyperparameter space')

## Repeated K-Fold

In [None]:
# Logistic Regression
# Defines a logistic regression model with L2 regularization and specific configurations
logit = LogisticRegression(
    penalty ='l2',         # Regularization type: L2
    C=1,                   # Inverse of regularization strength
    solver='liblinear',    # Optimization algorithm used
    random_state=4,        # Seed for reproducibility
    max_iter=10000         # Maximum number of iterations for convergence
)

# hyperparameter space
# Creates a dictionary of hyperparameters for tuning the model
param_grid = dict(
    penalty=['l1', 'l2'],  # Regularization options to explore
    C=[0.1, 1, 10],        # Values for inverse regularization strength to explore
)

# Repeated K-Fold Cross-Validation
# Sets up repeated K-Fold cross-validation with 5 splits and 10 repeats
rkf = RepeatedKFold(
    n_splits=5,           # Number of splits for K-Fold
    n_repeats=10,         # Number of times to repeat the cross-validation
    random_state=4,       # Seed for reproducibility
)

# search
# Configures Grid Search with the model, hyperparameters, accuracy metric, and repeated K-Fold CV
clf = GridSearchCV(
    logit,                  # Model to tune
    param_grid,            # Hyperparameter space to explore
    scoring='accuracy',     # Metric to optimize
    cv=rkf,                # Cross-validation strategy
    refit=True,            # Refits best model to entire dataset
)

search = clf.fit(X_train, y_train)  # Fits the model to the training data using Grid Search

# best hyperparameters
# Retrieves the optimal hyperparameters from Grid Search
search.best_params_

# Creates a DataFrame from Grid Search results, showing hyperparameters, mean, and standard deviation of scores
results = pd.DataFrame(search.cv_results_)[['params', 'mean_test_score', 'std_test_score']]
print(results.shape)  # Displays the shape of the results DataFrame

results  # Shows the results DataFrame

# Sorts the DataFrame by mean test score in descending order
results.sort_values(by='mean_test_score', ascending=False, inplace=True)

# Resets the index of the DataFrame after sorting
results.reset_index(drop=True, inplace=True)

# Plots mean test scores with error bars indicating the standard deviation
results['mean_test_score'].plot(yerr=[results['std_test_score'], results['std_test_score']], subplots=True)

# Adds a label to the y-axis representing accuracy
plt.ylabel('Mean Accuracy')

# Adds a label to the x-axis representing hyperparameter space
plt.xlabel('Hyperparameter space')

# let's get the predictions
# Generates predictions for the training and testing sets using the best model
train_preds = search.predict(X_train)
test_preds = search.predict(X_test)

# Prints the training accuracy by comparing predictions to actual labels
print('Train Accuracy: ', accuracy_score(y_train, train_preds))

# Prints the testing accuracy by comparing predictions to actual labels
print('Test Accuracy: ', accuracy_score(y_test, test_preds))

## Stratified K-Fold Cross-Validation


In [None]:
# Logistic Regression
# Defines a logistic regression model with L2 regularization and specific configurations
logit = LogisticRegression(
    penalty ='l2',         # Regularization type: L2
    C=1,                   # Inverse of regularization strength
    solver='liblinear',    # Optimization algorithm used
    random_state=4,        # Seed for reproducibility
    max_iter=10000         # Maximum number of iterations for convergence
)

# hyperparameter space
# Creates a dictionary of hyperparameters for tuning the model
param_grid = dict(
    penalty=['l1', 'l2'],  # Regularization options to explore
    C=[0.1, 1, 10],        # Values for inverse regularization strength to explore
)

# Stratified Cross-Validation
# Sets up Stratified K-Fold cross-validation with 5 splits and shuffling for stratified sampling
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=4)  # Stratified K-Fold CV

# search
# Configures Grid Search with the model, hyperparameters, accuracy metric, and Stratified K-Fold CV
clf = GridSearchCV(
    logit,                  # Model to tune
    param_grid,            # Hyperparameter space to explore
    scoring='accuracy',     # Metric to optimize
    cv=skf,                # Cross-validation strategy
    refit=True,            # Refits best model to entire dataset
)

search = clf.fit(X_train, y_train)  # Fits the model to the training data using Grid Search

# best hyperparameters
# Retrieves the optimal hyperparameters from Grid Search
search.best_params_

# Creates a DataFrame from Grid Search results, showing hyperparameters, mean, and standard deviation of scores
results = pd.DataFrame(search.cv_results_)[['params', 'mean_test_score', 'std_test_score']]  # Extract relevant results
print(results.shape)  # Displays the shape of the results DataFrame

results.head()  # Shows the first few rows of the results DataFrame

# Sorts the DataFrame by mean test score in descending order
results.sort_values(by='mean_test_score', ascending=False, inplace=True)

# Resets the index of the DataFrame after sorting
results.reset_index(drop=True, inplace=True)

# Plots mean test scores with error bars indicating the standard deviation
results['mean_test_score'].plot(yerr=[results['std_test_score'], results['std_test_score']], subplots=True)

# Adds a label to the y-axis representing accuracy
plt.ylabel('Mean Accuracy')

# Adds a label to the x-axis representing hyperparameter space
plt.xlabel('Hyperparameter space')

# let's get the predictions
# Generates predictions for the training and testing sets using the best model
train_preds = search.predict(X_train)  # Predictions on training data
test_preds = search.predict(X_test)      # Predictions on testing data

# Prints the training accuracy by comparing predictions to actual labels
print('Train Accuracy: ', accuracy_score(y_train, train_preds))

# Prints the testing accuracy by comparing predictions to actual labels
print('Test Accuracy: ', accuracy_score(y_test, test_preds))

# Group Cross Validation

**Group K-Fold Cross-Validation**
* What it is: A variation of K-Fold cross-validation where the data is divided into groups (or clusters). Each group contains related observations, and the goal is to ensure that all observations from a group are either in the training set or the validation set, but not both.
* When to use it: Use Group K-Fold when you have data that is grouped in some way (like patients in a hospital or students in a school), and you want to prevent data leakage. For example, you wouldn’t want to train and test on data from the same patient.
* Additional Example: In a study analyzing the effectiveness of a new teaching method, if you have multiple test scores from the same classrooms, Group K-Fold would ensure that all scores from a particular classroom are kept together, preventing any classroom's data from appearing in both the training and validation sets.

**Leave-One-Group-Out Cross-Validation (LOGO)**
* What it is: A specific case of Group K-Fold where you leave out one group at a time as the validation set while using all other groups for training. This is repeated for each group.
* When to use it: Use LOGO when you want to assess model performance while ensuring that the model never sees data from a group during training. It’s useful for small datasets or when groups represent distinct entities.
* Additional Example: In a clinical trial with multiple patients, if you want to evaluate a model predicting treatment outcomes, LOGO would leave out one patient’s data for validation while training on the remaining patients. This way, the model is tested on completely unseen patient data.

**Leave-P-Groups-Out Cross-Validation**
* What it is: A generalization of LOGO where you leave out
𝑝
p groups at a time for validation while using the remaining groups for training. This is done repeatedly for different combinations of groups.
* When to use it: Use this method when you have larger datasets and want a more robust evaluation by leaving out multiple groups at once. This helps in assessing how well the model generalizes across multiple related groups.
* Additional Example: In a study analyzing customer behavior across different stores, if you have customer purchase data grouped by store locations, you could use Leave-P-Groups-Out to leave out data from several stores for validation while training on the data from the other stores. This allows you to assess how well the model predicts customer behavior in new, unseen store locations.

# Assumptions

**Assumption for grouped CV:**

The dataset contains observations that are grouped together, and it is assumed that data from the same group may be correlated. The goal is to prevent data leakage by ensuring that all observations from a single group are either included in the training set or the validation set, but not both.

**Assumption for non group CV:**
The dataset consists of independent and identically distributed (i.i.d.) observations. It is assumed that the observations are not related to one another, allowing for random sampling without concern for data leakage.