## Early Stopping
Early stopping is used to monitor the validation performance of the model during training, and it halts the training process when the model starts to overfit so it prevent it(i.e., when the validation loss stops improving).Some type of models support early stop natively.

 Stopping Condition:
If the monitored metric does not improve for a specified number of epochs (called patience), training is stopped.The model's weights are restored to the point where the validation performance was best.
 Min Delta: The minimum change in the monitored metric to qualify as an improvement.
For example, if min_delta=0.01, the metric must improve by at least 0.01 to be considered an improvement.


* Disadvantages of Early Stopping:

 Risk of Underfitting:
 If patience is too low, training may stop before the model has fully learned the data.Choosing inappropriate values for patience or min delta can lead to suboptimal results.
 The quality of early stopping depends on the representativeness of the validation set.


* Best Practices for Early Stopping

1) Choose the Right Metric: Use a metric that aligns with your problem (e.g., validation loss for regression, accuracy for classification).
2) Set Appropriate Patience:Start with a moderate patience value (e.g., 10-20 epochs) and adjust based on the dataset size and complexity.
3) Use a Representative Validation Set:Ensure the validation set is representative of the overall data distribution.
4) Combine with Other Regularization Techniques:Use early stopping alongside dropout, weight decay, or data augmentation for better generalization.
5) Monitor Training and Validation Curves: Plot training and validation metrics to visually inspect for overfitting and determine the best stopping point.
6) You can use early stopping and regularization techniques normally when is a gradient boosting models (XGBoost, LightGBM, CatBoost) or deep learning models.

In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example: Load a dataset (replace this with your actual dataset)
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target

# Step 1: Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Convert the data into DMatrix format (required by XGBoost)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

# Step 3: Define the parameters for the XGBoost model
params = {
    'objective': 'binary:logistic',  # Binary classification
    'eval_metric': 'logloss',        # Metric to monitor (log loss for binary classification)
    'max_depth': 6,                  # Maximum depth of a tree
    'eta': 0.1,                      # Learning rate
    'subsample': 0.8,                # Subsample ratio of the training instances
    'colsample_bytree': 0.8,         # Subsample ratio of columns
    'reg_alpha': 0.1,  # L1 Regularization
    'reg_lambda': 0.5,  # L2 Regularization
    'seed': 42                       # Random seed for reproducibility
}

# Step 4: Train the model with early stopping
evals = [(dtrain, 'train'), (dval, 'val')]  # Evaluation sets to monitor
num_round = 100  # Maximum number of boosting rounds

# Train the model
model = xgb.train(
    params,
    dtrain,
    num_round,
    evals=evals,
    early_stopping_rounds=10,  # Stop if no improvement for 10 rounds on  validation loss (logloss)
    verbose_eval=10            # Print evaluation results every 10 rounds
)

# Step 5: Make predictions on the validation set
y_pred = model.predict(dval)
y_pred_binary = [1 if p > 0.5 else 0 for p in y_pred]  # Convert probabilities to binary predictions

# Step 6: Evaluate the model
accuracy = accuracy_score(y_val, y_pred_binary)
print(f"Validation Accuracy: {accuracy:.4f}")

In [None]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load example dataset (Breast Cancer dataset)
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create LightGBM datasets
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

# Define LightGBM parameters
params = {
    'objective': 'binary',  # Binary classification
    'metric': 'binary_logloss',  # Metric to monitor (log loss for binary classification)
    'boosting_type': 'gbdt',  # Gradient Boosting Decision Tree
    'num_leaves': 31,  # Number of leaves in a tree
    'learning_rate': 0.05,  # Learning rate
    'feature_fraction': 0.9,  # Fraction of features to use for each tree
    'verbose': -1  # Suppress LightGBM logs
}

# Train the model with early stopping
num_round = 1000  # Maximum number of boosting rounds
early_stopping_rounds = 50  # Stop if no improvement for 50 rounds

model = lgb.train(
    params,
    train_data,
    num_boost_round=num_round,
    valid_sets=[val_data],
    callbacks=[lgb.early_stopping(early_stopping_rounds)],  # Early stopping callback
    verbose_eval=10  # Print evaluation every 10 rounds
)

# Make predictions on the validation set
y_pred = model.predict(X_val, num_iteration=model.best_iteration)
y_pred_class = [1 if p > 0.5 else 0 for p in y_pred]  # Convert probabilities to class labels

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred_class)
print(f"Validation Accuracy: {accuracy:.4f}")
print(f"Best Iteration: {model.best_iteration}")

## Regularization techniques 
Essential for controlling model complexity and preventing overfitting. 

They work by adding a penalty term to the loss function, which discourages the model from fitting the noise in the training data. Regularization techniques like L1 (Lasso) and L2 (Ridge) penalize the magnitude of the model's coefficients. If features are on different scales, the penalty term will disproportionately affect larger-scaled features, leading to suboptimal results.To address this, you should standardize or normalize your features during the preprocessing phase.

Regularization Strength (alpha):Controls the overall impact of the penalty term. Higher values of alpha increase regularization, reducing overfitting but potentially underfitting.



#### L1 Regularization (Lasso)
Encourages sparsity by shrinking some coefficients to zero, effectively performing feature selection.
Use Case: When you suspect that only a subset of features is important.

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Lasso model
lasso = Lasso(alpha=0.1)  # alpha is the regularization strength (lambda)

# Train the model
lasso.fit(X_train, y_train)

# Evaluate the model
y_pred = lasso.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

# View coefficients (some may be zero)
print("Coefficients:", lasso.coef_)

#### L2 Regularization (Ridge)
Shrinks all coefficients but does not set them to zero, resulting in smaller, non-zero values.
Use Case: When all features are potentially relevant, but you want to prevent overfitting.

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Ridge model
ridge = Ridge(alpha=1.0)  # alpha is the regularization strength (lambda)

# Train the model
ridge.fit(X_train, y_train)

# Evaluate the model
y_pred = ridge.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

# View coefficients (all are non-zero but smaller)
print("Coefficients:", ridge.coef_)

####  ElasticNet Regularization
Can set some coefficients to zero (like Lasso) while shrinking others (like Ridge).
Use Case: When you want a balance between feature selection and preventing overfitting.

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize ElasticNet model
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio controls the mix of L1 and L2

# Train the model
elastic_net.fit(X_train, y_train)

# Evaluate the model
y_pred = elastic_net.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

# View coefficients (some may be zero, others shrunk)
print("Coefficients:", elastic_net.coef_)