### Model Selection

### Overview

This notebook focuses on training a `VotingClassifier`, an ensemble machine learning model that combines predictions from multiple individual classifiers to improve overall performance. The data used for training both the individual classifiers and the ensemble model is sourced from the `data/processed` directory. The main steps involved in this notebook are:

- **Model Selection**: Choose multiple classifiers (e.g., KNN, Logistic Regression, SVM, XGBoost, etc.) to be part of the ensemble.
- **Hyperparameter Tuning**: Perform hyperparameter optimization for each classifier to identify the best configuration for the given dataset.
- **Ensemble Creation**: Combine the tuned classifiers into a `VotingClassifier` ensemble, which aggregates the predictions of the individual models through majority voting (or soft voting based on predicted probabilities).
- **Model Training**: Train the `VotingClassifier` on the processed data and evaluate its performance.

### Pre-requisites 

To run this notebook, you need to set up a conda environment with all required dependencies.

Example setup:
```bash
cd path/to/conda/dir
conda env create -f aifootball_predictions.yaml
conda activate aifootball_predictions
python -m ipykernel install --user --name aifootball_predictions --display-name "aifootball_predictions"


In [2]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_halving_search_cv # noqa
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, HalvingGridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, precision_score, f1_score, roc_auc_score
from sklearn.pipeline import Pipeline
import xgboost as xgb
import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, VotingClassifier
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

#### GPU checks

In [2]:
# Check the available devices
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 905076402564013371
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 1734606848
locality {
  bus_id: 1
  links {
  }
}
incarnation: 11367158361600899005
physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6"
xla_global_id: 416903419
]


In [3]:
# Check if TensorFlow is built with CUDA support
tf.test.is_built_with_cuda()

True

In [4]:
# Check the available GPUs
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [5]:
print(tf.config.list_physical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


#### Model Selection

In [4]:
# read the data
uk_data = pd.read_csv('../data/processed/E0_merged_preprocessed.csv')

In [7]:
# show the first 5 rows of the data
uk_data.head()

Unnamed: 0,Date,Div,Time,HomeTeam,AwayTeam,FTR,HTR,Referee,Season,Last5HomeOver2.5Perc,...,HomeOver2.5Perc,AvgLast5AwayGoalsConceded,AvgLast5HomeGoalsScored,AwayOver2.5Perc,AvgLast5HomeGoalsConceded,AvgLast5AwayGoalsScored,B365C<2.5,MaxC>2.5,HR,Over2.5
0,2022-08-05,E0,20:00,Crystal Palace,Arsenal,A,A,A Taylor,2022/2023,0.0,...,42.11,0.0,0.0,47.37,2.0,2.0,1.72,2.19,0,0
1,2022-08-20,E0,17:30,Bournemouth,Arsenal,A,A,C Pawson,2022/2023,50.0,...,47.37,0.0,1.0,47.37,1.5,2.5,2.1,1.9,0,1
2,2022-09-04,E0,16:30,Man United,Arsenal,H,H,P Tierney,2022/2023,100.0,...,57.89,1.0,2.0,47.37,1.33,2.0,2.1,1.82,0,1
3,2022-09-18,E0,12:00,Brentford,Arsenal,A,A,D Coote,2022/2023,75.0,...,47.37,0.75,2.5,47.37,1.5,2.25,2.1,1.81,0,1
4,2022-10-16,E0,14:00,Leeds,Arsenal,A,A,C Kavanagh,2022/2023,40.0,...,63.16,0.6,1.2,47.37,0.6,2.0,2.3,1.64,0,0


In [5]:
# select the target variable
y = uk_data['Over2.5'].values

# Select only numerical columns for X, excluding 'Date' and the target variable 'Over2.5'
numerical_columns = uk_data.select_dtypes(include=['number']).columns
X = uk_data[numerical_columns].drop(columns=['Over2.5']).values

In [9]:
numerical_columns

Index(['Last5HomeOver2.5Perc', 'Last5AwayOver2.5Perc', 'HST', 'AST',
       'HomeOver2.5Perc', 'AvgLast5AwayGoalsConceded',
       'AvgLast5HomeGoalsScored', 'AwayOver2.5Perc',
       'AvgLast5HomeGoalsConceded', 'AvgLast5AwayGoalsScored', 'B365C<2.5',
       'MaxC>2.5', 'HR', 'Over2.5'],
      dtype='object')

In [10]:
y

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,

In [11]:
X

array([[  0.  ,   0.  ,   2.  , ...,   1.72,   2.19,   0.  ],
       [ 50.  ,  50.  ,   1.  , ...,   2.1 ,   1.9 ,   0.  ],
       [100.  ,  66.67,   6.  , ...,   2.1 ,   1.82,   0.  ],
       ...,
       [ 40.  ,  60.  ,   9.  , ...,   1.95,   2.  ,   0.  ],
       [ 80.  ,  60.  ,  12.  , ...,   4.  ,   1.28,   0.  ],
       [ 60.  ,  40.  ,  14.  , ...,   5.  ,   1.19,   0.  ]])

### Nested Cross Validation

In [13]:
def create_dnn_model(input_dim: int, dropout_rate: float = 0.5) -> tf.keras.Model:
    """
    Creates a Deep Neural Network (DNN) model for binary classification.

    Parameters:
    ----------
    input_dim : int
        The number of input features (dimensions).
    dropout_rate : float, optional
        The dropout rate to be used in Dropout layers to prevent overfitting (default is 0.5).

    Returns:
    -------
    tf.keras.Model
        A compiled DNN model ready for training.
    """
    model = Sequential()

    # Input layer
    model.add(Dense(128, activation='relu', input_dim=input_dim))
    model.add(BatchNormalization())
    model.add(Dropout(dropout_rate))

    # Hidden layers
    model.add(Dense(64, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(dropout_rate))

    model.add(Dense(32, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(dropout_rate))

    # Output layer
    model.add(Dense(1, activation='sigmoid'))

    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.001),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    return model


In [52]:
def create_lstm_model(input_shape: tuple, dropout_rate: float = 0.5) -> tf.keras.Model:
    """
    Creates an LSTM model for binary classification.

    Parameters:
    ----------
    input_shape : tuple
        The shape of the input data (timesteps, features).
    dropout_rate : float, optional
        The dropout rate to be used in Dropout layers to prevent overfitting (default is 0.5).

    Returns:
    -------
    tf.keras.Model
        A compiled LSTM model ready for training.
    """
    model = Sequential()

    # LSTM layer
    model.add(LSTM(128, activation='relu', input_shape=input_shape))
    model.add(BatchNormalization())
    model.add(Dropout(dropout_rate))

    model.add(LSTM(64, activation='relu', input_shape=input_shape))
    model.add(BatchNormalization())
    model.add(Dropout(dropout_rate))

    # Dense hidden layer
    model.add(Dense(32, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(dropout_rate))

    # Output layer
    model.add(Dense(1, activation='sigmoid'))

    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.001),
                  loss='binary_crossentropy',
                  metrics=['accuracy'], verbose=1)

    return model


In [68]:
# Logistic Regression Model and Hyperparameters
lr_model = LogisticRegression(solver='liblinear')
lr_param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'max_iter': [2000, 3000]  # Add max_iter as a hyperparameter to tune
}

# K-Nearest Neighbors Model and Hyperparameters
knn_model = KNeighborsClassifier()
knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Support Vector Machine Model and Hyperparameters
svm_model = SVC(probability=True)
svm_param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4, 5],
    'class_weight': [None, 'balanced']

}

# Random Forest Model and Hyperparameters
rf_model = RandomForestClassifier(random_state=42)

rf_param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [3, 5, 7, 9],  # Maximum depth of the tree (None means nodes are expanded until all leaves are pure)
    #'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    #'min_samples_leaf': [1, 2, 4],  # Minimum number of samples required to be at a leaf node
    #'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider when looking for the best split
    'bootstrap': [True],  # Whether bootstrap samples are used when building trees
    #'class_weight': [None, 'balanced', 'balanced_subsample']  # Weighing of classes in case of class imbalance
}

xgb_model = xgb.XGBClassifier(tree_method = "hist", 
                              eval_metric='logloss',
                              device = "cuda",  # Use GPU for training
                              response_method = None
                              )
xgb_param_grid = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.1, 0.2],  # Learning rate
}

# HistGradientBoostingClassifier Model
hgb_model = HistGradientBoostingClassifier(random_state=42)

# Hyperparameter grid for HistGradientBoostingClassifier
hgb_param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],  # Learning rate
    'max_iter': [100, 200, 300],  # Number of boosting iterations
    'max_depth': [3, 5, 7],  # Maximum depth of the tree
    #'min_samples_leaf': [10, 20, 30],  # Minimum number of samples required to be at a leaf node
    'l2_regularization': [0.0, 0.1, 0.5],  # L2 regularization strength
    #'max_bins': [255, 511],  # Maximum number of bins used for discretizing features
    'early_stopping': [True]
}

In [69]:
# Define scoring metrics
accuracy_scorer = make_scorer(accuracy_score)
precision_scorer = make_scorer(precision_score)
f1_scorer = make_scorer(f1_score)
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True, response_method='predict_proba')

In [70]:
# 10-fold cross-validation
cv = KFold(n_splits=10, shuffle=True, random_state=42)

# Combine the models and hyperparameters into a dictionary
models = {
    'XGBoost': (xgb_model, xgb_param_grid),
    'HistGradientBoosting': (hgb_model, hgb_param_grid),
    #'LSTM': (lstm_model, lstm_param_grid),
    #'Neural Network': (dnn_model, dnn_param_grid),
    'Logistic Regression': (lr_model, lr_param_grid),
    'KNN': (knn_model, knn_param_grid),
    'SVM': (svm_model, svm_param_grid),
    'Random Forest': (rf_model, rf_param_grid),
}

results = {}
best_params = {}

for model_name, (model, param_grid) in models.items():
    print(f"Evaluating {model_name}...")
    
    # Initialize HalvingGridSearchCV with the inner cross-validation and hyperparameter grid
    grid_search = HalvingGridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring=accuracy_scorer, verbose=0)
        
    # Fit the grid search on the whole dataset to get the best parameters
    grid_search.fit(X, y)

    # Get cross-validated score
    cv_score = cross_val_score(grid_search.best_estimator_, X, y, cv=cv, scoring=accuracy_scorer)
    
    # Store the results and best parameters
    results[model_name] = cv_score
    best_params[model_name] = grid_search.best_params_
    
    print(f"{model_name} - {accuracy_scorer._score_func.__name__}: {np.mean(cv_score):.4f} ± {np.std(cv_score):.4f}")
    print(f"Best parameters for {model_name}: {grid_search.best_params_}")

Evaluating XGBoost...


KeyboardInterrupt: 

In [64]:
# Show the best parameters for each model
for model_name, params in best_params.items():
    print(f"Best parameters for {model_name}: {params}")

Best parameters for XGBoost: {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 150}
Best parameters for HistGradientBoosting: {'early_stopping': True, 'l2_regularization': 0.1, 'learning_rate': 0.1, 'max_depth': 3, 'max_iter': 100}
Best parameters for Logistic Regression: {'C': 10, 'max_iter': 2000, 'penalty': 'l2'}
Best parameters for KNN: {'metric': 'manhattan', 'n_neighbors': 7, 'weights': 'uniform'}
Best parameters for SVM: {'C': 0.1, 'class_weight': None, 'degree': 2, 'gamma': 'scale', 'kernel': 'linear'}
Best parameters for Random Forest: {'bootstrap': True, 'max_depth': 9, 'n_estimators': 100}


In [65]:
# Compare models
print(f"\nModel Comparison {accuracy_scorer._score_func.__name__}:")
for model_name, scores in results.items():
    print(f"{model_name}: {np.mean(scores):.4f} ± {np.std(scores):.4f}")


Model Comparison accuracy_score:
XGBoost: 0.7711 ± 0.0417
HistGradientBoosting: 0.8013 ± 0.0438
Logistic Regression: 0.8053 ± 0.0407
KNN: 0.7750 ± 0.0401
SVM: 0.8132 ± 0.0347
Random Forest: 0.7974 ± 0.0409


- First Run hyperparameters selection
Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}
Best parameters for Logistic Regression: {'C': 0.1, 'penalty': 'l2'}
Best parameters for KNN: {'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'distance'}
Best parameters for SVM: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}

Model Comparison accuracy_score:
XGBoost: 0.7740 ± 0.0378
Logistic Regression: 0.8125 ± 0.0196
KNN: 0.7634 ± 0.0317
SVM: 0.8058 ± 0.0308

- Second Run hyperparameters selection
Best parameters for XGBoost: {'colsample_bytree': 0.6, 'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 200}
Best parameters for Logistic Regression: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
Best parameters for KNN: {'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'distance'}
Best parameters for SVM: {'C': 10, 'class_weight': 'balanced', 'degree': 2, 'gamma': 'scale', 'kernel': 'rbf'}

Model Comparison accuracy_score:
XGBoost: 0.7740 ± 0.0267
Logistic Regression: 0.7966 ± 0.0281
KNN: 0.7713 ± 0.0328
SVM: 0.8125 ± 0.0233

- Third Run
Best parameters for XGBoost: {'max_depth': 3, 'n_estimators': 50}
Best parameters for Logistic Regression: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
Best parameters for KNN: {'metric': 'manhattan', 'n_neighbors': 5, 'weights': 'uniform'}
Best parameters for SVM: {'C': 10, 'class_weight': None, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}
Best parameters for Random Forest: {'max_depth': 7, 'n_estimators': 200}

Model Comparison accuracy_score:
XGBoost: 0.7674 ± 0.0411
Logistic Regression: 0.8125 ± 0.0218
KNN: 0.7514 ± 0.0286
SVM: 0.8125 ± 0.0306
Random Forest: 0.7699 ± 0.0291

- Fourth Run (feature scaling + maximum variance feature selected in clustering)
Model Comparison accuracy_score:
XGBoost: 0.5585 ± 0.0375
Logistic Regression: 0.5851 ± 0.0096
KNN: 0.5492 ± 0.0492
SVM: 0.5664 ± 0.0405
Random Forest: 0.5864 ± 0.0057

### Ensamble learning

In [67]:
# Initialize the models with the best hyperparameters
best_lr_model = LogisticRegression(**best_params['Logistic Regression'])
best_knn_model = KNeighborsClassifier(**best_params['KNN'])
best_svm_model = SVC(**best_params['SVM'], probability=True)
best_rf_model = RandomForestClassifier(**best_params['Random Forest'])
best_xgb_model = xgb.XGBClassifier(**best_params['XGBoost'])
best_hgb_model = HistGradientBoostingClassifier(**best_params['HistGradientBoosting'])

# Combine the models into a voting classifier
voting_clf = VotingClassifier(estimators=[
    ('lr', best_lr_model),
    ('knn', best_knn_model),
    ('svm', best_svm_model),
    ('rf', best_rf_model),
    ('xgb', best_xgb_model),
    ('hgb', best_hgb_model)
], voting='soft')  # 'soft' for probability-based voting, 'hard' for majority voting

# Fit the voting classifier
voting_clf.fit(X, y)

# Evaluate the ensemble using cross-validation
cv_scores = cross_val_score(voting_clf, X, y, cv=10, scoring=accuracy_scorer)
print(f"Voting Classifier - {accuracy_scorer._score_func.__name__}: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

Voting Classifier - accuracy_score: 0.8092 ± 0.0429


### Bayesian Search for Hyp Tuning

In [2]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Suppress the ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

In [6]:
# Define models and hyperparameters with Bayesian Optimization
lr_model = LogisticRegression()
lr_param_space = {
    'C': Real(0.01, 10, prior='log-uniform'),
    'penalty': Categorical(['l1', 'l2']),
    'solver': Categorical(['liblinear', 'saga']),
    'max_iter': Integer(2000, 3000)
}

knn_model = KNeighborsClassifier()
knn_param_space = {
    'n_neighbors': Integer(3, 9),
    'weights': Categorical(['uniform', 'distance']),
    'metric': Categorical(['euclidean', 'manhattan'])
}

svm_model = SVC(probability=True)
svm_param_space = {
    'C': Real(0.1, 10, prior='log-uniform'),
    'kernel': Categorical(['linear', 'rbf', 'poly']),
    'gamma': Categorical(['scale', 'auto']),
    'degree': Integer(2, 5),
    'class_weight': Categorical([None, 'balanced'])
}

rf_model = RandomForestClassifier(random_state=42)
rf_param_space = {
    'n_estimators': Integer(50, 200),
    'max_depth': Integer(3, 9),
    'bootstrap': Categorical([True])
}

xgb_model = xgb.XGBClassifier(tree_method="hist", eval_metric='logloss')
xgb_param_space = {
    'n_estimators': Integer(50, 200),
    'max_depth': Integer(3, 9),
    'learning_rate': Real(0.01, 0.2, prior='log-uniform')
}

hgb_model = HistGradientBoostingClassifier(random_state=42)
hgb_param_space = {
    'learning_rate': Real(0.01, 0.2, prior='log-uniform'),
    'max_iter': Integer(100, 300),
    'max_depth': Integer(3, 7),
    'l2_regularization': Real(0.0, 0.5, prior='uniform'),
    'early_stopping': Categorical([True])
}

# Define scoring metrics
accuracy_scorer = make_scorer(accuracy_score)
precision_scorer = make_scorer(precision_score)
f1_scorer = make_scorer(f1_score)
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True, response_method='predict_proba')
scorer = make_scorer(accuracy_score)

# 10-fold cross-validation
cv = KFold(n_splits=10, shuffle=True, random_state=42)

# Combine the models and hyperparameters into a dictionary
models = {
    'Logistic Regression': (lr_model, lr_param_space),
    'KNN': (knn_model, knn_param_space),
    'SVM': (svm_model, svm_param_space),
    'Random Forest': (rf_model, rf_param_space),
    'XGBoost': (xgb_model, xgb_param_space),
    'HistGradientBoosting': (hgb_model, hgb_param_space),
}

results = {}
best_params = {}

for model_name, (model, param_space) in models.items():
    print(f"Evaluating {model_name}...")

    # Initialize BayesSearchCV with cross-validation and parameter space
    bayes_search = BayesSearchCV(estimator=model, search_spaces=param_space, cv=cv, scoring=scorer, n_iter=50, random_state=42, n_jobs=-1, verbose=0)

    # Fit the Bayesian search on the whole dataset to get the best parameters
    bayes_search.fit(X, y)

    # Get cross-validated score
    cv_score = cross_val_score(bayes_search.best_estimator_, X, y, cv=cv, scoring=scorer)

    # Store the results and best parameters
    results[model_name] = cv_score
    best_params[model_name] = bayes_search.best_params_

    print(f"{model_name} - {scorer._score_func.__name__}: {np.mean(cv_score):.4f} ± {np.std(cv_score):.4f}")
    print(f"Best parameters for {model_name}: {bayes_search.best_params_}")

Evaluating Logistic Regression...


KeyboardInterrupt: 

In [None]:
# Initialize the models with the best hyperparameters
best_lr_model = LogisticRegression(**best_params['Logistic Regression'])
best_knn_model = KNeighborsClassifier(**best_params['KNN'])
best_svm_model = SVC(**best_params['SVM'], probability=True)
best_rf_model = RandomForestClassifier(**best_params['Random Forest'])
best_xgb_model = xgb.XGBClassifier(**best_params['XGBoost'])
best_hgb_model = HistGradientBoostingClassifier(**best_params['HistGradientBoosting'])

# Combine the models into a voting classifier
voting_clf = VotingClassifier(estimators=[
    ('lr', best_lr_model),
    ('knn', best_knn_model),
    ('svm', best_svm_model),
    ('rf', best_rf_model),
    ('xgb', best_xgb_model),
    ('hgb', best_hgb_model)
], voting='soft')  # 'soft' for probability-based voting, 'hard' for majority voting

# Fit the voting classifier
voting_clf.fit(X, y)

# Evaluate the ensemble using cross-validation
cv_scores = cross_val_score(voting_clf, X, y, cv=10, scoring=accuracy_scorer)
print(f"Voting Classifier - {accuracy_scorer._score_func.__name__}: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")