<a href="https://colab.research.google.com/github/DrDavidL/learning-dhds/blob/main/Part_3_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Author: David Liebovitz, MD  
- Additional code updates and comments contributed by Xinyang(Oliver) Zhou, Northwestern Univ.
- Updated by Jay Manadan
-For Northwestern University Feinberg School of Medicine  
-May use with attribution

[Part 1](https://drive.google.com/file/d/1N9U5NybETiu6JdUBoiFvuLiZZ6UuG0Nx/view?usp=sharing), [Part 2](https://drive.google.com/file/d/1JGePTrMJfnjH4WWpu_F77Oy_rLRR2pCW/view?usp=sharing), [Part 3](https://drive.google.com/file/d/1UrGbYapImkuFA4RTp1rYC1zLCaCgaAxQ/view?usp=sharing),
[Part 4](https://drive.google.com/file/d/1h7NnkPIihE_JuztP7uEcEK3T5JMcDHk4/view?usp=sharing), [Part 5](https://drive.google.com/file/d/1qxjFy2fPc72tkyJxupQ6EruvFR94txaf/view?usp=sharing), [Part 6](https://drive.google.com/file/d/1bzYpdEFhFlr1_uqMIwNKesJ7XAWRSB5O/view?usp=sharing)


# Part 3: Generating and Testing a Predictive Model!

Run the first code cell below to get everything ready. This sets up your cloud computer and installs the tools we need.









In [None]:
# Be sure to run this critical first cell to import libraries and prepare our cloud computer!
# Click the "play" button to the left or press Ctrl+Enter to execute.
# If you haven't logged into Google, you'll be prompted.
# Ignore any warning prompt if you see one and proceed.
# This will then load the necessary libraries so we can analyze our data!

# Install missing libraries

%pip install mljar-scikit-plot xgboost imbalanced-learn -q

# Essential imports
import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.offline as ply
import ipywidgets as widgets

# Scikit-learn and data preprocessing
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import metrics, svm

# Metrics and plotting
from sklearn.metrics import (roc_curve, roc_auc_score, precision_recall_curve,
                             f1_score, auc, log_loss, recall_score, precision_score,
                             average_precision_score, classification_report, accuracy_score,
                             RocCurveDisplay, PrecisionRecallDisplay, confusion_matrix, ConfusionMatrixDisplay)

import scikitplot as skplt
from scikitplot.metrics import plot_roc_curve, plot_precision_recall_curve, plot_confusion_matrix

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

# Imbalanced-learn
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Google Colab-specific settings
from google.colab import data_table
data_table.enable_dataframe_formatter()

# Plotly settings
import plotly.offline
plotly.offline.init_notebook_mode(connected=True)

# Display options for pandas DataFrame
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Set seaborn default figure size
sns.set(rc={'figure.figsize':(12, 6)})



# Data import

As in Part 1 and Part 2, this notebook utilizes a dataset sourced from University of Virginia studies by Dr. Robert Schorling of several hundred rural African American patients. Additional information on this data set is available here:   
> https://hbiostat.org/data/repo/diabetes.html   

Using this data source, Dr. Robert Hoyt (https://data.world/rhoyt) assigned patients to a diabetes category if their hemoglobin A1c values were 6.5 or greater (https://data.world/informatics-edu/diabetes-prediction). The dataset  explored here is the modified version from Dr. Hoyt, almost ready to go for our explorations.

Our next step: Let's retrieve the data to use in our Notebook!



In [None]:
# Run this cell to reference the website that is holding the diabetes data in a CSV (comma separated values) file!
website = "https://drive.google.com/uc?export=download&id=1PQM8eQnQpaJwe9mAVb_XBpZoWoA5nTlM"

# polyp example
website_no = 'https://raw.githubusercontent.com/DrDavidL/dhds/main/colon_path_pred.csv'

# The command below assigns the name dm_raw to the now read CSV file retrieved from the website!
df_dm = pd.read_csv(website)

# Let's view 10 rows of data and enable sorting for the columns. Click column headers to get a feel for the max/min for columns.
# What's the maximum SBP? Lowest HDL? Note the Filter option at the top right. Click and filter to see how many patients are over 85.
data_table.DataTable(df_dm, include_index=False, num_rows_per_page=10)

In [None]:
# Once we load the data, we’ll turn categories like "male/female" and "diabetes/no diabetes" into numbers (1 or 0). This is important because most machine learning models need numbers, not words.
# Now, "no diabetes" will become "0" and "diabetes" will become "1" where they appear in the Diabetes column of our dataframe.

# Replace categorical values in 'Diabetes' column with numerical values (0 and 1).
df_dm['Diabetes'] = df_dm['Diabetes'].replace(['No diabetes', 'Diabetes'], [0, 1])

# Replace categorical values in 'Gender' column with numerical values (1 for 'male', 0 for 'female').
df_dm['Gender'] = df_dm['Gender'].replace(['male', 'female'], [1, 0])

# Let's check that the categorical columns are now updated as integers with (1 or 0) for entries.
print(f'Gender values:  \n{df_dm["Gender"].value_counts()} \n')
print(f'Diabetes values:  \n{df_dm["Diabetes"].value_counts()}')


In [None]:
# Prior to running a model, let's view the correlations with presence of diabetes for each of the variables.

# Calculate the correlation matrix for df_dm dataframe.
corr_df = df_dm.corr()

# Sort correlations with 'Diabetes' column in descending order and round to 3 decimal places.
diabetes_corr = corr_df['Diabetes'].round(3).sort_values(ascending=False)

# Print correlations with presence of diabetes for each variable.
print(diabetes_corr)


Finding Relationships in the Data

We’ll check which features (like glucose or cholesterol) are related to having diabetes.

A high positive number means that as the feature increases, diabetes is more likely. A negative number means that as the feature increases, diabetes is less likely.

For example:

Glucose (0.689) has a strong positive relationship with diabetes.

HDL cholesterol (-0.123) has a weak negative relationship with diabetes.

In [None]:
# Another approach uses colors instead of numbers to pick out strong correlations.

# Create a subplot with a specified size for the heatmap.
fig, ax = plt.subplots(figsize=(8, 8))

# Define a colormap with diverging colors for the heatmap.
cmap = sns.diverging_palette(0, 230, 90, 60, as_cmap=True)

# Calculate the correlation matrix for df_dm dataframe.
corr = df_dm.corr()

# Create a mask to hide the upper triangle of the heatmap (repeated information).
mask = np.triu(np.ones_like(corr, dtype=bool))

# Plot the heatmap using seaborn's heatmap function.
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns,
            mask=mask, cmap=cmap, vmax=.7, vmin=-.13, ax=ax)

# Set plot parameters and labels.
ax.set_title('Heatmap Showing Correlations between Variables')
plt.show()


**Using a Heatmap to See Relationships**

A heatmap is a colorful chart that shows how different variables are related. It helps us quickly spot strong relationships.

Look at the row labeled "Diabetes" in the heatmap. The darker blue areas mean strong positive relationships (like with glucose), and pink areas mean negative relationships (like with HDL).

# Getting Ready to Build a Model

We want to teach a model to predict whether someone has diabetes based on their other health info.

To do this:

1. We split the data into two parts:

  *   A `training set` (80%) to build the model

  *   A `test set` (20%) to check how well it works

2. We separate the data into inputs (like glucose, age, etc.) and outputs (whether or not the person has diabetes). We will use `trainData` and `testData` for the inputs respectively for the training and test data set. We'll then use `trainLabel` and `testLabel` are for the corresponding outputs (often called "labels"), i.e., whether or not a patient has diabetes.

3. We scale the numbers so they’re all on a similar range. This helps the model treat all features fairly. Remember, our dataset reveals data about each patient. We will set the mean value at 0 and the standard deviation at 1. This normalization will help us compare the weight of the features inside our model.


In [None]:
# Assign inputs and outputs
# "y" is considered the output we'd like to predict. Here, it's whether or not the patient has Diabetes so we select that column from our dataset
y = df_dm['Diabetes']

# The inputs are "X" and are all the columns except for the 'Diabetes' column.
X = df_dm.drop(columns=['Diabetes'])

# Now we split our data into a "training set" where our model learns the relationship between the input variables and diabetes.
# And, the other split is a "testing set." This we hold apart from our model creation so it's pure for testing how good our model is!
# There are many ways one can slip-up and, e.g., include the test data when using data statistics in the model creation. This is cheating!

# Splitting the dataset into training and testing sets, with 80% for training and 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(
    X,                       # Input features (all columns except 'Diabetes')
    y,                       # Target output ('Diabetes' column)
    random_state=42,         # Setting a random seed for reproducibility
    shuffle=True,            # Shuffling the data before splitting
    test_size=0.2,           # Percentage of data to allocate to the test set
    stratify=df_dm['Diabetes']  # Preserving the proportion of diabetic and non-diabetic cases
)
# You can pick your favorite number for "random_state"! It is used for the consistency of the data split/model performance.
# With the same random state value, the train/test datasets will look the same no matter how many times you run the code.

Next, let's examine our training dataset! How many patients have DM? Is it a skewed, or imbalanced dataset?

In [None]:
# Print the fractions of diabetic and non-diabetic patients in the training set
print(f'Diabetes Fractions: \n{y_train.value_counts(normalize=True).round(2)}\n') # Provide proportion instead of actual counts

# Plot a bar chart showing the counts of diabetic (1) and non-diabetic (0) patients in the training set
y_train.value_counts().plot(kind='bar', title="Patients without (0) and with (1) Diabetes") # Bar chart

## Making our first model!

Only about 15% of patients in our training data have diabetes. That’s a big imbalance, which can make it harder for the model to learn. To fix this, we’ll use a method called SMOTE. It creates synthetic examples of patients with diabetes, so the data is more balanced.

Next, we scale all the input numbers so they're on a similar range. For example, someone’s weight might be 200, while their waist/hip ratio might be 0.7. We want to make sure big numbers don’t unfairly influence the model just because of their size.

Once that’s done, we’ll put everything into a pipeline—a series of steps that prepare the data and then train the model. The machine learning method we’ll use first is called logistic regression.

We also have two options for training:

  *   Use the entire training dataset at once
  *   Or use a method called **stratified k-fold cross-validation**, which splits the training data into smaller pieces (called folds), trains the model on some folds, and tests it on others. This helps us get a more accurate idea of how well the model might perform on new data. We’ll use 5 folds for this.

The next step will run this full process and show how our diabetes prediction model performs.



In [None]:
# Note the 3 ingredients in our pipeline for model creation. First, we address the imbalanced data by generating synthetic data using
# SMOTE that retains characteristics of the diabetes patients already in the dataset. Then, we "scale" the values to lessen the impact of different value
# ranges as mentioned above. Then, in the pipeline, we are going to use "Logistic Regression" to generate our first model!

pipeline = Pipeline(steps=[
    ['smote', SMOTE(random_state=11)],  # Step 1: Address imbalanced data using SMOTE
    ['scaler', MinMaxScaler()],         # Step 2: Scale feature values using MinMaxScaler
    ['classifier', LogisticRegression(random_state=11, max_iter=1000)]  # Step 3: Logistic Regression classifier
])

# Remember k-fold cross validation as mentioned above? This will slice our training data into 5 sections and pass each slice into through our pipeline.
stratified_kfold = StratifiedKFold(n_splits=5, # K value of your choice
                                       shuffle=True, # Shuffle the data before splitting
                                       random_state=11 # Random state of your choice
                                   )

# There are even more advanced techniques to "tweak" how our model works. Suffice it to say, we're going to try
# a variety of these tweaks all at the same time, and then pick the best tweak possible!
param_grid = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}  # Hyperparameter grid for Logistic Regression
grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='roc_auc',  # Use ROC AUC for scoring
                           cv=stratified_kfold,  # Use stratified k-fold cross-validation
                           n_jobs=-1, refit=True)  # Use all CPU cores, refit with best parameters

# Here we send our data into our pipeline! Find the best estimators.
grid_search.fit(X_train, y_train)

# Lastly, let's name our best model setting. Get the best estimators and model.
clf_lg = grid_search.best_estimator_

# How did our model do? These steps gather all the generated predictions and probabilities.
test_prob1 = clf_lg.predict_proba(X_test)[:, 1] # Make predictions (probability)
train_prob1 = clf_lg.predict_proba(X_train)[:, 1]
test_pred1 = clf_lg.predict(X_test) # Make predictions (class)
train_pred1 = clf_lg.predict(X_train)

# Here we calculate the training set accuracy of our best tweaked version.
cv_score = grid_search.best_score_.round(3)

# Here we calculate the test dataset performanc.
test_score = grid_search.score(X_test, y_test).round(3)

# Print out evaluation metrics
print('Our Logistic Regression Model Metrics\n')

print('Accuracy (training) = {:.3f}'.format(cv_score))
print('Accuracy (testing) = {:.3f}\n'.format(test_score))

# Generate ROC/AUC score, which is a more comprehensive evaluation criteria compared to accuracy
print('AUC (training) = {:.3f}'.format(roc_auc_score(y_train, train_prob1)))
print('AUC (testing) = {:.3f}\n'.format(roc_auc_score(y_test, test_prob1)))

# Generate F1 score, which is a another comprehensive evaluation criteria,
# This score is made up by Precision(a score emphasize on false positive) and Recall(a score emphasize on false negative)
print('F1 score (training) = {:.5f}'.format(f1_score(y_train, train_pred1)))
print('F1 score (testing) = {:.5f}'.format(f1_score(y_test, test_pred1)))

## Accuracy and F1 statistics

When you read articles about machine learning models, you'll encounter these  terms:

> Accuracy: The percent of correct predictions, overall, including positive and negative predictions. Higher values are better with the range being between 0 to 1.

> AUC: AUC (Area Under Curve): Tells how well the model can separate diabetes from no-diabetes. A score near 1 is best.

> F1 Score: Balances two things:
  *   Precision: Of all the people the model said had diabetes, how many really did?
  *   Recall: Of all the people who had diabetes, how many did the model catch?

**Choosing the Best Threshold**

These measures charaterize the model overall. When we deploy a model, say in an EHR to give a heads up that your patient may have diabetes, we can set a threshold for a specific risk level that will switch the predicted category from one to another.

The model gives us a probability for each patient (like 52% chance of diabetes). We set a threshold (like 50%) to decide when to predict "yes" for diabetes. A higher threshold gives fewer false alarms (false positives), but may miss some real cases. A lower threshold catches more diabetes cases, but may also flag more people who don’t have it.

Let's look at them and then we'll explain!


In [None]:
# This first line helps us see plots next to each other.
fig, axs = plt.subplots(1,2)

# Here we generate the ROC curve from the Training Dataset.
RocCurveDisplay.from_estimator(clf_lg, X_train, y_train).plot(ax=axs[0])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Here we generate the ROC curve from the Testing Dataset.
RocCurveDisplay.from_estimator(clf_lg, X_test, y_test).plot(ax=axs[1])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Set the title for the figure, displaying side-by-side ROC curves for training and testing datasets.
fig.suptitle("""Training Dataset ROC Curve                               \
            Testing Dataset ROC Curve""");


##ROC Curves

An ROC curve shows the trade-off between catching real diabetes cases (true positives) and avoiding false alarms (false positives).

We want the curve to reach the top-left corner. That would mean the model is perfect!

The AUC number tells us how close the curve gets to perfect. Our test AUC is about 0.86, which is pretty good.


Next up: Precision-Recall curves:

In [None]:
# This first line helps us see plots next to each other.
fig, axs = plt.subplots(1, 2)

# Here we generate the Precision-Recall curve using the Training Dataset.
PrecisionRecallDisplay.from_estimator(clf_lg, X_train, y_train).plot(ax=axs[0])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Here we generate the Precision-Recall curve using the Testing Dataset.
PrecisionRecallDisplay.from_estimator(clf_lg, X_test, y_test).plot(ax=axs[1])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Set the title for the figure, displaying side-by-side Precision-Recall curves for training and testing datasets.
fig.suptitle("""    Training Precision-Recall Curve              \
            Testing Precision-Recall Curve""");


## Precision-Recall Curves

This curve helps when we have imbalanced data (like in our case).

It shows the trade-off between:
  *   Precision: How many of the predicted diabetes cases were correct
  *   Recall: How many actual diabetes cases the model found

A perfect model would reach the top-right corner of this plot.



In [None]:
# Import necessary libraries and functions
#from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Generate predictions using the trained model on the test set
predictions = clf_lg.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, predictions, labels=clf_lg.classes_)

# Initialize ConfusionMatrixDisplay with computed confusion matrix and display labels
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf_lg.classes_)

# Plot the confusion matrix with a color map ('Purples')
disp.plot(cmap='Purples')

# Disable grid lines on the plot
plt.grid(False)

# Set the title for the plot
plt.title("Confusion Matrix for Our Diabetes Prediction Model")

# Display the plot
plt.show()


## Confusion Matrix

This is a table that shows how many patients were correctly or incorrectly predicted:
  *   Top-left: Correctly said "no diabetes"
  *   Bottom-right: Correctly said "yes, diabetes"
  *   Top-right: False alarms (predicted diabetes but they didn’t have it)
  *   Bottom-left: Missed cases (they had diabetes but model said no)

You want to see high numbers in the top-left and bottom-right.

# Before you leave...

1. Please complete the ***very brief*** statement of completion survey [here](https://forms.office.com/r/PhxaQ9q4Pm):   


2. Check out the references for more information below!  

3. Check out the **BONUS** machine learning models at the end, below references!

Thank you!!!

# References and more

Perhaps you loved this so much, you want to make your own Colab notebooks! How would you do this???

First, save a copy of this notebook so you can always refer back to copy the libraries and leverage the code used on future notebooks.

Then, visit [here](https://colab.research.google.com/?utm_source=scs-index) for an overview and setup!

Thank you!

David Liebovitz, MD


References:

1. Bento C. Support Vector Machines explained with Python examples [Internet]. Medium. 2020 [cited 2021 Nov 11]. Available from: https://towardsdatascience.com/support-vector-machines-explained-with-python-examples-cb65e8172c85


2. Brownlee J. How to Use ROC Curves and Precision-Recall Curves for Classification in Python [Internet]. Machine Learning Mastery. 2018 [cited 2021 Nov 7]. Available from: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/

> - This is a great resource for explaining ROC curves and coding for them. (Other pages on the site are fantastic, too.)


3. Dhandhania K. End-to-End Data Science Example: Predicting Diabetes with Logistic Regression [Internet]. Medium. 2018 [cited 2021 Nov 7]. Available from: https://towardsdatascience.com/end-to-end-data-science-example-predicting-diabetes-with-logistic-regression-db9bc88b4d16

> - I came across this website as I was well on my way. There are many similarities to the approach followed above that is applied to a different data set. I applied the normalization, heatmap, and correlations methods discussed here.








# APPENDIX  

Here are examples applying other machine learning algorithms to our data set to generate predictive models:

> - Support Vector Machine  
> - XGBoost
> - Random Forest
> - Gaussian Naive Bayes

The model's accuracy doesn't change too much, although, we didn't have that large a data set and some approaches work optimally with larger or more complex data. Sometimes, as here, logistic regression appears fine!

It's also possible to combine models and "vote." So, e.g., 2/3 models are necessary to make the call. This is an example of an ensemble method. Three "heads" sometimes better than one!

An SVM (support vector machine) example:

In [None]:
# Define a pipeline for SVM model including SMOTE for data balancing, MinMaxScaler for feature scaling, and SVM classifier
pipeline = Pipeline(steps=[
    ['smote', SMOTE(random_state=11)],  # Step 1: Address imbalanced data using SMOTE
    ['scaler', MinMaxScaler()],         # Step 2: Scale feature values using MinMaxScaler
    ['classifier', svm.SVC(random_state=11, max_iter=1000)]  # Step 3: SVM classifier
])

# Define Stratified K-Fold cross-validation for robust evaluation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)

# Define parameter grid for SVM hyperparameter tuning
param_grid = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

# Setup GridSearchCV to find best SVM model based on ROC AUC scoring
grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=stratified_kfold,
                           n_jobs=-1
                           # If n_jobs = -1, it uses all available cores on your machine for parallel execution.
                           # If n_jobs = 1, it runs the grid search using a single core (no parallelism).
                           # If n_jobs > 1, it specifies the exact number of cores to use for parallel execution.
                           )

# Fit GridSearchCV on training data to find the best model
grid_search.fit(X_train, y_train)

# Get best cross-validation score from GridSearchCV
cv_score = grid_search.best_score_.round(3)

# Evaluate best model on test data
test_score = grid_search.score(X_test, y_test).round(3)

# Get best estimator (model) from GridSearchCV
clf_svm = grid_search.best_estimator_

# Print SVM model metrics
print(f'SVM Metrics: \nTraining ROC AUC: {cv_score}\nTesting ROC AUC: {test_score}\n')

# Generate predictions on test set using the best SVM model
predictions = clf_svm.predict(X_test)

# Compute confusion matrix for test predictions
cm = confusion_matrix(y_test, predictions, labels=clf_svm.classes_) # clf_svm.classes_ will get the prediction classes

# Initialize ConfusionMatrixDisplay with computed confusion matrix and display labels
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf_svm.classes_)

# Create a figure with two subplots for ROC curve and confusion matrix side-by-side
fig, axs = plt.subplots(1, 2)

# Generate and plot the ROC curve for the best SVM model on the Testing Dataset
RocCurveDisplay.from_estimator(clf_svm, X_test, y_test).plot(ax=axs[0])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Set the title for the figure, displaying SVM Testing ROC Curve and SVM Confusion Matrix side by side.
fig.suptitle("""    SVM Testing ROC Curve              \
            SVM Confusion Matrix""")

# Disable grid lines on the confusion matrix plot.
plt.grid(False)

# Plot the confusion matrix with a color map ('Purples')
disp.plot(cmap='Purples', ax=axs[1])

# Display the figure containing both plots.
plt.show()


**Advantages of Support Vector Machines (SVM):**
1. Effective in High-Dimensional Spaces: SVMs perform well in high-dimensional spaces, making them suitable for complex problems where the number of features is large.

2. Versatile Kernel Options: SVMs can use different kernel functions to handle non-linear decision boundaries, including linear, polynomial, radial basis function (RBF), and sigmoid kernels.

3. Effective in Non-Linear Classifications: With appropriate kernels, SVMs can model complex decision boundaries that other algorithms may struggle with.

4. Robust Against Overfitting: SVMs have regularization parameters that help prevent overfitting, even in high-dimensional spaces with small sample sizes.

**Limitations of Support Vector Machines (SVM):**
1. Computational Intensity: Training an SVM can be time-consuming, especially on large datasets. The complexity of SVMs also increases with the size of the dataset.

2. Difficulty in Choosing Kernels: Selecting the right kernel function and tuning its parameters can be challenging and requires domain knowledge and experimentation.

3. Limited Scalability: SVMs may not perform well with datasets that have millions of examples or more, as the training time increases significantly.

An XGBoost example

In [None]:
# Define a pipeline for XGBoost model including SMOTE for data balancing, MinMaxScaler for feature scaling, and XGBoost classifier
pipeline = Pipeline(steps=[
    ['smote', SMOTE(random_state=11)],  # Step 1: Address imbalanced data using SMOTE with a fixed random state for reproducibility
    ['scaler', MinMaxScaler()],         # Step 2: Scale feature values using MinMaxScaler to ensure all features are on the same scale
    ['classifier', XGBClassifier(random_state=11, max_iter=1000)]  # Step 3: XGBoost classifier with a fixed random state and maximum iterations set to 1000
])

# Define Stratified K-Fold cross-validation for robust evaluation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)

# Define parameter grid for XGBoost hyperparameter tuning
param_grid = {'classifier__learning_rate': [0.001, 0.01, 0.1, 1],  # Example parameter grid for learning rate
              'classifier__max_depth': [3, 5, 7]}  # Example parameter grid for max depth

# Setup GridSearchCV to find the best XGBoost model based on ROC AUC scoring
grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=stratified_kfold,
                           n_jobs=-1)

# Fit GridSearchCV on training data to find the best model
grid_search.fit(X_train, y_train)

# Get best cross-validation score from GridSearchCV
cv_score = grid_search.best_score_.round(3)

# Evaluate best model on test data
test_score = grid_search.score(X_test, y_test).round(3)

# Get best estimator (model) from GridSearchCV
clf_xgb = grid_search.best_estimator_

# Print XGBoost model metrics
print(f'XGB Metrics: \nTraining ROC AUC: {cv_score}\nTesting ROC AUC: {test_score}\n')

# Generate predictions on test set using the best XGBoost model
predictions = clf_xgb.predict(X_test)

# Compute confusion matrix for test predictions
cm = confusion_matrix(y_test, predictions, labels=clf_xgb.classes_)

# Initialize ConfusionMatrixDisplay with computed confusion matrix and display labels
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf_xgb.classes_)

# Create a figure with two subplots for ROC curve and confusion matrix side-by-side
fig, axs = plt.subplots(1, 2)

# Generate and plot the ROC curve for the best XGBoost model on the Testing Dataset
RocCurveDisplay.from_estimator(clf_xgb, X_test, y_test).plot(ax=axs[0])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Set the title for the figure, displaying XGB Testing ROC Curve and XGB Confusion Matrix side by side.
fig.suptitle("""XGB Testing ROC Curve          \
            XGB Confusion Matrix""")

# Disable grid lines on the confusion matrix plot.
plt.grid(False)

# Plot the confusion matrix with a color map ('Purples')
disp.plot(cmap='Purples', ax=axs[1])

# Display the figure containing both plots.
plt.show()

**Advantages of XGBoost:**

1. High Performance: XGBoost is known for its high computational efficiency and speed due to its optimized algorithms and parallelization.

2. Handles Missing Data: It has built-in capabilities to handle missing data, which reduces the need for data preprocessing.

3. Regularization: XGBoost includes regularization techniques such as L1 and L2 regularization to prevent overfitting.

4. Flexibility: Supports various objective functions and evaluation metrics, making it adaptable to different types of problems.

**Limitations of XGBoost:**

1. Complexity: The hyperparameter tuning process can be complex and time-consuming due to the large number of parameters.

2. Sensitive to Noisy Data: It can overfit noisy datasets if not properly tuned or regularized.

3. Black Box Nature: Similar to other ensemble methods, XGBoost is less interpretable compared to simpler models like linear regression or decision trees.

A Random Forest example:

In [None]:
# Define Random Forest Classifier with specified parameters
rfc1 = RandomForestClassifier(
    n_estimators=20,   # Number of trees in the forest (default is 100). More trees generally improve performance but increase computation time.
    max_depth=4,       # Maximum depth of each tree in the forest (default is None, which expands nodes until all leaves are pure or until all leaves contain less than min_samples_split samples).
    max_features=14,   # Maximum number of features to consider when splitting a node. Typically, the square root of the total number of features is a good starting point (default is 'auto', which uses all features).
    bootstrap=True,    # Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree (default is True).
    random_state=18
)
# You can also use GridSearch method to find the best parameters!

# Train the Random Forest model on the training data
rfc1.fit(X_train, y_train)

# Make predictions on the test and training data
pred = rfc1.predict(X_test)
pred_t = rfc1.predict(X_train)

# Print accuracy scores for training and testing
print("Training Accuracy for Random Forest: ", accuracy_score(y_train, pred_t))
print("Testing Accuracy for Random Forest: ", accuracy_score(y_test, pred))

# Create a figure with two subplots for ROC curve and confusion matrix side-by-side
fig, axs = plt.subplots(1, 2)

# Generate and plot the ROC curve for the Random Forest model on the Testing Dataset
RocCurveDisplay.from_predictions(y_test, pred).plot(ax=axs[0])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Compute confusion matrix for test predictions
cm = confusion_matrix(y_test, pred, labels=rfc1.classes_)

# Initialize ConfusionMatrixDisplay with computed confusion matrix and display labels
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rfc1.classes_)

# Plot the confusion matrix with a color map ('Purples')
disp.plot(cmap='Purples', ax=axs[1])

# Disable grid lines on the confusion matrix plot.
plt.grid(False)

# Display the figure containing both plots.
plt.show()


**Advantages:**

1. High Accuracy:
Random Forest generally produces highly accurate predictions by aggregating the results from multiple decision trees. It reduces overfitting by averaging or voting, which improves generalization.


2. Robust to Overfitting:
Due to its ensemble nature and the randomness introduced during tree construction, Random Forest is less prone to overfitting compared to individual decision trees.

3. Handles Missing Values and Outliers:
Random Forest has mechanisms to handle missing data and maintain accuracy even when a large proportion of data is missing.

**Limitations:**
1. Model Interpretability:
While Random Forest provides feature importance, it is more challenging to interpret compared to a single decision tree. Understanding individual predictions can be complex.
2. Computational Complexity:
Training a large number of decision trees can be computationally expensive and time-consuming, especially if the number of trees and depth of each tree are large.
3. Not Suitable for Very Sparse Data:
It may not perform well on very sparse datasets where the number of features is much larger than the number of samples.

Gaussian Naive Bayes

In [None]:
# Define a pipeline for Gaussian Naive Bayes model including SMOTE for data balancing, MinMaxScaler for feature scaling, and Gaussian Naive Bayes classifier
pipeline = Pipeline(steps=[
    ['smote', SMOTE(random_state=11)],      # Step 1: Address imbalanced data using SMOTE with a fixed random state for reproducibility
    ['scaler', MinMaxScaler()],             # Step 2: Scale feature values using MinMaxScaler to ensure all features are on the same scale
    ['classifier', GaussianNB()]            # Step 3: Gaussian Naive Bayes classifier
])

# Define Stratified K-Fold cross-validation for robust evaluation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)

# Define parameter grid for Gaussian Naive Bayes (var_smoothing is a parameter that helps with numerical stability by adding a small value to variances)
param_grid = {
    'classifier__var_smoothing': np.logspace(0, -9, num=30)  # Vary var_smoothing parameter across a range of values
}

# Setup GridSearchCV to find the best Gaussian Naive Bayes model based on ROC AUC scoring
grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=stratified_kfold,
                           n_jobs=-1)

# Fit GridSearchCV on training data to find the best model
grid_search.fit(X_train, y_train)

# Get best cross-validation score from GridSearchCV
cv_score = grid_search.best_score_.round(3)

# Evaluate best model on test data
test_score = grid_search.score(X_test, y_test).round(3)

# Get best estimator (model) from GridSearchCV
clf_gnb = grid_search.best_estimator_

# Print Gaussian Naive Bayes model metrics
print(f'Training Accuracy Gaussian Naive Bayes: {cv_score}\nTesting Accuracy Gaussian Naive Bayes: {test_score}\n')

# Generate predictions on test set using the best Gaussian Naive Bayes model
predictions = clf_gnb.predict(X_test)

# Compute confusion matrix for test predictions
cm = confusion_matrix(y_test, predictions, labels=clf_gnb.classes_)

# Create a figure with two subplots for ROC curve and confusion matrix side-by-side
fig, axs = plt.subplots(1, 2)

# Generate and plot the ROC curve for the best Gaussian Naive Bayes model on the Testing Dataset
RocCurveDisplay.from_estimator(clf_gnb, X_test, y_test).plot(ax=axs[0])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Set the title for the figure, displaying GNB Testing ROC Curve and GNB Confusion Matrix side by side.
fig.suptitle("""GNB Testing ROC Curve          \
            GNB Confusion Matrix""")

# Initialize ConfusionMatrixDisplay with computed confusion matrix and display labels
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf_gnb.classes_)

# Disable grid lines on the confusion matrix plot.
plt.grid(False)

# Plot the confusion matrix with a color map ('Purples')
disp.plot(cmap='Purples', ax=axs[1])

# Display the figure containing both plots.
plt.show()

## Another Gaussian Naive Bayes

Simple deployment - no grid search, SMOTE or cross validation.

In [None]:
from sklearn.naive_bayes import GaussianNB
# Instantiate Gaussian Naive Bayes classifier
classifier = GaussianNB()

# Train the Gaussian Naive Bayes classifier on the training data
classifier.fit(X_train, y_train)

# Make predictions on the test and training data
y_pred = classifier.predict(X_test)
y_pred_t = classifier.predict(X_train)

# Calculate confusion matrix for test predictions
cm = confusion_matrix(y_test, y_pred)

# Calculate accuracy scores for training and testing
ac = accuracy_score(y_test, y_pred)
act = accuracy_score(y_train, y_pred_t)

# Print accuracy scores for training and testing
print("Naive Bayes Training Accuracy: " + str(act))
print("Naive Bayes Testing Accuracy: " + str(ac))

# Create a figure with two subplots for ROC curve and confusion matrix side-by-side
fig, axs = plt.subplots(1, 2)

# Generate and plot the ROC curve for the Gaussian Naive Bayes model on the Testing Dataset
RocCurveDisplay.from_predictions(y_test, y_pred).plot(ax=axs[0])

# Close the plot to avoid displaying unwanted duplicate plots.
plt.close()

# Initialize ConfusionMatrixDisplay with computed confusion matrix and display labels
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)

# Plot the confusion matrix with a color map ('Purples')
disp.plot(cmap='Purples', ax=axs[1])

# Disable grid lines on the confusion matrix plot.
plt.grid(False)

# Display the figure containing both plots.
plt.show()

**Gaussian NB**

***Advantages:***

Efficiency: Gaussian NB is computationally efficient and can handle large datasets with high-dimensional feature spaces.

Simple and Fast: It's straightforward to implement and works well with small amounts of training data.


***Limitations:***


Independence Assumption: If features are strongly correlated, the model's performance can be affected negatively.

Gaussian Assumption: If the feature distributions are not Gaussian, Gaussian NB might not model the data well.