# Xgboost: Predicting heart disease

## Import modules

In [12]:
import numpy as np  
# Importing the NumPy library for efficient numerical operations, such as working with arrays and performing mathematical computations.

import matplotlib.pyplot as plt  
# Importing Matplotlib for creating visualizations, which are essential for understanding data trends and model performance.

import pandas as pd  
# Importing Pandas for data manipulation and analysis, such as loading, transforming, and summarizing datasets using DataFrames.

from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score  
# Importing various evaluation metrics to measure the performance of classification models, including precision, recall, F1-score, and ROC-AUC.

from xgboost import XGBClassifier  
# Importing the XGBoost classifier, a powerful and efficient machine learning algorithm optimized for gradient boosting.

from sklearn.model_selection import GridSearchCV  
# Importing GridSearchCV to optimize the model's hyperparameters by performing an exhaustive search over a predefined parameter grid.

from sklearn.model_selection import train_test_split  
# Importing train_test_split to divide the dataset into training and testing subsets, ensuring proper evaluation of the model's performance.

import joblib  
# Importing joblib to save and load trained models efficiently, enabling reuse without retraining.

import warnings  
# Importing warnings to handle and suppress unwanted warning messages for a cleaner notebook output.

import os  
# Importing os to interact with the file system, enabling navigation and accessing data stored in directories.

# Traversing the '/data' directory and its subdirectories to locate data files.
for dirname, _, filenames in os.walk('/data'):  
    for filename in filenames:  
        # Printing the full path of each file to identify available datasets for analysis or modeling.
        print(os.path.join(dirname, filename))

warnings.filterwarnings('ignore')  
# Suppressing warnings to improve notebook readability, especially for warnings that do not impact critical operations.

Here is the list of modules we will need for this notebook

## Import data

In [2]:
data = pd.read_csv("data/cleaned_merged_heart_dataset.csv")  
# Loading the cleaned and merged heart dataset into a Pandas DataFrame.
# The file path "data/cleaned_merged_heart_dataset.csv" points to a CSV file containing the processed data.

data.info()  
# Displaying a concise summary of the dataset, including:
# - The number of rows and columns (shape of the DataFrame).
# - The data types of each column (e.g., int64, float64, object).
# - The number of non-null entries in each column to identify missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1888 entries, 0 to 1887
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1888 non-null   int64  
 1   sex       1888 non-null   int64  
 2   cp        1888 non-null   int64  
 3   trestbps  1888 non-null   int64  
 4   chol      1888 non-null   int64  
 5   fbs       1888 non-null   int64  
 6   restecg   1888 non-null   int64  
 7   thalachh  1888 non-null   int64  
 8   exang     1888 non-null   int64  
 9   oldpeak   1888 non-null   float64
 10  slope     1888 non-null   int64  
 11  ca        1888 non-null   int64  
 12  thal      1888 non-null   int64  
 13  target    1888 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 206.6 KB


### Dataset Description:

1. **Structure**:
   - The dataset has **1888 rows** (entries) and **14 columns** (features).
   - The entries represent individual observations, likely patient records in the context of heart health.

2. **Columns**:
   - **13 features** (input variables) and **1 target column** (`target`), which may indicate the outcome of interest (e.g., presence or absence of heart disease).
   - The features are a mix of integers (`int64`) and one floating-point column (`float64`).

3. **Data Quality**:
   - All columns have **1888 non-null values**, indicating no missing data.
   - This ensures consistency and means no immediate need for imputation or removal of rows/columns due to missing entries.

### Insights:
This dataset is well-prepared, with no missing values and appropriate data types. It is ready for predicting the `target` variable based on the other features.

## Data reading

In [3]:
# Display a quick preview of the dataset's structure, statistics, and missing values.

# Display the first five rows of the dataset
print("First five rows of the dataset:")
print(data.head())

# Generate summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
print(data.describe())

# Check for any missing values in each column
print("\nMissing values per column:")
print(data.isna().sum())

First five rows of the dataset:
   age  sex  cp  trestbps  chol  fbs  restecg  thalachh  exang  oldpeak  \
0   63    1   3       145   233    1        0       150      0      2.3   
1   37    1   2       130   250    0        1       187      0      3.5   
2   41    0   1       130   204    0        0       172      0      1.4   
3   56    1   1       120   236    0        1       178      0      0.8   
4   57    0   0       120   354    0        1       163      1      0.6   

   slope  ca  thal  target  
0      0   0     1       1  
1      0   0     2       1  
2      2   0     2       1  
3      2   0     2       1  
4      2   0     2       1  

Summary statistics for numerical columns:
               age          sex           cp     trestbps         chol  \
count  1888.000000  1888.000000  1888.000000  1888.000000  1888.000000   
mean     54.354343     0.688559     1.279131   131.549258   246.855403   
std       9.081505     0.463205     1.280877    17.556985    51.609329   
min 

### Key Insights from the Dataset:

1. **First Five Rows**:
   - The dataset includes features like `age`, `sex`, and `chol` (cholesterol levels), which are numeric and provide clinical or demographic data for each record.
   - Features like `cp` (possibly chest pain type), `thal`, and `target` are categorical or ordinal, indicated by their discrete integer values.
   - The `target` column appears to represent the outcome (e.g., presence or absence of heart disease), with values of `1` suggesting the presence.

2. **Summary Statistics**:
   - **Age**: Ranges from 29 to 77, with a mean of ~54. This suggests the dataset covers adults, with most patients in their late 40s to early 60s (based on quartiles).
   - **Sex**: Coded as 0 and 1, with a mean of ~0.69, indicating more male patients (likely coded as `1`) than female.
   - **Cholesterol (`chol`)**: Values range widely (126 to 564), with a mean of ~247. This variability is typical in medical datasets.
   - **Chest Pain Type (`cp`)**: The median (`50%`) is `1`, and the max is `4`, indicating multiple chest pain categories. 
   - **Resting Blood Pressure (`trestbps`)**: Ranges from 94 to 200, with a mean of ~132, which aligns with typical ranges in clinical settings.

3. **Missing Values**:
   - No missing values (`0` missing in all columns), indicating the dataset is complete and requires no imputation.

### General Observations:
- The dataset is well-balanced and clean, making it ready for analysis.
- It likely involves a classification task where features are used to predict the `target` variable (e.g., heart disease presence).
- Continuous variables (e.g., `chol`, `age`) show variability, which is helpful for machine learning models.


## Xgboost Hyperparameter Tuning Using Grid Search

In [None]:
# Separate features and target variable
X = data.drop(columns=["target"])  
# Removes the target column ('target') from the dataset to use the remaining columns as input features (X).
# This step ensures that the target variable is not included as a predictor.

y = data["target"]  
# Assigns the target column to the variable `y`, representing the dependent variable (output) to be predicted by the model.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  
# - Splits the dataset into training (80%) and testing (20%) subsets.
# - `test_size=0.2`: Reserves 20% of the data for testing the model's performance.
# - `random_state=42`: Ensures reproducibility of the split, meaning the same split will occur every time this code is run.

# Define the parameter grid for XGBoost
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1],  # Step size for updating weights during gradient descent. Lower values often improve performance but require more trees.
    'n_estimators': [50, 100, 200],  # Number of decision trees to build. More trees often improve performance but increase training time.
    'max_depth': [3, 7, 10, None],  # Maximum depth of each tree. Deeper trees can capture more complex patterns but may lead to overfitting.
    'subsample': [0.5, 0.9, 1.0],  # Proportion of training samples used to fit each tree. Lower values help prevent overfitting.
    'colsample_bytree': [0.8, 1.0],  # Proportion of features considered when looking for the best split. Reducing this can improve generalization.
    'reg_alpha': [0, 0.01, 0.1],  # Strength of L1 regularization (sparsity-inducing penalty). Helps with feature selection by setting some coefficients to zero.
    'reg_lambda': [1, 2]  # Strength of L2 regularization (ridge penalty). Reduces the size of coefficients to prevent overfitting.
}

# Initialize XGBoost Classifier
xgb = XGBClassifier(objective='binary:logistic', random_state=42, use_label_encoder=False)
# - `objective='binary:logistic'`: Specifies that the model is solving a binary classification problem with logistic regression output.
# - `random_state=42`: Ensures reproducibility of the model's internal randomness.
# - `use_label_encoder=False`: Avoids unnecessary warnings in recent versions of XGBoost.

# Perform hyperparameter tuning using Grid Search
grid_search = GridSearchCV(
    estimator=xgb, 
    param_grid=param_grid, 
    scoring='accuracy',  # Optimizes the model based on accuracy, a common metric for classification tasks.
    cv=5,  # Performs 5-fold cross-validation for each combination of parameters to ensure robust evaluation.
    verbose=2,  # Prints detailed output about the progress of the grid search.
    n_jobs=-1  # Utilizes all available CPU cores to parallelize computations and speed up the search.
)

# Fit the model to the training data
grid_search.fit(X_train, y_train)  
# Trains the XGBoost model using the training dataset and evaluates all parameter combinations defined in `param_grid`.
# The best combination of hyperparameters is selected based on the highest cross-validation accuracy.




Fitting 5 folds for each of 1296 candidates, totalling 6480 fits


## Model Evaluation and Metrics

In [None]:
# Retrieve the best model from GridSearchCV
best_model = grid_search.best_estimator_  
# - Extracts the model with the optimal hyperparameters found during grid search.
# - This model is tuned to maximize cross-validation accuracy.

# Predict on the test set
y_pred = best_model.predict(X_test)  
# - Generates predictions for the test dataset using the tuned model.
# - Helps evaluate the model's performance on unseen data.

# Best parameters and best score
print("Best Parameters:", grid_search.best_params_)  
# - Displays the optimal combination of hyperparameters selected by grid search.
# - Useful for understanding which settings lead to the best performance.

print("Best Score:", grid_search.best_score_)  
# - Prints the highest cross-validation accuracy achieved during grid search.
# - Indicates how well the model performed on the training folds.

# Precision
precision = precision_score(y_test, y_pred)  
# - Computes precision, which is the fraction of true positive predictions among all positive predictions.
# - High precision means fewer false positives.
print(f"Precision: {precision:.4f}")

# Recall
recall = recall_score(y_test, y_pred)  
# - Computes recall (sensitivity), which is the fraction of true positives correctly identified.
# - High recall means fewer false negatives.
print(f"Recall: {recall:.4f}")

# F1 Score
f1 = f1_score(y_test, y_pred)  
# - Computes the F1 score, a metric that balances precision and recall.
# - Particularly useful for evaluating models on imbalanced datasets.
print(f"F1 Score: {f1:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)  
# - Generates a confusion matrix, summarizing the counts of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN).
# - Helps visualize the model's prediction errors.
print("Confusion Matrix:")
print(cm)

# Classification Report
report = classification_report(y_test, y_pred)  
# - Produces a detailed report that includes precision, recall, F1-score, and support (number of instances) for each class.
# - Useful for multi-class and binary classification performance evaluation.
print("Classification Report:")
print(report)

# ROC AUC Score (for binary classification)
if len(set(y_test)) == 2:  # Check if it's binary classification
    auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])  
    # - Computes the Area Under the Receiver Operating Characteristic Curve (ROC AUC).
    # - The score evaluates the model's ability to discriminate between positive and negative classes across all thresholds.
    # - Uses `predict_proba` to obtain probability estimates for the positive class.
    print(f"ROC AUC: {auc:.4f}")


Best Parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 10, 'n_estimators': 200, 'reg_alpha': 0, 'reg_lambda': 2, 'subsample': 0.9}
Best Score: 0.9708609271523179
Precision: 0.9531
Recall: 0.9632
F1 Score: 0.9581
Confusion Matrix:
[[179   9]
 [  7 183]]
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.95      0.96       188
           1       0.95      0.96      0.96       190

    accuracy                           0.96       378
   macro avg       0.96      0.96      0.96       378
weighted avg       0.96      0.96      0.96       378

ROC AUC: 0.9922


### Interpretation of the Model Evaluation Metrics:

1. **Best Parameters**:
   - The optimal hyperparameters found by the grid search are:
     - `colsample_bytree=0.8`: Uses 80% of the features to train each tree, improving generalization and reducing overfitting.
     - `learning_rate=0.05`: A low learning rate enables better convergence with more iterations.
     - `max_depth=10`: Allows the model to capture complex patterns without excessive overfitting.
     - `n_estimators=200`: Builds 200 trees, providing sufficient complexity and boosting performance.
     - `reg_alpha=0`: No L1 regularization applied.
     - `reg_lambda=2`: Moderate L2 regularization reduces overfitting by penalizing large coefficients.
     - `subsample=0.9`: Uses 90% of the training data for each tree, balancing variance and bias.

2. **Best Score**:
   - Cross-validation accuracy of **97.09%** indicates the model performs well during training with robust validation, suggesting it generalizes effectively.

3. **Evaluation Metrics**:
   - **Precision (0.9531)**: The model correctly identifies 95.31% of predicted positives as actual positives, with relatively few false positives.
   - **Recall (0.9632)**: The model identifies 96.32% of all actual positives, showing it effectively minimizes false negatives.
   - **F1 Score (0.9581)**: The harmonic mean of precision and recall reflects a balanced performance between avoiding false positives and negatives.

4. **Confusion Matrix**:
   - **True Negatives (TN)**: 179 cases correctly identified as class 0.
   - **False Positives (FP)**: 9 cases incorrectly identified as class 1.
   - **False Negatives (FN)**: 7 cases incorrectly identified as class 0.
   - **True Positives (TP)**: 183 cases correctly identified as class 1.

5. **Classification Report**:
   - **Class 0**: 
     - Precision: 96% (proportion of correctly identified negatives among all predicted negatives).
     - Recall: 95% (proportion of actual negatives correctly identified).
     - F1-Score: 96% (balanced metric).
   - **Class 1**: 
     - Precision: 95%, Recall: 96%, F1-Score: 96%.
   - **Overall Accuracy**: **96%**, indicating strong performance across both classes.

6. **ROC AUC (0.9922)**:
   - Indicates the model’s excellent ability to separate positive and negative classes. A score close to 1 shows high discrimination capability.

### **Insights**:
- The model is highly accurate and balances precision and recall well, making it effective for tasks requiring a balance between false positives and false negatives.
- The ROC AUC value suggests strong discriminatory power, meaning the model is highly reliable across different classification thresholds.
- Slight room for improvement may exist by addressing the small number of false positives and false negatives.

## Saving model

In [17]:
# Best model from grid search
best_model = grid_search.best_estimator_

# Save the best model to a file
joblib.dump(best_model, 'best_xgboost_model.pkl')

print("Best model saved successfully!")

Best model saved successfully!
