# Random Forest: Predicting heart disease

## Import modules

In [1]:
import numpy as np  # Importing the NumPy library for numerical operations, such as array manipulation and mathematical computations
import matplotlib.pyplot as plt  # Importing Matplotlib for creating visualizations like plots and graphs
import pandas as pd  # Importing Pandas for handling and analyzing structured data using DataFrames
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score
# Importing various evaluation metrics from scikit-learn to assess model performance

from sklearn.ensemble import RandomForestClassifier
# Importing the Random Forest algorithm for classification tasks

from sklearn.model_selection import GridSearchCV
# Importing GridSearchCV to perform hyperparameter tuning by testing different parameter combinations

from sklearn.model_selection import train_test_split  
# Importing a function to split datasets into training and testing subsets for model validation

import joblib  
# Importing joblib to save and load models, which is helpful for reusing trained models without retraining

import warnings 
# Importing warnings to handle and suppress unwanted warnings in the notebook output

import os  
# Importing os to interact with the operating system, such as navigating directories or accessing files

# Traversing the '/data' directory and its subdirectories to find all files
for dirname, _, filenames in os.walk('/data'):
    for filename in filenames:
        # Printing the full path of each file for reference or exploration
        print(os.path.join(dirname, filename))
        
warnings.filterwarnings('ignore')  
# Suppressing all warnings to ensure a cleaner notebook output, especially when warnings are not critical

Here is the list of modules we will need for this notebook

## Import data

In [2]:
data = pd.read_csv("data/cleaned_merged_heart_dataset.csv")  
# Loading the cleaned and merged heart dataset into a Pandas DataFrame.
# The file path "data/cleaned_merged_heart_dataset.csv" points to a CSV file containing the processed data.

data.info()  
# Displaying a concise summary of the dataset, including:
# - The number of rows and columns (shape of the DataFrame).
# - The data types of each column (e.g., int64, float64, object).
# - The number of non-null entries in each column to identify missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1888 entries, 0 to 1887
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1888 non-null   int64  
 1   sex       1888 non-null   int64  
 2   cp        1888 non-null   int64  
 3   trestbps  1888 non-null   int64  
 4   chol      1888 non-null   int64  
 5   fbs       1888 non-null   int64  
 6   restecg   1888 non-null   int64  
 7   thalachh  1888 non-null   int64  
 8   exang     1888 non-null   int64  
 9   oldpeak   1888 non-null   float64
 10  slope     1888 non-null   int64  
 11  ca        1888 non-null   int64  
 12  thal      1888 non-null   int64  
 13  target    1888 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 206.6 KB


### Dataset Description:

1. **Structure**:
   - The dataset has **1888 rows** (entries) and **14 columns** (features).
   - The entries represent individual observations, likely patient records in the context of heart health.

2. **Columns**:
   - **13 features** (input variables) and **1 target column** (`target`), which may indicate the outcome of interest (e.g., presence or absence of heart disease).
   - The features are a mix of integers (`int64`) and one floating-point column (`float64`).

3. **Data Quality**:
   - All columns have **1888 non-null values**, indicating no missing data.
   - This ensures consistency and means no immediate need for imputation or removal of rows/columns due to missing entries.

### Insights:
This dataset is well-prepared, with no missing values and appropriate data types. It is ready for predicting the `target` variable based on the other features.

## Data reading

In [3]:
# Display a quick preview of the dataset's structure, statistics, and missing values.

# Display the first five rows of the dataset
print("First five rows of the dataset:")
print(data.head())

# Generate summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
print(data.describe())

# Check for any missing values in each column
print("\nMissing values per column:")
print(data.isna().sum())

First five rows of the dataset:
   age  sex  cp  trestbps  chol  fbs  restecg  thalachh  exang  oldpeak  \
0   63    1   3       145   233    1        0       150      0      2.3   
1   37    1   2       130   250    0        1       187      0      3.5   
2   41    0   1       130   204    0        0       172      0      1.4   
3   56    1   1       120   236    0        1       178      0      0.8   
4   57    0   0       120   354    0        1       163      1      0.6   

   slope  ca  thal  target  
0      0   0     1       1  
1      0   0     2       1  
2      2   0     2       1  
3      2   0     2       1  
4      2   0     2       1  

Summary statistics for numerical columns:
               age          sex           cp     trestbps         chol  \
count  1888.000000  1888.000000  1888.000000  1888.000000  1888.000000   
mean     54.354343     0.688559     1.279131   131.549258   246.855403   
std       9.081505     0.463205     1.280877    17.556985    51.609329   
min 

### Key Insights from the Dataset:

1. **First Five Rows**:
   - The dataset includes features like `age`, `sex`, and `chol` (cholesterol levels), which are numeric and provide clinical or demographic data for each record.
   - Features like `cp` (possibly chest pain type), `thal`, and `target` are categorical or ordinal, indicated by their discrete integer values.
   - The `target` column appears to represent the outcome (e.g., presence or absence of heart disease), with values of `1` suggesting the presence.

2. **Summary Statistics**:
   - **Age**: Ranges from 29 to 77, with a mean of ~54. This suggests the dataset covers adults, with most patients in their late 40s to early 60s (based on quartiles).
   - **Sex**: Coded as 0 and 1, with a mean of ~0.69, indicating more male patients (likely coded as `1`) than female.
   - **Cholesterol (`chol`)**: Values range widely (126 to 564), with a mean of ~247. This variability is typical in medical datasets.
   - **Chest Pain Type (`cp`)**: The median (`50%`) is `1`, and the max is `4`, indicating multiple chest pain categories. 
   - **Resting Blood Pressure (`trestbps`)**: Ranges from 94 to 200, with a mean of ~132, which aligns with typical ranges in clinical settings.

3. **Missing Values**:
   - No missing values (`0` missing in all columns), indicating the dataset is complete and requires no imputation.

### General Observations:
- The dataset is well-balanced and clean, making it ready for analysis.
- It likely involves a classification task where features are used to predict the `target` variable (e.g., heart disease presence).
- Continuous variables (e.g., `chol`, `age`) show variability, which is helpful for machine learning models.


## Random Forest Hyperparameter Tuning Using Grid Search

In [4]:
# Separate features and target variable
X = data.drop(columns=["target"])  
# Removes the target column ('target') from the dataset to use the remaining columns as input features (X).

y = data["target"]  
# Assigns the target column to the variable `y`, representing the dependent variable to be predicted.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  
# - Splits the dataset into training (80%) and testing (20%) sets for model evaluation.
# - `test_size=0.2`: Allocates 20% of the data for testing.
# - `random_state=42`: Ensures reproducibility by using a fixed random seed.

# Define the parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],         # Number of trees in the forest to test
    'max_depth': [None, 10, 15, 20, 30],   # Maximum depth of each tree (None allows full growth)
    'max_features': ['sqrt', 'log2', None],      # Number of features considered for the best split:
                                           # - 'sqrt': square root of the total features.
                                           # - 'log2': base-2 logarithm of the total features.
    'min_samples_split': [2, 5, 10],       # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]          # Minimum number of samples required to be at a leaf node
}

# Initialize the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)  
# Random Forest model initialized with a fixed random seed for reproducibility.

# Perform hyperparameter tuning using Grid Search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
                           cv=5, scoring='accuracy', verbose=2, n_jobs=-1)  
# - `GridSearchCV`: Automates hyperparameter tuning by testing all combinations in `param_grid`.
# - `cv=5`: Performs 5-fold cross-validation for each parameter combination to ensure robustness.
# - `scoring='accuracy'`: Uses accuracy as the evaluation metric for model selection.
# - `verbose=2`: Provides detailed output of the grid search progress.
# - `n_jobs=-1`: Utilizes all available CPU cores to speed up computation.

# Fit the model to the training data
grid_search.fit(X_train, y_train)  
# Trains the Random Forest model using the training dataset and evaluates performance across the grid of parameters.



Fitting 5 folds for each of 405 candidates, totalling 2025 fits


## Model Evaluation and Metrics

In [5]:
best_model = grid_search.best_estimator_  
# Retrieves the best model (Random Forest) identified by the grid search.
# This model has the optimal hyperparameters based on the cross-validation process.

# Predict on the test set
y_pred = best_model.predict(X_test)  
# Generates predictions for the test set using the optimized Random Forest model.

# Best parameters and best score
print("Best Parameters:", grid_search.best_params_)  
# Prints the hyperparameter combination that achieved the highest cross-validation accuracy during the grid search.

print("Best Score:", grid_search.best_score_)  
# Displays the cross-validation accuracy score of the best model.

# Precision
precision = precision_score(y_test, y_pred)  
# Computes precision: the proportion of correctly predicted positive cases out of all predicted positives.
print(f"Precision: {precision:.4f}")

# Recall
recall = recall_score(y_test, y_pred)  
# Computes recall (sensitivity): the proportion of actual positive cases that were correctly identified.
print(f"Recall: {recall:.4f}")

# F1 Score
f1 = f1_score(y_test, y_pred)  
# Computes the F1 score: the harmonic mean of precision and recall, useful for imbalanced datasets.
print(f"F1 Score: {f1:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)  
# Creates a confusion matrix to summarize prediction results (TP, FP, FN, TN).
print("Confusion Matrix:")
print(cm)

# Classification Report
report = classification_report(y_test, y_pred)  
# Generates a detailed classification report, including precision, recall, F1-score, and support for each class.
print("Classification Report:")
print(report)

# ROC AUC Score (for binary classification)
if len(set(y_test)) == 2:  # Check if it's binary classification
    auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])  
    # Computes the ROC AUC score: a metric that evaluates model performance across all classification thresholds.
    # Uses `predict_proba` to get the probability scores for the positive class.
    print(f"ROC AUC: {auc:.4f}")


Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
Best Score: 0.9695364238410595
Precision: 0.9628
Recall: 0.9526
F1 Score: 0.9577
Confusion Matrix:
[[181   7]
 [  9 181]]
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.96      0.96       188
           1       0.96      0.95      0.96       190

    accuracy                           0.96       378
   macro avg       0.96      0.96      0.96       378
weighted avg       0.96      0.96      0.96       378

ROC AUC: 0.9950


### Interpretation of the Model Evaluation Metrics:

1. **Best Parameters**:
   - The hyperparameters that produced the best model performance are:
     - `max_depth: None`: No limit on the depth of trees, allowing them to grow until all leaves are pure.
     - `max_features: 'sqrt'`: Only a random subset of features is considered when splitting a node, which helps reduce overfitting.
     - `min_samples_leaf: 1`: A node must have at least 1 sample to be a leaf.
     - `min_samples_split: 5`: A node must have at least 5 samples to be split.
     - `n_estimators: 50`: The number of trees in the Random Forest is set to 50.

2. **Best Score**: 
   - The best cross-validation score is **0.9695**, indicating that the model has performed very well during training with high accuracy.

3. **Precision**: 
   - **0.9628**: This indicates that when the model predicts a positive class (1), 96.28% of the time, the prediction is correct.

4. **Recall**: 
   - **0.9526**: This indicates that the model correctly identified 95.26% of all actual positive instances (class 1) in the dataset.

5. **F1 Score**: 
   - **0.9577**: The harmonic mean of precision and recall, showing a balanced model performance (close to 1, which is ideal).

6. **Confusion Matrix**:
   - The confusion matrix shows how many predictions were correct and incorrect:
     - **True Negatives (181)**: Correctly predicted class 0 as class 0.
     - **False Positives (7)**: Incorrectly predicted class 1 as class 0.
     - **False Negatives (9)**: Incorrectly predicted class 0 as class 1.
     - **True Positives (181)**: Correctly predicted class 1 as class 1.
   - The model has very few errors, with only 7 false positives and 9 false negatives.

7. **Classification Report**:
   - **Precision**: The model's ability to correctly predict each class.
   - **Recall**: The model's ability to correctly identify actual instances of each class.
   - **F1-score**: A balanced measure that considers both precision and recall for each class.
   - **Support**: The number of actual occurrences of each class (188 for class 0 and 190 for class 1).

8. **ROC AUC**:
   - **0.9950**: The model has an excellent ROC AUC score, meaning it is extremely effective at distinguishing between the classes, with almost perfect performance in distinguishing between class 0 and class 1.

### Overall Conclusion:
The model is highly accurate, with a precision, recall, and F1 score around **0.96**, indicating a strong performance in classifying both classes. The ROC AUC score of **0.9950** further confirms that the model has excellent discriminative ability, making it a very effective classifier for this task. The hyperparameters chosen through grid search optimize the model's performance, balancing accuracy and generalization.

## Saving model

In [20]:
# Best model from grid search
best_model = grid_search.best_estimator_

# Save the best model to a file
joblib.dump(best_model, 'best_random_forest_model.pkl')

print("Best model saved successfully!")

Best model saved successfully!
