Introduction:
This code implements a machine learning pipeline for medical prediction using a Random Forest classifier on patient data. The implementation includes comprehensive data preprocessing steps, including handling missing values and feature selection focusing on five key medical indicators (aniongap_min, creatinine_min, resp_rate_mean, pt_max, potassium_min). The pipeline incorporates SMOTE for handling class imbalance and uses RandomizedSearchCV for hyperparameter optimization of the Random Forest model. The code structure follows best practices with clear separation of data preparation, model training, and prediction functionality. A utility function is provided for making individual predictions based on patient metrics, making it practical for clinical application.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import RandomizedSearchCV


# Load the dataset
df = pd.read_csv('cleaned_data.csv')

# Selecting the top 5 features and the target column
features = ['aniongap_min', 'creatinine_min', 'resp_rate_mean', 'pt_max', 'potassium_min']

# Drop rows with missing values in the selected columns
df_clean = df.dropna(subset=features + ['delay_rrt'])

# Splitting data into input features (X) and target (y)
X = df_clean[features]
y = df_clean['delay_rrt']

# Splitting data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Hyperparameter tuning for Random Forest with class weights
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, None],
    'min_samples_split': [2, 5],
    'class_weight': ['balanced']
}

# Use RandomizedSearchCV for a faster parameter search
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42), 
    param_grid, 
    cv=3, 
    scoring='precision', 
    n_iter=5, 
    random_state=42
)
# Train the model
random_search.fit(X_resampled, y_resampled)

# Get the best model
best_model = random_search.best_estimator_

# Evaluating the model with the test set
y_pred_balanced = best_model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred_balanced))

# Function to predict early dialysis based on user input
def predict_early_dialysis(aniongap_min, creatinine_min, resp_rate_mean, pt_max, potassium_min, model):
    # Create a numpy array with the input data
    input_data = np.array([[aniongap_min, creatinine_min, resp_rate_mean, pt_max, potassium_min]])
    
    # Use the trained model to predict the class (0 or 1)
    prediction = model.predict(input_data)
    
    # Output the result
    if prediction[0] == 1:
        return "Early dialysis is recommended."
    else:
        return "Early dialysis is not required."

# Example of user input and model prediction
result = predict_early_dialysis(aniongap_min=10, creatinine_min=2.0, resp_rate_mean=18, pt_max=30, potassium_min=4.5, model=best_model)
print(result)


Classification Report:
               precision    recall  f1-score   support

           0       0.29      0.29      0.29        63
           1       0.85      0.85      0.85       293

    accuracy                           0.75       356
   macro avg       0.57      0.57      0.57       356
weighted avg       0.75      0.75      0.75       356

Early dialysis is recommended.




Observations:

1.Poor performance on minority class with precision, recall, and F1-score all at 0.29

2.Strong performance on majority class with consistent metrics at 0.85

3.Significant data imbalance in the test set (63 vs 293 samples)

4.Large gap between macro average (0.57) and weighted average (0.75) indicates unbalanced performance

5.Model shows consistent behavior within each class (similar precision and recall values)