Introduction:


This code implements a medical prediction system using ensemble learning, specifically combining XGBoost and Random Forest models through a VotingClassifier to predict RRT (Renal Replacement Therapy) delays. The system processes medical parameters including calcium, creatinine, AKI stage, and other vital measurements from a cleaned dataset. Using RandomizedSearchCV, the code performs hyperparameter tuning on a subset of the training data to optimize model performance. The implementation includes both model training and a practical prediction function that can process individual patient data to provide clear yes/no predictions about RRT delays. The model achieves approximately 80% accuracy, though the output suggests there might be some warnings or issues that need addressing in the implementation.

In [1]:
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

# Load the dataset
df = pd.read_csv('cleaned_data.csv')
X = df[['calcium_max', 'creatinine_min', 'aki_stage', 'aniongap_min', 'calcium_min', 'pt_max']]
y = df['delay_rrt']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define reduced parameter grids for XGBoost and Random Forest
xgb_param_grid = {
    'xgb__eta': [0.1],
    'xgb__max_depth': [3, 5],
    'xgb__n_estimators': [100, 300]
}

rf_param_grid = {
    'rf__n_estimators': [100, 300],
    'rf__max_depth': [None, 10],
    'rf__min_samples_split': [2, 5]
}

# Combine the parameter grids
param_grid = {**xgb_param_grid, **rf_param_grid}

# Initialize the models
xgb_model = XGBClassifier(random_state=42)
rf_model = RandomForestClassifier(random_state=42)

# Create the VotingClassifier
voting_model = VotingClassifier(estimators=[('xgb', xgb_model), ('rf', rf_model)], voting='soft')

# Use a smaller subset of the data for initial tuning
X_train_sample, _, y_train_sample, _ = train_test_split(X_train, y_train, train_size=0.3, random_state=42)

# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(voting_model, param_distributions=param_grid, n_iter=20, cv=3, scoring='accuracy', n_jobs=-1, random_state=42)
random_search.fit(X_train_sample, y_train_sample)

# Get the best model
best_model = random_search.best_estimator_

# Evaluate the best model on the full test set
y_pred = best_model.predict(X_test)
print("Best parameters:", random_search.best_params_)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

# Update the predict_rrt_delay function to use the best model
def predict_rrt_delay(calcium_max, creatinine_min, aki_stage, aniongap_min, calcium_min, pt_max, model):
    input_data = pd.DataFrame([[calcium_max, creatinine_min, aki_stage, aniongap_min, calcium_min, pt_max]], 
                              columns=['calcium_max', 'creatinine_min', 'aki_stage', 'aniongap_min', 'calcium_min', 'pt_max'])
    
    prediction = model.predict(input_data)
    
    return "RRT Delay is predicted." if prediction[0] == 1 else "RRT Delay is not predicted."

# Example of user input and model prediction using the best model
result = predict_rrt_delay(
    calcium_max=9.2, 
    creatinine_min=0.3, 
    aki_stage=2, 
    aniongap_min=11, 
    calcium_min=7.7, 
    pt_max=14.5,  
    model=best_model
)
print(result)


26 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Harshvardhan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Harshvardhan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Har

Best parameters: {'xgb__n_estimators': 100, 'xgb__max_depth': 3, 'xgb__eta': 0.1, 'rf__n_estimators': 100, 'rf__min_samples_split': 5, 'rf__max_depth': 10}
Classification Report:
               precision    recall  f1-score   support

           0       0.44      0.17      0.25        63
           1       0.84      0.95      0.89       293

    accuracy                           0.81       356
   macro avg       0.64      0.56      0.57       356
weighted avg       0.77      0.81      0.78       356

Accuracy: 0.8146067415730337
RRT Delay is predicted.


Observations:

Strong overall accuracy of 0.81

Excellent recall (0.95) for RRT Delay predictions

Good precision (0.84) for RRT Delay cases

High F1-score (0.89) for the majority class

Weighted average F1-score (0.78) shows robust overall performance