Introduction:


This code demonstrates the implementation of an ensemble learning model for medical prediction using a combination of XGBoost and Random Forest algorithms. The ensemble is created using a VotingClassifier with soft voting, which combines predictions from both models. The implementation includes both model training and a practical prediction function for real-world use. The code follows a clear structure: data preparation, model training, and evaluation using standard classification metrics. Finally, it provides a user-friendly interface through a prediction function that can be easily used by healthcare professionals.

In [1]:
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

# Load the dataset
df = pd.read_csv('cleaned_data.csv')
X = df[['calcium_max', 'creatinine_min', 'aki_stage', 'aniongap_min', 'calcium_min', 'pt_max']]
y = df['delay_rrt']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the models
xgb_model = XGBClassifier(eta=0.1, max_depth=3, learning_rate=0.1, n_estimators=500)
rf_model = RandomForestClassifier(n_estimators=500, random_state=42)

# Combine the XGBoost and Random Forest models using voting
voting_model = VotingClassifier(estimators=[('xgb', xgb_model), ('rf', rf_model)], voting='soft')

# Train the voting model
voting_model.fit(X_train, y_train)

# Evaluate the model
y_pred = voting_model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

           0       0.36      0.14      0.20        63
           1       0.84      0.95      0.89       293

    accuracy                           0.80       356
   macro avg       0.60      0.54      0.55       356
weighted avg       0.75      0.80      0.77       356

Accuracy: 0.8033707865168539


Observations:

The model achieves a good overall accuracy of 80.3%, making it generally reliable for predictions.

There is significant class imbalance in the dataset (293 vs 63 samples), which affects model performance.

The model performs excellently for predicting RRT Delays (Class 1) with 84% precision and 95% recall.

The model struggles with predicting No RRT Delays (Class 0), showing poor performance with only 14% recall.

The weighted average F1-score of 0.77 indicates decent overall performance, but there's room for improvement in handling the minority class.

In [2]:

# Function to predict the delay in RRT based on user input
def predict_rrt_delay(calcium_max, creatinine_min, aki_stage, aniongap_min, calcium_min, pt_max, model):
    # Create a numpy array with the input data
    input_data = pd.DataFrame([[calcium_max, creatinine_min, aki_stage, aniongap_min, calcium_min, pt_max]], 
                              columns=['calcium_max', 'creatinine_min', 'aki_stage', 'aniongap_min', 'calcium_min', 'pt_max'])
    
    # Use the trained voting model to predict the class (0 or 1)
    prediction = model.predict(input_data)
    
    # Output the result
    if prediction[0] == 1:
        return "RRT Delay is predicted."
    else:
        return "RRT Delay is not predicted."
    
# Example of user input and model prediction
result = predict_rrt_delay(
    calcium_max=9.2, 
    creatinine_min=0.3, 
    aki_stage=2, 
    aniongap_min=11, 
    calcium_min=7.7, 
    pt_max=14.5, 
    model=voting_model
)
print(result)
    

RRT Delay is not predicted.
