Introduction:


This code implements a medical prediction system using XGBoost to determine if a patient needs early dialysis. It processes patient data from 'cleaned_data.csv', focusing on key medical parameters like calcium, creatinine, and AKI stage. The pipeline includes data cleaning, SMOTE for class balancing, and hyperparameter tuning via GridSearchCV. The model is trained on 80% of the data and evaluated on the remaining 20%. Finally, it provides a simple prediction function that takes medical parameters as input and returns a recommendation for early dialysis.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Load the dataset
df = pd.read_csv('cleaned_data.csv')

# Update features as per the requirement
features = ['calcium_max', 'creatinine_min', 'aki_stage', 'aniongap_min', 'calcium_min', 'pt_max']

# Drop rows with missing values in the selected columns
df_clean = df.dropna(subset=features + ['delay_rrt'])

# Splitting data into input features (X) and target (y)
X = df_clean[features]
y = df_clean['delay_rrt']

# Splitting data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Define the XGBoost classifier without early stopping
xgb_model = xgb.XGBClassifier(eval_metric='logloss')
param_grid = {
    'n_estimators': [50, 100],        # Reduced n_estimators range for faster grid search
    'max_depth': [3, 5],              # Limited max_depth range
    'learning_rate': [0.1],           # Single learning rate for faster search
    'subsample': [0.8],               # Single subsample value
    'colsample_bytree': [0.8],        # Single colsample_bytree value
    'scale_pos_weight': [1, 2]        # Adjusting for class imbalance
}

# Perform Grid Search with reduced search space and 3-fold CV
grid = GridSearchCV(xgb_model, param_grid, cv=3, scoring='precision', verbose=1)
grid.fit(X_resampled, y_resampled)

# Best hyperparameters and model
best_model = grid.best_estimator_

# Evaluating the model with the test set
y_pred_balanced = best_model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred_balanced))

# Function to predict early dialysis based on user input
def predict_early_dialysis(calcium_max, creatinine_min, aki_stage, aniongap_min, calcium_min, pt_max, model):
    # Create a numpy array with the input data
    input_data = np.array([[calcium_max, creatinine_min, aki_stage, aniongap_min, calcium_min, pt_max]])
    
    # Use the trained model to predict the class (0 or 1)
    prediction = model.predict(input_data)
    
    # Output the result
    if prediction[0] == 1:
        return "Early dialysis is recommended."
    else:
        return "Early dialysis is not required."

# Example of user input and model prediction
result = predict_early_dialysis(calcium_max=8.5, creatinine_min=1.8, aki_stage=2, aniongap_min=10, calcium_min=7.5, pt_max=12.5, model=best_model)
print(result)


Fitting 3 folds for each of 8 candidates, totalling 24 fits
Classification Report:
               precision    recall  f1-score   support

           0       0.39      0.43      0.41        63
           1       0.87      0.85      0.86       293

    accuracy                           0.78       356
   macro avg       0.63      0.64      0.63       356
weighted avg       0.79      0.78      0.78       356

Early dialysis is recommended.


Observations:

Strong Class 1 performance (precision: 0.87, recall: 0.85)

Poor Class 0 performance (precision: 0.39, recall: 0.43)

Significant class imbalance (63:293 samples)

Overall accuracy of 0.78

Large performance gap between classes (f1-score: 0.41 vs 0.86)