Can we predict whether a patient will be readmitted within 30 days of discharge, using information available during encounter


In [49]:
import pandas as pd
import numpy as np

In [50]:
df = pd.read_csv('diabetic_data.csv')

In [51]:
df.shape

(101766, 50)

In [52]:
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [54]:
df.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
encounter_id,101766.0,,,,165201645.622978,102640295.983458,12522.0,84961194.0,152388987.0,230270887.5,443867222.0
patient_nbr,101766.0,,,,54330400.694947,38696359.346534,135.0,23413221.0,45505143.0,87545949.75,189502619.0
race,101766.0,6.0,Caucasian,76099.0,,,,,,,
gender,101766.0,3.0,Female,54708.0,,,,,,,
age,101766.0,10.0,[70-80),26068.0,,,,,,,
weight,101766.0,10.0,?,98569.0,,,,,,,
admission_type_id,101766.0,,,,2.024006,1.445403,1.0,1.0,1.0,3.0,8.0
discharge_disposition_id,101766.0,,,,3.715642,5.280166,1.0,1.0,1.0,4.0,28.0
admission_source_id,101766.0,,,,5.754437,4.064081,1.0,1.0,7.0,7.0,25.0
time_in_hospital,101766.0,,,,4.395987,2.985108,1.0,2.0,4.0,6.0,14.0


In [55]:
# check true nulls
df.isna().sum().sort_values(ascending=False).head(10)

Unnamed: 0,0
max_glu_serum,96420
A1Cresult,84748
race,0
gender,0
age,0
weight,0
admission_type_id,0
discharge_disposition_id,0
admission_source_id,0
time_in_hospital,0


In [56]:
# check placehodler missing values
(df == "?").sum().sort_values(ascending=False).head(15)

Unnamed: 0,0
weight,98569
medical_specialty,49949
payer_code,40256
race,2273
diag_3,1423
diag_2,358
diag_1,21
admission_type_id,0
patient_nbr,0
encounter_id,0


In [57]:
# normalizing placeholder missing values
df = df.replace("?", np.nan)

In [58]:
# drop identifiers and leaker prone columns
id_cols = ['encounter_id', 'patient_nbr']
df = df.drop(columns=id_cols)

In [59]:
# columns with extreme missigness
high_missing = ["weight", "payer_code", "max_glu_serum","A1Cresult"]
df = df.drop(columns=high_missing)

In [60]:
# fill moderate missing categoricals with "unknwons"
for col in ["medical_specialty", "race", "diag_2", "diag_3"]:
    if col in df.columns:
        df[col] = df[col].fillna("Unknown")

In [61]:
# binary target: readitted within 30 days

df["readmitted_30"] = (df['readmitted'] == "<30").astype(int)
df = df.drop(columns=["readmitted"])

In [62]:
# quick sanity check
df.isna().sum().sort_values(ascending=False).head(10)

Unnamed: 0,0
diag_1,21
race,0
age,0
gender,0
discharge_disposition_id,0
admission_source_id,0
time_in_hospital,0
medical_specialty,0
num_lab_procedures,0
num_procedures,0


In [63]:
df.shape

(101766, 44)

In [64]:
# starting with modeling


In [65]:
# train test split and processing over her
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

In [66]:
df.columns.tolist()

['race',
 'gender',
 'age',
 'admission_type_id',
 'discharge_disposition_id',
 'admission_source_id',
 'time_in_hospital',
 'medical_specialty',
 'num_lab_procedures',
 'num_procedures',
 'num_medications',
 'number_outpatient',
 'number_emergency',
 'number_inpatient',
 'diag_1',
 'diag_2',
 'diag_3',
 'number_diagnoses',
 'metformin',
 'repaglinide',
 'nateglinide',
 'chlorpropamide',
 'glimepiride',
 'acetohexamide',
 'glipizide',
 'glyburide',
 'tolbutamide',
 'pioglitazone',
 'rosiglitazone',
 'acarbose',
 'miglitol',
 'troglitazone',
 'tolazamide',
 'examide',
 'citoglipton',
 'insulin',
 'glyburide-metformin',
 'glipizide-metformin',
 'glimepiride-pioglitazone',
 'metformin-rosiglitazone',
 'metformin-pioglitazone',
 'change',
 'diabetesMed',
 'readmitted_30']

In [67]:
X = df.drop(columns=["readmitted_30"])
y = df["readmitted_30"]

cat_cols = X.select_dtypes(include="object").columns
num_cols = X.select_dtypes(include="number").columns

preprocessor = ColumnTransformer(
    transformers=[("num", Pipeline([("imputer", SimpleImputer(strategy="median"))
    ]),num_cols),
      ("cat", Pipeline([("imputer", SimpleImputer(strategy="most_frequent")),
      ("onehot", OneHotEncoder(handle_unknown="ignore"))]), cat_cols)])


In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state= 42)

In [69]:
# baseline mode(logistic regression)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report

model = Pipeline(steps=[("prep", preprocessor),
                        ("clf", LogisticRegression(max_iter= 2000, class_weight="balanced", solver="lbfgs"))])

model.fit(X_train, y_train)

y_prob = model.predict_proba(X_test)[:,1]
y_pred = model.predict(X_test)

roc_auc_score(y_test, y_prob), classification_report(y_test, y_pred)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(np.float64(0.6488714777762981),
 '              precision    recall  f1-score   support\n\n           0       0.92      0.66      0.77     18083\n           1       0.17      0.55      0.26      2271\n\n    accuracy                           0.65     20354\n   macro avg       0.54      0.60      0.51     20354\nweighted avg       0.84      0.65      0.71     20354\n')

In [70]:
from sklearn.metrics import confusion_matrix

In [71]:
roc_auc = roc_auc_score(y_test, y_prob)
report = classification_report(y_test, y_pred, output_dict=True)
cm = confusion_matrix(y_test, y_pred)

print("ROC AUC:",roc_auc)
print("confusion_matrix:\n", cm)

ROC AUC: 0.6488714777762981
confusion_matrix:
 [[11927  6156]
 [ 1023  1248]]


In [72]:
# additional questions
# 1. what factors most strongly drive readmisison risk
# can we segment patients info risk tiers instead of binary predictions
# 3. can this model support interventions impact analysis

In [73]:
import sqlite3
import joblib
import os

In [74]:
import json
os.makedirs("metrics", exist_ok=True)

metrics = {
    "roc_auc": roc_auc,
    "classification_report": report,
    "confusion_matrix": cm.tolist()
}

with open("metrics/metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

print("Metrics saved to metrics/metrics.json")

Metrics saved to metrics/metrics.json


In [75]:
joblib.dump(model, "model.joblib")
print("Model artifact saves")

Model artifact saves


In [76]:
# api

In [77]:
pip install fastapi uvicorn




In [80]:
!printf '%s\n' \
'from fastapi import FastAPI' \
'from pydantic import BaseModel' \
'from typing import List, Dict, Any' \
'import joblib' \
'import pandas as pd' \
'' \
'# Load trained pipeline' \
'model = joblib.load("model.joblib")' \
'' \
'app = FastAPI(title="Readmission Prediction API")' \
'' \
'@app.get("/health")' \
'def health():' \
'    return {"status": "ok"}' \
'' \
'class PatientRecord(BaseModel):' \
'    data: Dict[str, Any]' \
'' \
'class PredictionRequest(BaseModel):' \
'    records: List[PatientRecord]' \
'' \
'# Extract preprocessing schema from trained pipeline' \
'preprocessor = model.named_steps["prep"]' \
'NUM_COLS = preprocessor.transformers_[0][2]' \
'CAT_COLS = preprocessor.transformers_[1][2]' \
'EXPECTED_COLUMNS = list(NUM_COLS) + list(CAT_COLS)' \
'' \
'def align_columns(df: pd.DataFrame) -> pd.DataFrame:' \
'    for col in NUM_COLS:' \
'        if col not in df.columns:' \
'            df[col] = 0' \
'    for col in CAT_COLS:' \
'        if col not in df.columns:' \
'            df[col] = "Unknown"' \
'    return df[EXPECTED_COLUMNS]' \
'' \
'@app.post("/predict")' \
'def predict(request: PredictionRequest):' \
'    df = pd.DataFrame([r.data for r in request.records])' \
'    df = align_columns(df)' \
'    probs = model.predict_proba(df)[:, 1]' \
'    return {"predictions": probs.tolist()}' \
> /content/app.py


In [81]:
!cat /content/app.py


from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Dict, Any
import joblib
import pandas as pd

# Load trained pipeline
model = joblib.load("model.joblib")

app = FastAPI(title="Readmission Prediction API")

@app.get("/health")
def health():
    return {"status": "ok"}

class PatientRecord(BaseModel):
    data: Dict[str, Any]

class PredictionRequest(BaseModel):
    records: List[PatientRecord]

# Extract preprocessing schema from trained pipeline
preprocessor = model.named_steps["prep"]
NUM_COLS = preprocessor.transformers_[0][2]
CAT_COLS = preprocessor.transformers_[1][2]
EXPECTED_COLUMNS = list(NUM_COLS) + list(CAT_COLS)

def align_columns(df: pd.DataFrame) -> pd.DataFrame:
    for col in NUM_COLS:
        if col not in df.columns:
            df[col] = 0
    for col in CAT_COLS:
        if col not in df.columns:
            df[col] = "Unknown"
    return df[EXPECTED_COLUMNS]

@app.post("/predict")
def predict(request: PredictionRequest):
    df = pd

In [87]:
!pkill -f uvicorn
!pkill -f ngrok


In [88]:
%cd /content
!nohup python -m uvicorn app:app --host 0.0.0.0 --port 8000 > uvicorn.log 2>&1 &



/content


In [91]:
!curl http://localhost:8000/health


{"status":"ok"}

In [90]:
from pyngrok import ngrok

ngrok.kill()   # make sure old tunnel is closed
public_url = ngrok.connect(8000)
print(public_url)



NgrokTunnel: "https://bec5e2ed1640.ngrok-free.app" -> "http://localhost:8000"


I started by working with a hospital readmissions dataset and framed the problem as a binary classification task to predict 30-day readmissions. After performing minimal but intentional cleaning and target engineering, I trained a logistic regression model wrapped inside a preprocessing pipeline so that feature handling during inference would remain consistent with training.

I then persisted the trained pipeline and exposed it using a FastAPI application. Since Google Colab does not support long-running foreground services reliably, I ran the API using a background uvicorn process and verified service health locally. Finally, I exposed the API using ngrok, which allowed external access and testing through Swagger UI.

The /predict endpoint successfully accepts structured patient data and returns a probabilistic prediction, demonstrating a complete, production-style machine learning workflow from data ingestion to live inference.