

#  Diabetes Readmission Prediction

Diabetes readmissions are a critical challenge in the healthcare sector, contributing significantly to increased medical costs, strained healthcare resources, and poorer patient outcomes. When patients are discharged and then readmitted shortly after, it often signals gaps in care coordination, follow-up, or risk assessment.

This project builds a predictive analytics solution to identify patients at high risk of being readmitted to the hospital after discharge. By analyzing patterns in patient demographics, diagnoses, procedures, and treatment history, we can develop a machine learning model that flags readmission risk before it happens. This empowers hospitals to take preventive action — such as closer follow-up, education, or resource allocation — to reduce avoidable readmissions.

The business and public health impact is substantial: reducing readmissions improves patient outcomes, reduces financial penalties under value-based care models, and frees up limited hospital capacity for new patients.

##  **Problem Understanding**

`Business Challenge:` Hospitals need a reliable way to predict which discharged patients are likely to be readmitted soon after, so that interventions can be applied before deterioration occurs. This can help reduce costs, improve care quality, and meet regulatory standards.

`Technical Approach:` Build a binary classification model using the "Diabetes 130-US hospitals for years 1999–2008" dataset to predict patient readmission. The focus is not just on predictive performance, but also on **model interpretability** — understanding which factors most influence the risk of readmission using tools like SHAP. This will support data-driven decisions by clinicians, case managers, and hospital administrators.





##  Data Understanding & Exploratory Data Analysis 

We begin our project by examining the dataset used for this analysis: the [Diabetes 130-US Hospitals for Years 1999–2008 Dataset](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) from the UCI Machine Learning Repository. It contains over 100,000 de-identified records of diabetic patient hospital encounters across 130 U.S. hospitals over a 10-year period.

Before we begin modeling, it's essential to build a thorough understanding of the dataset. In this phase, we will explore the structure, types, distributions, and relationships in the data to gain actionable insights that will inform our feature engineering and modeling decisions.

The goal is to answer key questions such as:
- What does the dataset look like?
- Are there any missing values, imbalanced classes, or irrelevant columns?
- Which features might be predictive of readmission?
- Are there any data quality issues that need to be addressed?

By performing both summary statistics and deeper feature-level exploration, we can begin forming hypotheses about which patterns may contribute to patient readmission. These insights will guide our preprocessing, feature selection, and model interpretability steps later in the project.

In [1]:
# Core Libraries
import pandas as pd
import numpy as np
import warnings
import joblib

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

# Scientific Computing
import scipy as sp

# Model Selection and Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    roc_auc_score, classification_report, roc_curve, auc
)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Preprocessing and Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.utils import resample

# Imbalanced Learning
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import RandomOverSampler

# Explainability
import shap

# Set pandas options
warnings.filterwarnings("ignore")

In [2]:
# Load the dataset 
df = pd.read_csv('Data/diabetic_data.csv')

# Preview the shape of the dataset
print("Dataset Dimensions:")
print(f"Rows: {df.shape[0]} | Columns: {df.shape[1]}\n")

Dataset Dimensions:
Rows: 101766 | Columns: 50



In [3]:
# Display the first 5 rows of the dataset
print(" First 5 Records:")
df.head()

 First 5 Records:


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [4]:
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

The dataset contains **101,766 patient encounters** and **50 features** capturing a wide range of information, including demographics, hospital admission details, diagnostics, treatment variables, and medication history.

From the first few records, we observe:
- Features such as `race`, `gender`, `age`, and `admission_type_id` offer demographic and procedural context.
- The `weight` column contains placeholder values (`?`), indicating potential missing data.
- The target variable, `readmitted`, appears at the far right, with values such as `NO`, `>30`, and `<30` — suggesting it may need to be converted into a binary outcome for modeling.
- Many medication-related columns record changes in prescriptions (e.g., `insulin`, `change`, `diabetesMed`), which could hold predictive value.


In [5]:
# Show column names and types
df.dtypes

encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride         

The dataset consists of a mix of **integer** and **object (categorical/string)** data types. Key observations include:

- Patient identifiers (`encounter_id`, `patient_nbr`) are numeric but likely not useful for modeling.
- Categorical variables such as `race`, `gender`, `age`, and many medication-related columns are stored as `object` types and will require encoding.
- Diagnostic codes (`diag_1`, `diag_2`, `diag_3`) are also stored as strings and may require grouping or dimensionality reduction.
- Several columns use placeholder values (e.g., `'?'`) rather than standard missing value indicators, which we’ll need to handle during preprocessing.

This confirms the need for **targeted data type handling, encoding, and missing value treatment** as we move forward.


In [6]:
# Check for duplicate rows
df.duplicated().sum()

np.int64(0)

In [8]:
df.isna().sum()

encounter_id                    0
patient_nbr                     0
race                            0
gender                          0
age                             0
weight                          0
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                      0
medical_specialty               0
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                          0
diag_2                          0
diag_3                          0
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide 

In [9]:
df['age'].unique()

array(['[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)',
       '[60-70)', '[70-80)', '[80-90)', '[90-100)'], dtype=object)

In [10]:
# Loop through each column and display its unique value counts (limited for readability)
for col in df.columns:
    print(f"\n🧾 Value Counts for: {col}")
    print("-" * 40)
    print(df[col].value_counts(dropna=False).head(10))


🧾 Value Counts for: encounter_id
----------------------------------------
encounter_id
443867222    1
2278392      1
149190       1
64410        1
500364       1
16680        1
443816024    1
443811536    1
443804570    1
443797298    1
Name: count, dtype: int64

🧾 Value Counts for: patient_nbr
----------------------------------------
patient_nbr
88785891    40
43140906    28
88227540    23
1660293     23
23199021    23
84428613    22
23643405    22
92709351    21
37096866    20
90609804    20
Name: count, dtype: int64

🧾 Value Counts for: race
----------------------------------------
race
Caucasian          76099
AfricanAmerican    19210
?                   2273
Hispanic            2037
Other               1506
Asian                641
Name: count, dtype: int64

🧾 Value Counts for: gender
----------------------------------------
gender
Female             54708
Male               47055
Unknown/Invalid        3
Name: count, dtype: int64

🧾 Value Counts for: age
------------------------

After reviewing the value counts for each feature and referencing the official mapping for hospital administrative codes, we can now better assess the role and quality of key categorical features.

---

**Missing or Placeholder Values**
- Several fields contain placeholders (`'?'` or `NaN`):
  - `weight`: Over 97% missing, should be dropped.
  - `payer_code`, `medical_specialty`, and `race`: Contain many `'?'` values, but may hold predictive value. We'll treat unknowns explicitly if retained.
  - `max_glu_serum` and `A1Cresult`: Include structured categories and `NaN`s; we’ll treat `NaN` as “Not Measured” and encode ordinally.

**ID Columns**
- `encounter_id` and `patient_nbr` are identifiers and will be dropped.

---

**Hospital Administrative Codes (Mapped)**

 `admission_type_id`
- Includes structured categories like `1 = Emergency`, `2 = Urgent`, `3 = Elective`, etc.
- Categories such as `5 = Not Available`, `6 = NULL`, and `8 = Not Mapped` may reflect poor data quality and will be grouped under `"Unknown"` or `"Other"`.

 `discharge_disposition_id`
- Includes critical outcomes like `1 = Discharged to home`, `7 = Left AMA`, `11 = Expired`, and `13–21 = Hospice/Expired`.
- Certain values (e.g., `18 = NULL`, `25 = Not Mapped`, `26 = Unknown/Invalid`) will be grouped as `"Unknown"` or imputed as a separate category.
- This column may reveal post-discharge care levels and correlate with readmission risk — we will keep it and apply meaningful mapping.

 `admission_source_id`
- Represents referral origin, including `1 = Physician Referral`, `7 = Emergency Room`, `4 = Transfer from a hospital`, etc.
- Codes like `9 = Not Available`, `17 = NULL`, `20 = Not Mapped`, `21 = Unknown/Invalid` will again be mapped to `"Unknown"`.

To improve interpretability and performance, we will **map these numerical codes to their categorical descriptions** using dictionaries, and treat them as categorical features.

---

**Categorical Variables for Encoding**
We'll apply suitable encoding strategies:
- **Binary/Boolean encoding** for features like `change` and `diabetesMed`
- **Ordinal encoding** for ordered features (e.g., `A1Cresult`, `max_glu_serum`)
- **One-hot or grouped encoding** for high-cardinality fields like `diagnosis codes`, `medical_specialty`, and `medication features`

In [11]:
# Quick summary statistics of numerical features
df.describe(include='number').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
encounter_id,101766.0,165201600.0,102640300.0,12522.0,84961194.0,152388987.0,230270900.0,443867222.0
patient_nbr,101766.0,54330400.0,38696360.0,135.0,23413221.0,45505143.0,87545950.0,189502619.0
admission_type_id,101766.0,2.024006,1.445403,1.0,1.0,1.0,3.0,8.0
discharge_disposition_id,101766.0,3.715642,5.280166,1.0,1.0,1.0,4.0,28.0
admission_source_id,101766.0,5.754437,4.064081,1.0,1.0,7.0,7.0,25.0
time_in_hospital,101766.0,4.395987,2.985108,1.0,2.0,4.0,6.0,14.0
num_lab_procedures,101766.0,43.09564,19.67436,1.0,31.0,44.0,57.0,132.0
num_procedures,101766.0,1.33973,1.705807,0.0,0.0,1.0,2.0,6.0
num_medications,101766.0,16.02184,8.127566,1.0,10.0,15.0,20.0,81.0
number_outpatient,101766.0,0.3693572,1.267265,0.0,0.0,0.0,0.0,42.0


The dataset contains several numeric features that capture hospital utilization, medication use, procedures performed, and diagnostic richness. Here are the key insights from the summary statistics:

**Patient and Encounter Identifiers**
- `encounter_id` and `patient_nbr` are unique or repeated identifiers with no analytical value. These should be **dropped** from the dataset.

**Hospital Utilization Metrics**
- `time_in_hospital`: Most patients are admitted for **2–6 days**, with a maximum of 14. This feature may correlate with severity or complexity of the case.
- `num_lab_procedures`: Highly variable (1 to 132), with a **median of 44**, indicating broad differences in patient testing frequency.
- `num_procedures` and `num_medications`: Range from 0–6 and 1–81 respectively, showing variation in intervention intensity. `num_medications` will be especially important to track treatment complexity.

**Patient Visit History**
- `number_outpatient`, `number_emergency`, and `number_inpatient`: All show **heavy skew toward zero**, but with some outliers (e.g., 42 outpatient visits, 76 emergency visits). These fields may reflect chronic care needs or poor control of conditions, which could strongly influence readmission.

**Diagnostic Breadth**
- `number_diagnoses`: Most patients have **6–9 diagnoses**, with a max of 16. This indicates high comorbidity in the population and should be treated as a **key feature** related to patient complexity.


In [10]:
# Set target and inspect it
df['readmitted'] = df['readmitted'].replace({'>30': 'NO'})  # Only consider '<30' as readmitted
print("\nTarget variable breakdown:")
print(df['readmitted'].value_counts())


Target variable breakdown:
readmitted
NO     90409
<30    11357
Name: count, dtype: int64


In [12]:
# Binarize target
df['readmitted'] = df['readmitted'].apply(lambda x: 1 if x == '<30' else 0)

In [13]:
df['readmitted'].value_counts()

readmitted
0    90409
1    11357
Name: count, dtype: int64

In [None]:
### Data Cleaning & Feature Engineering ###
# Drop columns with too many unique values or identifiers
df.drop(['encounter_id', 'patient_nbr', 'weight', 'payer_code', 'medical_specialty'], axis=1, inplace=True)


##  Initial Modeling Attempt

This early experiment applied a standard pipeline with minimal cleaning and basic feature engineering. Three models (Logistic Regression, Random Forest, XGBoost) were evaluated using stratified split and cross-validation, and SHAP was used for interpretability.

While the setup was valid, results—particularly recall and F1 score—were suboptimal due to limited preprocessing and unaddressed class imbalance. This highlighted the need for more robust cleaning, domain-driven features, and data balancing, which informed the improved approach later in the project.
