# Machine Learning-Based Prediction of 30-Day Hospital Readmission in Diabetic Patients

## Problem Statement

Hospital readmission within 30 days is a critical indicator of healthcare quality and patient management effectiveness. 
Diabetic patients are at higher risk of complications that may lead to early readmission after discharge.

The objective of this project is to build a machine learning model that predicts whether a diabetic patient will be readmitted within 30 days of hospital discharge based on demographic, clinical, and hospital encounter features.

## Dataset Information

- Source: UCI Machine Learning Repository  
- Dataset Name: Diabetes 130-US Hospitals for Years 1999–2008  
- Number of Instances: 101,766 hospital encounters  
- Number of Features: 50  
- Target Variable: readmitted (<30, >30, NO)  
- Contains missing values in selected columns  
- Includes demographic, clinical, laboratory, medication, and hospital admission details

In [29]:
import pandas as pd

In [30]:
df = pd.read_csv("./data/raw/diabetic_data.csv")

df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [31]:
df.shape

(101766, 50)

In [32]:
# Replacing ? to NA to identify missing values

df.replace('?', pd.NA, inplace=True)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      99493 non-null   object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    3197 non-null    object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                61510 non-null   object
 11  medical_specialty         51817 non-null   object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [34]:
df.isnull().sum()

encounter_id                    0
patient_nbr                     0
race                         2273
gender                          0
age                             0
weight                      98569
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                  40256
medical_specialty           49949
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                         21
diag_2                        358
diag_3                       1423
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide 

In [None]:
# Saving cleaned csv from Day-25

df.to_csv("./data/clean/day-25.csv", index=False)

## Target Variable

The target variable for this classification problem is "readmitted".  
It indicates whether a diabetic patient was readmitted to the hospital after discharge.

The original categories are:
- <30  → Readmitted within 30 days  
- \>30  → Readmitted after 30 days  
- NO   → No record of readmission  

For this project, the problem will be treated as a binary classification task focusing on predicting 30-day readmission risk.

In [36]:
df['readmitted'].value_counts()

readmitted
NO     54864
>30    35545
<30    11357
Name: count, dtype: int64

## Type of Machine Learning Problem

This is a Supervised Learning problem since the dataset contains labeled outcomes for hospital readmission.

It is treated as a Binary Classification task where the model will predict:

- 1 → Patient readmitted within 30 days (<30)  
- 0 → Patient not readmitted within 30 days (>30 or NO)

The objective is to identify high-risk diabetic patients who are likely to be readmitted within 30 days after discharge.