In [15]:
import pandas as pd

df = pd.read_csv("../data/raw/diabetic_data.csv")
df.shape
df.head()
df["readmitted"].value_counts(normalize=True)

readmitted
NO     0.539119
>30    0.349282
<30    0.111599
Name: proportion, dtype: float64

### Target Imbalance

The positive class (30-day readmission) represents approximately **11%** of the dataset, while the negative class (no readmission or readmission after 30 days) accounts for over **88%**. This indicates a **strong class imbalance**, with readmissions being relatively rare events.

### Implications for Evaluation

Due to this imbalance, **accuracy alone is not a meaningful evaluation metric**, as a model could achieve high accuracy by always predicting no readmission. Instead, evaluation should focus on metrics that better capture performance on the minority class, such as **recall**, **precision**, **ROC-AUC**, and **precisionâ€“recall curves**. In this clinical context, prioritizing **recall** is especially important to minimize missed high-risk patients.


In [16]:
df["readmitted_30"] = (df["readmitted"] == "<30").astype(int)
df = df.drop(columns=["readmitted"])
df["readmitted_30"].value_counts(normalize=True)

readmitted_30
0    0.888401
1    0.111599
Name: proportion, dtype: float64

### Label Definition

A 30-day readmission (`<30`) is treated as the **positive class**, while all other outcomes are treated as **negative**. This framing reflects the clinical cost of **false negatives**, where failing to identify a high-risk patient could lead to preventable readmissions and worse patient outcomes.
