In [24]:
import pandas as pd

df = pd.read_csv("../data/raw/diabetic_data.csv")
df.shape
df.head()
df["readmitted"].value_counts(normalize=True)

readmitted
NO     0.539119
>30    0.349282
<30    0.111599
Name: proportion, dtype: float64

### Target Imbalance

The positive class (30-day readmission) represents approximately **11%** of the dataset, while the negative class (no readmission or readmission after 30 days) accounts for over **88%**. This indicates a **strong class imbalance**, with readmissions being relatively rare events.

### Implications for Evaluation

Due to this imbalance, **accuracy alone is not a meaningful evaluation metric**, as a model could achieve high accuracy by always predicting no readmission. Instead, evaluation should focus on metrics that better capture performance on the minority class, such as **recall**, **precision**, **ROC-AUC**, and **precisionâ€“recall curves**. In this clinical context, prioritizing **recall** is especially important to minimize missed high-risk patients.


In [25]:
df["readmitted_30"] = (df["readmitted"] == "<30").astype(int)
df = df.drop(columns=["readmitted"])
df["readmitted_30"].value_counts(normalize=True)

readmitted_30
0    0.888401
1    0.111599
Name: proportion, dtype: float64

### Label Definition

A 30-day readmission (`<30`) is treated as the **positive class**, while all other outcomes are treated as **negative**. This framing reflects the clinical cost of **false negatives**, where failing to identify a high-risk patient could lead to preventable readmissions and worse patient outcomes.


### Data Leakage Considerations


In [26]:
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted_30'],
      dtype='object')

In [27]:
potential_leakage_cols = [
    "diag_1",
    "diag_2",
    "diag_3",
    "encounter_id", #the model can memorize the patients numbers or id which can make it not accurate
    "patient_nbr",
    "discharge_disposition_id"
    # add others you believe could leak info
]

### Data Leakage Considerations

Several features were reviewed for potential data leakage. Identifier columns such as encounter and patient IDs were removed to prevent memorization. Diagnosis codes and discharge disposition were excluded due to their strong correlation with downstream outcomes and potential to encode post-treatment severity. Only features reasonably available at discharge time were retained to ensure realistic evaluation.


In [None]:
target = "readmitted_30"
# print(len(df.columns))
X = df.drop(columns=potential_leakage_cols + [target])
y = df["readmitted_30"]
#checking to see if we're good to go 
print("X shape:", X.shape)
print("y shape:", y.shape)

50
X shape: (101766, 43)
y shape: (101766,)
