## Problem: Predict the risk of patient readmission within 30 days post-discharge.

### Objectives:
- Reduce avoidable readmissions.
- Support physicians with early intervention recommendations.
- Optimize hospital resource allocation.

### Stakeholders:
- Doctors and medical staff
- Hospital administrators

## Data Strategy
### Data Sources:
- Electronic Health Records (EHRs): lab results, vitals, medications
- Patient demographics and prior admission history

### Two Ethical Concerns:
- **Patient Privacy**: Sensitive health data must be protected (HIPAA-compliant handling).
- **Bias**: Historical disparities (e.g., based on insurance or race) may affect prediction fairness.

### Preprocessing Pipeline:
- **1. Data Cleaning**: Handle missing vitals/lab data (e.g., impute using median).
- **2. Feature Engineering**:
    - Calculate time since last admission
    - Number of comorbidities
- **3. Encoding**:
    - One-hot encode categorical features
    - Scale numeric features using MinMaxScaler


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import janitor

# %matplotlib_inline
sns.set_theme(style='whitegrid')

In [2]:
patients = pd.read_csv('../data/healthcare_dataset.csv', sep=",")
patients = janitor.clean_names(patients)
patients

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.782410,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55495,eLIZABeTH jaCkSOn,42,Female,O+,Asthma,2020-08-16,Joshua Jarvis,Jones-Thompson,Blue Cross,2650.714952,417,Elective,2020-09-15,Penicillin,Abnormal
55496,KYle pEREz,61,Female,AB-,Obesity,2020-01-23,Taylor Sullivan,Tucker-Moyer,Cigna,31457.797307,316,Elective,2020-02-01,Aspirin,Normal
55497,HEATher WaNG,38,Female,B+,Hypertension,2020-07-13,Joe Jacobs DVM,"and Mahoney Johnson Vasquez,",UnitedHealthcare,27620.764717,347,Urgent,2020-08-10,Ibuprofen,Abnormal
55498,JENniFER JOneS,43,Male,O-,Arthritis,2019-05-25,Kimberly Curry,"Jackson Todd and Castro,",Medicare,32451.092358,321,Elective,2019-05-31,Ibuprofen,Abnormal


In [3]:
patients.columns.to_list()

['name',
 'age',
 'gender',
 'blood_type',
 'medical_condition',
 'date_of_admission',
 'doctor',
 'hospital',
 'insurance_provider',
 'billing_amount',
 'room_number',
 'admission_type',
 'discharge_date',
 'medication',
 'test_results']

In [4]:
patients['duplicate_count'] = patients.groupby(['name'])['name'].transform('count')
patients

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,duplicate_count
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal,1
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive,1
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal,1
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.782410,450,Elective,2020-12-18,Ibuprofen,Abnormal,1
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55495,eLIZABeTH jaCkSOn,42,Female,O+,Asthma,2020-08-16,Joshua Jarvis,Jones-Thompson,Blue Cross,2650.714952,417,Elective,2020-09-15,Penicillin,Abnormal,2
55496,KYle pEREz,61,Female,AB-,Obesity,2020-01-23,Taylor Sullivan,Tucker-Moyer,Cigna,31457.797307,316,Elective,2020-02-01,Aspirin,Normal,2
55497,HEATher WaNG,38,Female,B+,Hypertension,2020-07-13,Joe Jacobs DVM,"and Mahoney Johnson Vasquez,",UnitedHealthcare,27620.764717,347,Urgent,2020-08-10,Ibuprofen,Abnormal,2
55498,JENniFER JOneS,43,Male,O-,Arthritis,2019-05-25,Kimberly Curry,"Jackson Todd and Castro,",Medicare,32451.092358,321,Elective,2019-05-31,Ibuprofen,Abnormal,2


In [5]:
patients.columns.tolist()

['name',
 'age',
 'gender',
 'blood_type',
 'medical_condition',
 'date_of_admission',
 'doctor',
 'hospital',
 'insurance_provider',
 'billing_amount',
 'room_number',
 'admission_type',
 'discharge_date',
 'medication',
 'test_results',
 'duplicate_count']

In [6]:
# adrIENNE bEll
patients[patients['name'] == 'adrIENNE bEll']

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,duplicate_count
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal,2
50144,adrIENNE bEll,44,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal,2


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

In [8]:
# Ensure dates are datetime
patients['date_of_admission'] = pd.to_datetime(patients['date_of_admission'])
patients['discharge_date'] = pd.to_datetime(patients['discharge_date'])
patients.head()

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,duplicate_count
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal,1
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive,1
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal,1
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal,1
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal,2


In [9]:
# Sort by patient and admission date
patients = patients.sort_values(['name', 'date_of_admission'])

# Create a column for the next admission date for each patient
patients['next_admission'] = patients.groupby('name')['date_of_admission'].shift(-1)
patients

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,duplicate_count,next_admission
28163,AARON DuncAn,22,Male,AB-,Obesity,2019-07-24,Ryan Perry,Welch-Yang,UnitedHealthcare,39906.147308,279,Urgent,2019-08-04,Paracetamol,Abnormal,1,NaT
4570,AARON HicKS,76,Female,A+,Arthritis,2022-03-02,Michael Butler,"Rasmussen Patrick and Newman,",Cigna,10584.185945,187,Elective,2022-03-15,Lipitor,Inconclusive,1,NaT
35390,AARON bAldWIN Jr.,20,Male,O-,Hypertension,2020-10-10,Amy Farley,"Flores Friedman and White,",Medicare,29740.960199,104,Urgent,2020-11-05,Paracetamol,Abnormal,1,NaT
45817,AARON hAWkIns,69,Female,B-,Diabetes,2019-10-17,Kimberly York,"Harris, Hernandez and Vazquez",Aetna,21535.554758,206,Urgent,2019-10-26,Penicillin,Abnormal,1,NaT
20288,AAROn HaRt,18,Male,B-,Cancer,2021-01-13,Sharon Morrison,"Fox Guzman James, and",Aetna,13895.551020,260,Emergency,2021-01-20,Paracetamol,Abnormal,1,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35911,zachaRy oDOM,52,Female,O+,Diabetes,2021-07-08,Sherry Durham,Group Turner,Blue Cross,48301.353787,326,Urgent,2021-07-10,Paracetamol,Abnormal,1,NaT
40197,zachaRy raMirEZ,58,Male,AB+,Asthma,2019-06-30,Heather Chen,"and Waters, Williams Daugherty",Aetna,29508.124922,205,Emergency,2019-07-20,Aspirin,Inconclusive,1,NaT
1902,zacharY BauTista,43,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,2020-08-21
50727,zacharY BauTista,46,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,NaT


In [10]:
# Calculate days until next admission
patients['days_until_next_admission'] = (patients['next_admission'] - patients['discharge_date']).dt.days
patients

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,duplicate_count,next_admission,days_until_next_admission
28163,AARON DuncAn,22,Male,AB-,Obesity,2019-07-24,Ryan Perry,Welch-Yang,UnitedHealthcare,39906.147308,279,Urgent,2019-08-04,Paracetamol,Abnormal,1,NaT,
4570,AARON HicKS,76,Female,A+,Arthritis,2022-03-02,Michael Butler,"Rasmussen Patrick and Newman,",Cigna,10584.185945,187,Elective,2022-03-15,Lipitor,Inconclusive,1,NaT,
35390,AARON bAldWIN Jr.,20,Male,O-,Hypertension,2020-10-10,Amy Farley,"Flores Friedman and White,",Medicare,29740.960199,104,Urgent,2020-11-05,Paracetamol,Abnormal,1,NaT,
45817,AARON hAWkIns,69,Female,B-,Diabetes,2019-10-17,Kimberly York,"Harris, Hernandez and Vazquez",Aetna,21535.554758,206,Urgent,2019-10-26,Penicillin,Abnormal,1,NaT,
20288,AAROn HaRt,18,Male,B-,Cancer,2021-01-13,Sharon Morrison,"Fox Guzman James, and",Aetna,13895.551020,260,Emergency,2021-01-20,Paracetamol,Abnormal,1,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35911,zachaRy oDOM,52,Female,O+,Diabetes,2021-07-08,Sherry Durham,Group Turner,Blue Cross,48301.353787,326,Urgent,2021-07-10,Paracetamol,Abnormal,1,NaT,
40197,zachaRy raMirEZ,58,Male,AB+,Asthma,2019-06-30,Heather Chen,"and Waters, Williams Daugherty",Aetna,29508.124922,205,Emergency,2019-07-20,Aspirin,Inconclusive,1,NaT,
1902,zacharY BauTista,43,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,2020-08-21,-4.0
50727,zacharY BauTista,46,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,NaT,


In [11]:
patients['days_until_next_admission'] = patients['days_until_next_admission'].abs()
patients

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,duplicate_count,next_admission,days_until_next_admission
28163,AARON DuncAn,22,Male,AB-,Obesity,2019-07-24,Ryan Perry,Welch-Yang,UnitedHealthcare,39906.147308,279,Urgent,2019-08-04,Paracetamol,Abnormal,1,NaT,
4570,AARON HicKS,76,Female,A+,Arthritis,2022-03-02,Michael Butler,"Rasmussen Patrick and Newman,",Cigna,10584.185945,187,Elective,2022-03-15,Lipitor,Inconclusive,1,NaT,
35390,AARON bAldWIN Jr.,20,Male,O-,Hypertension,2020-10-10,Amy Farley,"Flores Friedman and White,",Medicare,29740.960199,104,Urgent,2020-11-05,Paracetamol,Abnormal,1,NaT,
45817,AARON hAWkIns,69,Female,B-,Diabetes,2019-10-17,Kimberly York,"Harris, Hernandez and Vazquez",Aetna,21535.554758,206,Urgent,2019-10-26,Penicillin,Abnormal,1,NaT,
20288,AAROn HaRt,18,Male,B-,Cancer,2021-01-13,Sharon Morrison,"Fox Guzman James, and",Aetna,13895.551020,260,Emergency,2021-01-20,Paracetamol,Abnormal,1,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35911,zachaRy oDOM,52,Female,O+,Diabetes,2021-07-08,Sherry Durham,Group Turner,Blue Cross,48301.353787,326,Urgent,2021-07-10,Paracetamol,Abnormal,1,NaT,
40197,zachaRy raMirEZ,58,Male,AB+,Asthma,2019-06-30,Heather Chen,"and Waters, Williams Daugherty",Aetna,29508.124922,205,Emergency,2019-07-20,Aspirin,Inconclusive,1,NaT,
1902,zacharY BauTista,43,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,2020-08-21,4.0
50727,zacharY BauTista,46,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,NaT,


In [12]:
# Binary target: 1 if next admission within 30 days, else 0
patients['readmitted_30d'] = ((patients['days_until_next_admission'] >= 0) & 
                              (patients['days_until_next_admission'] <= 30)).astype(int)
patients

Unnamed: 0,name,age,gender,blood_type,medical_condition,date_of_admission,doctor,hospital,insurance_provider,billing_amount,room_number,admission_type,discharge_date,medication,test_results,duplicate_count,next_admission,days_until_next_admission,readmitted_30d
28163,AARON DuncAn,22,Male,AB-,Obesity,2019-07-24,Ryan Perry,Welch-Yang,UnitedHealthcare,39906.147308,279,Urgent,2019-08-04,Paracetamol,Abnormal,1,NaT,,0
4570,AARON HicKS,76,Female,A+,Arthritis,2022-03-02,Michael Butler,"Rasmussen Patrick and Newman,",Cigna,10584.185945,187,Elective,2022-03-15,Lipitor,Inconclusive,1,NaT,,0
35390,AARON bAldWIN Jr.,20,Male,O-,Hypertension,2020-10-10,Amy Farley,"Flores Friedman and White,",Medicare,29740.960199,104,Urgent,2020-11-05,Paracetamol,Abnormal,1,NaT,,0
45817,AARON hAWkIns,69,Female,B-,Diabetes,2019-10-17,Kimberly York,"Harris, Hernandez and Vazquez",Aetna,21535.554758,206,Urgent,2019-10-26,Penicillin,Abnormal,1,NaT,,0
20288,AAROn HaRt,18,Male,B-,Cancer,2021-01-13,Sharon Morrison,"Fox Guzman James, and",Aetna,13895.551020,260,Emergency,2021-01-20,Paracetamol,Abnormal,1,NaT,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35911,zachaRy oDOM,52,Female,O+,Diabetes,2021-07-08,Sherry Durham,Group Turner,Blue Cross,48301.353787,326,Urgent,2021-07-10,Paracetamol,Abnormal,1,NaT,,0
40197,zachaRy raMirEZ,58,Male,AB+,Asthma,2019-06-30,Heather Chen,"and Waters, Williams Daugherty",Aetna,29508.124922,205,Emergency,2019-07-20,Aspirin,Inconclusive,1,NaT,,0
1902,zacharY BauTista,43,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,2020-08-21,4.0,1
50727,zacharY BauTista,46,Female,AB+,Cancer,2020-08-21,Scott Bell,Rodgers Inc,Medicare,9988.199830,336,Urgent,2020-08-25,Aspirin,Inconclusive,2,NaT,,0


In [13]:
patients['admission_month'] = patients['date_of_admission'].dt.month
patients['admission_dayofweek'] = patients['date_of_admission'].dt.dayofweek

In [14]:
patients.columns.to_list()

['name',
 'age',
 'gender',
 'blood_type',
 'medical_condition',
 'date_of_admission',
 'doctor',
 'hospital',
 'insurance_provider',
 'billing_amount',
 'room_number',
 'admission_type',
 'discharge_date',
 'medication',
 'test_results',
 'duplicate_count',
 'next_admission',
 'days_until_next_admission',
 'readmitted_30d',
 'admission_month',
 'admission_dayofweek']

In [15]:
patients['length_of_stay'] = (patients['discharge_date'] - patients['date_of_admission']).dt.days

In [16]:
patients[['name', 'date_of_admission', 'discharge_date', 'readmitted_30d', 'admission_month',
 'admission_dayofweek', 'length_of_stay']]

Unnamed: 0,name,date_of_admission,discharge_date,readmitted_30d,admission_month,admission_dayofweek,length_of_stay
28163,AARON DuncAn,2019-07-24,2019-08-04,0,7,2,11
4570,AARON HicKS,2022-03-02,2022-03-15,0,3,2,13
35390,AARON bAldWIN Jr.,2020-10-10,2020-11-05,0,10,5,26
45817,AARON hAWkIns,2019-10-17,2019-10-26,0,10,3,9
20288,AAROn HaRt,2021-01-13,2021-01-20,0,1,2,7
...,...,...,...,...,...,...,...
35911,zachaRy oDOM,2021-07-08,2021-07-10,0,7,3,2
40197,zachaRy raMirEZ,2019-06-30,2019-07-20,0,6,6,20
1902,zacharY BauTista,2020-08-21,2020-08-25,1,8,4,4
50727,zacharY BauTista,2020-08-21,2020-08-25,0,8,4,4


In [17]:
# Drop helper columns if you want
helper_columns = [
   'name', 'doctor', 'hospital', 'date_of_admission', 'discharge_date', 
   'duplicate_count', 'next_admission', 'days_until_next_admission'
]
patients[helper_columns].head()

Unnamed: 0,name,doctor,hospital,date_of_admission,discharge_date,duplicate_count,next_admission,days_until_next_admission
28163,AARON DuncAn,Ryan Perry,Welch-Yang,2019-07-24,2019-08-04,1,NaT,
4570,AARON HicKS,Michael Butler,"Rasmussen Patrick and Newman,",2022-03-02,2022-03-15,1,NaT,
35390,AARON bAldWIN Jr.,Amy Farley,"Flores Friedman and White,",2020-10-10,2020-11-05,1,NaT,
45817,AARON hAWkIns,Kimberly York,"Harris, Hernandez and Vazquez",2019-10-17,2019-10-26,1,NaT,
20288,AAROn HaRt,Sharon Morrison,"Fox Guzman James, and",2021-01-13,2021-01-20,1,NaT,


In [18]:
# Drop helper columns
patients = patients.drop(columns=helper_columns)
patients

Unnamed: 0,age,gender,blood_type,medical_condition,insurance_provider,billing_amount,room_number,admission_type,medication,test_results,readmitted_30d,admission_month,admission_dayofweek,length_of_stay
28163,22,Male,AB-,Obesity,UnitedHealthcare,39906.147308,279,Urgent,Paracetamol,Abnormal,0,7,2,11
4570,76,Female,A+,Arthritis,Cigna,10584.185945,187,Elective,Lipitor,Inconclusive,0,3,2,13
35390,20,Male,O-,Hypertension,Medicare,29740.960199,104,Urgent,Paracetamol,Abnormal,0,10,5,26
45817,69,Female,B-,Diabetes,Aetna,21535.554758,206,Urgent,Penicillin,Abnormal,0,10,3,9
20288,18,Male,B-,Cancer,Aetna,13895.551020,260,Emergency,Paracetamol,Abnormal,0,1,2,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35911,52,Female,O+,Diabetes,Blue Cross,48301.353787,326,Urgent,Paracetamol,Abnormal,0,7,3,2
40197,58,Male,AB+,Asthma,Aetna,29508.124922,205,Emergency,Aspirin,Inconclusive,0,6,6,20
1902,43,Female,AB+,Cancer,Medicare,9988.199830,336,Urgent,Aspirin,Inconclusive,1,8,4,4
50727,46,Female,AB+,Cancer,Medicare,9988.199830,336,Urgent,Aspirin,Inconclusive,0,8,4,4


In [19]:
patients.columns.to_list()

['age',
 'gender',
 'blood_type',
 'medical_condition',
 'insurance_provider',
 'billing_amount',
 'room_number',
 'admission_type',
 'medication',
 'test_results',
 'readmitted_30d',
 'admission_month',
 'admission_dayofweek',
 'length_of_stay']

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier

In [21]:
# Optional: XGBoost (install if needed)
try:
    from xgboost import XGBClassifier
    xgb_installed = True
except ImportError:
    xgb_installed = False

# 1. Select columns
# selected_columns = [
#     'name', 'age', 'gender', 'blood_type', 'medical_condition', 'date_of_admission',
#     'doctor', 'hospital', 'insurance_provider', 'billing_amount', 'room_number',
#     'admission_type', 'discharge_date', 'medication', 'test_results', 'duplicate_count'
# ]
# patients_selected = patients[selected_columns].copy()

In [22]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55500 entries, 28163 to 45358
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  55500 non-null  int64  
 1   gender               55500 non-null  object 
 2   blood_type           55500 non-null  object 
 3   medical_condition    55500 non-null  object 
 4   insurance_provider   55500 non-null  object 
 5   billing_amount       55500 non-null  float64
 6   room_number          55500 non-null  int64  
 7   admission_type       55500 non-null  object 
 8   medication           55500 non-null  object 
 9   test_results         55500 non-null  object 
 10  readmitted_30d       55500 non-null  int64  
 11  admission_month      55500 non-null  int32  
 12  admission_dayofweek  55500 non-null  int32  
 13  length_of_stay       55500 non-null  int64  
dtypes: float64(1), int32(2), int64(4), object(7)
memory usage: 5.9+ MB


In [23]:
X = patients.drop(columns=['readmitted_30d'])
y = patients['readmitted_30d']

# 5. Identify numeric and categorical columns
num_cols = X.select_dtypes(exclude=["object"]).columns.tolist()
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()

# 6. Preprocessing pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())
])
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

In [24]:
# 7. Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [25]:
X_train.columns

Index(['age', 'gender', 'blood_type', 'medical_condition',
       'insurance_provider', 'billing_amount', 'room_number', 'admission_type',
       'medication', 'test_results', 'admission_month', 'admission_dayofweek',
       'length_of_stay'],
      dtype='object')

In [26]:
y_train = pd.DataFrame(y_train)
y_train.columns

Index(['readmitted_30d'], dtype='object')

In [27]:
# 8. Fit and evaluate models

# Logistic Regression
log_reg = Pipeline([
    ('pre', preprocessor),
    ('clf', LogisticRegression(max_iter=1000))
])

log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)

print("Logistic Regression Results:")
print(classification_report(y_test, y_pred_log_reg))

  y = column_or_1d(y, warn=True)


Logistic Regression Results:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95     10000
           1       0.00      0.00      0.00      1100

    accuracy                           0.90     11100
   macro avg       0.45      0.50      0.47     11100
weighted avg       0.81      0.90      0.85     11100



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [28]:
# Random Forest
rf_clf = Pipeline([
    ('pre', preprocessor),
    ('clf', RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42))
])

rf_clf.fit(X_train, y_train)
y_pred_rf_clf = rf_clf.predict(X_test)
print("Random Forest Results:")
print(classification_report(y_test, y_pred_rf_clf))

  return fit_method(estimator, *args, **kwargs)


Random Forest Results:
              precision    recall  f1-score   support

           0       0.89      0.92      0.90     10000
           1       0.00      0.00      0.00      1100

    accuracy                           0.83     11100
   macro avg       0.45      0.46      0.45     11100
weighted avg       0.80      0.83      0.81     11100



In [29]:
# XGBoost (if installed)
if xgb_installed:
    xgb = Pipeline([
        ('pre', preprocessor),
        ('clf', XGBClassifier(n_estimators=100, max_depth=10, use_label_encoder=True, eval_metric='logloss', random_state=42))
    ])
    
    xgb.fit(X_train, y_train)
    y_pred_xgb = xgb.predict(X_test)
    print("XGBoost Results:")
    print(classification_report(y_test, y_pred_xgb))
else:
    print("XGBoost not installed. Skipping XGBoost model.") 

Parameters: { "use_label_encoder" } are not used.



XGBoost Results:
              precision    recall  f1-score   support

           0       0.90      0.96      0.93     10000
           1       0.00      0.00      0.00      1100

    accuracy                           0.87     11100
   macro avg       0.45      0.48      0.46     11100
weighted avg       0.81      0.87      0.84     11100



In [30]:
if xgb_installed:
    xgb = Pipeline([
        ('pre', preprocessor),
        ('clf', XGBClassifier(n_estimators=100, max_depth=10, use_label_encoder=False, eval_metric='logloss', random_state=42))
    ])
    
    xgb.fit(X_train, y_train)
    y_pred_xgb = xgb.predict(X_test)
    print("XGBoost Results:")
    print(classification_report(y_test, y_pred_xgb))
else:
    print("XGBoost not installed. Skipping XGBoost model.")

Parameters: { "use_label_encoder" } are not used.



XGBoost Results:
              precision    recall  f1-score   support

           0       0.90      0.96      0.93     10000
           1       0.00      0.00      0.00      1100

    accuracy                           0.87     11100
   macro avg       0.45      0.48      0.46     11100
weighted avg       0.81      0.87      0.84     11100



## What do these numbers mean?
- **Class 0 (not readmitted within 30 days):**
    - Precision: 0.90 (90% of predicted 0s are correct)
    - Recall: 0.96 (96% of actual 0s are found)
    - F1-score: 0.93 (harmonic mean of precision and recall)
    - Support: 10,000 samples
    - Class 1 (readmitted within 30 days):

- **Precision: 0.00**
    - Recall: 0.00
    - F1-score: 0.00
    - Support: 1,100 samples
- **Overall accuracy:** 0.87 (87% of all predictions are correct)
- **Macro avg:** Average of metrics for both classes, treating them equally.
- **Weighted avg:** Average of metrics weighted by the number of samples in each class.

## Interpretation
- **The model predicts almost all samples as class 0.**
- **It fails to identify any class 1 cases** (readmissions within 30 days): precision, recall, and F1-score are all 0 for class 1.
- **High accuracy (0.87) is misleading** because the dataset is imbalanced (much more class 0 than class 1).
- **Macro and weighted averages are low** due to the model's inability to predict class 1.

## What does this mean?
- **The model is not learning to detect readmissions (class 1) at all.**
- This is a classic case of **class imbalance**: the model is biased toward the majority class (class 0).
- **Action:**
    - Try resampling techniques (oversample class 1, undersample class 0)
    - Use class weights in your model
    - Try different algorithms or hyperparameters

In [31]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Oversample the minority class (class 1)
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

In [32]:
# Use Class Weights
# for scikit-learn models
log_reg = Pipeline([
    ('pre', preprocessor),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

# for XGBoost 
# Convert y_train to a NumPy array or Series before the calculation:
# If y_train is a DataFrame, convert to Series
if isinstance(y_train, pd.DataFrame):
    y_train_series = y_train.squeeze()
else:
    y_train_series = y_train
# Calculate scale_pos_weight
# scale_pos_weight = y_train.value_counts()[0] / y_train.value_counts()[1]
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
xgb = Pipeline([
    ('pre', preprocessor),
    ('clf', XGBClassifier(n_estimators=100, max_depth=10, eval_metric='logloss', random_state=42, 
                          scale_pos_weight=np.array(scale_pos_weight)[0]))
])

In [33]:
print(scale_pos_weight.dtype)
print(np.array(scale_pos_weight)) # np.array(scale_pos_weight)[0]

float64
[9.08861622]


In [34]:
# Logistic Regression with resampled data
log_reg.fit(X_resampled, y_resampled)
y_pred_log_reg = log_reg.predict(X_test)
print("Logistic Regression Results (resampled):")
print(classification_report(y_test, y_pred_log_reg))

  y = column_or_1d(y, warn=True)


Logistic Regression Results (resampled):
              precision    recall  f1-score   support

           0       0.90      0.51      0.65     10000
           1       0.10      0.51      0.17      1100

    accuracy                           0.51     11100
   macro avg       0.50      0.51      0.41     11100
weighted avg       0.83      0.51      0.60     11100



In [35]:
# XGBoost with resampled data
xgb.fit(X_resampled, y_resampled)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Results (resampled):")
print(classification_report(y_test, y_pred_xgb))

XGBoost Results (resampled):
              precision    recall  f1-score   support

           0       0.88      0.75      0.81     10000
           1       0.04      0.11      0.06      1100

    accuracy                           0.69     11100
   macro avg       0.46      0.43      0.44     11100
weighted avg       0.80      0.69      0.74     11100



In [40]:
import os
import joblib
# Make sure the model directory exists
os.makedirs("model", exist_ok=True)
joblib.dump(xgb, "model/xgb.pkl")

['model/xgb.pkl']

In [36]:
# Undersample the majority class (class 0)
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

In [37]:
# Logistic Regression with resampled data
log_reg.fit(X_resampled, y_resampled)
y_pred_log_reg = log_reg.predict(X_test)
print("Logistic Regression Results (resampled):")
print(classification_report(y_test, y_pred_log_reg))

Logistic Regression Results (resampled):
              precision    recall  f1-score   support

           0       0.90      0.50      0.64     10000
           1       0.10      0.52      0.17      1100

    accuracy                           0.50     11100
   macro avg       0.50      0.51      0.41     11100
weighted avg       0.83      0.50      0.60     11100



  y = column_or_1d(y, warn=True)


In [38]:
# XGBoost with resampled data
xgb.fit(X_resampled, y_resampled)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Results (resampled):")
print(classification_report(y_test, y_pred_xgb))

XGBoost Results (resampled):
              precision    recall  f1-score   support

           0       0.88      0.22      0.36     10000
           1       0.09      0.72      0.16      1100

    accuracy                           0.27     11100
   macro avg       0.49      0.47      0.26     11100
weighted avg       0.80      0.27      0.34     11100



- Models trained after oversampling perform a little much better as compared to models trained after undersampling the denominating class.