# Healthcare Predictive Analytics for Patient Readmission

# Problem statement


Hospital readmissions within 30 days of discharge represent a significant challenge for healthcare systems, affecting patient outcomes and increasing operational costs. Early identification of patients at high risk for readmission can enable healthcare providers to implement targeted interventions, improve patient care, and reduce readmission rates. The objective is to develop a predictive model that accurately identifies patients at high risk of readmission based on their medical history and other relevant factors.

# Objective

To create a predictive model that estimates the probability of a patient being readmitted within 30 days of discharge. The model aims to support healthcare providers in targeting high-risk patients with appropriate interventions, thereby reducing readmission rates and associated costs.

# 1. Load dataset

In [14]:
import pandas as pd
import numpy as np

In [15]:
df=pd.read_csv('hospital_readmissions.csv')
df.head()

Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
0,[70-80),8,72,1,18,2,0,0,Missing,Circulatory,Respiratory,Other,no,no,no,yes,no
1,[70-80),3,34,2,13,0,0,0,Other,Other,Other,Other,no,no,no,yes,no
2,[50-60),5,45,0,18,0,0,0,Missing,Circulatory,Circulatory,Circulatory,no,no,yes,yes,yes
3,[70-80),2,36,0,12,1,0,0,Missing,Circulatory,Other,Diabetes,no,no,yes,yes,yes
4,[60-70),1,42,0,7,0,0,0,InternalMedicine,Other,Circulatory,Respiratory,no,no,no,yes,no


In [16]:
df.columns

Index(['age', 'time_in_hospital', 'n_lab_procedures', 'n_procedures',
       'n_medications', 'n_outpatient', 'n_inpatient', 'n_emergency',
       'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'glucose_test',
       'A1Ctest', 'change', 'diabetes_med', 'readmitted'],
      dtype='object')

# 2.  Data preprocessing

In [17]:
df.describe()

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,4.45332,43.24076,1.35236,16.2524,0.3664,0.61596,0.1866
std,3.00147,19.81862,1.715179,8.060532,1.195478,1.177951,0.885873
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0
25%,2.0,31.0,0.0,11.0,0.0,0.0,0.0
50%,4.0,44.0,1.0,15.0,0.0,0.0,0.0
75%,6.0,57.0,2.0,20.0,0.0,1.0,0.0
max,14.0,113.0,6.0,79.0,33.0,15.0,64.0


In [4]:
df.isnull().sum()

age                  0
time_in_hospital     0
n_lab_procedures     0
n_procedures         0
n_medications        0
n_outpatient         0
n_inpatient          0
n_emergency          0
medical_specialty    0
diag_1               0
diag_2               0
diag_3               0
glucose_test         0
A1Ctest              0
change               0
diabetes_med         0
readmitted           0
dtype: int64

In [5]:
df.duplicated().sum()

0

In [6]:
len(df.columns)

17

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                25000 non-null  object
 1   time_in_hospital   25000 non-null  int64 
 2   n_lab_procedures   25000 non-null  int64 
 3   n_procedures       25000 non-null  int64 
 4   n_medications      25000 non-null  int64 
 5   n_outpatient       25000 non-null  int64 
 6   n_inpatient        25000 non-null  int64 
 7   n_emergency        25000 non-null  int64 
 8   medical_specialty  25000 non-null  object
 9   diag_1             25000 non-null  object
 10  diag_2             25000 non-null  object
 11  diag_3             25000 non-null  object
 12  glucose_test       25000 non-null  object
 13  A1Ctest            25000 non-null  object
 14  change             25000 non-null  object
 15  diabetes_med       25000 non-null  object
 16  readmitted         25000 non-null  objec

In [8]:
# Check for missing values
missing_values = df.isnull().sum()

# Encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

# Display missing values and the first few rows of the encoded dataframe
missing_values, df_encoded.head()


(age                  0
 time_in_hospital     0
 n_lab_procedures     0
 n_procedures         0
 n_medications        0
 n_outpatient         0
 n_inpatient          0
 n_emergency          0
 medical_specialty    0
 diag_1               0
 diag_2               0
 diag_3               0
 glucose_test         0
 A1Ctest              0
 change               0
 diabetes_med         0
 readmitted           0
 dtype: int64,
    time_in_hospital  n_lab_procedures  n_procedures  n_medications  \
 0                 8                72             1             18   
 1                 3                34             2             13   
 2                 5                45             0             18   
 3                 2                36             0             12   
 4                 1                42             0              7   
 
    n_outpatient  n_inpatient  n_emergency  age_[50-60)  age_[60-70)  \
 0             2            0            0        False        False   
 1   

# Test Train Split

In [9]:
from sklearn.model_selection import train_test_split

# Features and target variable
X = df_encoded.drop('readmitted_yes', axis=1)
y = df_encoded['readmitted_yes']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((20000, 45), (5000, 45), (20000,), (5000,))

In [10]:
X

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,age_[50-60),age_[60-70),age_[70-80),...,diag_3_Missing,diag_3_Musculoskeletal,diag_3_Other,diag_3_Respiratory,glucose_test_no,glucose_test_normal,A1Ctest_no,A1Ctest_normal,change_yes,diabetes_med_yes
0,8,72,1,18,2,0,0,False,False,True,...,False,False,True,False,True,False,True,False,False,True
1,3,34,2,13,0,0,0,False,False,True,...,False,False,True,False,True,False,True,False,False,True
2,5,45,0,18,0,0,0,True,False,False,...,False,False,False,False,True,False,True,False,True,True
3,2,36,0,12,1,0,0,False,False,True,...,False,False,False,False,True,False,True,False,True,True
4,1,42,0,7,0,0,0,False,True,False,...,False,False,False,True,True,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,14,77,1,30,0,0,0,False,False,False,...,False,False,False,False,True,False,False,True,False,False
24996,2,66,0,24,0,0,0,False,False,False,...,False,False,True,False,True,False,False,False,True,True
24997,5,12,0,6,0,1,0,False,False,True,...,False,False,True,False,False,True,True,False,False,False
24998,2,61,3,15,0,0,0,False,False,True,...,False,False,True,False,True,False,True,False,True,True


In [11]:
y

0        False
1        False
2         True
3         True
4        False
         ...  
24995     True
24996     True
24997     True
24998    False
24999     True
Name: readmitted_yes, Length: 25000, dtype: bool

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Initialize the model
logreg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
logreg.fit(X_train, y_train)

# Make predictions
y_pred = logreg.predict(X_test)
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

accuracy, precision, recall, f1, roc_auc


(0.61,
 0.6272727272727273,
 0.41246797608881297,
 0.4976816074188563,
 0.6465943329484359)

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
# Initialize the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
logreg.fit(X_train, y_train)

# Make predictions
y_pred = logreg.predict(X_test)
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

accuracy, precision, recall, f1, roc_auc


(0.61,
 0.6272727272727273,
 0.41246797608881297,
 0.4976816074188563,
 0.6465943329484359)

In [20]:
df.columns

Index(['age', 'time_in_hospital', 'n_lab_procedures', 'n_procedures',
       'n_medications', 'n_outpatient', 'n_inpatient', 'n_emergency',
       'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'glucose_test',
       'A1Ctest', 'change', 'diabetes_med', 'readmitted'],
      dtype='object')

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Load your data
df = pd.read_csv('hospital_readmissions.csv')

# Encode categorical features
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    if column != 'readmitted':
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le

# Define features and target
X = df.drop('readmitted', axis=1)
y = df['readmitted']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Function to handle unseen labels
def transform_with_unseen(le, value):
    if value in le.classes_:
        return le.transform([value])[0]
    else:
        # Add the unseen value to the classes
        new_classes = np.append(le.classes_, value)
        le.classes_ = new_classes
        return le.transform([value])[0]



In [2]:
# Function to predict readmission
def predict_readmission(input_features):
    input_df = pd.DataFrame([input_features], columns=X.columns)
    print("Input DataFrame before transformation:")
    print(input_df)
    for column, le in label_encoders.items():
        input_df[column] = input_df[column].apply(lambda x: transform_with_unseen(le, x))
    print("Input DataFrame after transformation:")
    print(input_df)
    prediction = model.predict(input_df)
    return 'Yes' if prediction[0] == 1 else 'No'

# Interactive input
input_features = {}
input_features['age'] = input("Enter age range (e.g., '60-70'): ")
input_features['time_in_hospital'] = int(input("Enter time in hospital (days): "))
input_features['n_lab_procedures'] = int(input("Enter number of lab procedures: "))
input_features['n_procedures'] = int(input("Enter number of procedures: "))
input_features['n_medications'] = int(input("Enter number of medications: "))
input_features['n_outpatient'] = int(input("Enter number of outpatient visits: "))
input_features['n_inpatient'] = int(input("Enter number of inpatient visits: "))
input_features['n_emergency'] = int(input("Enter number of emergency visits: "))
input_features['medical_specialty'] = input("Enter medical specialty: ")
input_features['diag_1'] = input("Enter primary diagnosis: ")
input_features['diag_2'] = input("Enter secondary diagnosis: ")
input_features['diag_3'] = input("Enter tertiary diagnosis: ")
input_features['glucose_test'] = input("Enter glucose test result (yes/no): ")
input_features['A1Ctest'] = input("Enter A1C test result (yes/no): ")
input_features['change'] = input("Enter change in medications (yes/no): ")
input_features['diabetes_med'] = input("Enter diabetes medication (yes/no): ")

# Predict readmission
result = predict_readmission(input_features)
print(f'Readmission: {result}')


Enter age range (e.g., '60-70'): 60-70
Enter time in hospital (days): 1
Enter number of lab procedures: 42
Enter number of procedures: 0
Enter number of medications: 7
Enter number of outpatient visits: 0
Enter number of inpatient visits: 0
Enter number of emergency visits: 0
Enter medical specialty: InternalMedicine
Enter primary diagnosis: Other
Enter secondary diagnosis: Circulatory
Enter tertiary diagnosis: Respiratory
Enter glucose test result (yes/no): no
Enter A1C test result (yes/no): no
Enter change in medications (yes/no): no
Enter diabetes medication (yes/no): yes
Input DataFrame before transformation:
     age  time_in_hospital  n_lab_procedures  n_procedures  n_medications  \
0  60-70                 1                42             0              7   

   n_outpatient  n_inpatient  n_emergency medical_specialty diag_1  \
0             0            0            0  InternalMedicine  Other   

        diag_2       diag_3 glucose_test A1Ctest change diabetes_med  
0  Circulato

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import pickle

# Load your data
df = pd.read_csv('hospital_readmissions.csv')

# Encode categorical features
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    if column != 'readmitted':
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le

# Define features and target
X = df.drop('readmitted', axis=1)
y = df['readmitted']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Save the model and label encoders to a pickle file
with open('model.pkl', 'wb') as file:
    pickle.dump((model, label_encoders, X.columns), file)


In [None]:
# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Save the model and label encoders to a pickle file
with open('model.pkl', 'wb') as file:
    pickle.dump((model, label_encoders, X.columns), file)