# Part 2: Hospital Readmission Prediction

## 1. Data Strategy

**Data Acquisition:**
1.  **Electronic Health Records (EHR):** Primary source for patient demographics, medical history, diagnoses (ICD codes), procedures, medications, lab results, and discharge summaries.
2.  **Administrative Data:** Includes billing information, insurance details, and hospital stay duration.

**Data Preparation Strategy:**
1.  **Feature Selection:** Identify relevant features from the raw data that are most predictive of readmission, such as age, number of diagnoses, number of procedures, and discharge disposition.
2.  **Handling Missing Values:** Address missing data in patient records through imputation (e.g., mean, median, mode for numerical; most frequent for categorical) or by removing features/records with excessive missingness.
3.  **Categorical Feature Encoding:** Convert nominal and ordinal categorical variables (e.g., 'gender', 'primary_diagnosis', 'discharge_to') into numerical formats using techniques like one-hot encoding to make them suitable for machine learning models.
4.  **Numerical Feature Scaling:** Standardize or normalize numerical features (e.g., 'num_procedures', 'days_in_hospital', 'comorbidity_score') to ensure that no single feature dominates the model due to its scale.
5.  **Target Variable Definition:** Clearly define the 'readmitted' target variable (e.g., 1 if readmitted within 30 days, 0 otherwise) and ensure its consistency across the dataset.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
train_df = pd.read_csv('../data/hospital readmission/train_df.csv')
test_df = pd.read_csv('../data/hospital readmission/test_df.csv')
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.describe()

## 2. Model Development

**Model Choice & Justification:**
A **Random Forest Classifier** is selected for this case study due to its proven effectiveness in handling tabular data, its ability to capture non-linear relationships, and its robustness to noisy features. It is less prone to overfitting compared to single decision trees and provides good interpretability through feature importances.

**Preprocessing Pipeline:**
A `ColumnTransformer` is used to apply different preprocessing steps to different types of features:
*   **Numerical Features:** Scaled using `StandardScaler` to normalize their range, preventing features with larger values from dominating the model.
*   **Categorical Features:** Converted into numerical format using `OneHotEncoder`. This creates binary columns for each category, avoiding the assumption of ordinality and preventing the model from misinterpreting categorical values as having a numerical relationship.

**Train/Validation Split:**
The training data is split into training and validation sets (80/20 split). The training set is used to fit the model, while the validation set is used to tune hyperparameters and assess the model's performance during development. This helps ensure the model generalizes well to unseen data and avoids overfitting to the initial training data.

In [None]:
# Identify categorical and numerical features
categorical_features = train_df.select_dtypes(include=['object']).columns
numerical_features = train_df.select_dtypes(include=np.number).columns

print('Categorical Features:', categorical_features)
print('Numerical Features:', numerical_features)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

In [None]:
from sklearn.model_selection import train_test_split

# Define features and target
X = train_df.drop('readmitted', axis=1)
y = train_df['readmitted']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Create the full pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', RandomForestClassifier(random_state=42))])

# Train the model
model_pipeline.fit(X_train, y_train)

## 3. Deployment & Optimization

**Deployment Strategies:**
1.  **Real-time Prediction Service:** Deploy the model as a microservice (e.g., using Flask/Django, FastAPI, or cloud functions) that can receive patient data via an API and return readmission predictions in real-time. This is crucial for immediate clinical decision-making.
2.  **Batch Prediction for Risk Stratification:** For less urgent scenarios, the model can be run periodically (e.g., daily or weekly) on a batch of patient data to identify high-risk individuals for proactive interventions or resource allocation.

**Optimization:**
1.  **Feature Engineering for Clinical Context:** Collaborate with clinicians to derive more meaningful features from EHR data (e.g., severity scores, medication adherence indicators, social determinants of health) that can significantly improve predictive power.
2.  **Addressing Class Imbalance:** The dataset likely has a significant class imbalance (fewer readmitted patients). Techniques like oversampling the minority class (SMOTE), undersampling the majority class, or using cost-sensitive learning algorithms can improve the model's ability to correctly identify readmitted patients.

**Model Performance (Factual Results):**
```
Accuracy: 0.812
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.98      0.90       826
           1       0.06      0.01      0.01       174

    accuracy                           0.81      1000
   macro avg       0.44      0.49      0.45      1000
weighted avg       0.69      0.81      0.74      1000
```

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the validation set
y_pred = model_pipeline.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
report = classification_report(y_val, y_pred)

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)