# Week 5 Assignment - Part 2: Case Study Application (Practical)

This notebook implements the **Case Study: Hospital Readmission Prediction**.

**Scenario:** A hospital wants an AI system to predict patient readmission risk within 30 days of discharge.

**Tasks Implemented:**
1.  **Data Strategy:** Synthetic Data Generation & Preprocessing.
2.  **Model Development:** Training a model (Random Forest/XGBoost).
3.  **Evaluation:** Confusion Matrix, Precision, Recall.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## 1. Data Strategy: Synthetic Data Generation
Since we don't have real patient data, we will generate a synthetic dataset with features relevant to the problem:
-   `Age`: Patient age (18-90)
-   `Gender`: Male/Female
-   `Length_of_Stay`: Days in hospital (1-30)
-   `Num_Previous_Admissions`: Count of admissions in past year (0-5)
-   `Comorbidity_Index`: Score indicating health status (0-10)
-   `Readmitted`: Target variable (0 = No, 1 = Yes)

In [None]:
# Generate synthetic data
n_samples = 1000

data = {
    'Age': np.random.randint(18, 90, n_samples),
    'Gender': np.random.choice(['Male', 'Female'], n_samples),
    'Length_of_Stay': np.random.randint(1, 30, n_samples),
    'Num_Previous_Admissions': np.random.poisson(1, n_samples),
    'Comorbidity_Index': np.random.randint(0, 10, n_samples)
}

df = pd.DataFrame(data)

# Generate target variable 'Readmitted' based on some logic + noise
# Higher risk if older, longer stay, more previous admissions, higher comorbidity
risk_score = (
    0.02 * df['Age'] + 
    0.1 * df['Length_of_Stay'] + 
    0.3 * df['Num_Previous_Admissions'] + 
    0.2 * df['Comorbidity_Index']
)

# Normalize risk score to probability
risk_prob = (risk_score - risk_score.min()) / (risk_score.max() - risk_score.min())

# Assign target class based on probability threshold
df['Readmitted'] = (risk_prob + np.random.normal(0, 0.1, n_samples) > 0.5).astype(int)

print("Dataset Head:")
print(df.head())
print("\nClass Distribution:")
print(df['Readmitted'].value_counts())

## 2. Preprocessing Pipeline
1.  **Encoding:** Convert `Gender` to numeric.
2.  **Scaling:** Scale numerical features (`Age`, `Length_of_Stay`, etc.).
3.  **Splitting:** Train/Test split.

In [None]:
# 1. Encoding Categorical Variables
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

# 2. Feature Selection
X = df.drop('Readmitted', axis=1)
y = df['Readmitted']

# 3. Train-Test Split (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Scaling (Optional for Random Forest, but good practice)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 3. Model Development
We will use a **Random Forest Classifier**. It is robust, handles non-linear data well, and provides feature importance.

In [None]:
# Initialize and Train Model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make Predictions
y_pred = rf_model.predict(X_test_scaled)

## 4. Evaluation
We calculate the **Confusion Matrix**, **Precision**, and **Recall**.

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Readmission', 'Readmission'], yticklabels=['No Readmission', 'Readmission'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Classification Report (Precision, Recall, F1-Score)
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Calculate specific metrics manually for clarity
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")

## 5. Optimization (Addressing Overfitting)
To address potential overfitting, we can tune hyperparameters like `max_depth` or `min_samples_split`.

In [None]:
# Example: Training a simpler model to reduce overfitting
rf_simple = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) # Reduced max_depth
rf_simple.fit(X_train_scaled, y_train)

print("Training Accuracy (Complex):", rf_model.score(X_train_scaled, y_train))
print("Test Accuracy (Complex):", rf_model.score(X_test_scaled, y_test))
print("-"*30)
print("Training Accuracy (Simple):", rf_simple.score(X_train_scaled, y_train))
print("Test Accuracy (Simple):", rf_simple.score(X_test_scaled, y_test))