## 1. Problem Scope 

### Problem Statement
Design an AI system that predicts whether a patient will be readmitted to the hospital within 30 days after being discharged. The goal is to identify at-risk patients early and reduce unnecessary readmissions, which strain healthcare resources and impact patient outcomes.

### Objectives
1. Predict 30-day readmission risk for each discharged patient.
2. Support early intervention by medical staff for high-risk cases.
3. Reduce overall hospital readmission rates and improve patient care.

### Stakeholders
- **Hospital Administrators** – for cost and resource optimization.
- **Doctors and Nurses** – to prioritize high-risk patients.
- **Patients and Caregivers** – to receive better follow-up care.
- **Health Insurance Providers** – to reduce avoidable claims.


## 2. Data Strategy (10 points)

### Selected Dataset
**Dataset:** Diabetes 130-US hospitals for readmission prediction  
**Source:** [Kaggle – Brandão Diabetes Readmission Dataset](https://www.kaggle.com/datasets/brandao/diabetes)  
This dataset includes over 100,000 hospital admissions for diabetes patients, with features spanning demographics, clinical data, treatments, and readmission status.

---

### Data Sources
1. **Electronic Health Records (EHRs)**  
   - Diagnosis codes: `diag_1`, `diag_2`, `diag_3`  
   - Lab tests: `num_lab_procedures`, `num_procedures`, `number_lab_procedures`

2. **Demographic Data**  
   - `race`, `gender`, `age`

3. **Hospital Administrative Data**  
   - Admission/discharge details: `admission_type_id`, `discharge_disposition_id`, `admission_source_id`  
   - Hospital stay length: `time_in_hospital`

4. **Medication & Treatment Records**  
   - Medications and dosing: `metformin`, `repaglinide`, `insulin`, etc.  
   - Treatment change indicator: `change`

5. **Readmission Outcome**  
   - `readmitted`: categories `<30`, `>30`, `NO`, which will be converted to a binary target (`<30` = 1, otherwise = 0)

---

### Ethical Concerns

1. **Patient Privacy & Confidentiality**  
   - Although the dataset is de-identified, risks remain (e.g. re-identification via combination with external data).  
   - Must implement secure storage (encryption, access controls) and avoid sharing sensitive info.

2. **Algorithmic Bias & Fairness**  
   - Certain demographics (e.g., racial groups or age brackets) may be underrepresented.  
   - If uncorrected, the model could underperform for minority groups, perpetuating disparities in post-discharge care.

---

### Preprocessing Pipeline Design

#### 1. **Data Cleaning**
- Replace placeholder values like `?` with `NaN`.
- Decide whether to drop or impute missing values based on data quality.
- Remove duplicate rows.
- Filter out non-applicable `discharge_disposition_id` values (e.g., transfers to hospice or death).

#### 2. **Feature Encoding**
- Encode categorical variables (`race`, `gender`, `admission_type_id`, `discharge_disposition_id`, `medical_specialty`) via one-hot encoding or label encoding.
- Standardize medication columns (`metformin`, etc.) to indicate whether the patient was on the medication (e.g., binary coding).

#### 3. **Target Encoding**
- Convert `readmitted` to binary:
  - `<30` → 1 (readmitted within 30 days)  
  - `NO` or `>30` → 0 (not readmitted within 30 days)

#### 4. **Normalization/Scaling**
- Apply `MinMaxScaler` or `StandardScaler` to numerical fields (`time_in_hospital`, `num_lab_procedures`, `number_inpatient`, etc.) to improve model convergence.

#### 5. **Feature Engineering**
- **`has_prior_readmission`**: flag as 1 if `number_inpatient` > 0, else 0.
- **`medication_change_flag`**: 1 if `change` == ‘Ch’, else 0.
- **Age binning**: convert `age` intervals (e.g., `[0–10)`, `[10–20)`, etc.) to numeric ordinal codes for easier modeling.

#### 6. **Class Imbalance Handling**
- If the `<30 days` class is underrepresented, apply SMOTE or perform undersampling of the majority class to balance the dataset.

---

*Actual implementation of this pipeline will be coded in the next step (Step 3).*  


## Step 3: Model Development

### Model Chosen: Random Forest Classifier

**Why Random Forest?**
- Works well with high-dimensional, structured data like hospital records.
- Can handle missing data and imbalanced datasets better than many linear models.
- Provides interpretability via feature importance — crucial in medical decision support.
- Reduces overfitting compared to a single decision tree.

---

### Data Loading and Preparation
