# Step 5 — Evaluation & One‑Page Report (Cleveland Heart Dataset)

**Objective.** Predict heart disease (binary) using the Cleveland dataset and compare two simple models (Logistic Regression, Random Forest) using **accuracy** as the primary metric.


## Dataset & Preprocessing (summary)

- **Source:** UCI ML Repository — Cleveland subset (processed file)
- **Shape:** 303 rows × 14 columns
- **Target:** `target` (0–4) → **binarized** to `target_bin` (0 = healthy, 1 = disease)

**Preprocessing**
- Missing values: `ca`, `thal` → **mode**; all other numeric features (except target) → **median**
- Scaling: Min‑Max to [0, 1] on all features
- Final feature set: 13 columns
- Class balance: ~54% healthy, ~46% disease


## Results (from Step 4)

- **Logistic Regression Accuracy:** 0.8525  
- **Random Forest Accuracy:** 0.9016  
- **Selected model (by accuracy):** **Random Forest**


## Interpretation

- Random Forest achieved higher accuracy (0.9016) than Logistic Regression (0.8525).
- Given the **balanced** dataset, accuracy is a reasonable first metric.
- In medical settings, also consider:
  - **Recall (Sensitivity)** for the positive class (disease), to reduce false negatives.
  - **Precision** to limit false positives.
  - **F1‑score** as a balanced measure.
- The confusion matrix and classification report above help validate that performance is not driven by class imbalance.


## Submission Checklist (Week 1)

- [x] 01_load_and_explore.ipynb — loading, names, `info()`, `describe()`, missingness
- [x] 02_preprocessing.ipynb — imputation, scaling, binary target, build `X`, `y`
- [x] 03_eda.ipynb — class balance + quick plots + correlation heatmap
- [x] 04_model_training.ipynb — Logistic Regression & Random Forest, accuracy comparison
- [x] 05_evaluation_report.ipynb — this one‑pager summary with final results
- [x] Push all notebooks + `data/` README to GitHub (private/public as allowed)
- [x] Short LinkedIn post with problem, approach, and repo link


In [1]:
import pypandoc

pypandoc.convert_text(
    open("week1_report.md").read(),
    'pdf',
    format='md',
    outputfile='week1_report.pdf',
    extra_args=['--standalone']
)


ModuleNotFoundError: No module named 'pypandoc'