# CSE 475 Final Project Milestone 3: Final Project Report

Overview:
This notebook provides the analysis of the model developed in milestone 2.
The goal is to analyze the performance of the final model, outline a comprehensive deployment plan, and discuss expanded ethical considerations.

## Step 1: Final Model Performance Analysis

### Analyze the performance of the final model chosen from Milestone 2
The predictive model I developed in milestone 2 for **heart disease classification** using the Heart Failure Prediction dataset showed a good sign of generalizing and overall good performance. After preprocessing the data and establishing a baseline ANN architecture, I assessed model performance with appropriate metrics and ensured robustness using **Stratified K-Fold Cross-Validation**.

1. The ANN demonstrated **consistently strong performance**, achieving:

* **Test Accuracy:** 0.875
* **Precision:** 0.862
* **Recall:** 0.922
* **F1-Score:** 0.891
* **ROC-AUC:** 0.926

These results indicate that the ANN is effective at distinguishing between patients with and without heart disease, especially in terms of **recall**, which is particularly important in medical screening to reduce the number of missed cases.

2. **Confusion Matrix**

The confusion matrix highlights error types:

- True Negatives: 67

- False Positives: 15

- False Negatives: 8

- True Positives: 94

The confusion Matrix also showed a good sign, with a number of 8 false negatives, it is considered as a low probability of detecting patients with heart disease as healthy, reflecting strong sensitivity and providing evidence of the model’s suitability for screening applications. However, the performance cannot be considered as excellent, as 8 is still relatively high compared to what we expect in real life medical field scnenarios.

3. **Learning Behavior from Training Curves**

The training curves from Milestone 2 showed:
- Training accuracy rose steadily and smoothly.
- Validation accuracy stabilized between 0.82–0.84, close to training accuracy.
- Validation loss plateaued around 0.40.
- No divergence between train and validation curves → little to no overfitting.
- Early stopping halted training appropriately at the point of optimal validation performance.

The ANN learned meaningful patterns without memorizing the training data. The combination of dropout and early stopping effectively controlled overfitting, which is particularly important given the small dataset size.


4. **Cross-Validation Consistency**

5-fold stratified cross-validation produced:

* Mean Accuracy ≈ 0.847
* Mean AUC ≈ 0.918

These values are nearly identical to the final test performance. The model generalizes consistently across different data splits, supporting its reliability and stability.

5. **Clinical Relevance of Performance Metrics**

In medical risk prediction:

- High recall is crucial to minimize missed diagnoses

- High AUC ensures reliable ranking of patients by risk

- Strong precision reduces unnecessary anxiety and testing

The ANN excels in all three dimensions, which reinforces its practical usability in a screening support tool.




###comparative analysis with alternative models

In the previous milestone, I also developed a logistic regression model as a baseline to compare the performances of the models. Compared to the ANN model, Logistic Regression showed lower Recall, F1-score, and ROC-AUC, reflecting its limited ability to model the nonlinear relationships observed in the data (e.g., Age–MaxHR interaction, nonlinear Oldpeak patterns, and complex categorical interactions).

Both models perform well, but the **ANN shows slightly stronger predictive performance** than Logistic Regression.

* **Accuracy:** ANN (0.875) > LR (0.859)
* **Recall:** ANN (0.922) > LR (0.892) — *ANN misses fewer true disease cases.*
* **F1 Score:** ANN (0.891) > LR (0.875)
* **ROC-AUC:** ANN (0.927) > LR (0.911)

The confusion matrices also show that the ANN produces **fewer false negatives**, which is important in medical prediction problems.

While Logistic Regression provides interpretability and a fast training time, the ANN demonstrated superior discrimination ability and overall predictive performance. The ANN’s higher AUC and F1-score suggest it captures richer structure in the data, making it a better choice for this heart disease prediction task.

###Discuss any limitations of the model and potential areas for improvement.

Although the final ANN model demonstrated strong performance (AUC ≈ 0.93, F1 ≈ 0.89), several limitations remain that should be acknowledged when interpreting the results or considering real-world deployment.

1. **Limited Dataset Size**

The dataset contains fewer than 1,000 samples, which is relatively small for training deep learning models.
- ANN models typically benefit from large datasets; with small data, they are more prone to variance and overfitting.

- Cross-validation helps mitigate variance but does not fully overcome data scarcity.

Possible Improvement:
- Collect more patient data across demographics and clinical settings.
- Use data augmentation techniques for tabular data (e.g., SMOTE) to balance classes.


2. **Potential Bias in the dataset**

The dataset may not fully represent broader clinical populations (e.g., geographic, demographic, or socioeconomic diversity).
- The model might perform differently on populations not represented in the training data.

- Certain conditions (like cholesterol ranges or ECG types) may be underrepresented.

Possible Improvements:
- Evaluate the model on external validation datasets.
- Assess fairness across subgroups (e.g., age, sex) and retrain using bias-mitigation techniques if needed.

3. Limited Model Interpretability

ANNs are less interpretable compared to simpler models like Logistic Regression.
- Harder to justify decisions in medical environments where explainability is crucial.
- SHAP improves interpretability but requires additional computation and expertise to interpret correctly.

Possible Improvements:
- Incorporate model-agnostic interpretability tools (LIME, SHAP) into deployment dashboards.
- Consider hybrid models (e.g., monotonic neural networks) for clinical interpretability.

##Step 2: Deployment Plan

### Outline a detailed deployment architecture that includes the model, data pipelines, monitoring systems, and user interfaces.'

1. Architecture:
- Data pipeline:
  - We are most likely collecting data source from medical research fields via CSV uploads.
  - Inputs include: Age, Sex, ChestPainType, RestingBP, Cholesterol, MaxHR, Oldpeak, ST_Slope, etc
- Data preprocessing:
  - Validates input (no negative BP, missing values handled)
  - Apply One-hot encoding for categorical variables, and ordinal encoding for ordered variables such as ExerciseAngina and ST_Slope, ensuring that all features are numeric and suitable for downstream models
  - Apply Min-Max scaling to all numeric features so that they lie in a common range.

We can use the data preprocessing steps in Milestone 1 and wrap it up as a part of model development, change it into a reusable function or pipeline.

- User Interface:
  - The model can be constructed as a web form where the users will be entering the details of the patient, including each field that are included in the dataset(Age, Sex, RestingBP, Cholesterol, MaxHR, Oldpeak, ChestPainType, etc.)
  - Alternatively, hospital's ER system calls the model API in the background.

- Model:
  - The model can be build using API. A small backend app that receives patient data as JSON objects, apply the same data preprocessing steps above, then load the model, and returns the predicted probability of heart disease. It could also return a risk category (low, medium or high).

- Storage and Monitoring:
  - A database or log storage to save inputs, outputs, timestamps, and (when available) true labels.
  - A monitoring component such as a dashboard that tracks each of the following:
    - Number of predictions
    - Distribution of inputs
    - Performance over time

- Security/Access control:
  - In order to prevent fradulent input interrupting the model's learning process and data integrity, only authenticated users are allowed to input values, and data must be encrypted of some sort.





###Simulate a mock deployment and provide a video demonstration or detailed description of the process

Below code showed a brief execution of the model on the test dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import numpy as np
import pandas as pd
from tensorflow import keras

#Load the saved model
model_path = "/content/drive/MyDrive/heart_failure_project/final_ann_model.keras"
final_model = keras.models.load_model(model_path)
print("Model loaded successfully.")

# Load the preprocessed test dataframe
test_path = "/content/drive/MyDrive/heart_failure_project/test_processed.csv"
test_df = pd.read_csv(test_path)

# Separate features
X_new = test_df.drop(columns=["HeartDisease"]).values

# true labels  to check performance
y_true = test_df["HeartDisease"].values

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Model loaded successfully.


In [5]:
# Make predictions
y_prob = final_model.predict(X_new)
y_pred = (y_prob >= 0.5).astype(int)

# Show a prediction example
i = 100 # index of the sample
print("Predicted probability of heart disease:", y_prob[i][0])
print("Predicted class:", y_pred[i][0])
print("True label:", y_true[i])

[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
Predicted probability of heart disease: 0.94149846
Predicted class: 1
True label: 1


The raw clinical data collected will be inputed and the model will predict the probability of the patient having a heart failure disease based on the data given.  

###Discuss considerations such as scalability, maintenance, and compliance with legal regulations.

In order to increase the scalability, let the model to handle more users and more inputs, I will be activly collecting more data and implement a stateless API, so it is possible to run multiple instances.

In order for the matain the model, I will periodically retrain the model with new data to avoid performance degradation, I will also gradually increase the dataset size each time retraining it, and slowly improve the capacity of the model. While doing so, I will also monitor for data drift and performance drift, check if the input distribution has changed, or the AUC score has droppped etc. Documenting and updating different versions will also be very important in order to differentiate and see the improvements from previous models.

To be compliant with legal regulations, there are several important rules that needs to be considered during the data collection/model developing process:
- Patient data is protected health information. It must comply with HIPAA or GDPR, making sure the data collected is through legal and valid source with approval.
- Use encryption for raw data during the transfer process from webpage form to the backend API.
- Provide clear documentation of limitations; ensure clinicians remain the ultimate decision-makers.
- The model should be used as decision support, not automatic diagnosis.



##Step 3: Ethical Considerations

###Conduct an in-depth analysis of the ethical implications of your project, focusing on a broader range of issues such as societal impacts and stakeholder effects.

While the model shows strong technical performance, deploying such a system in any real or semi-real clinical context raises important ethical questions. These concerns span societal impacts, stakeholder effects, and the need for fairness, accountability, and regulatory compliance.




###Social Impacts

If used correctly, this model can:
  - Help predicting the possibility of a patient having heart failure disease based on patient's medical records, support earlier detection of high-risk patients, enabling preventive interventions and reducing morbidity and mortality.
  - Standardize risk assessment by providing a consistent, data-driven estimate that complements human judgment.

These benefits however depend on careful integration into clinical workflows, and appropriate human feedback beforehand, and consistent monitoring.

At the same time, there are risks and potential risks:
  - Clinicians might give excessive weight to model outputs, potentially relying on the result given by this model but ignoring important clinical nuance.
  - If the training data underrepresents certain populations, the model may perform worse for them, worsening existing health disparities.
  - Payers, insurers, or employers could potentially use this model in an inappropriate way (e.g., to adjust premiums, deny coverage, or discriminate in hiring).

Therefore, any real deployment must explicitly define the appropriate scope, users, and limitations of the model.




###Fairness and Bias

This project uses a relatively small Kaggle dataset, which likely:
  - Does not represent all demographic groups equally (e.g., sex distribution, age range).
  - May be biased toward certain healthcare settings or geographic regions.

If the model is deployed broadly without adjustment, it may:
  - Underperform on underrepresented groups, misclassifying their risk.
  - Produce systematically different false positive or false negative rates by subgroup (e.g., men vs. women).

To improve fairness in a real deployment scenario:
  - Subgroup evaluation helps reduce bias. by separate key subgroups (sex, age intervals) and carefully evaluate each of them, we can iddentify where performance is worse and analyze the causes.
  - Retrain the model with more representative data from the target pooulation will also be very helpful. If necessary, resampling can also reduce imbalance.
  - Explore the fairness constrains, such as equal opportunity and equalized odds, and incorporate fairness metrics into model selection
  - In order to remain transparency, clearly document the known limitations and populations on which the model has or has not been validated. Avoid deploying the model in settings where the training data is clearly not representative.



###Accountability

ANNs are not inherently interpretable. This project addresses this partially by comparing ANN performance with a Logistic Regression baseline, which is more transparent. In a real life deployment, feature importance visualizations (SHAP summary plots) should be part of model documentation. And clinicians should be able to see at least a high-level explanation of why a given patient is labeled high-risk (e.g., elevated Oldpeak, low MaxHR, abnormal ST_Slope).

Accountability should be shared but clear: the model developer, deploying institution, and end-users each have defined responsibilities.

To ensure accountability:
- Log all predictions with timestamps, input features, and model version.
- Establish a review process for adverse events, for example, missed cases where the model predicted low risk.
- Maintain clear documentation of training data sources, preprocessing steps, and model architecture and evaluation metrics




###Privacy, Security, and Regulatory Compliance

Predicting heart disease involves handling sensitive health information. A real deployment must:
- Comply with health privacy laws, such as using encryption in transit, limit access via authentication and role-based access control, and minimize data retention and use de-identified data for model monitoring whenever possible.
- Obtain appropriate consent, patients should know their data is used, especially if it's reused for model training or improvement
- Support auditability and oversight, regulators on internal ethics boards should be able to review model design, performance, and usage logs.

##Step 4: Overall Summary

This project explored the full development, evaluation, and ethical analysis of a machine learning system designed to predict heart disease risk using the Heart Failure Prediction dataset. Building on the preprocessing and exploratory work conducted in Milestone 1, this notebook implemented, analyzed, and critically evaluated a final predictive model built in Milestone 2, while considering deployment and ethical responsibilities.

After defining an appropriate architecture and regularization strategy, the model was trained using early stopping and validated using 5-Fold Stratified Cross-Validation. The cross-validation results demonstrated strong and consistent performance, with an average AUC of approximately 0.92, indicating excellent discriminative capability across multiple data splits.

A final ANN model was then trained on the full training dataset and evaluated on the independent test set. The model achieved a Test Accuracy of 0.875, F1-Score of 0.891, and ROC-AUC of 0.926, showing robust generalization to unseen data. In comparison with a Logistic Regression baseline, it further highlighted the ANN’s advantage in capturing nonlinear patterns, particularly through its higher recall and AUC. SHAP-based interpretability methods were introduced to provide insights into feature contributions, demonstrating how clinical variables influence predictions at the model level.

Beyond technical performance, the notebook examined how such a model could be responsibly deployed in clinical environments. Considerations surrounding scalability, retraining, and integration into healthcare systems were also discussed.

Considerations surrounding scalability, retraining, and integration into healthcare systems were also discussed.

Finally, the project addressed the ethical implications associated with predictive healthcare models. Issues such as fairness, potential bias, patient and clinician impacts, accountability, transparency, and regulatory compliance were analyzed. Recommendations including subgroup performance evaluation, secure deployment, comprehensive logging, human oversight, and model interpretability, were provided to support responsible real-world use.

Overall, this notebook demonstrated a holistic understanding of how such a system must be evaluated, explained, deployed, and monitored in order to be safe and beneficial within healthcare settings.
