### Loading the previous stage
So we can work on the economic costs & modelling in this notebook

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [None]:
clf = RandomForestClassifier()

In [None]:
clf.fit(X_train_hd, y_train_hd)

In [None]:
hd = pd.read_csv(r"C:\Users\muzam\OneDrive\Desktop\PROJECTS\Resources\datasets\heart.csv")
hd_df = pd.DataFrame(hd)

In [None]:
X = hd_df.drop("target", axis=1)
y = hd_df["target"]

In [None]:
X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(X, y, test_size=0.25, stratify=y, random_state=1)

In [None]:
y_prob = clf.predict_proba(X_test_hd)[:, 1] 

We will be using RandomForestClassifier as it yielded the largest amount of True Negatives and lowest False Negatives

### However
we need to include the data in a Pipeline (for the model) which requires preprocessing

In [None]:
clf = RandomForestClassifier(max_depth=6, random_state=1)
clf_xgb = XGBClassifier(
    learning_rate=0.1,
    max_depth=6,
    eval_metric='logloss'
)

In [None]:
y_pred_hd = clf.predict(X_test_hd)

In [None]:
cm_hd = confusion_matrix(y_test_hd, y_pred_hd)
cm_hd

In [None]:
sns.heatmap(
    cm_hd,
    annot=True,
    fmt="d",
    cmap="Purples"
)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.title("Heart Disease Confusion Matrix")
plt.show()

# Purpose of Economics Costs & Modelling:
## 🔍 Link model predictions with economic consequences and visualize cost patterns to make smarter medical decisions.

*Note: this will only be done for the heart-disease.csv dataset*

### 4.1 Define Costs

In [None]:
COST_TP = -2000 # Savings from early treatments
COST_TN = 0 # Saved additional costs of procedures/medicine
COST_FP = 950 # Unnecessary test
COST_FN = 5000 # Missed diagnosis leading to complications
# Currency: USD

### 4.2 Compute Expected Costs

In [None]:
costs = []

### 4.3 Threshold Optimization

Please note: This section was developed with guidance from YouTube and comprises of both my skills and YouTube's

In [None]:
pred = [] 

#### Determine the optimal threshold (Heart Disease dataset)

In [None]:
auc_score=roc_auc_score(y_test_hd, y_prob)
print('Ensemble test ROC-AUC: {}'.format(auc_score))

We will now concatenate each set of model predictions to different columns

#### ROC curve
Assists in determining the optimal threshold, by:
* **combining** information about TP, FP, FN, TN
* from **multiple** confusion matrices
* computed at **different** thresholds
* into a single plot to **evaluate** the trade-off between sensitivity and fall-out.


### #1) Sweep through all the thresholds to find cost-minimal cut-off

In [None]:
fpr, tpr, thresholds = roc_curve(y_test_hd, y_prob)
fpr, tpr, thresholds

In [None]:
def plot_roc(fpr, tpr):
    plt.plot(fpr, tpr, label=f"ROC Curve (({auc_score:.2f})")
    plt.plot([0, 1], [0, 1], linestyle="--", color="red") 
    plt.xlabel("False Positive Rate     FP/(FP+TN)")
    plt.ylabel("True Positive Rate     TP/(TP+FN)")
    plt.title("Receiver Operating Characteristic (ROC)")
    plt.legend()
    plt.grid(True)
    plt.show()

In [None]:
plot_roc(fpr, tpr)

#### In essence:
✅ ROC curve aggregates multiple confusion matrices computed at various thresholds into a single plot to evaluate the trade-off between sensitivity and fall-out.

### Calculate accuracy for each threshold applied to the ROC Curve

In [None]:
tp, fp, tn, fn

In [None]:
accuracy_ls = []

In [None]:
for thres in thresholds:
    y_pred = np.where(y_prob >= thres, 1, 0)
    
    tn, fp, fn, tp = confusion_matrix(y_test_hd, y_pred).ravel()

    accuracy = accuracy_score(y_test_hd, y_pred)
    accuracy_ls.append(accuracy)
    for model in [clf, clf_xgb]:
        pred.append(pd.Series(y_pred))

    ## Compute Costs ##
    savings = tp * COST_TP
    # early treatments: benefit
    avoided_costs = tn * COST_TN
    # avoided unnecessary procedures/medicine: benefit
    needless_costs = fp * COST_FP
    # false alarms: cost
    missed_case_costs = fn * COST_FN
    # missed real cases: cost

    total_cost = savings+avoided_costs+needless_costs+missed_case_costs
    costs.append(total_cost)

In [None]:
min_len = min(len(thresholds), len(costs))
thresholds = thresholds[:min_len]
costs = costs[:min_len]

In [None]:
costs_df = pd.DataFrame({
    'thresholds': thresholds,
    'costs': costs
}).sort_values(by='costs', ascending=False)
costs_df

In [None]:
accuracy_df = pd.DataFrame({
    'thresholds': thresholds,
    'accuracy': accuracy_ls
}).sort_values(by='accuracy', ascending=False)

In [None]:
print(accuracy_df)

### Identify threshold that minimizes total_cost with highest accuracy
#### (Heart Disease dataset)

In [None]:
# index of min. cost
min_cost_idx = np.argmin(costs)

# Obtain threshold of min. cost
optimal_thres = thresholds[min_cost_idx]

# Optimal cost
optimal_cost = costs[min_cost_idx]

min_cost_idx, optimal_thres, optimal_cost

### Note: All the extra steps like optimal thresholds, costs, ROC curve functions were done for learning purposes and may not have been needed 
(because of very low variance in data)

## 4.4 Heart Disease Health Costs vs. Thresholds Plot

We will trim lengths to fit eachother considering redundant costs' data (Padding)

Removing the extra spaces

#### Plot again

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(thresholds, costs, marker='o')
plt.axvline(optimal_thres, color='red', linestyle='--', linewidth=2)
plt.title("Total Expected Cost vs. Classification Threshold")
plt.xlabel("Threshold")
plt.ylabel("Total Cost ($)")
plt.tight_layout()
plt.show()

#### Project Usefulness: 5/10

**| Higher than 5? |** If you had used variable health-cost data & extended it with actionable insights, integrated it with local health systems, or developed a tool that stakeholders could immediately use. |