---
title: "Practice Activity 9.1: Decision Boundaries"
format: 
  html:
    embed-resources: true
execute:
  echo: true
code-fold: true
author: James Compagno
jupyter: python3
---

In [21]:
import numpy as np
import pandas as pd
import plotnine as p9
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, LogisticRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_recall_fscore_support, roc_auc_score, confusion_matrix, classification_report, roc_curve, auc, precision_score, recall_score
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

## The Data

At this link, you will find a dataset containing information about heart disease patients: https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1

A description of the original dataset can be found here: https://archive.ics.uci.edu/dataset/45/heart+disease (However, this dataset has been cleaned and reduced, and the people have been given fictious names.)

In [22]:
# Read the data
ha = pd.read_csv("https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1")
ha = ha.dropna()

ha.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,restecg,thalach
count,204.0,204.0,204.0,204.0,204.0,204.0,204.0
mean,53.813725,0.666667,2.04902,131.245098,248.377451,0.558824,149.147059
std,9.354781,0.472564,1.030352,18.352024,53.176624,0.526603,23.990925
min,29.0,0.0,1.0,94.0,126.0,0.0,71.0
25%,46.0,0.0,1.0,120.0,212.75,0.0,132.0
50%,54.0,1.0,2.0,129.5,241.0,1.0,153.5
75%,61.0,1.0,3.0,140.0,276.25,1.0,166.25
max,77.0,1.0,4.0,200.0,564.0,2.0,202.0


In [23]:
# Separate X and Y
X = ha[['age', 'chol']]
y = ha['diagnosis']

# Train/test split on
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, stratify=y)

# # Model Library 
# model_library = {}
# records = []

## 1. Logistic Regression

Fit a Logistic Regression using only `age` and `chol` (cholesterol) as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

How high for the doctors to estimate a 90% chance that heart disease is present?

In [24]:
model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

display(f"Accuracy: {accuracy_score(y_test, y_pred)}")
display("\nClassification Report:")
display(classification_report(y_test, y_pred))
display("\nConfusion Matrix:")
display(confusion_matrix(y_test, y_pred))

'Accuracy: 0.5609756097560976'

'\nClassification Report:'

'              precision    recall  f1-score   support\n\n     Disease       0.59      0.74      0.65        23\n  No Disease       0.50      0.33      0.40        18\n\n    accuracy                           0.56        41\n   macro avg       0.54      0.54      0.53        41\nweighted avg       0.55      0.56      0.54        41\n'

'\nConfusion Matrix:'

array([[17,  6],
       [12,  6]])

In [None]:
# Extract the intercept and coefficients
intercept_log = model.intercept_[0]
coefs_log = model.coef_[0]



# Age is fixed at 55
age = 55

display(intercept_log)

display(coefs_log)

np.float64(-2.7223856216250093)

array([0.04415623, 0.00034468])

In [None]:
model_name = "chol_log_50"

# Cholesterol level for 50% probability
chol_log = (-intercept_log - (coefs_log[0] * age)) / coefs_log[1]

# Store results
    records.append({
        "Model": model_name,
        "Classification Type": "Logistic",
        "Age": age
        "Hyperparameter 1 Name": "intercept_log", 
        "Hyperparameter 1 Value": intercept_log,
        "Hyperparameter 2 Name": "coefs_log", 
        "Hyperparameter 2 Value": coefs_log,
        "Output Context": chol_log,
        "Output": chol_log,
        "Range Tested": k_range,
        "ROC AUC Mean (CV)": best_cv_score,
        "Test ROC AUC": test_roc_auc,
        "Test Accuracy": test_accuracy,
        "Confusion Matrix": conf_matrix,
        "Precision": precision,
        "Recall": recall,
        "Specificity": specificity,
    })

display(f"Logistic Regression:")
display(f"For a 55-year-old to be on the decision boundary (50% probability), their cholesterol would need to be approximately: {chol_log_50}")


'Logistic Regression:'

'For a 55-year-old to be on the decision boundary (50% probability), their cholesterol would need to be approximately: 852.3651469978166'

In [27]:
# --- Question 2: Cholesterol level for 90% probability ---
chol_log_90 = (np.log(9) - intercept_log - (coefs_log[0] * age)) / coefs_log[1]

display(f"For a 90% chance of heart disease, their cholesterol would need to be approximately: {chol_log_90}")

'For a 90% chance of heart disease, their cholesterol would need to be approximately: 7227.048034076787'

## 2. Linear Discriminant Analysis

Fit an LDA model using only `age` and `chol` (cholesterol)  as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [28]:
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)

In [29]:
intercept_lda = lda_model.intercept_[0]
coefs_lda = lda_model.coef_[0]

display(intercept_lda)

display(coefs_lda)

np.float64(-2.704485484651287)

array([0.04379621, 0.00035287])

In [32]:
# --- Question: Cholesterol level for the decision boundary ---
chol_lda = (-intercept_lda - (coefs_lda[0] * age)) / coefs_lda[1]
display(f"For a 55-year-old to be on the decision boundary, their cholesterol would need to be approximately: {chol_lda}")

'For a 55-year-old to be on the decision boundary, their cholesterol would need to be approximately: 837.9667990552236'

## 3. Support Vector Classifier

Fit an SVC model using only `age` and `chol` as predictors.  Don't forget to tune the regularization parameter.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

## 4. Comparing Decision Boundaries

Make a scatterplot of `age` and `chol`, coloring the points by their true disease outcome.  Add a line to the plot representing the **linear separator** (aka **decision boundary**) for each of the three models above.