## The Data

At this link, you will find a dataset containing information about heart disease patients: https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1

A description of the original dataset can be found here: https://archive.ics.uci.edu/dataset/45/heart+disease (However, this dataset has been cleaned and reduced, and the people have been given fictious names.)

In [1]:
import pandas as pd

df = pd.read_csv("C:/Users/Eddie/Documents/GSB 544/Data/ha_1.csv")

## 1. Logistic Regression

Fit a Logistic Regression using only `age` and `chol` (cholesterol) as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

How high for the doctors to estimate a 90% chance that heart disease is present?

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import r2_score, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, cohen_kappa_score, precision_score, recall_score, roc_auc_score
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

In [37]:
X = df[["age","chol"]]
y = df["diagnosis"]

ct_logistic = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object))
  ],
  remainder = "passthrough"
)

pipeline_logistic = Pipeline(
  [("preprocessing", ct_logistic),
  ("logit_step", LogisticRegression(max_iter = 1000))]
)

logistic = pipeline_logistic.fit(X,y)


In [38]:
coefs = logistic.named_steps['logit_step'].coef_
inter = logistic.named_steps['logit_step'].intercept_

intercept = inter[0]
age_coef = coefs[0,0]
chol_coef = coefs[0,1]

print(age_coef, chol_coef, intercept)

age = 55

0.04686330613036849 0.001801238519029131 -3.240112258859075


In [45]:
import math

chol_55 = (-intercept - (age_coef * age)) / chol_coef
chol_55
# 367 chol to for a 55 year old to be deemed hear disease is present 

z = math.log10((0.9) / (1-0.9))

#log(0.9) = intercept + age_coef*age + chol_coef*x
x = (z - intercept - (age_coef*age)) / chol_coef
x

np.float64(897.6451003277614)

## 2. Linear Discriminant Analysis

Fit an LDA model using only `age` and `chol` (cholesterol)  as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [46]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


In [47]:
X = df[["age","chol"]]
y = df["diagnosis"]

ct_logistic = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object))
  ],
  remainder = "passthrough"
)

pipeline_lda = Pipeline(
  [("preprocessing", ct_logistic),
  ("LDA", LinearDiscriminantAnalysis())]
)

lda = pipeline_lda.fit(X,y)

## 3. Support Vector Classifier

Fit an SVC model using only `age` and `chol` as predictors.  Don't forget to tune the regularization parameter.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [None]:
X = df[["age","chol"]]
y = df["diagnosis"]

ct_logistic = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object))
  ],
  remainder = "passthrough"
)

pipeline_lda = Pipeline(
  [("preprocessing", ct_logistic),
  ("LDA", LinearDiscriminantAnalysis())]
)

lda = pipeline_lda.fit(X,y)

## 4. Comparing Decision Boundaries

Make a scatterplot of `age` and `chol`, coloring the points by their true disease outcome.  Add a line to the plot representing the **linear separator** (aka **decision boundary**) for each of the three models above.

In [None]:
X = df[["age","chol"]]
y = df["diagnosis"]

ct_logistic = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object))
  ],
  remainder = "passthrough"
)

pipeline_lda = Pipeline(
  [("preprocessing", ct_logistic),
  ("LDA", LinearDiscriminantAnalysis())]
)

lda = pipeline_lda.fit(X,y)