---
title: "Practice Activity 9.2: Multiclass"
format: 
  html:
    embed-resources: true
execute:
  echo: true
code-fold: true
author: James Compagno
jupyter: python3
---


Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [52]:
import numpy as np
import pandas as pd
import plotnine as p9
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, LogisticRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_recall_fscore_support, roc_auc_score, confusion_matrix, classification_report, roc_curve, auc, precision_score, recall_score
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import f1_score

In [53]:
# Read the data
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

ha.head()

Unnamed: 0,age,sex,cp,trtbps,chol,restecg,thalach,output
0,63,1,3,145,233,0,150,1
1,37,1,2,130,250,1,187,1
2,56,1,1,120,236,1,178,1
3,57,0,0,120,354,1,163,1
4,57,1,0,140,192,1,148,1


In [54]:
# Separate X and Y
X = ha.drop(['cp'], axis=1)
y = ha['cp']

# Train/test split on
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67, stratify=y)

# # Model Library 
model_library = {}
records = []

## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


In [55]:
# Create Model
model_name= 'KNN'
knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Eveluate
knn_pipe.fit(X_train, y_train)
knn_pred = knn_pipe.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred)\

# Store results
records.append({
        "Model": model_name,
        "Classification Type": "KNN",
        "Quality Measure": "Accuracy",
        "Value": knn_accuracy,
    })

In [56]:
model_name= 'Decision Tree'
md = 3
dt_model = DecisionTreeClassifier(max_depth=md, random_state=42)
dt_model.fit(X_train, y_train)

dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Store results
records.append({
        "Model": model_name,
        "Classification Type": "Decision Tree",
        "Quality Measure": "Accuracy",
        "Value": dt_accuracy,
    })

In [57]:
model_name= 'LDA'
lda_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis())
])
lda_pipe.fit(X_train, y_train)

lda_pred = lda_pipe.predict(X_test)
lda_accuracy = accuracy_score(y_test, lda_pred)

# Store results
records.append({
        "Model": model_name,
        "Classification Type": "Linear Discriminant Analysis",
        "Quality Measure": "Accuracy",
        "Value": lda_accuracy,
    })

In [58]:
pd.DataFrame(records)

Unnamed: 0,Model,Classification Type,Quality Measure,Value
0,KNN,KNN,Accuracy,0.581818
1,Decision Tree,Decision Tree,Accuracy,0.527273
2,LDA,Linear Discriminant Analysis,Accuracy,0.563636


## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

In [59]:
model_name = 'OvR'

for cp_value in [0, 1, 2, 3]:
    y_train_binary = (y_train == cp_value).astype(int)
    y_test_binary = (y_test == cp_value).astype(int)
    
    lr_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(max_iter=1000, class_weight='balanced'))
    ])
    lr_pipe.fit(X_train, y_train_binary)
    
    y_pred = lr_pipe.predict(X_test)
    f1 = f1_score(y_test_binary, y_pred)
    
    records.append({
        "Model": model_name + str(cp_value), 
        "Classification Type": "OvR",
        "Quality Measure": "F1",
        "Value": f1,  
    })

pd.DataFrame(records)

Unnamed: 0,Model,Classification Type,Quality Measure,Value
0,KNN,KNN,Accuracy,0.581818
1,Decision Tree,Decision Tree,Accuracy,0.527273
2,LDA,Linear Discriminant Analysis,Accuracy,0.563636
3,OvR0,OvR,F1,0.769231
4,OvR1,OvR,F1,0.457143
5,OvR2,OvR,F1,0.521739
6,OvR3,OvR,F1,0.206897


## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?

In [60]:
model_name = 'OvO'

# Pairs to compare: (0, 1), (0, 2), and (0, 3)
for pair_compaire in [1, 2, 3]:
    mask_train = y_train.isin([0, pair_compaire])
    mask_test = y_test.isin([0, pair_compaire])
    
    X_train_ovo = X_train[mask_train]
    y_train_ovo = y_train[mask_train]
    X_test_ovo = X_test[mask_test]
    y_test_ovo = y_test[mask_test]
    
    # Create and fit logistic regression pipeline
    lr_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(max_iter=1000))
    ])
    lr_pipe.fit(X_train_ovo, y_train_ovo)
    
    # Predict probabilities and calculate ROC-AUC
    y_pred_proba = lr_pipe.predict_proba(X_test_ovo)[:, 1]
    roc_auc = roc_auc_score(y_test_ovo, y_pred_proba)
    
    records.append({
        "Model": model_name + f" 0vs{pair_compaire}",
        "Classification Type": "OvO",
        "Quality Measure": "ROC-AUC",
        "Value": roc_auc,
    })

pd.DataFrame(records)

Unnamed: 0,Model,Classification Type,Quality Measure,Value
0,KNN,KNN,Accuracy,0.581818
1,Decision Tree,Decision Tree,Accuracy,0.527273
2,LDA,Linear Discriminant Analysis,Accuracy,0.563636
3,OvR0,OvR,F1,0.769231
4,OvR1,OvR,F1,0.457143
5,OvR2,OvR,F1,0.521739
6,OvR3,OvR,F1,0.206897
7,OvO 0vs1,OvO,ROC-AUC,0.910256
8,OvO 0vs2,OvO,ROC-AUC,0.822115
9,OvO 0vs3,OvO,ROC-AUC,0.894231
