---
title: "Lab 8: Linear Classifiers"
format: 
  html:
    embed-resources: true
execute:
  echo: true
code-fold: true
author: James Compagno
jupyter: python3
---

# The Data
This week, we consider a dataset generated from text data.

The original dataset can be found here: https://www.kaggle.com/datasets/kingburrito666/cannabis-strains. It consists of user reviews of different strains of cannabis. Users rated their experience with the cannabis strain on a scale of 1 to 5. They also selected words from a long list to describe the Effects and the Flavor of the cannabis.

In the dataset linked above, each row is one strain of cannabis. The average rating of all testers is reported, as well as the most commonly used words for the effect and flavor.

Some data cleaning has been performed for you: The Effect and Flavor columns have been converted to dummy variables indicating if the particular word was used for the particular strain.

This cleaned data can be found at: https://www.dropbox.com/s/s2a1uoiegitupjc/cannabis_full.csv
Our goal will be to fit models that identify the Sativa types from the Indica types, and then to fit models that also distinguish the Hybrid types.

IMPORTANT: In this assignment, you do not need to consider different feature sets. Normally, this would be a good thing to try - but for this homework, simply include all the predictors for every model.


In [52]:
import numpy as np
import pandas as pd
import plotnine as p9
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, LogisticRegression, Ridge
from sklearn.metrics import (mean_squared_error, r2_score, accuracy_score, 
                             precision_recall_fscore_support, roc_auc_score, 
                             confusion_matrix, classification_report, roc_curve, 
                             auc, precision_score, recall_score, f1_score)
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC


# Part One: Binary Classification

Create a dataset that is limited only to the Sativa and Indica type cannabis strains.

This section asks you to create a final best model for each of the four new model types studied this week: LDA, QDA, SVC, and SVM. For SVM, you may limit yourself to only the polynomial kernel.

For each, you should:

    - Choose a metric you will use to select your model, and briefly justify your choice. (Hint: There is no specific target category here, so this should not be a metric that only prioritizes one category.)

    - Find the best model for predicting the Type variable. Don't forget to tune any hyperparameters. 

    - Report the (cross-validated!) metric.
    
    - Fit the final model.
    
    - Output a confusion matrix.

For my metric I will choose ROC-AUC. 

In [46]:
weed = pd.read_csv("https://www.dropbox.com/s/s2a1uoiegitupjc/cannabis_full.csv?dl=1")
weed = weed.dropna()

weed.describe()

Unnamed: 0,Rating,Creative,Energetic,Tingly,Euphoric,Relaxed,Aroused,Happy,Uplifted,Hungry,Talkative,Giggly,Focused,Sleepy,Dry,Mouth,Earthy,Sweet,Citrus,Flowery,Violet,Diesel,Spicy/Herbal,Sage,Woody,Apricot,Grapefruit,Orange,Pungent,Grape,Pine,Skunk,Berry,Pepper,Menthol,Blue,Cheese,Chemical,Mango,Lemon,Peach,Vanilla,Nutty,Chestnut,Tea,Tobacco,Tropical,Strawberry,Blueberry,Mint,Apple,Honey,Lavender,Lime,Coffee,Ammonia,Minty,Tree,Fruit,Butter,Pineapple,Tar,Rose,Plum,Pear
count,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0
mean,4.430169,0.329986,0.283432,0.151073,0.727522,0.77225,0.088088,0.835691,0.670927,0.208581,0.158832,0.130534,0.264263,0.327704,0.000456,0.000456,0.504336,0.480146,0.240073,0.121406,0.003195,0.109539,0.102693,0.0178,0.116385,0.004108,0.017344,0.035144,0.205842,0.074395,0.154267,0.079416,0.162026,0.026472,0.010497,0.069375,0.02921,0.016887,0.014605,0.086262,0.002282,0.015518,0.01141,0.003195,0.007759,0.004108,0.069831,0.021451,0.06618,0.024646,0.007303,0.014149,0.016887,0.02419,0.010954,0.01278,0.018713,0.015518,0.015518,0.008672,0.019169,0.003651,0.007303,0.000913,0.001369
std,0.419576,0.470315,0.450767,0.358201,0.445336,0.419476,0.283487,0.37064,0.469984,0.406387,0.365602,0.336967,0.441041,0.469484,0.021364,0.021364,0.500095,0.49972,0.427225,0.326673,0.056446,0.312386,0.303627,0.132254,0.320759,0.063974,0.130578,0.184185,0.404408,0.262473,0.361287,0.270448,0.368559,0.160571,0.101941,0.254148,0.168434,0.128878,0.119994,0.280815,0.047727,0.123629,0.106232,0.056446,0.087763,0.063974,0.25492,0.144917,0.248653,0.15508,0.085162,0.118131,0.128878,0.153673,0.10411,0.112348,0.13554,0.123629,0.123629,0.092739,0.137151,0.060329,0.085162,0.030206,0.036986
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.3,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.4,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.6,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [47]:
for col in weed.columns:
    if col not in ['Type', 'Strain', 'Effects', 'Flavor']:
        weed[col] = pd.to_numeric(weed[col], errors='coerce')

# Then proceed with your normal setup
y = weed['Type']
X = weed.drop(columns=['Type', 'Strain', 'Effects', 'Flavor'])

# Binary Split 
binary_weed = weed['Type'].isin(['indica', 'sativa'])
X_binary = X[binary_weed] 
y_binary = y[binary_weed]

# Model Library 
model_library = {}
records = []

## Q1: LDA - Linear Discriminant Analysis

In [48]:
model_name = "LDA_Binary"
lda_model = LinearDiscriminantAnalysis()

# Cross-validated prediction
y_pred_cv = cross_val_predict(lda_model, X_binary, y_binary, cv=5)
y_proba_cv = cross_val_predict(lda_model, X_binary, y_binary, cv=5, method='predict_proba')[:, 1]

# Metrics
conf_matrix = confusion_matrix(y_binary, y_pred_cv)
tn, fp, fn, tp = conf_matrix.ravel()
cv_roc_auc = roc_auc_score(y_binary, y_proba_cv)
cv_accuracy = accuracy_score(y_binary, y_pred_cv)
precision = precision_score(y_binary, y_pred_cv, pos_label='sativa', zero_division=0)
recall = recall_score(y_binary, y_pred_cv, pos_label='sativa', zero_division=0)
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0.0

# Fit 
lda_model.fit(X_binary, y_binary)

# Store in model library
model_library[model_name] = lda_model

# Store results 
records.append({
    "Model": model_name,
    "Classification Type": "LDA",
    "Variables Used": "All",
    "Hyperparameter 1 Name": "NA", 
    "Hyperparameter 1 Value": "NA",
    "Hyperparameter 2 Name": "NA", 
    "Hyperparameter 2 Value": "NA",
    "Range Tested": "NA",
    "ROC AUC": cv_roc_auc,
    "CV Accuracy": cv_accuracy,
    "Confusion Matrix": conf_matrix,
    "Precision": precision,
    "Recall": recall,
    "Specificity": specificity,
})

# Print
print("Confusion Matrix (CV):")
print(conf_matrix)

Confusion Matrix (CV):
[[597  62]
 [ 88 321]]


The LDA model is overall strong at identifying the correct strain with a ROC-AUC of 0.932. However, it had a lower Recall of (0.785) so some sativa strains were miss-classified. The model was very good however at classifying indica strains with a specificity of Specificity 0.906. 

## Q2: QDA - Quadratic Discriminant Analysis

In [49]:
model_name = "QDA_Binary"
qda_model = QuadraticDiscriminantAnalysis()

# Parameter grid for QDA 
param_grid = {
    'reg_param': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}

# GridSearchCV
grid_search = GridSearchCV(
    qda_model,
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
grid_search.fit(X_binary, y_binary)

# Best model
best_qda = grid_search.best_estimator_
best_reg_param = grid_search.best_params_['reg_param']

# Cross-validated prediction with best model
y_pred_cv = cross_val_predict(best_qda, X_binary, y_binary, cv=5)
y_proba_cv = cross_val_predict(best_qda, X_binary, y_binary, cv=5, method='predict_proba')[:, 1]

# Metrics
conf_matrix = confusion_matrix(y_binary, y_pred_cv)
tn, fp, fn, tp = conf_matrix.ravel()
cv_roc_auc = roc_auc_score(y_binary, y_proba_cv)
cv_accuracy = accuracy_score(y_binary, y_pred_cv)
precision = precision_score(y_binary, y_pred_cv, pos_label='sativa', zero_division=0)
recall = recall_score(y_binary, y_pred_cv, pos_label='sativa', zero_division=0)
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0.0

# Store in model library
model_library[model_name] = best_qda

# Store results 
records.append({
    "Model": model_name,
    "Classification Type": "QDA",
    "Variables Used": "All",
    "Hyperparameter 1 Name": "reg_param", 
    "Hyperparameter 1 Value": best_reg_param,
    "Hyperparameter 2 Name": "NA", 
    "Hyperparameter 2 Value": "NA",
    "Range Tested": str(param_grid['reg_param']),
    "ROC AUC": cv_roc_auc,
    "CV Accuracy": cv_accuracy,
    "Confusion Matrix": conf_matrix,
    "Precision": precision,
    "Recall": recall,
    "Specificity": specificity,
})

# Print
print("Confusion Matrix (CV):")
print(conf_matrix)



Confusion Matrix (CV):
[[601  58]
 [ 92 317]]


With a ROC AUC of 0.937 QDA slightly out performs LDA overall. The QDA correctly identified 4 more indica strains than LDA at the expense of misclassfiying 4 staiva strains as indica. Percision therefore went up at the cost of recall. 

## Q3: SVC - Support Vector Classifier

In [None]:
model_name = "SVC_Binary"
svc_model = SVC(kernel='linear', probability=True, random_state=67)

# Parameter grid for SVC
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

# GridSearchCV
grid_search = GridSearchCV(
    svc_model,
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
grid_search.fit(X_binary, y_binary)

# Best model
best_svc = grid_search.best_estimator_
best_C = grid_search.best_params_['C']

# Cross-validated prediction with best model
y_pred_cv = cross_val_predict(best_svc, X_binary, y_binary, cv=5)
y_proba_cv = cross_val_predict(best_svc, X_binary, y_binary, cv=5, method='predict_proba')[:, 1]

# Metrics
conf_matrix = confusion_matrix(y_binary, y_pred_cv)
tn, fp, fn, tp = conf_matrix.ravel()
cv_roc_auc = roc_auc_score(y_binary, y_proba_cv)
cv_accuracy = accuracy_score(y_binary, y_pred_cv)
precision = precision_score(y_binary, y_pred_cv, pos_label='sativa', zero_division=0)
recall = recall_score(y_binary, y_pred_cv, pos_label='sativa', zero_division=0)
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0.0

# Store in model library
model_library[model_name] = best_svc

# Store results 
records.append({
    "Model": model_name,
    "Classification Type": "SVC",
    "Variables Used": "All",
    "Hyperparameter 1 Name": "C", 
    "Hyperparameter 1 Value": best_C,
    "Hyperparameter 2 Name": "NA", 
    "Hyperparameter 2 Value": "NA",
    "Range Tested": str(param_grid['C']),
    "ROC AUC": cv_roc_auc,
    "CV Accuracy": cv_accuracy,
    "Confusion Matrix": conf_matrix,
    "Precision": precision,
    "Recall": recall,
    "Specificity": specificity,
})

# Print
print("Confusion Matrix (CV):")
print(conf_matrix)

## Q4: SVM - Support Vector Machine

In [50]:
dfPt1 = pd.DataFrame(records)
dfPt1.sort_values('ROC AUC', ascending=False)

Unnamed: 0,Model,Classification Type,Variables Used,Hyperparameter 1 Name,Hyperparameter 1 Value,Hyperparameter 2 Name,Hyperparameter 2 Value,Range Tested,ROC AUC,CV Accuracy,Confusion Matrix,Precision,Recall,Specificity
1,QDA_Binary,QDA,All,reg_param,0.2,,,"[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, ...",0.937465,0.859551,"[[601, 58], [92, 317]]",0.845333,0.775061,0.911988
0,LDA_Binary,LDA,All,,,,,,0.931704,0.859551,"[[597, 62], [88, 321]]",0.83812,0.784841,0.905918


# Part Two: Natural Multiclass
Now use the full dataset, including the Hybrid strains.

## Q1
Fit a decision tree, plot the final fit, and interpret the results.
Your answer here

## Q2
Repeat the analyses from Part One for LDA, QDA, and KNN.

## Q3
Were your metrics better or worse than in Part One? Why? Which categories were most likely to get mixed up, according to the confusion matrices? Why?

# Part Three: Multiclass from Binary
Consider two models designed for binary classification: SVC and Logistic Regression.

## Q1
Fit and report metrics for OvR versions of the models. That is, for each of the two model types, create three models:

    - Indica vs. Not Indica
    - Sativa vs. Not Sativa
    - Hybrid vs. Not Hybrid

## Q2
Which of the six models did the best job distinguishing the target category from the rest? Which did the worst? Does this make intuitive sense?

## Q3
Fit and report metrics for OvO versions of the models. That is, for each of the two model types, create three models:

    - Indica vs. Sativa
    - Indica vs. Hybrid
    - Hybrid vs. Sativa

## Q4
Which of the six models did the best job distinguishing at differentiating the two groups? Which did the worst? Does this make intuitive sense?

## Q5
Suppose you had simply input the full data, with three classes, into the LogisticRegression function. Would this have automatically taken an "OvO" approach or an "OvR" approach?

What about for SVC?

Note: You do not actually have to run code here - you only need to look at sklearn's documentation to see how these functions handle multiclass input.