**Instructions:**

- For questions that require coding, you need to write the relevant code and display its output. Your output should either be the direct answer to the question or clearly display the answer in it.
- For questions that require a written answer (sometimes along with the code), you need to put your answer in a Markdown cell. Writing the answer as a comment or as a print line is not acceptable.
- You need to render this file as HTML using Quarto and submit the HTML file. **Please note that this is a requirement and not optional.** A submission cannot be graded until it is properly rendered.

Import all the libraries and tools you need below.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, classification_report
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, cross_val_score, KFold, StratifiedKFold
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# Run the line below to install the xgboost library. It is not in Anaconda by default.
!pip install xgboost

from xgboost import XGBClassifier, XGBRegressor





In this assignment, you will use the data from the **cirrhosis_outcomes.csv** file. Each observation is a patient with liver cirrhosis. 

- The `Status` variable represents the survival state of the patient at `N-Days`: `C` for censored (alive), `D` for death and `CL` for censored (alive) with liver transplant.
- All other variables are medical predictors, either about the treatment or the patient.

## 1) Preprocessing (15 points)

### a)

Read the data. Use `index_col=0` to assign the `id` variable to the index; it should not be a predictor. **(2 points)**

In [2]:
# Read the data (pd), assign the id variable to index
cirrhosis = pd.read_csv('cirrhosis_outcomes.csv', index_col = 0)
# Check data
cirrhosis.head()


Unnamed: 0_level_0,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,999,D-penicillamine,21532,M,N,N,N,N,2.3,316.0,3.35,172.0,1601.0,179.8,63.0,394.0,9.7,3.0,D
1,2574,Placebo,19237,F,N,N,N,N,0.9,364.0,3.54,63.0,1440.0,134.85,88.0,361.0,11.0,3.0,C
2,3428,Placebo,13727,F,N,Y,Y,Y,3.3,299.0,3.55,131.0,1029.0,119.35,50.0,199.0,11.7,4.0,D
3,2576,Placebo,18460,F,N,N,N,N,0.6,256.0,3.5,58.0,1653.0,71.3,96.0,269.0,10.7,3.0,C
4,788,Placebo,16658,F,N,Y,N,N,1.1,346.0,3.65,63.0,1181.0,125.55,96.0,298.0,10.6,4.0,C


### b)

`Status` will be the response (target) variable for the classification task. Print the `value_counts` of the classes. Are the classes balanced? Which one is the minority class? **(5 points)**

In [3]:
# Print the value count for variable "Status"
print(cirrhosis['Status'].value_counts())

Status
C     4965
D     2665
CL     275
Name: count, dtype: int64


The classes are not balanced. The minor class is CL.

### c)

`map` the class labels to 0, 1 and 2. This is necessary because some models that are included do not recognize non-numeric input. **(2 points)**

In [4]:
# Create a dictionary for label mapping
label_mapping = {'C': 0, 'D': 1, 'CL': 2}
# map the classses using pandas (Change the Status column to numeric)
cirrhosis['Status'] = cirrhosis['Status'].map(label_mapping)
# Check data
cirrhosis.head()

Unnamed: 0_level_0,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,999,D-penicillamine,21532,M,N,N,N,N,2.3,316.0,3.35,172.0,1601.0,179.8,63.0,394.0,9.7,3.0,1
1,2574,Placebo,19237,F,N,N,N,N,0.9,364.0,3.54,63.0,1440.0,134.85,88.0,361.0,11.0,3.0,0
2,3428,Placebo,13727,F,N,Y,Y,Y,3.3,299.0,3.55,131.0,1029.0,119.35,50.0,199.0,11.7,4.0,1
3,2576,Placebo,18460,F,N,N,N,N,0.6,256.0,3.5,58.0,1653.0,71.3,96.0,269.0,10.7,3.0,0
4,788,Placebo,16658,F,N,Y,N,N,1.1,346.0,3.65,63.0,1181.0,125.55,96.0,298.0,10.6,4.0,0


### d)

- Separate the response and the predictors. All variables other than `Status` should be a predictor.
- One-hot-encode the categorical predictors. (This can and should be done with one function in one line.) Use `drop_first=True`.
- Create the training and test data with an 80%-20% split. **Stratify the data.** Use `random_state=42`. 
- Scale the training and the test data.

**(6 points)**

In [5]:
# Separate response and predictors (Set axis = 1 to focus at column)
X = cirrhosis.drop(['Status'], axis = 1) # predictors
y = cirrhosis['Status'] # response

# One-hot-encode categorical predictors (Automatically, dont need to specify columns)
X = pd.get_dummies(X, drop_first= True)

# Create the training and testing dataset with train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y, random_state = 42)

# Scale training and testing data (Only for predictors)
scaler = StandardScaler()
# Fit the scaler model
scaler.fit(X_train)
# Use the scaler to transform X_train and X_test
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 2) Tuning and Evaluating Different Multi-Class Classifiers (40 points)

### a)

Create four models with the specified inputs:

- A [Logistic Regression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) model: Use `multi_class = 'ovr'`, `solver = liblinear` and `random_state=1`.
- A [Linear SVC](https://scikit-learn.org/dev/modules/generated/sklearn.svm.LinearSVC.html): Use the `LinearSVC` object for efficiency reasons. Use `multi_class = 'ovr'` and `random_state=1`.
- A [KNN (K-Nearest Neighbors)](https://scikit-learn.org/1.5/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) classifier: Do not use any inputs.
- An [XGBoost](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) classifier: Use `random_state=1`. Do not use any other inputs.

**(10 points)**

In [6]:
# Create logistic regression model
logistic_reg = LogisticRegression(multi_class = 'ovr', solver = 'liblinear', random_state = 1)

# Create a linear SVC
linear_SVC = LinearSVC(multi_class = 'ovr', random_state = 1)

# Create a KNN model
knn = KNeighborsClassifier()

# XGBoost classifier
xgboost = XGBClassifier(random_state = 1)


### b)

Note that the links to the model documentations are given in Part a. Using the documentations, answer the following questions:

- Do you see a `multi-class` input option for the models that did not take any such input in Part a? Why is that the case? (Only consider the scikit-learn API for XGBoost and disregard the experimental/work-in-progress inputs; they are not fully developed yet.)
- Among the models that took a `multi_class` input, `ovr` is an option along with some other algorithms. Is **OvO** (One vs One) one of the options? Why do you think this is the case?

**(10 points)**

There are no multi-class input option for KNN and XGBoost because they do not need it. KNN Classifier model inherently handle multi-class classification since it labels the class of a certain observation based on its n nearest neighbors; therefore, multi-class method is not needed to be specified. 

Similarly, the XGBoost utilizes tree method which can perform multi-class classification directly without the specification of multi-class in a model.

For Linear SVC and logistic regression, multi_class does not support OvO. In my opinion, the reason shuld be how computationally expensive OvO method is. OvO asks the model to train the classifier for every pair of classes and that is a lot of burden for the program. In addition, for the model like Linear SVC, efficiency is firstly concerned and then OvO is not permitted. In SVC with non-linear kernel, OvO is allowed.



### c)

Using the given hyperparameter grids and the following specifications, tune and evaluate each model:

- Use `cv=5`. The default classification setting of `GridSearchCV` is stratification. (The object requirement in the previous in-class assignment was to get everyone familiar with the usage of those cross-validation setting objects.)
- Use `f1_macro` for scoring. F1-score is calculated as: $$2*\frac{precision*recall}{precision+recall}$$ The macro f1-score uses the macro precision and recall scores. It is a good metric to use if you want to tune your model with both precision and recall.
- Print the cross-validation performance of the best model (`best_score_`).
- Print the `confusion_matrix` and the `classification_report` for the test data.
- Print the **micro** recall score for the test data.

**(20 points)**

In [43]:
# Grid for hyperparameters
grid_lr = {
    'penalty': [None, 'l1', 'l2', 'elasticnet'],
    'l1_ratio': [0, 0.3, 0.6, 1],
    'C': [0.01,0.1,1,10,100]
}

# Do gridsearchCV with default cv = 5 (Grid object)
gscv = GridSearchCV(logistic_reg, grid_lr, cv = 5, scoring = 'f1_macro', n_jobs = -1)

# Fit gscv object with all data 
gscv.fit(X_train_scaled, y_train)

# Print out the best score of the model
print(f"Best Score: {gscv.best_score_}")

# Print confusion matrix for the test data, classification report, and micro recall score
best_model = gscv.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print(f"Confusion Matrix: {confusion_matrix(y_test, y_pred)}")
print(f"Classification Report: {classification_report(y_test, y_pred)}")
print(f"Micro recall: {recall_score(y_test, y_pred, average = 'micro')}")
warnings.filterwarnings("ignore")




Best Score: 0.5239958865593068
Confusion Matrix: [[887 106   0]
 [160 373   0]
 [ 30  25   0]]
Classification Report:               precision    recall  f1-score   support

           0       0.82      0.89      0.86       993
           1       0.74      0.70      0.72       533
           2       0.00      0.00      0.00        55

    accuracy                           0.80      1581
   macro avg       0.52      0.53      0.53      1581
weighted avg       0.77      0.80      0.78      1581

Micro recall: 0.7969639468690702




In [44]:
grid_svm = {
    'C': [0.01, 0.1, 1, 10, 100]
}

# Do gridsearchCV with default cv = 5 (Grid object)
gscv = GridSearchCV(linear_SVC, grid_svm, cv = 5, scoring = 'f1_macro', n_jobs = -1)

# Fit gscv object with all data 
gscv.fit(X_train_scaled, y_train)

# Print out the best score of the model
print(f"Best Score: {gscv.best_score_}")

# Print confusion matrix for the test data, classification report, and micro recall score
best_model = gscv.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print(f"Confusion Matrix: {confusion_matrix(y_test, y_pred)}")
print(f"Classification Report: {classification_report(y_test, y_pred)}")
print(f"Micro recall: {recall_score(y_test, y_pred, average = 'micro')}")

Best Score: 0.521399479134701
Confusion Matrix: [[900  93   0]
 [173 360   0]
 [ 32  23   0]]
Classification Report:               precision    recall  f1-score   support

           0       0.81      0.91      0.86       993
           1       0.76      0.68      0.71       533
           2       0.00      0.00      0.00        55

    accuracy                           0.80      1581
   macro avg       0.52      0.53      0.52      1581
weighted avg       0.77      0.80      0.78      1581

Micro recall: 0.7969639468690702


In [None]:
param_grid = {
    'n_neighbors': np.arange(1,25,2)
}

# Do gridsearchCV with default cv = 5 (Grid object)
gscv = GridSearchCV(knn, param_grid, cv = 5, scoring = 'f1_macro', n_jobs = -1)

# Fit gscv object with all data 
gscv.fit(X_train_scaled, y_train)

# Print out the best score of the model
print(f"Best Score: {gscv.best_score_}")

# Print confusion matrix for the test data, classification report, and micro recall score
best_model = gscv.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print(f"Confusion Matrix: {confusion_matrix(y_test, y_pred)}")
print(f"Classification Report: {classification_report(y_test, y_pred)}")
print(f"Micro recall: {recall_score(y_test, y_pred, average = 'micro')}")


Best Score: 0.5344739252515328
Confusion Matrix: [[814 156  23]
 [176 340  17]
 [ 28  20   7]]
Classification Report:               precision    recall  f1-score   support

           0       0.80      0.82      0.81       993
           1       0.66      0.64      0.65       533
           2       0.15      0.13      0.14        55

    accuracy                           0.73      1581
   macro avg       0.54      0.53      0.53      1581
weighted avg       0.73      0.73      0.73      1581

Micro recall: 0.7343453510436433


In [8]:
best_model

In [11]:
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.001]
}

# Do gridsearchCV with default cv = 5 (Grid object)
gscv = GridSearchCV(xgboost, param_grid_xgb, cv = 5, scoring = 'f1_macro', n_jobs = -1)

# Fit gscv object with all data 
gscv.fit(X_train_scaled, y_train)

# Print out the best score of the model
print(f"Best Score: {gscv.best_score_}")

# Print confusion matrix for the test data, classification report, and micro recall score
best_model = gscv.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print(f"Confusion Matrix: {confusion_matrix(y_test, y_pred)}")
print(f"Classification Report: {classification_report(y_test, y_pred)}")
print(f"Micro recall: {recall_score(y_test, y_pred, average = 'micro')}")

Best Score: 0.6353943747694202
Confusion Matrix: [[902  89   2]
 [144 387   2]
 [ 26  19  10]]
Classification Report:               precision    recall  f1-score   support

           0       0.84      0.91      0.87       993
           1       0.78      0.73      0.75       533
           2       0.71      0.18      0.29        55

    accuracy                           0.82      1581
   macro avg       0.78      0.61      0.64      1581
weighted avg       0.82      0.82      0.81      1581

Micro recall: 0.8216318785578748


In [10]:
best_model

## 3) Interpretation (45 points)

Using the prediction results of all four models, answer the following questions. **You need to justify your answers with the corresponding results for credit.**

### a)

In this classification task, what is the random baseline accuracy that the accuracy values would be compared against? **(5 points)**

The random baseline of accuracy should be $\displaystyle\frac{100}{Number of Classes} = \frac{100}{3} = 33.33$ %

### b)

How do the linear models handle the minority class? What do the False Negatives (FNs) and False Positives (FPs) of the minority class indicate about the linear models' capacity to handle the minority class? **(10 points)**

The linear model, both logistic regression and SVC, poorly handle the minority class CL as there is no True Positive.  

There are 55 False Negatives and 0 False Positive, reflecting that the model is dominantly influenced by major classes and does not predict any results to be in the minority class.

### c)

Is there a considerable difference between the micro and macro recall scores for all models? Why or why not? **(10 points)**

There is a considerable difference between micro and macro recall scores because the macro recall score is heavily damaged by the recall score of the minority class being 0. On the other hand, the micro recall scores takes class imbalance into an account, lessening the impact of low recall score of the minority class.

### d)

Compare the test accuracies of the linear models with the KNN classifier. Which one has a higher accuracy? Is accuracy a useful metric to evaluate the model performance in this case, especially regarding the minority class? Why or why not? **(10 points)**

Both linear models has accuracy of 0.80, higher than that of the KNN classifier at 0.73. 

Accuracy is not a useful metric here since it counts the proportion of $\frac{TP+TN}{TP+TN+FP+FN}$ which does not have a significant implication. However, in this certain case, the recall score = $\frac{TP}{TP + FN}$ is more important as it specifies the magnitude of FN -- predicting that the certain person is fine but in fact they have cancer and will terminally die -- which is the most critical situation and should be decreased among all.

### e)

Which model performs the best overall? How does its performance still change with the support (number of observations) of each class? What do you think can be done to overcome this persistent issue? (You will explore some options in this regard in Homework Assignment 2.) **(10 points)**

The XGBoost performs best overall. It achieves the highest accuracy at 0.82, the highest micro, and macro recall scores. The performance is better, as the precision, and recall scores are higher, when the number of observation of each class increases. In this case, the model performs best at predicting the class 0 which has highest number of observations and performs worst at predciting the class 2 as the model does not have many observations in this class to learn upon. 

The undersampling of the majority class and undersampling of the minority class will mitigate the class imbabalnce problem and enable the model to overcome this perisitent issue.