## BME i9400
## Fall 2024
### Homework 4: Stratified K-Fold Cross Validation and L2 Regularized Logistic Regression


**Due date: Wednesday, November 13th 2024, 11:59:59.987 PM EST**

In this homework, you will implement a logistic regression model with L2 regularization, and evaluate it using stratified K-Fold cross-validation.

Stratification refers to the process of rearranging the data so as to ensure that each fold is a good representative of the whole. For example, in a binary classification problem where each class comprises 50% of the data, it is best to arrange the data such that in every fold, each class comprises around half the instances.

In the cells below, I have indicated places where code needs to be added with instructions contained in double hashtags (for example ## DO SOMETHING ##). 

In [11]:
import pandas as pd
import numpy as np
from scipy.special import y_pred
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, average_precision_score
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt


ImportError: DLL load failed while importing _multiarray_umath: The specified module could not be found.

ImportError: numpy._core.multiarray failed to import

### Set the random seed

In [27]:
## DO NOT MODIFY THIS CELL
np.random.seed(42)

### Load the data

In [3]:
## DO NOT MODIFY THIS CELL
df = pd.read_csv('parkinsons.csv')
labels = df["status"].values
features = df.drop(columns=["status", "name"]).values
features.shape, labels.shape

((195, 22), (195,))

### Create an instance of the StratifiedKFold class with 5 folds

In [4]:
## DO NOT MODIFY THIS CELL
skf = StratifiedKFold(n_splits=5)

NameError: name 'StratifiedKFold' is not defined

### Task 1 
**Evaluate a logistic regression model on this dataset using 5-fold stratified cross-validation.**
- You should use the model object created in the above cell
- Do not regularize the classifier!
- Use a value for ```max_iter``` of 10000 and disregard any convergence warnings
- For each of the five folds, compute the area under the ROC curve and the average precision, storing each of them in a list
- Report the average area under the ROC curve and the average average precision across the five folds

In [5]:
rocs = []
prcs = []

## Add code for cross-validation here
X = features
y = labels

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = LogisticRegression(penalty=None, solver='saga', max_iter = 10000)
    model.fit(X_train, y_train)

    y_pred_probs = model.predict_proba(X_test)[:,1]

    rocs.append(roc_auc_score(y_test, y_pred_probs))
    prcs.append(average_precision_score(y_test, y_pred_probs))

## Report AUROC and average precision here
AUROC = np.mean(rocs)
avg_prc = np.mean(prcs)

print( "Average AUROC: ", AUROC)
print("Average precision across 5 folds: ", avg_prc)

NameError: name 'skf' is not defined

### Task 2
**Repeat Task 1, but this time adding L2 to the logistic regression model.**
- You must evaluate the following values for the hyperparameter ```C```
    - C: 0.01, 0.1, 1, 10, 100, 1000, 10000
- For each hyperparameter value, compute the average area under the ROC curve and the average average precision across the five folds, and store them in a list or numpy array
- Use a sufficiently large value for the ```max_iter``` parameter of the LogisticRegression class to avoid convergence warnings.
- Report the highest value of the average area under the ROC curve and the average average precision.
- Also report the hyperparameters that yield the best average area under the ROC curve and the average average precision.
- Create a plot with L2 hyperparameter on the x-axis, and average ROC and average precision on the y-axis (overlaid) -- use a logarithmic scale for the x-axis.


In [10]:
cc = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
rocs = np.zeros(shape = (len(cc), 5))
prcs = np.zeros(shape = (len(cc), 5))

## Add code for cross-validation here
for i, C in enumerate(cc):
    for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model = LogisticRegression(penalty='l2', solver='saga', C=C, max_iter=10000)
        model.fit(X_train, y_train)
        y_pred_probs = model.predict_proba(X_test)[:, 1]

        rocs[i, fold] = roc_auc_score(y_test, y_pred_probs)
        prcs[i, fold] = average_precision_score(y_test, y_pred_probs)


NameError: name 'skf' is not defined

In [9]:
## Compute the average auroc and average precision for each value of cc
AUROC_avg = np.mean(rocs, axis=1)
avg_prc = np.mean(prcs, axis=1)

## Report the highest values of the average auroc and average precision here
best_AUROC = np.max(AUROC_avg)
best_prc = np.max(avg_prc)
best_c_auroc = cc[np.argmax(AUROC_avg)]
best_c_prc = cc[np.argmax(avg_prc)]

print("Best AUROC with c value ", best_AUROC, best_c_auroc)
print("Best Precision with c value ", best_prc, best_c_prc)

Best AUROC with c value  0.0 0.001
Best Precision with c value  0.0 0.001


In [11]:
## Create the plot here
plt.figure(figsize=(10, 6))
plt.plot(cc, np.mean(rocs, axis=1), label='Average ROC AUC')
plt.plot(cc, avg_prc, label='Average Precision')
plt.xscale('log')
plt.xlabel('L2 Hyperparameter (C)')
plt.ylabel('Score')
plt.title('L2 Regularization: ROC AUC and Average Precision')
plt.legend()
plt.show()

NameError: name 'plt' is not defined