**Instructions:**

- For questions that require coding, you need to write the relevant code and display its output. Your output should either be the direct answer to the question or clearly display the answer in it.
- For questions that require a written answer (sometimes along with the code), you need to put your answer in a Markdown cell. Writing the answer as a comment or as a print line is not acceptable.
- You need to render this file as HTML using Quarto and submit the HTML file. **Please note that this is a requirement and not optional.** A submission cannot be graded until it is properly rendered.

Import all the libraries and tools you need below.

In [52]:
# Import all reqiored libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, StratifiedKFold, cross_val_predict, cross_val_score
from sklearn.svm import SVC, LinearSVC, SVR, LinearSVR
from sklearn.metrics import accuracy_score, recall_score, precision_score, mean_absolute_error, mean_squared_error, confusion_matrix

## 1)

In this assignment, you will use the **creditcard.csv** file. Each observation is a credit card transaction. Most of the variables are created by a dimensionality reduction method called Principle Component Analysis (PCA), which will be covered in Week 5 or 6. The `Class` values are stored in the last variable and represent whether the transaction is a regular (0) or fraudulent (1) transaction.

### a)

Read the data. Print the number of Class 0 and Class 1 observations. You should see that there is a class imbalance. **(5 points)**

In [57]:
# Read the data
credit_card = pd.read_csv("creditcard.csv")
# Check the data
credit_card.head()
# Check class imbalance
print(f"Class 0: {credit_card['Class'].value_counts()[0]}.")
print(f"Class 1: {credit_card['Class'].value_counts()[1]}")

# There is a severe class imbalance

Class 0: 284315.
Class 1: 492


### b)

There are different methods to handle the class imbalance in the dataset at hand. In this assignment, you will undersample the majority class.

There are built-in functions for undersampling in specialized libraries, such as [imblearn](https://imbalanced-learn.org/stable/), which is not installed in Anaconda by default. To make this step reproducible, you will use a more low-level approach with pandas:

- Separate the regular (0) and fraudulent (1) observations into two different DataFrames by filtering.
- Sample 1000 observations from the DataFrame with the majority class. Use `.sample` method with `random_state=2`.
- Concatenate the undersampled majority class DataFrame and the minority class DataFrame.

**(10 points)**

In [62]:
# Apply undersampling

# Separate the dataframe by applying filtering
credit_card_0 = credit_card.loc[credit_card['Class'] == 0]
credit_card_1 = credit_card.loc[credit_card['Class'] == 1]
print(credit_card_0.shape)
print(credit_card_1.shape)

# Sample 1000 observations from dataframe with majority of class (Class = 0)
credit_card_0_sample = credit_card_0.sample(n = 1000, random_state = 2)

# Concatenate the credit card class = 0 with class = 1 
credit_card_combined = pd.concat([credit_card_0_sample, credit_card_1], axis = 0)
credit_card_combined.head()

(284315, 31)
(492, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
188671,128077.0,-3.288944,-4.809848,-1.231524,6.041078,-1.9484,1.772856,2.177367,0.561664,-1.859951,...,1.352295,0.869342,2.760665,-0.514612,0.114622,0.531044,-0.533325,-0.548096,1206.14,0
190546,128878.0,2.020493,-1.000658,-1.039495,-0.453275,-0.737034,-0.43376,-0.587955,-0.148476,-0.491796,...,0.024865,0.619987,-0.041074,-0.264674,-0.025601,0.910031,-0.051887,-0.063195,67.95,0
46318,42728.0,0.883694,-0.761362,0.928801,1.389779,-0.730351,1.228938,-0.771123,0.432439,1.196648,...,-0.036445,-0.104183,-0.321204,-0.812608,0.54763,-0.273658,0.055533,0.03959,150.0,0
267636,162855.0,-0.072377,0.7354,-2.21124,-2.153156,3.556343,2.781633,1.142571,0.602748,-0.561615,...,0.256899,0.796234,-0.159167,0.752505,-0.263957,0.108374,0.406915,0.263924,24.0,0
189610,128481.0,-0.264285,0.99004,-0.643148,-0.984799,0.81384,0.033159,0.536661,0.483326,-0.368426,...,-0.24636,-0.70831,-0.024761,-1.45731,-0.325381,0.213825,0.115022,0.01099,14.28,0


### c)

Drop the `Amount` and `Time` variables. The rest of the variables (except `Class`) are the predictors.

Create the training and test sets with a 70%-30% split and `random_state=0`. **Stratify the data.**

**Note:** Since the PCA-created predictor values are all in the same order of magnitude, scaling is not necessary.

**(5 points)**

In [63]:
# Drop the amount and time variables
credit_card_combined = credit_card_combined.drop(['Amount', 'Time'], axis = 1)


In [73]:
# Create X and y
X = credit_card_combined.drop(['Class'], axis = 1) # Predictors
y = credit_card_combined['Class'] # Response

# Split data (70-30, stratify = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0, stratify = y)


## 2)

After preprocessing the data, you will tune and train a **linear** Support Vector Machine (SVM) classifier.

### a)

Create a **linear** SVM classifier. Its training algorithm has some random processes, so keep `random_state=1`. You will need some prediction probabilities, which is not in the original training algorithm of an SVM. To use the extended version of the algorithm, use `probability=True`. **(5 points)**

**Note:** Use `SVC`, you will use `LinearSVC` in the next in-class assignment.

In [37]:
# Create a linear SVM classifier
SVM_model = SVC(kernel= 'linear', random_state= 1, probability= True)

### b)

Tune the model with the following specifications:

- For the grid of the **only hyperparameter**, use $10^i$ where i is an integer between -4 and 3 (inclusive).
- Keep your folds **stratified**. In the same object, also `shuffle` the data with `random_state=15`.
- Use both accuracy and recall as your scoring metric, but recall should be the primary (`refit`) metric to find the best hyperparameter.

Print the best hyperparameter value and the cross-validation (CV) recall score. (Both can be returned with the attributes of a `GridSearchCV` object.)

**Note:** This should take 3-4 minutes to run. You can use the given lines to keep track of the elapsed time. (You do not have to and you can delete those lines if you wish.)

**(20 points)**

In [40]:
tic = time.time()

########## YOUR CODE HERE #############

# Create a hyperparameter grid
C_grid = {'C': [10 ** i for i in range(-4,4)]}

# Create a KFold object (Usin) with StratifiedKfold
cv_settings = StratifiedKFold(n_splits= 5, shuffle= True, random_state= 15)

# Create a CV object
gscv = GridSearchCV(SVM_model, C_grid, cv= cv_settings, scoring= ['accuracy', 'recall'], refit= 'recall', n_jobs= -1)

# Fit the gscv model with data
gscv.fit(X_train, y_train)

# Print the best hyperparams and recall score
print(f"Optimal Hyperparameter: {gscv.best_params_}.")
print(f"Recall Score: {gscv.best_score_}.")



#######################################

toc = time.time()
print('Elapsed Time: ', (toc-tic)/60, 'minutes.')

Optimal Hyperparameter: {'C': 1000}.
Recall Score: 0.9013213981244672.
Elapsed Time:  1.9050789992014567 minutes.


### c)

Print the `cv_results_` as a DataFrame to get more insight about the cross-validation process. Print only three columns: (1) the hyperparameter values, (2) the average cross-validation accuracy and (3) the average cross-validation recall.

- Did the model sacrifice some accuracy to maximize the recall or is the accuracy increasing with the recall?
- What probability threshold do these accuracy and recall values correspond to?

**Note:** The column names can be a bit misleading - there are not any test results here, only average cross-validation results.

**(10 points)**

In [50]:
# Print CV results
cv_result = pd.DataFrame(gscv.cv_results_)
cv_result.loc[:,['param_C', 'mean_test_accuracy', 'mean_test_recall']]

Unnamed: 0,param_C,mean_test_accuracy,mean_test_recall
0,0.0001,0.926265,0.776343
1,0.001,0.939676,0.81705
2,0.01,0.948293,0.857715
3,0.1,0.947336,0.866411
4,1.0,0.951173,0.880946
5,10.0,0.954053,0.895524
6,100.0,0.95501,0.895524
7,1000.0,0.956924,0.901321


The model does not sacrifice accuracy to maximize the recall. Instead, the accuracy increases while we try to maximize recall.

The default probability threshold in this SVC model (linear) is 0.5.

## 3)

For a classification model, tuning the model hyperparameters is not enough. The decision threshold needs to be tuned as well, especially for tasks like fraud detection.

### a)

- For the classification task in this assignment, are False Negatives and False Positives equally important?
- If not, which one is more important to avoid? Why?
- Which classification metric needs to be maximized (ideally)?

**(5 points)**

They are not equally important. In this case related to fraud, false negative means they predict there is no fraudulent action, but in fact the fraud is committed. On the other hand, false negative means they predict that there is fraudulent transaction but in reality no crime is comitted. False negative, which mistakenly leaves out fraudulent transaction, can lead to critical consequences like financial losses or system failure while false positive which falsely perceive the regular transaction to be fraudulent is just annoying but does not contribute to big impacts. Therefore, false negative is more critical.

The classification matrix we need to maximize is **recall** since recall increases when false negative decreases. The maximum recall leads to the minimum false negative.

### b)

In order to tune the decision threshold, you need to find the cross-validation accuracy and recall for all possible threshold values (with a reasonable granularity).

- Start by creating an empty DataFrame with three columns: `thr`, `acc`, and `rec`. You will use this DataFrame to store your results.
- Initialize a `counter` variable at 0. You will use this to index the DataFrame to store your results.
- Using the `best_estimator_` of the cross-validation in Question 2b, obtain the cross-validation Class 1 probabilities of all observations. You need the `cross_val_predict` function for this; please consider checking the posted notes if you are not familiar with its usage. You also need to keep the same `cv` input as in Question 2, so the results are consistent.
- For each threshold value between 0 and 1, with a stepsize of 0.01, calculate the accuracy and recall score when the Class 1 probabilities are converted to class values with the threshold value.
- Store the accuracy and recall values of all thresholds in your DataFrame.

**Note:** This should take 3-4 minutes to run. You can use the given lines to keep track of the elapsed time. (You do not have to and you can delete those lines if you wish.)

**(20 points)**

In [74]:
tic = time.time()

########## YOUR CODE HERE #############

# Create empty data frame
df = pd.DataFrame(columns= ['thr', 'acc', 'rec'])

# Initialize counter variable
counter = 0

# Get the best estimator from the model, find cross val predict prob (Class 1, column 2)
probability_class1 = cross_val_predict(gscv.best_estimator_, X_train, y_train, cv = cv_settings, method= 'predict_proba')[:,1]

# Set the possible threshold
thrs = np.arange(0,1.01,0.01)

# Find accuracy and recall of each threshold, store it in dataframe
for thr in thrs:
    y_pred = (probability_class1 >= thr).astype(int)
    df.loc[counter,'thr'] = thr # Set threshold to thr
    df.loc[counter,'acc'] = accuracy_score(y_train, y_pred)
    df.loc[counter,'rec'] = recall_score(y_train, y_pred)

    counter += 1

#######################################

toc = time.time()
print('Elapsed Time: ', (toc-tic)/60, 'minutes.')

Elapsed Time:  2.2208179871241254 minutes.


### c)

Print all the rows with threshold values that return a perfect recall. Is there a threshold value that returns a reasonable accuracy with the perfect recall? (The word "reasonable" sounds subjective, but you should see a clear difference from the accuracy results in Question 2.)

**(5 points)**

In [89]:
# Print all the rows that return perfect recall

# Convert everything to numeric first (necessary because of sklearn)
df = df.astype(float)

df.loc[df['rec'] == 1.00]


Unnamed: 0,thr,acc,rec
0,0.0,0.329502,1.0
1,0.01,0.329502,1.0
2,0.02,0.329502,1.0
3,0.03,0.329502,1.0
4,0.04,0.329502,1.0
5,0.05,0.329502,1.0
6,0.06,0.329502,1.0
7,0.07,0.331418,1.0
8,0.08,0.334291,1.0
9,0.09,0.39751,1.0


There are ten thresholds value 0.00, 0.01, 0.02, 0.03, ..., 0.09 that returns a perfect recall of 1.0. However, the accuracy is not reasonable.

### d)

Find the threshold value that returns the highest recall while the accuracy is above 80%. This will be your tuned threshold. **(5 points)**

In [84]:
# Filter the acc > 80 % first
df_acc = df.loc[df['acc'] > 0.8]

df_acc.rec.idxmax() # Use idx.max instead of arg.max

df_acc.loc[df_acc.rec.idxmax(), :]


thr    0.100000
acc    0.809387
rec    0.869186
Name: 10, dtype: float64

### e)

Using the tuned model from Question 2 and the tuned threshold from Part d in this question, find the test accuracy and recall. Do you observe any overfitting? (This conclusion might be up for debate, so just write what you observe and how you interpret it.)

**(10 points)**

In [86]:
# Get the best model and tuned threshold
best_model = gscv.best_estimator_
threshold = 0.1

# Train the best model
best_model.fit(X_train, y_train) # Use the ENTIRE dataset

# Predict and evaluate with test data and the best threshold
y_pred = best_model.predict_proba(X_test)[:,1] > threshold
print(f"Accuracy Score: {accuracy_score(y_test, y_pred)}")
print(f"Recall Score: {recall_score(y_test, y_pred)}")

Accuracy Score: 0.9486607142857143
Recall Score: 0.8445945945945946


There is no overfitting since the accuracy score and recall score from the testing data is higher than that of training data.