## Homework 3: Imbalanced Datasets

### Submission Instructions:

1. Submit a PDF File on GradeScope:

- Please prepare your solutions neatly and compile them into a single PDF file.
- Submit this PDF file on GradeScope before the specified deadline.
- Ensure that your submission is clearly labeled with your UNI ID
- Ensure that your solutions are entirely original and free from any form of plagiarism.


2. Submit a .ipynb File + PDF File on Courseworks:

- Alongside the PDF submission on GradeScope, also submit your Notebook (.ipynb) file and its corresponding PDF version on the Courseworks platform.
- The Notebook should contain your code, explanations, and any additional details necessary for understanding your solutions.

Please try to name your soltution file in the following format - AML_HW3_Solutions_UNI

Dataset Location -  The dataset you will be using for this assignment is called 'onlinefraud.csv'. You can find it in coursworks 'Files' section under the 'datasets' folder.

### GIST:
The goal of this assignment is to build a model that can reliably classify online payments into two categories - fraudulent and non-fradulent. You will notice that, without much effort, you can build a model that gives you a very high ‘accuracy’ score for the given dataset. However, this metric is misleading since the model cannot correctly classify instances of the minority class (‘1’ in this case). This can be attributed to the  inherent imbalance present in the target class of the dataset.  

To solve this issue, you will need to employ certain ML techniques that are designed to counter class imbalance. Hence, the focus of this assignment will be towards addressing class imbalance and testing the model using different evaluation metrics other than just accuracy.

## Name:  Apurva Patel

## UNI: amp2365

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Import below any other package you need for your solution

In [4]:
frauddf=pd.read_csv('/content/drive/MyDrive/AML_3/onlinefraud.csv')

In [5]:
frauddf.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,...,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,...,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,...,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,...,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,...,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,...,M1230701703,0.0,0.0,0,0


In [8]:
frauddf.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [9]:
frauddf['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

In [22]:
print(f'Percentage of minority class: {(8213 / (8213 + 6354407)) * 100:.5f}%')

Percentage of minority class: 0.12908%


### **Model Calibration** (USE MODEL AND DATA FROM ASSIGNMENT 2)

In [None]:
#Please import the dataset for assignment 2 and implement the HistGradientBoosting Model again.
#(The one you trained in part 3.1 of Assignment 2)

**Estimate the brier score for the HistGradientBoosting model (trained with optimal hyperparameters from Q3.1 in Assignment 2) scored on the test dataset.**

In [None]:
# Your Code Here

**Calibrate the trained HistGradientBoosting model using Platt Scaling. Print the brier score after calibration and plot predicted v.s. actual on test datasets from the calibration method.**

In [None]:
#Your Code Here

**Compare the brier scores from the previous two cells above.  Do the calibration methods help in having better predicted probabilities?**

In [None]:
#Your Code Here

### **Data Exploration & Cleaning**

- The dataset has been downloaded from Kaggle. You are encouraged to check this [link](https://www.kaggle.com/datasets/jainilcoder/online-payment-fraud-detection) to learn more about the dataset you are going to work with.<br> <br>

- _OPTIONAL_ : By now, you should be comfortable with data cleaning. Employ all necessary techniques you feel would help improve your dataset. This includes handling missing values, outliers, datatype discrepancies, etc. Other 'preprocessing' techniques have been included later in the assignment. This part is just about cleaning your dataset (data-munging) and will not be graded.

In [None]:
#import the dataset

In [None]:
#Your code here

### **1. Examining Class Imbalance.**

a. Identify the correct target column. A single line comment for the answer is sufficient.</br>
b. Examine the class imbalance in the target column. What is its class distribution? Show this information visually using an appropriate scale. </br>
c. What is the degree of imbalance? (Mild/Moderate/Extreme)

In [None]:
#Your code here

In [None]:
#Your code here

In [None]:
#Your code here

### **2. Pre-processing**

a. Encode categorical columns, and scale numerical columns. Drop irrelevant features (if any). </br>
b. How did you make this decision about whom to drop? Since there are only 10 features (other than the target column), should we consider including them all? </br>
c. Split the dataset into development and test sets. What splitting methodology did you choose, and why? </br>
d. Print the shape of the development and test set.

In [None]:
#Your code here

In [None]:
#Your code here

In [None]:
#Your code here

In [None]:
#Your code here

### 3.1 Default Dataset
Use the Decision tree classifier (use max_depth=10 and random_state=42) model and print the AUC and Average Precision values of 5 Fold Cross Validation </br>

In [None]:
#Your Code Here

### 3.2 Balanced Weight

a. Here, we are going to use a 'balanced' decision tree clasifier on the same dataset. Use max_depth=10 and random_state=42, and then print the AUC and Average Precision values of 5 Fold Cross Validation.

In [None]:
#Your Code Here

### 3.3 Random Oversampling**

a. Perform random oversampling on the development dataset. (Please set random state to 42 while doing this).
Examine the target column again. What is its class distribution now? Print the shape of the development set. </br>

b. Repear part 3.1 again. Use the Decision tree classifier (use max_depth=10 and random_state=42) model and print the AUC and Average Precision values of 5 Fold Cross Validation

In [None]:
#Your Code Here

In [None]:
#Your Code Here

### 3.4 Random Undersampling

a. Perform random undersampling on the development dataset. (Please set random state to 42 while doing this).
Examine the target column again. What is its class distribution now? Print the shape of the development set. </br>

b. Repear part 3.1 again. Use the Decision tree classifier (use max_depth=10 and random_state=42) model and print the AUC and Average Precision values of 5 Fold Cross Validation

In [None]:
#Your Code Here

In [None]:
#Your Code Here

### 3.5 SMOTE

a. Perform Synthetic Minority Oversampling Technique (SMOTE) on the development dataset. (Please set random state to 42 while doing this). Examine the target column again. What is its class distribution now? Print the shape of the development set. </br>

b. Repear part 3.1 again. Use the Decision tree classifier (use max_depth=10 and random_state=42) model and print the AUC and Average Precision values of 5 Fold Cross Validation

In [None]:
#Your Code Here

In [None]:
#Your Code Here

### 3.6 Visual Comparison

Prepare a plot comparing the class distribtion of the target column for each of the imbalance techiques used above. Use the default class split as well.

In [None]:
#Your Code Here

### **4: Model Prediction & Evaluation - AUC Scores**
4.1 Make predictions on the test set using the five models that you built and report their AUC values<br>
(Five models include models from - Default Baseline, Random Undersampling, Random Oversampling, SMOTE & Balanced Weight). Did the models with high AUC scores on the development set exhibit similar performance on the test set? Explain.

In [None]:
#Your Code Here

### **4: Model Prediction & Evaluation - Confusion Matrix**
4.2a.Plot Confusion Matrices for all the five models on the test set. Comment on your results and share in detail. Consider precision, recall and f1 scores. <br>
4.2b. For the dataset at hand, which evaluation metric matters most according to you?

In [None]:
#Your Code Here

In [None]:
#Your Code Here

### **4: Model Prediction & Evaluation - ROC Curves**

4.3 Plot ROC for all the five models on the test set in a single plot. Recomment which technique is most appropriate and why.

In [None]:
#Your Code Here