# Lab 06: Building a Fraud Detector

---
author: Ethan Wee
date: October 13, 2023
embed-resources: true
---

## Introduction and Data

**Goal:** The goal of this lab is to create a **fraud detector** that can be used as a part of an automated banking system. It should predict whether or not each transaction is fraud or not.

To do this, you'll need to import the following:

In [6]:
# basics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

# Binary Classification
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, make_scorer
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, balanced_accuracy_score

You are free to import additional packages and modules as you see fit, and you will almost certainly need to.

The data for this lab originally comes from Kaggle.

- [Kaggle: Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/)

>  This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

> It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. 

We are providing a slightly modified subset of the data for this lab. Beyond subsetting, we have:

- Removed the `Time` variable as it is misleading.
- Slightly modified the ratio of fraud to not fraud.

Note that PCA is a method that we will learn about later in the course. For now, know that it takes some number of features as inputs, and outputs either the same or fewer features, that retain most of the original information in the features. You can assume things like location and type of purchase were among the original input features. (Ever had a credit card transaction denied while traveling?)

We present the **train** data as both a complete data frame, or the `X` and `y` data. The former will be useful for calculating summary statistics. The latter will be useful for model training.

Note that we are *not* providing a **test** dataset. Instead, the test dataset will live within the autograder, and once you submit, you will receive feedback and metrics based on the test data. (Therefor, cross-validation or a validation set will be your friend here.)

In [2]:
train = pd.read_csv("https://cs307.org/lab/lab-06/data/credit_train.csv")
train


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-16.598665,10.541751,-19.818982,6.017295,-13.025901,-4.128779,-14.118865,11.161144,-4.099551,-9.222826,...,1.725853,-1.151606,-0.680052,0.108176,1.066878,-0.233720,1.707521,0.511423,99.99,1
1,1.140431,1.134243,-1.429455,2.012226,0.622800,-1.152923,0.221159,0.037372,0.034486,-1.879644,...,-0.367136,-0.891627,-0.160578,-0.108326,0.668374,-0.352393,0.071993,0.113684,1.00,1
2,-13.897206,6.344280,-14.281666,5.581009,-12.887133,-3.146176,-15.450467,9.060281,-5.486121,-14.676470,...,3.058082,0.941180,-0.232710,0.763508,0.075456,-0.453840,-1.508968,-0.686836,9.99,1
3,-20.906908,9.843153,-19.947726,6.155789,-15.142013,-2.239566,-21.234463,1.151795,-8.739670,-18.271168,...,-1.977196,0.652932,-0.519777,0.541702,-0.053861,0.112671,-3.765371,-1.071238,1.00,1
4,-4.221221,2.871121,-5.888716,6.890952,-3.404894,-1.154394,-7.739928,2.851363,-2.507569,-5.110728,...,1.620591,1.567947,-0.578007,-0.059045,-1.829169,-0.072429,0.136734,-0.599848,7.59,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17810,-0.312808,1.571520,0.625071,2.851596,1.440896,0.370273,1.559423,-0.122626,-2.127059,0.721825,...,-0.072840,-0.276301,-0.144144,-0.813403,-0.094574,-0.086966,0.101273,0.154888,49.66,0
17811,-0.487943,0.571487,2.170264,1.787858,-1.172119,0.854935,-1.604364,-2.430118,0.901141,-1.000582,...,-1.199242,0.664000,-0.318814,0.499873,1.124744,-0.044556,0.037318,0.199837,109.00,0
17812,-0.538047,1.726031,-1.045152,1.006301,1.187114,-0.414069,0.981697,0.382300,-1.123658,-0.669527,...,0.124935,0.468033,-0.218558,0.584019,0.029165,-0.424886,0.321420,0.204597,15.15,0
17813,-1.873035,-1.374780,1.112630,-2.800630,-0.440281,-0.601116,-0.855678,0.612189,-2.374276,0.487293,...,0.025152,-0.106362,-0.187154,-0.422605,0.699307,-0.124145,0.144883,-0.129892,91.00,0


In [3]:
# create X and y for train data
X_train = train.drop("Class", axis=1)
y_train = train["Class"]


In [33]:
print(y_train[y_train == 0].shape)
print(y_train[y_train == 1].shape)



(17715,)
(100,)


In [52]:
train[train["Class"]==1].Amount.mean()

126.91189999999999

In [53]:
train.Amount.mean()

85.6023412854316

In [35]:
17715/(17815)

0.994386752736458

## Summary Statistics (Graded Work)

What summary statistics should be calculated? See the relevant assignment on PrairieLearn!

- [Lab 06: Building a Fraud Detector](https://us.prairielearn.com/pl/course_instance/140731/assessment_instance/6332208)

## Model Training (Graded Work)

For this lab, you may train models however you'd like!

The only rules are:

- Models must start from the given training data, unmodified.
    - Importantly the type and shape of `X_train` and `y_train` should not be changed, and should be the input to your models.
    - That is, any pipeline must start from these. After that, do whatever!
- Your model must have a `predict` method.
- Your model must have a `predict_proba` method.

You will submit your chosen model to an autograder to checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you'll only need to submit once!

In [12]:
# use this cell to train models

credit_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("randomforest",RandomForestClassifier(criterion = "gini"))

])
credit_parameter_grid = {
    "randomforest__max_depth":[3,5,7],
    "randomforest__min_samples_split":[2,5,10]
}
credit_randomforestgrid = (GridSearchCV(credit_pipe,credit_parameter_grid,cv = 5))
credit_randomforestgrid.fit(X_train,y_train)

To submit your model to the autograder, you will need to [serialize](https://scikit-learn.org/stable/model_persistence.html) them. In the following cell, replace `_____` with the model you have found.

In [13]:
dump(credit_randomforestgrid, "fraud_detector.joblib")


['fraud_detector.joblib']

After you run this cell, a file will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.

- [Lab 06: Building a Fraud Detector](https://us.prairielearn.com/pl/course_instance/140731/assessment_instance/6332208)

## Discussion

In [None]:
# use this cell to create and print any supporting statistics
# the autograder will give you tn, fp, fn, tp for the test data
# with that information, you can calculate any and all relevant metrics


**Graded discussion:** Do you think that your model is good enough for a bank to use in production? Justify your answer? Describe the potential real-world risks of both false positives and false negatives in this case.

Yes I do, the prarielearn autograder gave me 2 false positives and 20 false negatives. I think it is far better to have a false negative than to have a false positive. For example, were we to use this it is okay to accidentally label a good person with fraud, but very bad to label a fraudster with a good label. 

## Submission

Before submitting, please review the [**Lab Policy** document](https://cs307.org/lab.html) on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

**Be sure that you have added your name at the top of this notebook.**

Once you've reviewed the lab policy document, head to [**Canvas**](https://canvas.illinois.edu) to submit your lab notebook.