# Credit Card Fraud - Jupyter Notebook

## Welcome to your notebook!

This is where you will read, write, and execute Python code. We will work through this notebook together, but you'll find notes including between code cells to help you keep on track.

Developers typically start using a new programming language by figuring out how to get the computer to output "Hello World." You can do so here by clicking into the code cell below and typing *shift-Enter* or by clicking the "Run" button above. Then, modify the code to print out another message!

In [None]:
print("Hello World!")

## Loading the dataset

Our dataset, stored in **data/creditcard.csv**, is from a publicly available set of [credit card transactions](https://www.kaggle.com/mlg-ulb/creditcardfraud). These card present transactions are from European cardholders in September 2013. This dataset is commonly referenced in research literature in the fraud space.

Our ability to make good predictions depends on the data we use -- what differences might you expect between the model we will make based on this dataset and models built on more recent data?

In [None]:
import pandas as pd

data = pd.read_csv("data/creditcard_data.csv", index_col=0)

In [None]:
data.head()

In [None]:
data.describe()

## Building our model

We want to build a decision tree that is able to predict whether a certain transaction is fraudulent based on the data available to us.

Again, we won't start from scratch; we'll use a data science toolkit called [sklearn](https://scikit-learn.org/stable/modules/tree.html), but we'll need to specify what data we are using as input and which column we want to predict as output.

In [None]:
from sklearn import tree

# Use all data except the 'Class' column as input
X = data.drop('Class', axis=1)
# Use the 'Class' column as what we want to predict as output
y = data['Class']

# Create an empty model 
model = tree.DecisionTreeClassifier()

# Fit the model to our data
model = model.fit(X, y)

## Evaluating our model

We've built our tree -- now let's test it.

In the cell below, use **model.score(X, y)** to evaluate the accuracy of our tree using our input and output data.

## Model iteration

Just like much of writing is reading and re-writing, when data scientists test their models, they analyze the results and re-build the models.

What might explain the accuracy score of your model?

In [None]:
from sklearn.model_selection import train_test_split

# Split X and y (our input and outputs) into training and testing datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

Below is the code we used to build the model before; **modify it to use your training and testing datasets.**

In [None]:
model = tree.DecisionTreeClassifier()
model = model.fit(X, y)
model.score(X, y)

Again, what might explain the accuracy score of your model?

With data about a transaction and no model to form a prediction, what would you guess?

In [None]:
counts = pd.value_counts(data["Class"])

print(counts)
counts.plot(kind="bar",
           title="Frequency of Genuine vs Fraudulent Transactions")

Why is this a problem?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(16,16))

correlation_matrix = data.corr()
sns.heatmap(correlation_matrix)

## Transforming the data

Data scientists make choices that impact model outputs. To deal with this class imbalance, we could choose to oversample the minority class or undersample the majority; there are trade offs with each.

With more time, we might test multiple strategies. Today we'll undersample the number of genuine transactions.

**Modify the code below to sample the number of genuine transactions that balance the classes.**

In [None]:
# How many genuine transactions should we use to balance the classes?
number_genuine = 1

# Separate genuine transactions and fraud
genuine = data[data["Class"] == 0].sample(number_genuine)
fraud = data[data["Class"] == 1]

# Combine fraud and genuine
even_data = pd.concat([genuine, fraud])

# Summarize our new dataset, even_data
even_data.describe()

Previously, we used the correlation matrix to get a sense of the predictive power of our intial dataset, "data".

**Modify the code below to view the correlation matrix of our new dataset, "even_data".**

In [None]:
plt.figure(figsize=(16,16))

correlation_matrix = data.corr()
sns.heatmap(correlation_matrix)

Since we have a new dataset, we'll need to recreate our inputs, outputs, and split them into training and testing sets.

In [None]:
# Create inputs and outputs with new dataset
X = even_data.drop('Class', axis=1)
y = even_data['Class']

# Split new inputs and outputs into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

# Train and score decision tree using new data
model = tree.DecisionTreeClassifier()
model = model.fit(X_train, y_train)
model.score(X_test, y_test)

You'll notice we're repeating a lot of the same code -- let's put it in a function to make it easier to use later.

Run the cell below to *define* a function called fit_and_score_model, which creates a decision tree model to predict the 'Class' column using the dataset you specify. When you provide information to a function, put it in the parentheses.

In [None]:
def fit_and_score_model(data, max_depth=None):
    # Create inputs and outputs
    X = data.drop('Class', axis=1)
    y = data['Class']
    
    # Split inputs and outputs into training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

    # Train and score decision tree
    model = tree.DecisionTreeClassifier(max_depth=None)
    model = model.fit(X_train, y_train)
    print(model.score(X_test, y_test))
    
    return model

Now, let's use the function we just defined.

In [None]:
my_model = fit_and_score_model(even_data)

## Independent model building

We can continue to iterate on our models with a few additional tools.

In [None]:
def scale_columns(data, column_names):
    from sklearn.preprocessing import RobustScaler
    rob_scaler = RobustScaler()
    temp_data = data.copy()
    
    for column_name in column_names:
        if column_name == "": print("Enter a column name or list of names that you'd like scaled!"); return;
        
        temp_data[column_name] = rob_scaler.fit_transform(temp_data[column_name].values.reshape(-1,1))
    return temp_data

def drop_outliers(data, column_names, fraud=1, threshold=1.5):
    import numpy as np
    
    for column_name in column_names:
        fraud_values = data[column_name][data["Class"] == fraud].values
        q25, q75 = np.percentile(fraud_values, 25), np.percentile(fraud_values, 75)
        iqr = q75 - q25
        lower, upper = q25 - (iqr * threshold), q75 + (iqr * threshold)
        data = data.drop(data[(data[column_name] > upper) | (data[column_name] < lower)].index)
    return data

print("Functions successfully loaded: ")
print("*\t my_data = scale_columns(data, [\"Column\", \"Name(s)\"])")
print("*\t my_data = drop_outliers(data, [\"Column\", \"Name(s)\"], fraud=1, threshold=1.5)")
print("*\t my_model = fit_and_score_model(data, max_depth=5)")

We'll make a fresh copy of our balanced dataset called my_data so we can experiment.  If you ever want to go back to the balanced dataset, run the cell below again.

In [None]:
my_data = even_data.copy()

Now, using the options above, create your own dataset, use it to build a new model, and test it to see how accurate you're able to make it. 