# Credit Card Fraud - Jupyter Notebook
<br><br>
<b>Notebook objective:</b> build a machine learning fraud detection engine in Python

<h2 style="background-color:yellow; text-align:center"><br>Step 1: Python intro<br></h2>
<br>
To introduce you to coding in Python, you're going to run code that prints "Hello World!"
<br><br>
<mark>Click in the gray cell below and hit Shift + Enter to run the code. If it works, edit the code and re-run it to make it print out your name!</mark>

In [None]:
print("Hello World!")

<br>
<h2 style="background-color:Tomato; text-align:center"><br>Step 2: Loading the dataset<br></h2>
<br><br>
Our dataset, <em>creditcard_data.csv</em> is stored in the folder <em>data</em>, which is in the same folder as this notebook. 
<br>That means the "filepath" from here is <b><em>data/creditcard_data.csv</em></b>.

We need to load that data with Python code into a variable. How do we do that?

With a bit of googling it looks like the easiest way out there is to use a library of pre-written Python code called [PANDAS](https://www.google.com/search?q=load+csv+in+python+with+pandas&oq=load+csv+in+python+with+pandas&aqs=chrome..69i57j0l7.5608j1j9&sourceid=chrome&ie=UTF-8) (Python AND DAta Science) made to make working with data in Python easier.

<mark>Read the code below and try to understand what it is doing (it's ok not to understand every detail). Change INSERT_FILEPATH to the correct filepath and run the code (keep the quotation marks!). If you do it correctly, you will get no error.</mark>

In [None]:
import pandas as pd

data = pd.read_csv("INSERT_FILEPATH", index_col=0)

You have now stored the data inside a variable called <em>data</em>, but we can't see it! Searching online, we can find simple PANDAS code to help us.

<mark>Run <b>data.head(5)</b> in the cell below to see the top (head) 5 rows of the data (feel free to experiment with the number). Scroll left and right to see all the columns.</mark>

**Notice how many column headers are anonymized (V1, V2 etc..) - this mimics what Mastercard sees in Decision Intelligence. Since our model will learn using correlations, it doesn't need to know what each number represents.**

<mark>Run <b>data.describe()</b> in the cell below to see a summary of the entire dataset. What is the average amount of the transactions in this dataset?</mark>

<mark>While waiting for the facilitator to move on to the next section, discuss the following two questions with your partner:</mark>

1. Our ability to make good predictions depends on the data we use -- what differences might you expect between the model we will make based on this dataset and models built on more recent data?
<br><br>
2. Why do you think it is useful to have anonymized columns?


<br>
<h2 style="background-color:DodgerBlue; text-align:center"><br>Step 3: Building our model<br></h2>
<br>
<b>Mastercard uses a variety of types models to detect fraud. As you know, one of them is decision trees.</b>

Today, we will build a decision tree that is able to predict whether a certain transaction is fraudulent based on the data available to us.

Again, we won't start from scratch; we'll use a data science toolkit called [sklearn](https://scikit-learn.org/stable/modules/tree.html), but we'll need to specify what data we are using as input and which column we want to predict as output.

<mark>Read the code below and try to understand what it is doing. The gray text after "#" are comments - little bits of text to explain the code, they don't do anything. Once you're happy, run the code and hope for no errors!</mark> 

In [None]:
from sklearn import tree

# Use all data except the 'Class' column as input
X = data.drop('Class', axis=1)
# Use the 'Class' column as what we want to predict as output
y = data['Class']

# Create an empty model 
model = tree.DecisionTreeClassifier()

# Fit the model to our data
model = model.fit(X, y)

<br>
<h2 style="background-color:MediumSeaGreen; text-align:center"><br>Step 4: Evaluating our model<br></h2>
<br>

We've built our tree -- now let's test it.

<mark>In the cell below, use **model.score(X, y)** to evaluate the accuracy of our tree using our input and output data.</mark>

Accuracy ranges between 0.0 (it predicted every transaction wrong) to 1.0 (it predicted every transaction right).

<mark>Look at your accuracy and discuss with your partner: is it possible to be too accurate?</mark>

## Model iteration

Just like much of writing is reading and re-writing, when data scientists test their models, they analyze the results and re-build the models.

<mark>Run the code below to split the data into training X and y and test X and Y</mark>



In [None]:
from sklearn.model_selection import train_test_split

# Split X and y (our input and outputs) into training and testing datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

Below is the code we used to build the model before.

<mark>Modify and run it to use your training and testing datasets.</mark>

In [None]:
model = tree.DecisionTreeClassifier()
model = model.fit(X, y)
model.score(X, y)

Again, what might explain the accuracy score of your model?

<mark>Run the code below and interpret the results. What problem is this showing?</mark>

In [None]:
counts = pd.value_counts(data["Class"])

%matplotlib inline
print(counts)
counts.plot(kind="bar",
           title="Frequency of Genuine vs Fraudulent Transactions")

Why is this a problem?

## Transforming the data

**Data scientists make choices that impact model outputs. At Mastercard, data scientists are dealing with the same challenge: trying to reduce fraud based on limited datasets.**

To deal with the uneven number of fraud and genuine transactions, we could artificially increase the number of fraud transactions by creating similar transactions, or we could reduce the number of genuine transactions.

With more time, we might test multiple strategies. Today we'll reduce the number of genuine transactions.

<mark>Run the code, and look at the mean of the Class column. What does it mean? Modify number_genuine to change the number of genuine transactions that balance the classes, and re-run the code.</mark>

In [None]:
# How many genuine transactions should we use to balance the classes?
number_genuine = 1

# Separate genuine transactions and fraud
genuine = data[data["Class"] == 0].sample(number_genuine)
fraud = data[data["Class"] == 1]

# Combine fraud and genuine
even_data = pd.concat([genuine, fraud])

# Summarize our new dataset, even_data
even_data.describe()

Since we have a new dataset, we'll need to recreate our inputs, outputs, and split them into training and testing sets.

In [None]:
# Create inputs and outputs with new dataset
X = even_data.drop('Class', axis=1)
y = even_data['Class']

# Split new inputs and outputs into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

# Train and score decision tree using new data
model = tree.DecisionTreeClassifier(max_depth = 1)
model = model.fit(X_train, y_train)
model.score(X_test, y_test)

The last step is to visualize the decision tree we made. To do this, we've copied some code from the sklearn documentation.

<mark>Run the code below to see your tree!</mark>

In [None]:
import graphviz
dot_data = tree.export_graphviz(model, out_file=None, 
                     feature_names=X.columns.values,  
                     class_names=["Genuine","Fraud"],  
                     filled=True, rounded=True, 
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

You might notice your decision tree is very small. The final step is to evaluate how complex your decision tree needs to be.

<mark>Go back <b>two</b> code cells and change the max_depth of your decsion tree (line 9). Run the code, then re-run the code to visualize the tree</mark>

<mark>What size of decision tree gets the greatest accuracy for your data? Why?</mark>