# Credit Card Fraud - Jupyter Notebook
<br><br>
<b>Notebook objective:</b> build a machine learning fraud detection engine in Python

<h2 style="background-color:DarkCyan; text-align:center"><br>Step 1: Python intro<br></h2>
<br>
To introduce you to coding in Python, you're going to run code that prints "Hello World!"
<br><br>
<mark>Click in the gray cell below and hit Shift + Enter to run the code. If it works, you will see text printed out beneath the cell. Edit the code and re-run it to make it print out your name!</mark>

In [None]:
print("Hello World!")

<h2 style="background-color:Tomato; text-align:center"><br>Step 2: Loading the dataset<br></h2>


We need to read in our csv file of data. To do this with raw Python, we'd have to write a lot of code. Fortunately, someone ([Wes Mckinney](https://wesmckinney.com/pages/about.html)) created a "library" of Python code that packages up all that code in to a simple function. The library is called Pandas (Python ANd Data Science), which we can import and give a short nickname ("pd").
<br><br>
<mark>Run the code below, without editing it, the same way you did above. If everything works you will get a number next to the cell, and no error.</mark>

In [None]:
import pandas as pd

data = pd.read_csv("data/creditcard_data.csv")

<mark>Run the code below to see the top ("head") 5 rows of the data. Scroll left and right to see all the columns. Then change the number and re-run the code to see what it does.</mark>

In [None]:
data.head(5)

<mark>Run the code below to see summary statistics for the entire dataset. Can you find the average amount of the transactions in this dataset?</mark>

In [None]:
data.describe()

<br>
<h2 style="background-color:DodgerBlue; text-align:center"><br>Step 3: Building our model<br></h2>
<br>
<b>Mastercard uses a variety of types models as part of Decision Intelligence to detect fraud. One of them is decision trees.</b>

<mark>Read the code below and try to understand what it is doing. The greenish gray text after "#" are comments - little bits of text to explain the code, they don't do anything other than explain the code. Once you're happy, run the code and hope for no errors!</mark> 

In [None]:
# Import the most popular library of code for making decision trees (found by googling)
from sklearn import tree

# Use all data except the 'Fraud' column as input
X = data.drop('Fraud', axis=1)

# Use the 'Fraud' column as what we want to predict as output
Y = data['Fraud']

# Create an empty model 
model = tree.DecisionTreeClassifier()

# Fit the model to our data
model = model.fit(X, Y)

We've built our tree -- now let's test it.

<mark>Run the code below to evaluate the accuracy of our tree using our input and output data.</mark>

In [None]:
model.score(X,Y)

Accuracy ranges between 0.0 (it predicted every transaction wrong) to 1.0 (it predicted every transaction right).

<mark>Look at your accuracy and consider: is it possible to be too accurate?</mark>

<br>
<h2 style="background-color:MediumSeaGreen; text-align:center"><br>Step 4: Evaluating our model<br></h2>
<br>

## Model iteration

Just like much of writing is reading and re-writing, when data scientists build their models, they constantly test and re-build them.

<mark>Run the code below to split the data into training X and Y and test X and Y (this creates four sets of data)</mark>



In [None]:
from sklearn.model_selection import train_test_split

# Split X and y (our input and outputs) into training and testing datasets.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=99)

Below is the code we used to build the model before.

<mark><b>Modify the code</b> so that it fits on your training data and scores on your testing data. (Hint: look at the code in the cell above)</mark>

In [None]:
model = tree.DecisionTreeClassifier()
model = model.fit(X, Y)
model.score(X, Y)

Again, what might explain the accuracy score of your model?

<mark>Run the code below and interpret the results. What problem is this showing?</mark>

In [None]:
counts = pd.value_counts(data['Fraud'])

%matplotlib inline
print(counts)
counts.plot(kind="bar",
           title="Frequency of Genuine vs Fraudulent Transactions")

Why is this a problem?

<br>
<h2 style="background-color:DarkOrange; text-align:center"><br>Step 5: Improving our model<br></h2>
<br>

## Transforming the data

**Data scientists make choices that impact model outputs. At Mastercard, data scientists are dealing with the same challenge: trying to reduce fraud based on limited datasets.**

To deal with the uneven number of fraud and genuine transactions, we could artificially increase the number of fraud transactions by creating similar transactions, or we could reduce the number of genuine transactions.

With more time, we might test multiple strategies. Today we'll just try reducing the number of genuine transactions.

<mark>Run the code, and look at the mean of the Class column. What does it mean? <b>Modify number_genuine to change the number of genuine transactions that balance the classes, and re-run the code</b>.</mark>

In [None]:
# How many genuine transactions should we use to balance the classes?
number_genuine = 1

# Separate genuine transactions and fraud
genuine = data[data['Fraud'] == 0].sample(number_genuine)
fraud = data[data['Fraud'] == 1]

# Combine fraud and genuine
even_data = pd.concat([genuine, fraud])

# Summarize our new dataset, even_data
even_data.describe()

Since we have a new dataset, we'll need to recreate our inputs, outputs, and split them into training and testing sets.

In [None]:
# Create inputs and outputs with new dataset
X = even_data.drop('Fraud', axis=1)
Y = even_data['Fraud']

# Split new inputs and outputs into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=99)

# Train and score decision tree using new data
model = tree.DecisionTreeClassifier(max_depth = 1)
model = model.fit(X_train, Y_train)
model.score(X_test, Y_test)

The last step is to visualize the decision tree we made. To do this, we've copied some code from the sklearn documentation.

<mark>Run the code below to see your tree!</mark>

In [None]:
import graphviz
dot_data = tree.export_graphviz(model, out_file=None, 
                     feature_names=X.columns.values,  
                     class_names=["Genuine","Fraud"],  
                     filled=True, rounded=True, 
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

You might notice your decision tree is very small. The final step is to evaluate how complex your decision tree needs to be.

<mark>Go back <b>two</b> code cells and change the max_depth of your decsion tree (line 9). Run the code, then re-run the code to visualize the tree</mark>

<mark>What size of decision tree gets the greatest accuracy for your data? Why?</mark>

<br>
<h2 style="background-color:Gold; text-align:center"><br>Step 6: Quantifying the investment<br></h2>
<br>

**Building good models requires time and resources; it is important to focus on valuable investments.**

How do you know your time was well spent?

There are huge costs associated with accepting a fraudlent transaction; [LexisNexis](https://risk.lexisnexis.com/insights-resources/research/2018-true-cost-of-fraud-study-for-the-retail-sector) finds fraud costs retailers an average of $2.94 per fraudulent dollar in fees, prevention, legal costs, etc. Declining a genuine transaction is costly too! [Ayden and 451 Research](https://go.adyen.com/rs/222-DNK-376/images/Retail%20Report%202019.pdf?mkt_tok=eyJpIjoiWXpNeE56Y3paRGszTnpBNSIsInQiOiJaVmJ1NXVJVkZFMkdHY1FCYVRGUENFemlDWnU3RSthM21LRmF3MDdtUldwSjZvMVF6ZzVjTTFjemJKS1BxZUJWWElxejZrQXVKeDhwNlZGVXkwT3FtcTkwd1BFTkwwaWZlV1BFcnM3YmY2aEQ1RnMrT3BFS1g4MTRsaWI3R1BUSSJ9) report 2 in 5 consumers have abandoned a purchase after a declined payment in the past 6 months. Customers are less likely to return to a merchant after a failed payment.

In our simulation, we'll charge $2.94 per dollar for each false approval and the cost of the transaction for each false decline. Of course, every situation is different, and a client will likely have their own costs associated with false approvals and declines.

<mark>Run the cell below to compare the cost of fraud when using your model with the cost of approving all transactions.</mark>

In [None]:
COST_PER_FRAUD_DOLLAR = 2.94 # Cost per dollar of a false approval
COST_PER_FALSE_DECLINE_DOLLAR = 1 # Cost per dollar of a false decline

predictions = list(model.predict(X_test))
truth = list(Y_test)

false_approval_cost = 0
false_approval_num = 0
false_decline_cost = 0
false_decline_num = 0
correct_num = 0
correct_cost = 0

for i in range(len(predictions)):
    if predictions[i] != truth[i]: # If our prediction was wrong
        if truth[i] == 1: # If we falsely approved
            false_approval_cost += (X_test.iloc[i, 0] * COST_PER_FRAUD_DOLLAR) # Cost increases by $2.94 * the amount of the transaction
            false_approval_num += 1
        else: # If we falsely decline
            false_decline_cost += (X_test.iloc[i, 0] * COST_PER_FALSE_DECLINE_DOLLAR) # We miss a sale, cost increases by the amount of the transaction
            false_decline_num += 1
    else: # If our prediction was correct
        correct_num += 1
        if truth[i] == 0: # It's a genuine transaction
            correct_cost += X_test.iloc[i, 0]
print("You processed {} payments. {} were correct predictions, of which the genuine transactions totalled ${} in revenue.\n\nYou had {} false approvals, which cost ${} in fees and administrative costs.\nYou had {} false declines, which cost ${} in missed sales.\n".format(len(predictions), correct_num, round(correct_cost,2),  false_approval_num, round(false_approval_cost, 2), false_decline_num, round(false_decline_cost,2)))
print("{}% of your predictions were incorrect.\nYour loss due to fraud was {}% of revenue.\n".format(round((false_approval_num+false_decline_num)*100/len(predictions),2), round((false_approval_cost+false_decline_cost)/(false_approval_cost+false_decline_cost+correct_cost)*100,2)))

approve_all_cost = 0
approve_all_num = 0
all_genuine_cost = 0

for i in range(len(truth)):
    if truth[i] == 1: # There was fraud
        approve_all_cost += (X_test.iloc[i, 0] * COST_PER_FRAUD_DOLLAR)
        approve_all_num += 1
    else: # Genuine transaction
        all_genuine_cost += X_test.iloc[i, 0]

print("If you had simply approved all {} transactions, you would have falsely approved {} transactions, costing ${} while earning ${} in revenue.\n".format(len(predictions), approve_all_num, round(approve_all_cost, 2), round(all_genuine_cost, 2)))
print("Your model's predictions were worth ${}.".format(round(((correct_cost - false_decline_cost - false_approval_cost)-(all_genuine_cost - approve_all_cost)),2)))

<mark>What differs in our simulation when compared to reality?</mark>

<mark>What could make our model stronger?</mark>