# Classification of fraudulent credit card transactions

In this exercise, we are predicting [credit card fraud](https://www.kaggle.com/mlg-ulb/creditcardfraud/home).

In [11]:
import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

## Importing the data ##

Let's have a look at the data first.

In [12]:
df = pd.read_csv('creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


The data consists of a number of variables [which are described on Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud). There are 28 "secret" variables, plus *Time* (when was the transaction) and *Amount*. 
We can use them all. We are predicting *Class*: was the transaction fraudelent (1) or not (0). Let's see how many fraudulent cases there are.

The "secret" variables are deducted from a PCA analysis, which is a way to retrieve features (a combination of variables), which give you more optimal results. If you want to know how PCA works, watch this statquest: https://www.youtube.com/watch?v=FgakZw6K1QQ (it is however not important for this assignment.

1. Check how many fraudulent and non fraudulent cases Class contains, what can you conclude from this division?


In [13]:
df["Class"].value_counts()

0    99776
1      223
Name: Class, dtype: int64

2. Let's get our *X* and *y* and split the data.

In [14]:
X = df.loc[:, 'Time' : 'Amount']
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
X_train.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
92336,63884,0.805625,-1.695688,1.352358,-0.282137,-2.253216,0.017385,-1.217197,0.20276,-0.152576,...,-0.005321,-0.004854,0.04217,-0.132694,0.590563,-0.139888,1.096028,-0.039289,0.058267,239.76
62015,50089,1.100031,-0.102983,1.111263,1.292381,-0.933713,-0.115089,-0.54014,0.190514,0.607337,...,-0.162889,0.151511,0.592482,-0.12518,0.561148,0.570113,-0.244571,0.051015,0.021777,15.93
5005,4573,-1.552078,-0.531074,1.974243,1.013701,1.024195,-1.144346,0.002698,0.048984,1.339099,...,0.24135,-0.181249,-0.715027,0.328461,0.268026,-0.015884,-0.726971,0.026191,0.155041,83.72
56848,47632,-0.419158,0.949003,1.336161,0.85433,0.442809,0.107296,0.772343,-0.182008,-0.640852,...,0.015254,0.213674,0.779649,-0.181551,0.038274,-0.493093,-0.334844,-0.222714,-0.062746,19.99
99373,67089,0.054913,1.758103,-2.796921,1.320415,1.549875,-0.875941,0.498296,0.236048,-0.683338,...,0.008538,-0.125729,-0.211676,0.142032,-1.351474,-0.221678,-0.351459,-0.10093,-0.083579,8.49


## Training the algorithm ##

Let's train the Random Forest algorithm. RF uses randomness, so we need to set a *random_state* if we want the result to be stable for presentation purposes.

I've also set the number of trees (*n_estimators*) to 100. This will become the default number of trees in the future of the *sklearn* package, since current literature suggests using more trees than was used traditionally (10). Also, computing power has increased (more trees require more computing power). The following might take half a minute or so to run, depending on your machine.

In [15]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_estimators=100) #RF is a random algorithm, so to get the same results we need to use random_state
rf = rf.fit(X_train, y_train)

3. Calculate the accuracy

In [16]:
rf.score(X_test, y_test)

0.9994666666666666

## Evaluating the model ##

Let's evaluate the model using our standard approach for a *classification* problem: making a confusion matrix and calculating accuracy, precision and recall.

The confusion matrix uses the *sorted* labels, so 0 comes first, 1 second.

In [17]:
rf.classes_

array([0, 1], dtype=int64)

4. run the predictions and generate a confusion matrix, what can you conclude from this? And what is the precision?

In [20]:
y_pred = rf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm = pd.DataFrame(cm, index=['no fraud (actual', 'fraud(actual)'])
cm

Unnamed: 0,0,1
no fraud (actual,29933,1
fraud(actual),15,51


In [22]:
51/52

0.9807692307692307

Since you already learned how to calculate precision and recall by hand, now we're going to do it the easy way using a function called *classification_report*.

In [21]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29934
           1       0.98      0.77      0.86        66

    accuracy                           1.00     30000
   macro avg       0.99      0.89      0.93     30000
weighted avg       1.00      1.00      1.00     30000



5. Draw your conclusions from this report.