# Preparing the Data
In this step we read the source data, study the variables present in it and have a look at some sample data. 
This will help us in knowing the different columns present in the data set and study their features. 
We will use Pandas is library to create the data frame which will be used in the subsequent steps.

In [4]:
import pandas as pd
#Load the creditcard.csv using pandas
datainput = pd.read_csv('creditcard.csv')
#https://www.kaggle.com/mlg-ulb/creditcardfraud
# Print the top 5 records
print(datainput[0:5],"\n")
# Print the complete shape of the dataset
print("Shape of Complete Data Set")
print(datainput.shape,"\n")

   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26       V27       V28 

# Checking the Imbalance in the Data
Now we check how the data is distributed among fraudulent and genuine transactions. 
This gives us an idea of of what percentage of data is expected to be fraudulent. 
In ml algorithm this is is referred as data imbalance. 
If most of the transactions is not fraudulent then it becomes difficult to judge few transactions as genuine or not. 
We use the class column to count the number of fraudulent engine in transactions and 
then figure out the actual percentage of fraudulent transactions.

In [5]:
import pandas as pd
#Load the creditcard.csv using pandas
datainput = pd.read_csv('creditcard.csv')
false = datainput[datainput['Class'] == 1]
true = datainput[datainput['Class'] == 0]
n = len(false)/float(len(true))
print(n)
print('False Detection Cases: {}'.format(len(datainput[datainput['Class'] == 1])))
print('True Detection Cases: {}'.format(len(datainput[datainput['Class'] == 0])),"\n")

0.0017304750013189597
False Detection Cases: 492
True Detection Cases: 284315 



# Details of Transaction Types
We investigate further into the nature of the transactions for each category of fraudulent and non-fraudulent transactions. 
We try to statistically estimate various parameters like mean standard deviation maximum value minimum value and 
different percentiles. This is achieved by using the described method.

In [6]:
import pandas as pd
#Load the creditcard.csv using pandas
datainput = pd.read_csv('creditcard.csv')

#Check for imbalance in data
false = datainput[datainput['Class'] == 1]
true = datainput[datainput['Class'] == 0]

#False Detection Cases
print("False Detection Cases")
print("----------------------")
print(false.Amount.describe(),"\n")

#True Detection Cases
print("True Detection Cases")
print("----------------------")
print(true.Amount.describe(),"\n")

False Detection Cases
----------------------
count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64 

True Detection Cases
----------------------
count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64 



# Separating features and Label
Before we implement the ML algorithm, we need to decide on the features and labels. Which basically means the categorizing the dependent variables and the independent ones. In our dataset the class column is dependent on the rest of all other columns. So we create a data frames for the last column as well as another dataframe for rest of all other columns. These dataframes will be used to train the model that we are going to create.

In [8]:
import pandas as pd
#Load the creditcard.csv using pandas
datainput = pd.read_csv('creditcard.csv')
#separating features(X) and label(y)
# Select all columns except the last for all rows
X = datainput.iloc[:, :-1].values
# Select the last column of all rows
Y = datainput.iloc[:, -1].values

print(X.shape)
print(Y.shape)

(284807, 30)
(284807,)


# Train the Model
Now we split the data set into two parts. One is for training and another is for testing. The test_size parameter is used to decide what percentage of the data set will be used only for testing. This exercise will help us gain the confidence on the model we are creating.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split

#Load the creditcard.csv using pandas
datainput = pd.read_csv('creditcard.csv')

#separating features(X) and label(y)
X = datainput.iloc[:, :-1].values

# Select the last column of all rows
Y = datainput.iloc[:, -1].values

#train_test_split method
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Applying Decision Tree Classification
There are many different kinds of algorithms available to be applied to this situation. But we choose decision tree as our algorithm for classification. Which is a max tree depth of 4 and supply the test sample to predict the values. Finally, we calculate the accuracy of the result from the test to decide on whether to continue further with this algorithm or not.

In [13]:
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split

#Load the creditcard.csv using pandas
#datainput = pd.read_csv('creditcard.csv')

#separating features(X) and label(y)
X = datainput.iloc[:, :-1].values
Y = datainput.iloc[:, -1].values

#train_test_split method
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

#DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
classifier=DecisionTreeClassifier(max_depth=4)
classifier.fit(X_train,Y_train)
predicted=classifier.predict(X_test)
print("\npredicted values :\n",predicted)

#Accuracy
DT = metrics.accuracy_score(Y_test, predicted) * 100
print("\nThe accuracy score using the DecisionTreeClassifier : ",DT)


predicted values :
 [0 0 0 ... 0 0 0]

The accuracy score using the DecisionTreeClassifier :  99.9420666409185


# Finding Evaluation Parameters
Once the accuracy level in the above step is acceptable we go on a further evaluation of the model by finding out different parameters. Which use Precision, recall value and F score as our parameters. precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of the total amount of relevant instances that were actually retrieved. F score provides a single score that balances both the concerns of precision and recall in one number.

In [14]:
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

#Load the creditcard.csv using pandas
#datainput = pd.read_csv('creditcard.csv')
#separating features(X) and label(y)

X = datainput.iloc[:, :-1].values
Y = datainput.iloc[:, -1].values

#train_test_split method
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

#DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
classifier=DecisionTreeClassifier(max_depth=4)
classifier.fit(X_train,Y_train)
predicted=classifier.predict(X_test)
print("\npredicted values :\n",predicted)
#
# #Accuracy
DT = metrics.accuracy_score(Y_test, predicted) * 100
print("\nThe accuracy score using the DecisionTreeClassifier : ",DT)
#
# #Precision
print('precision')
# Precision = TP / (TP + FP) (Where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative).
precision = precision_score(Y_test, predicted, pos_label=1)
print(precision_score(Y_test, predicted, pos_label=1))

#Recall
print('recall')
# Recall = TP / (TP + FN)
recall = recall_score(Y_test, predicted, pos_label=1)
print(recall_score(Y_test, predicted, pos_label=1))

#f1-score
print('f-Score')
# F - scores are a statistical method for determining accuracy accounting for both precision and recall.
fscore = f1_score(Y_test, predicted, pos_label=1)
print(f1_score(Y_test, predicted, pos_label=1))


predicted values :
 [0 0 0 ... 0 0 0]

The accuracy score using the DecisionTreeClassifier :  99.91748885221726
precision
0.8144329896907216
recall
0.7314814814814815
f-Score
0.7707317073170732
