# CREDIT CARD FRAUD DETECTION

<strong> Problem Statement </strong>
In this project we want to identify fraudulent transactions with Credit Cards. Our objective is to build a Fraud detection system using Machine learning techniques. In the past, such systems were rule-based. Machine learning offers powerful new ways.

The project uses a dataset of 300,000 fully anonymized transactions. Each transation is labelled either fraudulent or not fraudulent. Note that prevalence of fraudulent transactions is very low in the dataset. Less than 0.1% of the card transactions are fraudulent. This means that a system predicting each transaction to be normal can reach an accuracy of over 99.9% despite not detecting any fraudulent transaction. This will necessitate adjustment techniques.

It is a CSV file, contains 31 features, the last feature is used to classify the transaction whether it is a fraud or not.

<b>Case Study:</b> Fraud detection is important to e-coomerce store and lot of money is used to prevention.
We have a e-commerce store that sells books. Thousands of books were sold last year and toda we will use transaction data to build Fraud Detection System. We will use publica;ly available dataset.

<b> Business Problem:</b> Build a classifier that give a new transaction can say whether fraudulent or not with confidence.<br>
<b> Outcome: </b>
0:Non-Fradulent <br>
1:Fradulent

# Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the dataset

In [2]:
df = pd.read_csv('creditcard.csv')

In [3]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

# Data Pre Processing

<b> dropping some columns after EDA that these columns give no useful information.

In [4]:
df= df.drop(columns=['Time','V13','V15','V22','V23','V24','V25','V26','V27','V28'])

In [5]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,69.99,0


### Normalising the amount range using Standard Scaler
## Feature Scaling

In [6]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()

In [7]:
df['Scaled_Amount']= sc.fit_transform(df['Amount'].values.reshape(-1,1))

In [8]:
df=df.drop(columns=['Amount'],axis=1)

In [9]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Class,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0,-0.073403


# Splitting the Data

In [10]:
x= df.loc[:,df.columns != 'Class']
y=df['Class']

In [11]:
x.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V14,V16,V17,V18,V19,V20,V21,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,-0.073403


In [12]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [13]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

# Applying machine learning algorithms
<b> Creating a function that will give the Following output when any model is run

In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred = clf.predict(x_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(x_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")

# Decision Tree

In [15]:
from sklearn.tree import DecisionTreeClassifier
clf_tree= DecisionTreeClassifier(random_state=0)

clf_tree.fit(x_train,y_train)
print_score(clf_tree, x_train, y_train, x_test, y_test, train=True)
print_score(clf_tree, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                  0      1  accuracy  macro avg  weighted avg
precision       1.0    1.0       1.0        1.0           1.0
recall          1.0    1.0       1.0        1.0           1.0
f1-score        1.0    1.0       1.0        1.0           1.0
support    227454.0  391.0       1.0   227845.0      227845.0
_______________________________________________
Confusion Matrix: 
 [[227454      0]
 [     0    391]]

Test Result:
Accuracy Score: 99.93%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999560    0.826087   0.99928      0.912824      0.999253
recall         0.999719    0.752475   0.99928      0.876097      0.999280
f1-score       0.999640    0.787565   0.99928      0.893602      0.999263
support    56861.000000  101.000000   0.99928  56962.000000  56962.000000


# Feature Selection

In [17]:
from sklearn.feature_selection import RFE # Recurssive Feature Elimination
from sklearn.tree import DecisionTreeClassifier


In [19]:
# Reducing the features to 20 from 42
clf_tree= DecisionTreeClassifier(random_state=0)
rfe = RFE(clf_tree, 4)
rfe= rfe.fit(x_train,y_train)




In [20]:
# summarise the selection attributes
# which columns are selected(True) and which are not (False)
print(rfe.support_)

[False False False False False  True False False False  True False False
  True False  True False False False False False]


In [22]:
x_train.columns[rfe.support_]

Index(['V6', 'V10', 'V14', 'V17'], dtype='object')

In [23]:
rfe.ranking_

array([13, 16, 12,  4, 11,  1, 14, 15,  3,  1, 10,  6,  1, 17,  1,  8,  2,
        7,  5,  9])

In [24]:
# Predicting Test Set
clf_tree.fit(x_train[x_train.columns[rfe.support_]],y_train)
y_pred = clf_tree.predict(x_test[x_test.columns[rfe.support_]])

In [25]:
## Making the confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm= confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[56839    22]
 [   20    81]]


0.9992626663389628

# K Fold Validation

### Decision Tree

In [28]:
from sklearn.model_selection import cross_val_score

In [29]:
accuracies= cross_val_score(estimator=clf_tree,
                           X=x_train,
                           y=y_train,
                           cv=5)

In [30]:
accuracies.mean()

0.9991178213259013

<b>
The accuracy is so high because the dataset is an imbalanced datset.

In [31]:
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

## Working with data imbalance part

In [32]:
# create the training df by remerging X_train and y_train
df_train = x_train.join(y_train)
df_train.sample(10)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Scaled_Amount,Class
126384,-0.535426,1.457386,-2.049344,-1.096206,2.572402,3.756631,-2.203098,-4.481461,-1.013707,-2.015644,...,0.148403,-0.047322,1.393363,0.278406,0.605146,-1.111584,1.249887,-2.673996,-0.350191,0
270269,-1.383006,0.108843,2.217794,-0.841221,-0.701896,0.330197,-0.766704,0.954386,0.298538,-1.071272,...,-0.286435,0.26511,1.220082,-0.779385,1.266334,-0.159559,-0.073452,0.311585,-0.218294,0
93451,-0.376499,1.082959,1.370642,-0.155422,0.229367,-0.596987,0.739614,-0.072111,-0.461119,-0.330588,...,0.136522,-0.019012,0.31279,-0.611987,-0.270833,0.197208,0.154778,-0.253315,-0.342475,0
61361,1.220161,0.110153,0.507274,0.55428,-0.601274,-1.015235,0.024003,-0.161176,0.176541,-0.07855,...,-0.00952,0.386688,0.3727,-0.364132,-0.55734,-0.040998,-0.093703,-0.348283,-0.261353,0
220249,-0.092593,0.955224,-0.944377,-1.128561,1.406329,-1.311666,1.686886,-0.407722,-0.407341,-0.494366,...,-0.200852,0.708699,-0.602373,-0.46662,-0.297642,-0.082927,0.022827,0.251802,-0.257275,0
284091,-1.9707,-2.791704,-0.824508,0.426773,4.413998,-3.177168,-1.427549,0.082781,0.288183,-0.513504,...,0.227115,-0.760862,0.199885,0.304559,-0.022014,-0.140732,0.745278,0.172417,-0.348072,0
19157,0.723013,-1.806477,1.08605,0.08214,-2.07822,-0.18986,-0.779009,-0.056962,0.190734,0.154809,...,0.490543,-1.265307,0.379767,0.939604,-1.582795,0.73176,0.71073,0.404395,0.894173,0
78631,-2.860916,1.586937,-0.817489,-1.780102,-1.728374,-0.395657,-3.043793,-4.47925,-3.865632,-0.322652,...,0.562173,1.172272,-0.278318,1.338399,-0.103846,-0.203206,0.388294,-2.511224,-0.293258,0
126669,0.73266,-0.894062,-0.942705,0.172826,-0.138037,-0.827609,0.851088,-0.421252,-0.100504,-0.230819,...,-0.203881,0.542296,0.204525,-0.441969,-0.327017,0.328869,0.621297,0.098101,1.018114,0
277808,2.01683,-0.152317,-1.315494,0.276013,-0.106577,-1.359209,0.276531,-0.443139,0.334756,0.142343,...,0.418945,0.335213,0.014134,-0.494076,-0.346556,-0.248688,-0.103082,0.237332,-0.193506,0


In [33]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = df_train[df_train.Class==0]
df_minority = df_train[df_train.Class==1]

print(df_majority.Class.count())
print("-----------")
print(df_minority.Class.count())
print("-----------")
print(df_train.Class.value_counts())

227454
-----------
391
-----------
0    227454
1       391
Name: Class, dtype: int64


## UpSampling the Data

In [34]:
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=227454,    # to match majority class
                                 random_state=587) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
df_upsampled.Class.value_counts()

1    227454
0    227454
Name: Class, dtype: int64

In [35]:
x_upsampled = df_upsampled.drop(['Class'], axis= 1)
y_upsampled = df_upsampled.Class

## DT on UpSampled Data

In [36]:
from sklearn.tree import DecisionTreeClassifier
clf_tree= DecisionTreeClassifier(random_state=0)

clf_tree.fit(x_train,y_train)
print_score(clf_tree, x_upsampled, y_upsampled, x_test, y_test, train=True)
print_score(clf_tree, x_upsampled, y_upsampled, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                  0         1  accuracy  macro avg  weighted avg
precision       1.0       1.0       1.0        1.0           1.0
recall          1.0       1.0       1.0        1.0           1.0
f1-score        1.0       1.0       1.0        1.0           1.0
support    227454.0  227454.0       1.0   454908.0      454908.0
_______________________________________________
Confusion Matrix: 
 [[227454      0]
 [     0 227454]]

Test Result:
Accuracy Score: 99.93%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999560    0.826087   0.99928      0.912824      0.999253
recall         0.999719    0.752475   0.99928      0.876097      0.999280
f1-score       0.999640    0.787565   0.99928      0.893602      0.999263
support    56861.000000  101.000000   0.99928  56962.000000

## Down-Sampling the Data

In [37]:
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=391,     # to match minority class
                                 random_state=24) # reproducible results
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
# Display new class counts
df_downsampled.Class.value_counts()

1    391
0    391
Name: Class, dtype: int64

In [38]:
x_downsampled = df_downsampled.drop(['Class'], axis = 1)
y_downsampled = df_downsampled.Class

## DT on Down-Sampled Data

In [39]:
from sklearn.tree import DecisionTreeClassifier
clf_tree= DecisionTreeClassifier(random_state=0)

clf_tree.fit(x_train,y_train)
print_score(clf_tree, x_downsampled, y_downsampled, x_test, y_test, train=True)
print_score(clf_tree, x_downsampled, y_downsampled, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
               0      1  accuracy  macro avg  weighted avg
precision    1.0    1.0       1.0        1.0           1.0
recall       1.0    1.0       1.0        1.0           1.0
f1-score     1.0    1.0       1.0        1.0           1.0
support    391.0  391.0       1.0      782.0         782.0
_______________________________________________
Confusion Matrix: 
 [[391   0]
 [  0 391]]

Test Result:
Accuracy Score: 99.93%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999560    0.826087   0.99928      0.912824      0.999253
recall         0.999719    0.752475   0.99928      0.876097      0.999280
f1-score       0.999640    0.787565   0.99928      0.893602      0.999263
support    56861.000000  101.000000   0.99928  56962.000000  56962.000000
___________________________

# Pickle file saving

In [40]:
import pickle

# save the model to disk
filename = 'C://Users//ebineet//Documents//Deployment//Deployment//Credit_Card_Fraud_DT.sav'
pickle.dump(clf_tree, open(filename, 'wb'))