# CREDIT CARD FRAUD DETECTION

<strong> Problem Statement </strong>
In this project we want to identify fraudulent transactions with Credit Cards. Our objective is to build a Fraud detection system using Machine learning techniques. In the past, such systems were rule-based. Machine learning offers powerful new ways.

The project uses a dataset of 300,000 fully anonymized transactions. Each transation is labelled either fraudulent or not fraudulent. Note that prevalence of fraudulent transactions is very low in the dataset. Less than 0.1% of the card transactions are fraudulent. This means that a system predicting each transaction to be normal can reach an accuracy of over 99.9% despite not detecting any fraudulent transaction. This will necessitate adjustment techniques.

It is a CSV file, contains 31 features, the last feature is used to classify the transaction whether it is a fraud or not.

<b>Case Study:</b> Fraud detection is important to e-coomerce store and lot of money is used to prevention.
We have a e-commerce store that sells books. Thousands of books were sold last year and toda we will use transaction data to build Fraud Detection System. We will use publica;ly available dataset.

<b> Business Problem:</b> Build a classifier that give a new transaction can say whether fraudulent or not with confidence.<br>
<b> Outcome: </b>
0:Non-Fradulent <br>
1:Fradulent

# Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the dataset

In [3]:
df = pd.read_csv('creditcard.csv')

In [4]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

# Data Pre Processing

<b> dropping some columns after EDA that these columns give no useful information.

In [5]:
df= df.drop(columns=['Time','V13','V15','V22','V23','V24','V25','V26','V27','V28'])

In [6]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,69.99,0


### Normalising the amount range using Standard Scaler
## Feature Scaling

In [7]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()

In [8]:
df['Scaled_Amount']= sc.fit_transform(df['Amount'].values.reshape(-1,1))

In [9]:
df=df.drop(columns=['Amount'],axis=1)

In [10]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Class,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0,-0.073403


# Splitting the Data

In [11]:
x= df.loc[:,df.columns != 'Class']
y=df['Class']

In [12]:
x.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V14,V16,V17,V18,V19,V20,V21,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,-0.073403


In [13]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [14]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

# Applying machine learning algorithms
<b> Creating a function that will give the Following output when any model is run

In [15]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred = clf.predict(x_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(x_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")

# LOGISTIC REGRESSION

In [16]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0)

clf.fit(x_train,y_train)
print_score(clf, x_train, y_train, x_test, y_test, train=True)
print_score(clf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                       0           1  accuracy      macro avg   weighted avg
precision       0.999332    0.885185  0.999197       0.942259       0.999136
recall          0.999864    0.611253  0.999197       0.805558       0.999197
f1-score        0.999598    0.723147  0.999197       0.861372       0.999123
support    227454.000000  391.000000  0.999197  227845.000000  227845.000000
_______________________________________________
Confusion Matrix: 
 [[227423     31]
 [   152    239]]

Test Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999314    0.885714  0.999175      0.942514      0.999113
recall         0.999859    0.613861  0.999175      0.806860      0.999175
f1-score       0.999587    0.725146  0.999175      0.862367      0.999100


# Decision Tree

In [17]:
from sklearn.tree import DecisionTreeClassifier
clf= DecisionTreeClassifier(random_state=0)

clf.fit(x_train,y_train)
print_score(clf, x_train, y_train, x_test, y_test, train=True)
print_score(clf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                  0      1  accuracy  macro avg  weighted avg
precision       1.0    1.0       1.0        1.0           1.0
recall          1.0    1.0       1.0        1.0           1.0
f1-score        1.0    1.0       1.0        1.0           1.0
support    227454.0  391.0       1.0   227845.0      227845.0
_______________________________________________
Confusion Matrix: 
 [[227454      0]
 [     0    391]]

Test Result:
Accuracy Score: 99.93%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999560    0.826087   0.99928      0.912824      0.999253
recall         0.999719    0.752475   0.99928      0.876097      0.999280
f1-score       0.999640    0.787565   0.99928      0.893602      0.999263
support    56861.000000  101.000000   0.99928  56962.000000  56962.000000


# Random Forest

In [18]:
from sklearn.ensemble import RandomForestClassifier
clf_rf= RandomForestClassifier(random_state=0)

clf_rf.fit(x_train,y_train)
print_score(clf_rf, x_train, y_train, x_test, y_test, train=True)
print_score(clf_rf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                  0      1  accuracy  macro avg  weighted avg
precision       1.0    1.0       1.0        1.0           1.0
recall          1.0    1.0       1.0        1.0           1.0
f1-score        1.0    1.0       1.0        1.0           1.0
support    227454.0  391.0       1.0   227845.0      227845.0
_______________________________________________
Confusion Matrix: 
 [[227454      0]
 [     0    391]]

Test Result:
Accuracy Score: 99.95%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999631    0.930233  0.999526      0.964932      0.999508
recall         0.999894    0.792079  0.999526      0.895987      0.999526
f1-score       0.999763    0.855615  0.999526      0.927689      0.999507
support    56861.000000  101.000000  0.999526  56962.000000  56962.000000


# SVM

In [19]:
from sklearn.svm import SVC

print("=======================Linear Kernel SVM==========================")
model = SVC(kernel='linear')
model.fit(x_train, y_train)
print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)

print("=======================Polynomial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='poly', degree=2, gamma='auto')
model.fit(x_train, y_train)

print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)

print("=======================Radial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='rbf', gamma=1)
model.fit(x_train, y_train)

print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)


Train Result:
Accuracy Score: 99.94%
_______________________________________________
CLASSIFICATION REPORT:
                       0           1  accuracy      macro avg   weighted avg
precision       0.999631    0.841096  0.999377       0.920363       0.999359
recall          0.999745    0.785166  0.999377       0.892456       0.999377
f1-score        0.999688    0.812169  0.999377       0.905929       0.999366
support    227454.000000  391.000000  0.999377  227845.000000  227845.000000
_______________________________________________
Confusion Matrix: 
 [[227396     58]
 [    84    307]]

Test Result:
Accuracy Score: 99.94%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999666    0.820000   0.99935      0.909833      0.999347
recall         0.999683    0.811881   0.99935      0.905782      0.999350
f1-score       0.999675    0.815920   0.99935      0.907798      0.999349


# Feature Selection

In [20]:
from sklearn.feature_selection import RFE # Recurssive Feature Elimination
from sklearn.linear_model import LogisticRegression


In [22]:
# Reducing the features to 20 from 42
clf = LogisticRegression(random_state=0)
rfe = RFE(clf, 4)
rfe= rfe.fit(x_train,y_train)


In [23]:
# summarise the selection attributes
# which columns are selected(True) and which are not (False)
print(rfe.support_)

[False False False  True False False False False False  True False False
  True  True False False False False False False]


In [24]:
x_train.columns[rfe.support_]

Index(['V4', 'V10', 'V14', 'V16'], dtype='object')

In [25]:
rfe.ranking_

array([15,  8, 11,  1,  5, 10, 17,  2, 13,  1,  4, 16,  1,  1,  7, 14, 12,
        6,  3,  9])

In [29]:
# Predicting Test Set
clf.fit(x_train[x_train.columns[rfe.support_]],y_train)
y_pred = clf.predict(x_test[x_test.columns[rfe.support_]])

In [30]:
## Making the confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm= confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[56853     8]
 [   42    59]]


0.9991222218320986

In [33]:
# Analyzing Coefficients
pd.concat([pd.DataFrame(x_train[x_train.columns[rfe.support_]].columns, columns = ["features"]),
           pd.DataFrame(np.transpose(clf.coef_), columns = ["coef"])
           ],axis = 1)


Unnamed: 0,features,coef
0,V4,0.483711
1,V10,-0.28624
2,V14,-0.769235
3,V16,-0.293106
