# CREDIT CARD FRAUD DETECTION

<strong> Problem Statement </strong>
In this project we want to identify fraudulent transactions with Credit Cards. Our objective is to build a Fraud detection system using Machine learning techniques. In the past, such systems were rule-based. Machine learning offers powerful new ways.

The project uses a dataset of 300,000 fully anonymized transactions. Each transation is labelled either fraudulent or not fraudulent. Note that prevalence of fraudulent transactions is very low in the dataset. Less than 0.1% of the card transactions are fraudulent. This means that a system predicting each transaction to be normal can reach an accuracy of over 99.9% despite not detecting any fraudulent transaction. This will necessitate adjustment techniques.

It is a CSV file, contains 31 features, the last feature is used to classify the transaction whether it is a fraud or not.

<b>Case Study:</b> Fraud detection is important to e-coomerce store and lot of money is used to prevention.
We have a e-commerce store that sells books. Thousands of books were sold last year and toda we will use transaction data to build Fraud Detection System. We will use publica;ly available dataset.

<b> Business Problem:</b> Build a classifier that give a new transaction can say whether fraudulent or not with confidence.<br>
<b> Outcome: </b>
0:Non-Fradulent <br>
1:Fradulent

# Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the dataset

In [2]:
df = pd.read_csv('creditcard.csv')

In [3]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

# Data Pre Processing

<b> dropping some columns after EDA that these columns give no useful information.

In [4]:
df= df.drop(columns=['Time','V13','V15','V22','V23','V24','V25','V26','V27','V28'])

In [5]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,69.99,0


### Normalising the amount range using Standard Scaler
## Feature Scaling

In [6]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()

In [7]:
df['Scaled_Amount']= sc.fit_transform(df['Amount'].values.reshape(-1,1))

In [8]:
df=df.drop(columns=['Amount'],axis=1)

In [9]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Class,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0,-0.073403


# Splitting the Data

In [10]:
x= df.loc[:,df.columns != 'Class']
y=df['Class']

In [11]:
x.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V14,V16,V17,V18,V19,V20,V21,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,-0.073403


In [12]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [13]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

# Applying machine learning algorithms
<b> Creating a function that will give the Following output when any model is run

In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred = clf.predict(x_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(x_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")

# LOGISTIC REGRESSION

In [15]:
from sklearn.linear_model import LogisticRegression
clf_LR = LogisticRegression(random_state=0)

clf_LR.fit(x_train,y_train)
print_score(clf_LR, x_train, y_train, x_test, y_test, train=True)
print_score(clf_LR, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                       0           1  accuracy      macro avg   weighted avg
precision       0.999332    0.885185  0.999197       0.942259       0.999136
recall          0.999864    0.611253  0.999197       0.805558       0.999197
f1-score        0.999598    0.723147  0.999197       0.861372       0.999123
support    227454.000000  391.000000  0.999197  227845.000000  227845.000000
_______________________________________________
Confusion Matrix: 
 [[227423     31]
 [   152    239]]

Test Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999314    0.885714  0.999175      0.942514      0.999113
recall         0.999859    0.613861  0.999175      0.806860      0.999175
f1-score       0.999587    0.725146  0.999175      0.862367      0.999100


# Feature Selection

In [16]:
from sklearn.feature_selection import RFE # Recurssive Feature Elimination
from sklearn.linear_model import LogisticRegression


In [17]:
# Reducing the features to 20 from 42
clf = LogisticRegression(random_state=0)
rfe = RFE(clf, 4)
rfe= rfe.fit(x_train,y_train)




In [18]:
# summarise the selection attributes
# which columns are selected(True) and which are not (False)
print(rfe.support_)

[False False False  True False False False False False  True False False
  True  True False False False False False False]


In [19]:
x_train.columns[rfe.support_]

Index(['V4', 'V10', 'V14', 'V16'], dtype='object')

In [20]:
rfe.ranking_

array([15,  8, 11,  1,  5, 10, 17,  2, 13,  1,  4, 16,  1,  1,  7, 14, 12,
        6,  3,  9])

In [21]:
# Predicting Test Set
clf.fit(x_train[x_train.columns[rfe.support_]],y_train)
y_pred = clf.predict(x_test[x_test.columns[rfe.support_]])

In [22]:
## Making the confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm= confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[56853     8]
 [   42    59]]


0.9991222218320986

In [23]:
# Analyzing Coefficients
pd.concat([pd.DataFrame(x_train[x_train.columns[rfe.support_]].columns, columns = ["features"]),
           pd.DataFrame(np.transpose(clf.coef_), columns = ["coef"])
           ],axis = 1)


Unnamed: 0,features,coef
0,V4,0.483711
1,V10,-0.28624
2,V14,-0.769235
3,V16,-0.293106


# K Fold Validation

### LR

In [24]:
from sklearn.model_selection import cross_val_score

In [25]:
accuracies= cross_val_score(estimator=clf_LR,
                           X=x_train,
                           y=y_train,
                           cv=5)

In [26]:
# will show all 10 fold accuracies
accuracies

array([0.99912221, 0.99918804, 0.99920999, 0.99929777, 0.99905638])

In [27]:
accuracies.mean()

0.999174877658057

<b>
The accuracy is so high because the dataset is an imbalanced datset.

In [28]:
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

## Working with data imbalance part

In [29]:
# create the training df by remerging X_train and y_train
df_train = x_train.join(y_train)
df_train.sample(10)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Scaled_Amount,Class
133981,1.47082,-1.093877,0.586213,-1.427984,-1.484319,-0.449748,-1.12035,-0.197704,-1.783109,1.369121,...,-0.310096,-0.811774,-0.056602,0.096182,0.311133,-0.259984,-0.138396,-0.058275,-0.166119,0
202199,-0.303567,0.342453,1.39391,-0.427824,0.12005,0.162231,0.286003,0.078517,1.044161,-0.390909,...,-1.012398,-0.171312,-0.171866,-0.415294,0.650071,0.112403,0.092123,0.223831,-0.277746,0
30388,1.387402,-1.2477,0.658385,-1.476512,-1.620395,-0.258452,-1.263674,-0.014959,-1.919455,1.558467,...,-0.041404,-0.417492,0.125204,-0.027769,0.855301,-0.045213,-0.165782,0.004266,-0.113344,0
84447,-0.957317,-3.729433,-1.130661,1.100066,-1.717646,-0.533478,1.488717,-0.349142,0.091474,-0.498225,...,0.204622,0.960795,0.070022,0.007324,0.111324,0.159097,2.118664,0.647212,4.212585,0
162607,0.044414,0.916702,0.281566,-0.617775,0.546759,-0.986355,1.039814,-0.219661,-0.155303,-0.43957,...,0.539467,-0.03784,-0.167457,-0.421102,-0.90861,-0.133821,0.001114,-0.251856,-0.349671,0
32463,1.218485,-0.98821,0.998254,-0.704519,-1.576072,-0.089468,-1.303595,0.317841,-0.367926,0.779892,...,-0.883389,0.202889,1.423518,0.275562,-0.616294,0.082365,-0.01081,0.467524,-0.193346,0
116732,-0.266467,0.956125,1.198786,-0.510644,0.893077,0.262564,0.850658,-0.114955,-0.584,0.037962,...,0.693954,-0.173463,0.655972,-1.305537,0.422226,0.812004,0.296084,-0.30414,-0.338996,0
122560,1.163792,0.212417,0.540331,1.315578,-0.246162,-0.217715,-0.023605,0.023107,0.266675,-0.085484,...,0.320393,0.196921,-0.395852,0.085578,-0.884918,-0.482589,-0.200455,-0.185168,-0.325643,0
224548,2.272594,-1.535335,-0.69349,-1.625756,-1.516127,-0.451512,-1.457541,0.028303,-1.053188,1.770232,...,-0.938033,-0.085696,-0.087821,0.158394,0.779032,0.201567,-0.526074,-0.134616,-0.325243,0
6568,-0.384234,1.015288,1.359077,0.240638,0.07604,-0.530719,0.366273,0.126172,1.261815,-0.852032,...,-3.816768,1.526462,0.176455,1.160916,-0.010953,-0.635733,-0.121413,-0.41545,-0.345313,0


In [30]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = df_train[df_train.Class==0]
df_minority = df_train[df_train.Class==1]

print(df_majority.Class.count())
print("-----------")
print(df_minority.Class.count())
print("-----------")
print(df_train.Class.value_counts())

227454
-----------
391
-----------
0    227454
1       391
Name: Class, dtype: int64


## UpSampling the Data

In [31]:
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=227454,    # to match majority class
                                 random_state=587) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
df_upsampled.Class.value_counts()

1    227454
0    227454
Name: Class, dtype: int64

In [32]:
x_upsampled = df_upsampled.drop(['Class'], axis= 1)
y_upsampled = df_upsampled.Class

## Logistic Regression on UpSampled Data

In [33]:
from sklearn.linear_model import LogisticRegression
clf_LR = LogisticRegression(random_state=0)

clf_LR.fit(x_train,y_train)
print_score(clf_LR, x_upsampled, y_upsampled, x_test, y_test, train=True)
print_score(clf_LR, x_upsampled, y_upsampled, x_test, y_test, train=False)

Train Result:
Accuracy Score: 80.52%
_______________________________________________
CLASSIFICATION REPORT:
                       0              1  accuracy      macro avg  \
precision       0.719716       0.999777   0.80524       0.859746   
recall          0.999864       0.610616   0.80524       0.805240   
f1-score        0.836969       0.758175   0.80524       0.797572   
support    227454.000000  227454.000000   0.80524  454908.000000   

            weighted avg  
precision       0.859746  
recall          0.805240  
f1-score        0.797572  
support    454908.000000  
_______________________________________________
Confusion Matrix: 
 [[227423     31]
 [ 88567 138887]]

Test Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999314    0.885714  0.999175      0.942514      0.999113
recall         0.999859    0.613861  0.999175      0.8068

## Down-Sampling the Data

In [34]:
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=391,     # to match minority class
                                 random_state=24) # reproducible results
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
# Display new class counts
df_downsampled.Class.value_counts()

1    391
0    391
Name: Class, dtype: int64

In [35]:
x_downsampled = df_downsampled.drop(['Class'], axis = 1)
y_downsampled = df_downsampled.Class

## Logistic Regression on Down-Sampled Data

In [36]:
from sklearn.linear_model import LogisticRegression
clf_LR = LogisticRegression(random_state=0)

clf_LR.fit(x_train,y_train)
print_score(clf_LR, x_downsampled, y_downsampled, x_test, y_test, train=True)
print_score(clf_LR, x_downsampled, y_downsampled, x_test, y_test, train=False)

Train Result:
Accuracy Score: 80.56%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.720074    1.000000  0.805627    0.860037      0.860037
recall       1.000000    0.611253  0.805627    0.805627      0.805627
f1-score     0.837259    0.758730  0.805627    0.797995      0.797995
support    391.000000  391.000000  0.805627  782.000000    782.000000
_______________________________________________
Confusion Matrix: 
 [[391   0]
 [152 239]]

Test Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999314    0.885714  0.999175      0.942514      0.999113
recall         0.999859    0.613861  0.999175      0.806860      0.999175
f1-score       0.999587    0.725146  0.999175      0.862367      0.999100
support    56861.000000  101.000000  0.999175  

# Pickle file saving

In [37]:
import pickle

In [38]:
# save the model to disk
filename = 'C://Users//ebineet//Documents//Deployment//Deployment//Credit_Card_Claasification_model.sav'
pickle.dump(clf_LR, open(filename, 'wb'))

In [39]:
# save the model to disk
filename = 'C://Users//ebineet//Documents//Deployment//Deployment//Credit_Card_model.p'
pickle.dump(clf_LR, open(filename, 'wb'))