# CREDIT CARD FRAUD DETECTION

<strong> Problem Statement </strong>
In this project we want to identify fraudulent transactions with Credit Cards. Our objective is to build a Fraud detection system using Machine learning techniques. In the past, such systems were rule-based. Machine learning offers powerful new ways.

The project uses a dataset of 300,000 fully anonymized transactions. Each transation is labelled either fraudulent or not fraudulent. Note that prevalence of fraudulent transactions is very low in the dataset. Less than 0.1% of the card transactions are fraudulent. This means that a system predicting each transaction to be normal can reach an accuracy of over 99.9% despite not detecting any fraudulent transaction. This will necessitate adjustment techniques.

It is a CSV file, contains 31 features, the last feature is used to classify the transaction whether it is a fraud or not.

<b>Case Study:</b> Fraud detection is important to e-coomerce store and lot of money is used to prevention.
We have a e-commerce store that sells books. Thousands of books were sold last year and toda we will use transaction data to build Fraud Detection System. We will use publica;ly available dataset.

<b> Business Problem:</b> Build a classifier that give a new transaction can say whether fraudulent or not with confidence.<br>
<b> Outcome: </b>
0:Non-Fradulent <br>
1:Fradulent

# Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the dataset

In [2]:
df = pd.read_csv('creditcard.csv')

In [3]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

# Data Pre Processing

<b> dropping some columns after EDA that these columns give no useful information.

In [4]:
df= df.drop(columns=['Time','V13','V15','V22','V23','V24','V25','V26','V27','V28'])

In [5]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,69.99,0


### Normalising the amount range using Standard Scaler
## Feature Scaling

In [6]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()

In [7]:
df['Scaled_Amount']= sc.fit_transform(df['Amount'].values.reshape(-1,1))

In [8]:
df=df.drop(columns=['Amount'],axis=1)

In [9]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Class,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0,-0.073403


# Splitting the Data

In [10]:
x= df.loc[:,df.columns != 'Class']
y=df['Class']

In [11]:
x.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V14,V16,V17,V18,V19,V20,V21,Scaled_Amount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,-0.073403


In [12]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [13]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

# Applying machine learning algorithms
<b> Creating a function that will give the Following output when any model is run

In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred = clf.predict(x_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(x_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")

# Decision Tree

In [16]:
from sklearn.tree import DecisionTreeClassifier
clf_tree= DecisionTreeClassifier(random_state=0)

clf_tree.fit(x_train,y_train)
print_score(clf_tree, x_train, y_train, x_test, y_test, train=True)
print_score(clf_tree, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                  0      1  accuracy  macro avg  weighted avg
precision       1.0    1.0       1.0        1.0           1.0
recall          1.0    1.0       1.0        1.0           1.0
f1-score        1.0    1.0       1.0        1.0           1.0
support    227454.0  391.0       1.0   227845.0      227845.0
_______________________________________________
Confusion Matrix: 
 [[227454      0]
 [     0    391]]

Test Result:
Accuracy Score: 99.93%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999560    0.826087   0.99928      0.912824      0.999253
recall         0.999719    0.752475   0.99928      0.876097      0.999280
f1-score       0.999640    0.787565   0.99928      0.893602      0.999263
support    56861.000000  101.000000   0.99928  56962.000000  56962.000000


# Random Forest

In [17]:
from sklearn.ensemble import RandomForestClassifier
clf_rf= RandomForestClassifier(random_state=0)

clf_rf.fit(x_train,y_train)
print_score(clf_rf, x_train, y_train, x_test, y_test, train=True)
print_score(clf_rf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                  0      1  accuracy  macro avg  weighted avg
precision       1.0    1.0       1.0        1.0           1.0
recall          1.0    1.0       1.0        1.0           1.0
f1-score        1.0    1.0       1.0        1.0           1.0
support    227454.0  391.0       1.0   227845.0      227845.0
_______________________________________________
Confusion Matrix: 
 [[227454      0]
 [     0    391]]

Test Result:
Accuracy Score: 99.95%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999631    0.930233  0.999526      0.964932      0.999508
recall         0.999894    0.792079  0.999526      0.895987      0.999526
f1-score       0.999763    0.855615  0.999526      0.927689      0.999507
support    56861.000000  101.000000  0.999526  56962.000000  56962.000000


In [19]:
# Reducing the features to 20 from 42
clf = LogisticRegression(random_state=0)
rfe = RFE(clf, 4)
rfe= rfe.fit(x_train,y_train)




In [20]:
# summarise the selection attributes
# which columns are selected(True) and which are not (False)
print(rfe.support_)

[False False False  True False False False False False  True False False
  True  True False False False False False False]


In [21]:
x_train.columns[rfe.support_]

Index(['V4', 'V10', 'V14', 'V16'], dtype='object')

In [22]:
rfe.ranking_

array([15,  8, 11,  1,  5, 10, 17,  2, 13,  1,  4, 16,  1,  1,  7, 14, 12,
        6,  3,  9])

In [23]:
# Predicting Test Set
clf.fit(x_train[x_train.columns[rfe.support_]],y_train)
y_pred = clf.predict(x_test[x_test.columns[rfe.support_]])

In [24]:
## Making the confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm= confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[56853     8]
 [   42    59]]


0.9991222218320986

In [25]:
# Analyzing Coefficients
pd.concat([pd.DataFrame(x_train[x_train.columns[rfe.support_]].columns, columns = ["features"]),
           pd.DataFrame(np.transpose(clf.coef_), columns = ["coef"])
           ],axis = 1)


Unnamed: 0,features,coef
0,V4,0.483711
1,V10,-0.28624
2,V14,-0.769235
3,V16,-0.293106


# K Fold Validation

### LR

In [26]:
from sklearn.model_selection import cross_val_score

In [27]:
accuracies= cross_val_score(estimator=clf_LR,
                           X=x_train,
                           y=y_train,
                           cv=5)

In [28]:
# will show all 10 fold accuracies
accuracies

array([0.99912221, 0.99918804, 0.99920999, 0.99929777, 0.99905638])

In [29]:
accuracies.mean()

0.999174877658057

### Decision Tree

In [30]:
from sklearn.model_selection import cross_val_score

In [31]:
accuracies= cross_val_score(estimator=clf_tree,
                           X=x_train,
                           y=y_train,
                           cv=5)

In [32]:
accuracies.mean()

0.9991178213259013

### Random Forest

In [33]:
from sklearn.model_selection import cross_val_score

In [34]:
accuracies= cross_val_score(estimator=clf_rf,
                           X=x_train,
                           y=y_train,
                           cv=5)

In [35]:
accuracies.mean()

0.9995479382913823

### SVM

In [36]:
from sklearn.model_selection import cross_val_score

<b>
The accuracy is so high because the dataset is an imbalanced datset.

In [37]:
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

## Working with data imbalance part

In [38]:
# create the training df by remerging X_train and y_train
df_train = x_train.join(y_train)
df_train.sample(10)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V12,V14,V16,V17,V18,V19,V20,V21,Scaled_Amount,Class
91399,-2.307939,-1.995021,1.633618,-2.413268,-0.867315,0.327203,2.607588,-1.267149,2.279375,-1.014836,...,-0.614645,-1.43297,0.052624,-1.265911,0.473202,0.333513,-0.600572,-0.393535,1.352354,0
276792,1.946051,0.675179,-0.797716,3.70347,0.524524,-0.383933,0.353888,-0.229634,-1.191897,1.528632,...,-0.155767,0.333935,0.84783,-0.913748,-0.341869,-1.790636,-0.268672,0.234448,-0.334918,0
58008,-0.78295,1.35143,1.256045,0.368087,0.050375,-0.882264,0.660864,-0.22295,-0.387727,-0.028516,...,0.135776,-0.561515,0.038275,0.202122,-0.124503,0.602782,0.174696,-0.172119,-0.325523,0
176062,-3.034353,-0.262247,-0.385991,-3.142276,-1.622657,1.461466,-0.988286,1.724881,-2.101497,0.59226,...,-0.601657,0.507125,-0.265931,1.082911,-0.342935,-1.396542,-0.6286,-0.129188,0.278468,0
266971,-2.34748,1.906698,0.518341,-0.37959,-0.888491,-0.5632,-0.383494,0.732846,0.319286,-0.411794,...,1.060598,0.072014,0.061557,-0.153071,-0.107327,0.3712,-0.344062,0.023564,-0.333239,0
132477,-1.024907,1.068628,2.159984,1.219775,-0.635047,0.74728,-0.478264,0.877752,0.058144,-0.48663,...,-0.239382,0.045725,-0.41531,0.514226,0.020502,0.674764,0.044567,-0.016759,-0.333279,0
268966,1.780541,0.008474,-2.549174,1.574685,0.81163,-0.880434,0.755707,-0.285778,0.317556,-0.384304,...,-0.723899,-0.708836,-0.114104,1.039761,0.260487,-0.420624,-0.036483,0.01867,0.208822,0
242041,1.750367,-0.676421,-0.948436,0.966969,0.050144,0.827662,-0.528501,0.250703,0.797027,0.348069,...,0.176625,0.286006,1.047363,-1.412892,1.338847,-0.091558,0.046298,0.270361,0.19051,0
254857,-0.779664,0.164888,0.777698,-0.266798,0.125544,0.077234,-0.542197,0.742019,0.24106,-0.788901,...,0.861052,0.336512,0.137659,-0.449354,0.935278,0.329377,-0.063056,0.462853,-0.333279,0
229018,2.279318,-0.733895,-3.498674,-2.038261,2.447428,2.684337,-0.380628,0.480325,-1.13116,0.953808,...,-0.629595,0.569641,0.355726,0.271427,-1.836285,0.531009,0.003502,0.339869,-0.338237,0


In [39]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = df_train[df_train.Class==0]
df_minority = df_train[df_train.Class==1]

print(df_majority.Class.count())
print("-----------")
print(df_minority.Class.count())
print("-----------")
print(df_train.Class.value_counts())

227454
-----------
391
-----------
0    227454
1       391
Name: Class, dtype: int64


## UpSampling the Data

In [40]:
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=227454,    # to match majority class
                                 random_state=587) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
df_upsampled.Class.value_counts()

1    227454
0    227454
Name: Class, dtype: int64

In [41]:
x_upsampled = df_upsampled.drop(['Class'], axis= 1)
y_upsampled = df_upsampled.Class

## Logistic Regression on UpSampled Data

In [42]:
from sklearn.linear_model import LogisticRegression
clf_LR = LogisticRegression(random_state=0)

clf_LR.fit(x_train,y_train)
print_score(clf_LR, x_upsampled, y_upsampled, x_test, y_test, train=True)
print_score(clf_LR, x_upsampled, y_upsampled, x_test, y_test, train=False)

Train Result:
Accuracy Score: 80.52%
_______________________________________________
CLASSIFICATION REPORT:
                       0              1  accuracy      macro avg  \
precision       0.719716       0.999777   0.80524       0.859746   
recall          0.999864       0.610616   0.80524       0.805240   
f1-score        0.836969       0.758175   0.80524       0.797572   
support    227454.000000  227454.000000   0.80524  454908.000000   

            weighted avg  
precision       0.859746  
recall          0.805240  
f1-score        0.797572  
support    454908.000000  
_______________________________________________
Confusion Matrix: 
 [[227423     31]
 [ 88567 138887]]

Test Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999314    0.885714  0.999175      0.942514      0.999113
recall         0.999859    0.613861  0.999175      0.8068

## Down-Sampling the Data

In [43]:
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=391,     # to match minority class
                                 random_state=24) # reproducible results
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
# Display new class counts
df_downsampled.Class.value_counts()

1    391
0    391
Name: Class, dtype: int64

In [44]:
x_downsampled = df_downsampled.drop(['Class'], axis = 1)
y_downsampled = df_downsampled.Class

## Logistic Regression on Down-Sampled Data

In [45]:
from sklearn.linear_model import LogisticRegression
clf_LR = LogisticRegression(random_state=0)

clf_LR.fit(x_train,y_train)
print_score(clf_LR, x_downsampled, y_downsampled, x_test, y_test, train=True)
print_score(clf_LR, x_downsampled, y_downsampled, x_test, y_test, train=False)

Train Result:
Accuracy Score: 80.56%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.720074    1.000000  0.805627    0.860037      0.860037
recall       1.000000    0.611253  0.805627    0.805627      0.805627
f1-score     0.837259    0.758730  0.805627    0.797995      0.797995
support    391.000000  391.000000  0.805627  782.000000    782.000000
_______________________________________________
Confusion Matrix: 
 [[391   0]
 [152 239]]

Test Result:
Accuracy Score: 99.92%
_______________________________________________
CLASSIFICATION REPORT:
                      0           1  accuracy     macro avg  weighted avg
precision      0.999314    0.885714  0.999175      0.942514      0.999113
recall         0.999859    0.613861  0.999175      0.806860      0.999175
f1-score       0.999587    0.725146  0.999175      0.862367      0.999100
support    56861.000000  101.000000  0.999175  

## SMOTE

In [46]:
# pip install imblearn

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=587)
x_SMOTE, y_SMOTE = sm.fit(x_train, y_train)
print(len(y_SMOTE))
print(y_SMOTE.sum())
print(y_SMOTE.value_counts())