# PROBLEM

### One of the most critical issues that the finance sector faces is fraud. The fraud impacts the bottom line of a financial institution. It is estimated that a typical financial institution loses 5% of its revenue to fraud. If we apply this estimate to the Gross World Product of USD 79.6, the global loss during 2017 was USD 4 trillion (more than the GDP of India)

# APPROACH

### The goal of the project is to detect whether a transaction is a normal payment or a fraud.
### The dataset contains two-day transactions by European cardholders during September 2013.
### Due to the privacy reasons, the dataset has been anonymized. The feature names have also been changed (V1, V2, V3, etc.). Hence, you will not gain much insights from visualization.
### We will use the MAchine Learning Algorithms from the python library scikit learn to predict the Fraudulent Transaction.

In [1]:
#Importing essential Libraries or packages for the solution

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas import set_option
from pandas.plotting import scatter_matrix

In [3]:
#Loading the dataset csv file with Pandas

df=pd.read_csv("/Users/danishkhan/Downloads/creditcard.csv")

In [4]:
#Understanding and Analysing the data
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
#Checking the number of rows and columns

df.shape

(284807, 31)

In [7]:
#checking the datatypes and attributes of the variables

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [8]:
#countinng the null values

df.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

## Our observations are as follows-:

#### 1.NaN values are not present in the data set. Because of the Non-Null Count and number of rows in the dataset match.
#### 2.There are 29 Input Variables and 1 Output Variable (Class)
#### 3.The data type of all the input variables is float64 whereas the data type of out variable (Class) is int64


In [11]:
set_option("Precision",2)

df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,285000.0,285000.0,285000.0,285000.0,285000.0,285000.0,285000.0,285000.0,285000.0,...,285000.0,285000.0,285000.0,285000.0,285000.0,285000.0,285000.0,285000.0,284807.0,285000.0
mean,94813.86,3.92e-15,5.68e-16,-8.76e-15,2.81e-15,-1.55e-15,2.04e-15,-1.7e-15,-1.89e-16,-3.15e-15,...,1.47e-16,8.04e-16,5.28e-16,4.46e-15,1.43e-15,1.7e-15,-3.66e-16,-1.22e-16,88.35,0.00173
std,47488.15,1.96,1.65,1.52,1.42,1.38,1.33,1.24,1.19,1.1,...,0.735,0.726,0.624,0.606,0.521,0.482,0.404,0.33,250.12,0.0415
min,0.0,-56.4,-72.7,-48.3,-5.68,-114.0,-26.2,-43.6,-73.2,-13.4,...,-34.8,-10.9,-44.8,-2.84,-10.3,-2.6,-22.6,-15.4,0.0,0.0
25%,54201.5,-0.92,-0.599,-0.89,-0.849,-0.692,-0.768,-0.554,-0.209,-0.643,...,-0.228,-0.542,-0.162,-0.355,-0.317,-0.327,-0.0708,-0.053,5.6,0.0
50%,84692.0,0.0181,0.0655,0.18,-0.0198,-0.0543,-0.274,0.0401,0.0224,-0.0514,...,-0.0295,0.00678,-0.0112,0.041,0.0166,-0.0521,0.00134,0.0112,22.0,0.0
75%,139320.5,1.32,0.804,1.03,0.743,0.612,0.399,0.57,0.327,0.597,...,0.186,0.529,0.148,0.44,0.351,0.241,0.091,0.0783,77.16,0.0
max,172792.0,2.45,22.1,9.38,16.9,34.8,73.3,121.0,20.0,15.6,...,27.2,10.5,22.5,4.58,7.52,3.52,31.6,33.8,25691.16,1.0


In [12]:
#transposing the  dataframe

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Time,284807.0,94800.0,47488.15,0.0,54201.5,84700.0,139320.5,172792.0
V1,284807.0,3.92e-15,1.96,-56.41,-0.92,0.0181,1.32,2.45
V2,284807.0,5.68e-16,1.65,-72.72,-0.6,0.0655,0.8,22.06
V3,284807.0,-8.76e-15,1.52,-48.33,-0.89,0.18,1.03,9.38
V4,284807.0,2.81e-15,1.42,-5.68,-0.85,-0.0198,0.74,16.88
V5,284807.0,-1.55e-15,1.38,-113.74,-0.69,-0.0543,0.61,34.8
V6,284807.0,2.04e-15,1.33,-26.16,-0.77,-0.274,0.4,73.3
V7,284807.0,-1.7e-15,1.24,-43.56,-0.55,0.0401,0.57,120.59
V8,284807.0,-1.89e-16,1.19,-73.22,-0.21,0.0224,0.33,20.01
V9,284807.0,-3.15e-15,1.1,-13.43,-0.64,-0.0514,0.6,15.59


## We can see that the data for the variables from V1 to V28 is already scaled and cleaned. So there is no need for a data cleaning process in this case



In [18]:
class_names = {0:'Not Fraud', 1:'Fraud'}
print(df.Class.value_counts().rename(index = class_names))


Not Fraud    284315
Fraud           492
Name: Class, dtype: int64


In [14]:
from sklearn.model_selection import train_test_split
y= df["Class"]
X = df.loc[:, df.columns != 'Class']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/3,random_state=42, stratify=y)

-The first parameter of the train_test_split is "test_size" which specifies the ratio of data in the test dataset and test dataset. The value  will put one-third values in the test data set and two-thirds values in the training data set.

-The second parameter is "random_state". Before splitting the data into training and test datasets, the data is randomly shuffled. By giving a value for the random state we ensure, the data is shuffled in a similar way every time so that you get the consistent training and test dataset.



-The third parameter is stratify. Stratify parameter ensures that the proportion of values in the training and test data set will be the same as the proportion of values in the master dataset.

In [19]:
X_train.shape

(189871, 30)

In [22]:
y_train.shape

(189871,)

In [23]:
X_test.shape

(94936, 30)

In [24]:
y_test.shape

(94936,)

### Now we will evaluate different machine learning models

We will use Linear as well as Non-Linear Algorithms for this evaluation

### Linear Algorithms

Logistic Regression (LR) and Linear Discriminant Analysis(LDA)

### Non-Linear Algorithms

Classification and Regression Tree (CART) and K-Nearest Neighbours



## LOGISTIC REGRESSION

In [76]:
#Import Library for Accuracy Score
from sklearn.metrics import accuracy_score

#Import Library for Logistic Regression
from sklearn.linear_model import LogisticRegression

#Initialize the Logistic Regression Classifier
logisreg = LogisticRegression()

In [62]:
#Train the model using Training Dataset & checking the accuracy

logisreg.fit(X_train,y_train)
y_pred=logisreg.predict(X_test)
acc_logisreg=round(accuracy_score(y_test,y_pred)*100,2)
print(f'The Accuracy of Logistic Regression model is: {acc_logisreg}')

The Accuracy of Logistic Regression model is: 99.91


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## LINEAR DISCRIMINANT ANALYSIS (LDA)

In [63]:
# linear discriminant analysis model

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [64]:
#Train the model using Training Dataset & checking the accuracy

linear_dis_an=LinearDiscriminantAnalysis()
linear_dis_an.fit(X_train,y_train)
y_pred=linear_dis_an.predict(X_test)
acc_lda=round(accuracy_score(y_test,y_pred)*100,2)
print(f'The Accuracy of Linear Discriminat Analysis is: {acc_lda}')

The Accuracy of Linear Discriminat Analysis is: 99.93


## GAUSSIAN NAIVE BAYES

In [65]:
#Import Library for Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB

In [66]:
gauss_nb= GaussianNB()
gauss_nb.fit(X_train,y_train)
y_pred=gauss_nb.predict(X_test)
acc_nb=round(accuracy_score(y_test,y_pred)*100,2)
print(f'The Accuracy of Linear Discriminat Analysis is: {acc_nb}')


The Accuracy of Linear Discriminat Analysis is: 99.28


## DECISION TREE

In [67]:
#Import Library for Decision Tree

from sklearn.tree import DecisionTreeClassifier

In [68]:
dec_tree=DecisionTreeClassifier()
dec_tree.fit(X_train,y_train)
y_pred=dec_tree.predict(X_test)
acc_dt=round(accuracy_score(y_test,y_pred)*100,2)
print(f'The Accuracy of Linear Discriminat Analysis is: {acc_dt}')


The Accuracy of Linear Discriminat Analysis is: 99.91


## RANDOM FOREST

In [53]:
#Import Library for Random Forest

from sklearn.ensemble import RandomForestClassifier


In [56]:
ran_for=RandomForestClassifier()
ran_for.fit(X_train,y_train)
y_pred=ran_for.predict(X_test)
acc_rf=round(accuracy_score(y_test,y_pred)*100,2)
print(f'The Accuracy of Linear Discriminat Analysis is: {acc_rf}')


The Accuracy of Linear Discriminat Analysis is: 99.95


## SUPPORT VECTOR MACHINE(SVM)

In [69]:
#Import Library for Support Vector Machine Model

from sklearn import svm

In [71]:
svmachine= svm.SVC()
svmachine.fit(X_train,y_train)
y_pred=svmachine.predict(X_test)
acc_svm=round(accuracy_score(y_test,y_pred)*100,2)
print(f'The Accuracy of Linear Discriminat Analysis is: {acc_svm}')

The Accuracy of Linear Discriminat Analysis is: 99.83


## K-NEAREST NEIGHBOUR(KNN)

In [73]:
#Import Library for K Nearest Neighbour Model

from sklearn.neighbors import KNeighborsClassifier

In [74]:
knn= KNeighborsClassifier()
knn.fit(X_train,y_train)
y_pred= knn.predict(X_test)
acc_knn=round(accuracy_score(y_test,y_pred)*100,2)
print(f'The Accuracy of Linear Discriminat Analysis is: {acc_knn}')

The Accuracy of Linear Discriminat Analysis is: 99.83


### We can compare the accuracy of all the models and choose the one with the maximum accuracy

In [77]:
models_check = pd.DataFrame({'Used Models': ['Logistic Regression', 'Linear Discriminant Analysis','Naive Bayes', 'Decision Tree', 'Random Forest', 'Support Vector Machines', 'K - Nearest Neighbors'],
                       'Score': [acc_logisreg, acc_lda, acc_nb, acc_dt, acc_rf, acc_svm, acc_knn]})

models_check.sort_values(by='Score', ascending=False)


Unnamed: 0,Used Models,Score
4,Random Forest,99.95
1,Linear Discriminant Analysis,99.93
0,Logistic Regression,99.91
3,Decision Tree,99.91
5,Support Vector Machines,99.83
6,K - Nearest Neighbors,99.83
2,Naive Bayes,99.28


# We can select the RANDOM FOREST as it has given us the maximum accuracy