# Python Implementation

## Problem Statement

To detect potential fraud cases so as to make sure that the customers who own the credit card don't have to pay for items they didn't purchase. The [Credit Card dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud) contains all the transactions made by European Credit Card users over a period of time (September 2013). The goal here is to classify and distinguish fraudulent Credit Card Transactions by building Classification Models 

## Importing the required packages


In [1]:
import pandas as pd
import numpy as np

In [2]:
## Importing data from the creditcard dataset

ccdata = pd.read_csv("creditcard.csv")

In [3]:
ccdata.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
ccdata.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,...,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


# Analyzing and Processing the dataset

## Understanding the Dataset

The numerical variables in the dataset are a result of applying a PCA transformation on the original datatset. 

-Features V1 to V28 are the principal components obtained from using the PCA method.

-The 'Time' feature tells us the amount of time(in seconds) that has elapsed between each transaction and the first transaction in the series.

-The 'Amount' features is the Transaction amount.

-The 'Class' feature is the depandent variable where the value 1 indicates that the transacation is fraudulent and the value 0 indicates otherwise.

We are going to drop the Time column as it won't help us while building the classification models

In [5]:
ccdata.drop('Time', axis = 1, inplace = True)
ccdata.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [6]:
## Checking the Dataset for any missing values

print("The Total number of missing values found in the Class column of the Credit Card Dataset is", 
      ccdata['Class'].isnull().sum())

The Total number of missing values found in the Class column of the Credit Card Dataset is 0


In [7]:
ccdata.loc[:, 'Class']

0         0
1         0
2         0
3         0
4         0
         ..
284802    0
284803    0
284804    0
284805    0
284806    0
Name: Class, Length: 284807, dtype: int64

In [8]:
ccdata['Class'].unique()

array([0, 1])

## Checking the number of fraudulent cases in the entire dataset


In [9]:
Cases = len(ccdata)
print("The Total amount of transactions in the ccdataset is:", Cases,"\n")

FraudCount = len(ccdata[ccdata['Class'] == 1])
print("The Amount of Fraudulent Transactions in the ccdataset is:", FraudCount,"\n")

NonFraudCount = Cases - FraudCount 
print("The Amount of Non-Fraudulent(Genuine) Transactions in the ccdataset is:", NonFraudCount,"\n")

FraudPercent = FraudCount/Cases * 100
FraudPercent = round(FraudPercent, 3)

print("The Percentage of the Instances where the teansaaction has been found to be fraudulent:",FraudPercent)

The Total amount of transactions in the ccdataset is: 284807 

The Amount of Fraudulent Transactions in the ccdataset is: 492 

The Amount of Non-Fraudulent(Genuine) Transactions in the ccdataset is: 284315 

The Percentage of the Instances where the teansaaction has been found to be fraudulent: 0.173


## The need for data scaling

0.173% being the total percentage of cases where fraudulent transcations have occured shows us that the data we are dealing with is highly imbalanced. This needs to be addressed before bulding the models and evaluting their coressponding results.

To better understand this, we are going to have a closer look at the Fraudulent and the Non-Fraudulent cases(Genuine Transactions)

In [10]:
FraudCases = ccdata[ccdata['Class'] == 1]

In [11]:
NonFraudCases = ccdata[ccdata['Class'] == 0]

In [12]:
print("Statistical Breakdown of the Instances which were found to be Fraudulent: \n\n", FraudCases['Amount'].describe())


Statistical Breakdown of the Instances which were found to be Fraudulent: 

 count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64


In [13]:
print("Statistical Breakdown of the Instances which were Genuine: \n\n", NonFraudCases['Amount'].describe())

Statistical Breakdown of the Instances which were Genuine: 

 count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64


PCA (Prinicpal Component Analysis) is a method that this is used to reduce the dimensionality of a datatset while trying to retain as much information as possible. While seeing the statistics of the 'Amount' values in both features, we can see that it's values varies a lot when compared to the rest of the variables. To reduce it's wide range of values, we are going to standarise it by using the standardscaler class which is part of the skearn package.

Why Standardisation? By standardising the values, we can shift the distributioun of each attribute to have a mean of zero and a sd of 1. Rescaled data from standaradisation also works better with Classification models which aligns with our objective. 

In [14]:
#Importing the StandardScaler class from the sklearn package
from sklearn.preprocessing import StandardScaler 

#setting up the Scaler
scaler = StandardScaler() 

#Loading the values of the 'Amount' column that need to be standardised
scaledamount = ccdata['Amount'].values 

#Rewriting the imbalanced values with the Scaled ones
ccdata['Amount'] = scaler.fit_transform(scaledamount.reshape(-1,1))

print(ccdata['Amount'].head(10))


0    0.244964
1   -0.342475
2    1.160686
3    0.140534
4   -0.073403
5   -0.338556
6   -0.333279
7   -0.190107
8    0.019392
9   -0.338516
Name: Amount, dtype: float64


### Feature Selection and Splitting the Dataset

In this step, we define the dependent variable(y) and the independent variable(x). Here, the 'Class' Feature is going to be our dependent variable(y) that is poised against the other columns of the dataset(x)

After deciding our features, we split thw dataset into a Training set & a Testing set to build. These sets will be to build models and will be further used to make predictions off of these very models. For this project, we are going to split the sets in a (75/25) Ratio.

We can split the dataset using the train_test_split function that is a part of sklearn package

In [15]:
from sklearn.model_selection import train_test_split # Function for splitting the data

x = ccdata.drop('Class', axis = 1).values #Independent Feature
y = ccdata['Class'].values #dependent variable

x_train, x_test, y_train, y_test = train_test_split(x,y , test_size = 0.25, random_state = 0)

# Modelling 

## 1.) Logistic Regression 

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report, f1_score

model = LogisticRegression()

model.fit(x_train,y_train)

y_pred = model.predict(x_test)

ac_lr = accuracy_score(y_test, y_pred)
print("The Accuracy Score of the Logistic Regression Model is",ac_lr)

The Accuracy Score of the Logistic Regression Model is 0.9992977725344794


In [17]:
cm_lr = confusion_matrix(y_test, y_pred)
print("The Coonfusion Matrix of the Logistic Regression Model:\n\n",cm_lr)

The Coonfusion Matrix of the Logistic Regression Model:

 [[71071    11]
 [   39    81]]


In [18]:

ar_lr = classification_report(y_test, y_pred)
print("The Classification Report of the Logistic Regressiono model:\n\n",ar_lr)

The Classification Report of the Logistic Regressiono model:

               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71082
           1       0.88      0.68      0.76       120

    accuracy                           1.00     71202
   macro avg       0.94      0.84      0.88     71202
weighted avg       1.00      1.00      1.00     71202



## 2.) Decision Tree Classfier


In [19]:
from sklearn.tree import DecisionTreeClassifier 

treemodel =  DecisionTreeClassifier(max_depth = 4, criterion = 'entropy')
treemodel.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

In [20]:
tree_y_pred =  treemodel.predict(x_test)

In [21]:
ac_tree = accuracy_score(y_test, tree_y_pred)
print("The Accuracy score of the Decision Tree model is",ac_tree)

The Accuracy score of the Decision Tree model is 0.9993960843796522


In [22]:
cm_tree = confusion_matrix(y_test, tree_y_pred)
print("The Confusion Matrix of the Decision Tree Model:\n",cm_tree)

The Confusion Matrix of the Decision Tree Model:
 [[71068    14]
 [   29    91]]


In [23]:
cr_tree = classification_report(y_test, tree_y_pred)
print("The Classification Report of the the Decision Tree Model\n",cr_tree)

The Classification Report of the the Decision Tree Model
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71082
           1       0.87      0.76      0.81       120

    accuracy                           1.00     71202
   macro avg       0.93      0.88      0.90     71202
weighted avg       1.00      1.00      1.00     71202



## 3.) K-Nearest Neighbours

In [24]:
from sklearn.neighbors import KNeighborsClassifier 

knnmodel = KNeighborsClassifier(n_neighbors = 4)

In [25]:
knnmodel.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=4)

In [26]:
knn_y_pred = knnmodel.predict(x_test)

In [27]:
ac_knn = accuracy_score(y_test, knn_y_pred)
print("The Accuracy score of the KNN model is",ac_knn)

The Accuracy score of the KNN model is 0.9994803516755147


In [28]:
cm_knn = confusion_matrix(y_test, knn_y_pred)
print("The Confusion Matrix of the KNN Model:\n",cm_knn)

The Confusion Matrix of the KNN Model:
 [[71077     5]
 [   32    88]]


In [29]:
cr_knn = classification_report(y_test, knn_y_pred)
print("The Classification Report of the the Decision Tree Model:\n",cr_knn)

The Classification Report of the the Decision Tree Model:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71082
           1       0.95      0.73      0.83       120

    accuracy                           1.00     71202
   macro avg       0.97      0.87      0.91     71202
weighted avg       1.00      1.00      1.00     71202



## 4.) Support Vector Machines


In [30]:
from sklearn.svm import SVC 

svmmodel = SVC()

In [31]:
svmmodel.fit(x_train, y_train)

SVC()

In [32]:
svm_y_pred = svmmodel.predict(x_test)

In [33]:
ac_svm = accuracy_score(y_test, svm_y_pred)
print("The Accuracy score of the SVM model is",ac_svm)

The Accuracy score of the SVM model is 0.9993539507317211


In [34]:
cm_svm = confusion_matrix(y_test, svm_y_pred)
print("The Confusion Matrix of the SVM Model:\n",cm_svm)

The Confusion Matrix of the SVM Model:
 [[71076     6]
 [   40    80]]


In [35]:
cr_svm = classification_report(y_test, svm_y_pred)
print("The Classification Report for the SVM Model:\n",cr_svm)

The Classification Report for the SVM Model:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71082
           1       0.93      0.67      0.78       120

    accuracy                           1.00     71202
   macro avg       0.96      0.83      0.89     71202
weighted avg       1.00      1.00      1.00     71202




## 5.) Random Forest Classfier

In [36]:
from sklearn.ensemble import RandomForestClassifier 

rfmodel = RandomForestClassifier(max_depth =  4)

In [37]:
rfmodel.fit(x_train,y_train)

RandomForestClassifier(max_depth=4)

In [38]:
rf_y_pred = rfmodel.predict(x_test)

In [39]:
ac_rf = accuracy_score(y_test, rf_y_pred)
print("The Accuracy score of the Random Forest Tree model is",ac_rf)

The Accuracy score of the Random Forest Tree model is 0.9993820398303418


In [40]:
cm_rf = confusion_matrix(y_test, rf_y_pred)
print("The confusion Matrix of the Random Forest Tree model:\n",cm_rf)

The confusion Matrix of the Random Forest Tree model:
 [[71074     8]
 [   36    84]]


In [41]:
ar_rf = classification_report(y_test,rf_y_pred)
print("The Classfication Report of the Random Forest Tree model:\n",ar_rf)

The Classfication Report of the Random Forest Tree model:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71082
           1       0.91      0.70      0.79       120

    accuracy                           1.00     71202
   macro avg       0.96      0.85      0.90     71202
weighted avg       1.00      1.00      1.00     71202



# Evaluation



We are going to test the different classifcation models we have built using evaluation metrics from the sklearn package. Using these evaluation metrics, we will decide the best model we have built for this dataset. As you can see above, the parameters we use for evaluation here are the Accuracy score, Confusion matrix and the Values in the Classfication Report


## 1.) Classfication Accuracy 

Accuracy (Accuracy Scores to be specific here) is the ratio between the number of Correct Predictions and the Total number of Predictions. It's the most common evaluation metric for Clasification Problems.

In [42]:
print("ACCURACY SCORES")
print("\nThe Accuracy score of the Logistic Regression model is",ac_lr)
print("\nThe Accuracy score of the Decision Tree model is",ac_tree)
print("\nThe Accuracy score of the KNN model is",ac_knn)
print("\nThe Accuracy score of the SVM model is",ac_svm)
print("\nThe Accuracy score of the Random Forest Tree model is",ac_rf)

ACCURACY SCORES

The Accuracy score of the Logistic Regression model is 0.999298

The Accuracy score of the Decision Tree model is 0.999396

The Accuracy score of the KNN model is 0.99948

The Accuracy score of the SVM model is 0.999354

The Accuracy score of the Random Forest Tree model is 0.999382


Comparing the Accuracy scores(AS) of the models, we can see that the AS of the KNN Model is the highest whereas the AS of the Logistic Regression Model is at the bottom when compared to the rest. From this we can assume the KNN model to be most accurate model of the 5. 

## 2. ) Confusion Matrix 

In [46]:
print("The Confusion Matrix of the Logistic Regression model is\n",cm_lr)
print("\n\nThe Confusion Matrix for the KNN Model is\n",cm_knn)
print("\n\nThe confusion Matrix of the Decision Tree Model is\n",cm_tree)
print("\n\nThe Confusion Matrix for the SVM Model is\n",cm_svm)
print("\n\nThe confusion Matrix for the Random Forest Tree model:\n",cm_rf)

The Confusion Matrix of the Logistic Regression model is
 [[71071    11]
 [   39    81]]


The Confusion Matrix for the KNN Model is
 [[71077     5]
 [   32    88]]


The confusion Matrix of the Decision Tree Model is
 [[71068    14]
 [   29    91]]


The Confusion Matrix for the SVM Model is
 [[71076     6]
 [   40    80]]


The confusion Matrix for the Random Forest Tree model:
 [[71074     8]
 [   36    84]]


## Breaking down the Confusion Matrix:

The confusioin matrix is a table that tells us the performance of the Classifcation Model. It sets the predicted values made by the model against the actual values. There output of the tale should be read as

|X                                 |  Predicted Yes      |  Predicted No      |
|---:|:-------------|:-----------|
| Actual Yes(Genuine Transaction)  |  True Positive(TP)  |  False Negative(FN)|
| Actual No(Fraudulent Transaction)|  False Positive(FP) |  True Negative(TN) |


True Positives (TP): These are the cases where the prediction is yes and it lines with up the actual values(Genuine Transactions).

True Negatives (TN): The Prediction is no and it's true as it's a Fraudulent Transaction

False Positives(FP): The Prediction is yes but it's False as it's a Fraudulent Transaction (Type I error)

False Negatives(FN): The Prediction is no, but it's False as it's a Genuine Transaction (Type II error)




#### We get the Classificaition Report values of a model by using these confusion matrix values in our calculations:

Accuracy = (TruePositive + TrueNegative)/Total Number of cases

Precision = TruePositives / (TruePositives + FalsePositives)

Recall = TruePositives / (TruePositives + FalseNegatives)

Recall = TruePositives / (TruePositives + FalseNegatives)

F-Measure = (2 * Precision * Recall) / (Precision + Recall)


Comparing the above Confusion Matrices, we can see that the sum of the True Positive and the True Negative values are the highest. This means that the model has classifed 71077 Genuine cases as true(which is the highest) and flagged 88 Fraudulent cases as False(second highest compared to the other models). So seeing it's high TP and TN values, we assume that it's the most accurate out of all the models



## 3.) F1 Scores

Accuracy Score isn't the only thing we use to evaluate our models. The F1 or the F-Score is another popular metric that is used to evaluate Classification models. It is defined as the harmonic mean of the Precision(no of True Positive results divided by the number of all positive results) and Recall(no of True Positive results divided by the number of all the samples that should have identified as positive) 

Using the Classification report function from the sklearn package we can see the Precision, Recall and the F1 score for all the models

In [43]:
print("F1 SCORES")
print("\nThe F1 score of the Logistic Regression Model is",f1_score(y_test, y_pred))
print("\nThe F1 score of the Decision Tree Model is",f1_score(y_test, tree_y_pred))
print("\nThe F1 score of the KNN Model is",f1_score(y_test, knn_y_pred))
print("\nThe F1 score of the SVM Model is",f1_score(y_test, svm_y_pred))
print("\nThe F1 score of the Random Forest Tree Model is",f1_score(y_test, rf_y_pred))

F1 SCORES

The F1 score of the Logistic Regression Model is 0.7641509433962266

The F1 score of the Decision Tree Model is 0.808888888888889

The F1 score of the KNN Model is 0.8262910798122065

The F1 score of the SVM Model is 0.7766990291262136

The F1 score of the Random Forest Tree Model is 0.7924528301886793


Similar to the previous Evaluation Metric, we can see that the KNN Model has the best score and that the Logisitc Regression Model's F1 score is at the last position when compared to the scores of the other models.

## Conclusion

So in this python file, we were able to:

a.) Standardize an Imbalanaced Dataset that was a result of a PCA Transformation,

b.) Set the feature variables and split the entire dataset into Training and Testing Datasets,

c.) Apply 5 different Classification Algorithms to the Training datasets, and

d.) Compare their respective performance by using 3 different evaluation metrics(Accuracy Score, Confusion Matrix and F -Score)

And by comparing all the performance metrics, we were able to come to the conlsuion that out of the 5 different Classification models we applied to the Credit Card Dataset, the KNN(K-Nearest Neighbours) model is the most suitable one for this particular case.