# Data Science Project - Detecting Fraudulent Credit Card Transactions

The research question here is to investigate whether we can determine a credit card transation to be fraudulent, using the Credit Card Fraud Detection dataset from Kaggle.

First we need to import necessary libraries and load in the data. Then do some early exploratory data analysis to better understand the data.

In [2]:
""" Identify whether a credit card transaction is fraudulent or not. Using credit card transaction data from Kaggle """

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


# Load the test and training data
train_raw_df = pd.read_csv(".\Data\creditcard_train.csv")
test_raw_df = pd.read_csv(".\Data\creditcard_test.csv")

# See how many rows and columns there are
train_raw_df.shape
test_raw_df.shape

# Look for null values and make sure data types are matching
print(train_raw_df.info())
print(test_raw_df.info())

# Get a brief visual look at the actual values in the data and make some initial deductions
train_raw_df.head()




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199364 entries, 0 to 199363
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    199364 non-null  float64
 1   V1      199364 non-null  float64
 2   V2      199364 non-null  float64
 3   V3      199364 non-null  float64
 4   V4      199364 non-null  float64
 5   V5      199364 non-null  float64
 6   V6      199364 non-null  float64
 7   V7      199364 non-null  float64
 8   V8      199364 non-null  float64
 9   V9      199364 non-null  float64
 10  V10     199364 non-null  float64
 11  V11     199364 non-null  float64
 12  V12     199364 non-null  float64
 13  V13     199364 non-null  float64
 14  V14     199364 non-null  float64
 15  V15     199364 non-null  float64
 16  V16     199364 non-null  float64
 17  V17     199364 non-null  float64
 18  V18     199364 non-null  float64
 19  V19     199364 non-null  float64
 20  V20     199364 non-null  float64
 21  V21     19

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,33419.0,-2.178201,-3.132187,1.315758,-0.129783,-2.736013,0.743459,-0.752718,-2.650826,-0.184284,...,-0.828762,-0.219136,-1.004913,0.788588,1.061994,-0.319407,-0.132313,0.333476,937.75,0
1,151317.0,2.064423,0.185575,-1.684612,0.411066,0.479555,-0.797963,0.205544,-0.240568,0.415454,...,-0.351331,-0.876025,0.343288,0.522189,-0.259568,0.173623,-0.05628,-0.029665,1.98,0
2,132434.0,-0.547505,0.798072,-0.719939,-1.129561,0.925708,0.763338,0.231338,0.799204,-0.277812,...,0.366664,1.068933,-0.101523,-1.604148,-0.318277,0.838076,0.012324,-0.015564,11.95,0
3,81787.0,-0.94571,0.323579,0.595681,-1.288095,0.818906,-0.748491,0.890076,-0.130671,-0.471365,...,-0.371528,-1.14951,0.217859,-0.507989,-0.026857,0.591496,-0.326179,-0.007543,24.98,0
4,125062.0,1.898722,-0.321038,-1.771837,0.672408,0.115019,-1.267347,0.61281,-0.44107,0.450298,...,0.015111,0.006269,-0.029094,-0.071333,0.179444,0.378225,-0.106042,-0.059506,104.36,0


First inspection seems to show that we are dealing with numerical data in our 30 features, and a categorical label in our 'Class' column with just two classes "1" and "0". 

There are fortunately no null or missing values in either training or test set.

We can also see that features 'V1 - V28' might already been feature scaled in some way, where as 'Time' and 'Amount' have not. Let's use .describe() on every column to check this.

In [None]:
for column in train_raw_df:
    print(train_raw_df[column].describe(), "\n")

The prints from above confirm these initial thoughts, because the mean for all of the columns from 'V1' to 'V28' are extremely close to zero, suggesting that the data has been standardized (z-score normalised). 

It therefore makes sense to use this same type of normalisation on the non-feature scaled features, 'Time' and 'Amount' but only when we are using ML models to classify the data. 

In the meantime, let us continue with further exploratory data analysis.

In [8]:
# Look for duplicate values
print("Train duplicates:", train_raw_df.duplicated().sum())
print("Test duplicates:", test_raw_df.duplicated().sum())

train_duplicates = train_raw_df[train_raw_df.duplicated()]
train_duplicates.sort_values("Time")

Train duplicates: 585
Test duplicates: 131


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
152621,26.0,-0.529912,0.873892,1.347247,0.145457,0.414209,0.100223,0.711206,0.176066,-0.286717,...,0.046949,0.208105,-0.185548,0.001031,0.098816,-0.552904,-0.073288,0.023307,6.14,0
162734,74.0,1.038370,0.127486,0.184456,1.109950,0.441699,0.945283,-0.036715,0.350995,0.118950,...,0.102520,0.605089,0.023092,-0.626463,0.479120,-0.166937,0.081247,0.001192,1.18,0
188909,145.0,-2.419486,1.949346,0.552998,0.982710,-0.284815,2.411200,-1.398537,-0.188922,0.675695,...,1.213390,-1.238354,0.007191,-1.724175,0.239721,-0.313607,-0.187431,0.119472,6.74,0
114880,919.0,0.904289,-0.538055,0.396058,0.500680,-0.864473,-0.657199,0.027231,-0.029473,0.265447,...,-0.099460,-0.597579,-0.048666,0.551824,0.182934,0.402176,-0.081357,0.027252,158.00,0
116200,919.0,1.207596,-0.036860,0.572104,0.373148,-0.709633,-0.713698,-0.181105,0.011277,0.283940,...,-0.194591,-0.514717,0.089714,0.543768,0.240581,0.418921,-0.051693,-0.000085,1.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171037,170731.0,2.033492,0.766969,-2.107555,3.631952,1.348594,-0.499907,0.945159,-0.286392,-1.370581,...,0.241894,0.658545,-0.102644,0.580535,0.643637,0.347240,-0.116618,-0.078601,0.76,0
120908,171288.0,1.912550,-0.455240,-1.750654,0.454324,2.089130,4.160019,-0.881302,1.081750,1.022928,...,-0.524067,-1.337510,0.473943,0.616683,-0.283548,-1.084843,0.073133,-0.036020,11.99,0
168708,171627.0,-1.464380,1.368119,0.815992,-0.601282,-0.689115,-0.487154,-0.303778,0.884953,0.054065,...,0.287217,0.947825,-0.218773,0.082926,0.044127,0.639270,0.213565,0.119251,6.82,0
149684,172233.0,-2.691642,3.123168,-3.339407,1.017018,-0.293095,-0.167054,-0.745886,2.325616,-1.634651,...,0.402639,0.259746,-0.086606,-0.097597,0.083693,-0.453584,-1.205466,-0.213020,36.74,0


It appears we have a number of duplicates in our datasets. Unfortunately, our data does not contain a clearly identifiable primary key such as 'Transaction ID'. If that was the case then we could simply remove duplicates which shared the same transaction ID.

Looking at the documentation for the dataset (Source: https://www.kaggle.com/mlg-ulb/creditcardfraud) we can see that V1-V28 were likely to have been anonymised for the sake of protecting user's identity. This means that it is likely that the values in these fields combined could be enough to uniquely identity a person. Therefore, it makes it highly improbable for all the values in V1-28 AS WELL AS the values in time and amount to all be exactly the same in more than one entry. On this basis, it seems sensible to remove the duplicate values.

In [11]:
# Remove duplicate data
train_df = train_raw_df.drop_duplicates()
test_df = test_raw_df.drop_duplicates()
print(train_raw_df.shape)
print(train_df.shape)


(199364, 31)
(198779, 31)


In [10]:
# Convert the 'Class' colum from int64 to category as we know it is a categorical variable
train_df["Class"] = train_df['Class'].astype('category')
train_df["Class"]


0         0
1         0
2         0
3         0
4         0
         ..
199359    0
199360    0
199361    0
199362    0
199363    0
Name: Class, Length: 199364, dtype: category
Categories (2, int64): [0, 1]

## Imbalanced Data - Undersampling vs Oversampling

Imbalanced data is when the distribution of classes is uneven, and there is a clear majority and minority class. Undersampling involves removing examples in the majority class to help balance the data, whereas oversampling involves duplicating example from the minority class.  (Source: https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/)





Now we should try to identify influencial variables by performing futher exploratory analysis whilst also cleaning up the data.


In [None]:
# Research fraud detection
# Do main data cleaning stuff here
# Identify influencial variables

Now we want to visualise the data so we need to perform dimensionality reduction.

In [None]:
# Do dimensonality reduction 
# Do some good plt plots

Identify and discuss at least 2 suitable evaluation metrics for this task. Then classify the data.

Since we are working with imbalanced data, it does not make sense to use accuracy as an evaluation metric. This is because if the system just predicted everything to be negative (i.e. class = 0) it would still get a high accuracy score. Instead we should look at precision, recall, and the F1 score which is a combination of the previous two. 

Since we have a low number of overall positive cases (i.e. where the class = 1), recall and F1 score will be the two most important metrics here and therefore what will be taken into account. This is because recall is the number of true positives divided by the total number of actual positives in the dataset. This means it will give a low score if the model gives a lot of false negatives. F1 score is good because it combines both precision and recall to give a good overall score. (Source: https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/)

In [None]:
# Research what evaluation metrics are good
# Learn what it means by classify
# Standardise 'Time' and 'Amount'       

Using a model based method, identify the top 8 most influential variables in the dataset


In [None]:
# Run a model here to get the top 8 most influential variables