Dataset imported from
https://www.kaggle.com/mlg-ulb/creditcardfraud

# Data Explorations

In [1]:
import pandas as pd
import numpy as np
# load dataset
fraud_df = pd.read_csv("creditcard.csv")

In [2]:
fraud_df.shape
#"Dataset Shape 

(284807, 31)

In [3]:
fraud_df.columns
#column names

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [4]:
#Unique values of target variable :- 
fraud_df['Class'].unique()

array([0, 1])

The target variable Class has 0 and 1 values. Here

0 for non-fraudulent transactions
1 for fraudulent transactions
Because we aim to find fraudulent transactions, the dataset's target value has a positive value for that. 



yeah, we have to check how many samples each target class is having.

In [5]:
#Number of samples under each target value :- 
fraud_df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

we have 284315 non-fraudulent transaction samples & 492 fraudulent transaction samples.



# Credit Card Data Preprocessing

Preprocessing is the process of cleaning the dataset. In this step, we will apply different methods to clean the raw data to feed more meaningful data for the modeling phase. This method includes

Remove duplicates or irrelevant samples
Update missing values with the most relevant values 
Convert one data type to another example, categorical to integers, etc.

Removing irrelevant columns/features
In our dataset, only one irrelevant or not useful feature id Time. So we can drop that feature from the dataset.

In [6]:
# make sure which features are useful & which are not
# we can remove irrelevant features
fraud_df = fraud_df.drop(['Time'], axis=1)
fraud_df.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'],
      dtype='object')

Checking null or nan values 
We can check the datatypes of all features and, at the same time, the number of non-null values of all features by using info() of pandas. 

Null or nan values are nothing, but there is no value for that particular feature or attribute.



In [7]:
fraud_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   V1      284807 non-null  float64
 1   V2      284807 non-null  float64
 2   V3      284807 non-null  float64
 3   V4      284807 non-null  float64
 4   V5      284807 non-null  float64
 5   V6      284807 non-null  float64
 6   V7      284807 non-null  float64
 7   V8      284807 non-null  float64
 8   V9      284807 non-null  float64
 9   V10     284807 non-null  float64
 10  V11     284807 non-null  float64
 11  V12     284807 non-null  float64
 12  V13     284807 non-null  float64
 13  V14     284807 non-null  float64
 14  V15     284807 non-null  float64
 15  V16     284807 non-null  float64
 16  V17     284807 non-null  float64
 17  V18     284807 non-null  float64
 18  V19     284807 non-null  float64
 19  V20     284807 non-null  float64
 20  V21     284807 non-null  float64
 21  V22     28

the result of dataset info(); 

it provides all information about our dataset, such as 

Total number of samples or rows
Column names
Number of non-null values
The data type of each column
Our dataset doesn’t have any null values because the total number features are 284807 that ranges from 0-284806; all features have the same number of samples/rows.

# #Data Transformation
Except for the Amount column, all column’s values are within some range of values. So let's change the Amount columns values to a smaller range of numbers. 

We can simply do this process by using StandardScaler from the sklearn library.

In [8]:
#few values of Amount column :- 
fraud_df['Amount'][0:4]

0    149.62
1      2.69
2    378.66
3    123.50
Name: Amount, dtype: float64

the values of the Amount feature values are in high range compared to other feature values. 

We will change values within a smaller range.

In [9]:
 #data preprocessing
from sklearn.preprocessing import StandardScaler
fraud_df['norm_amount'] = StandardScaler().fit_transform(
fraud_df['Amount'].values.reshape(-1,1))
fraud_df = fraud_df.drop(['Amount'], axis=1)
fraud_df['norm_amount'][0:4]



0    0.244964
1   -0.342475
2    1.160686
3    0.140534
Name: norm_amount, dtype: float64

The scalar result is added as a new column with norm_amount name to the data frame after we drop the Amount column because there is no use with it.



In [10]:
#Splitting dataset 
#Now we will take all independent columns (target column is dependent and the remaining all are independent columns to each other), as X and the target variable as y.


## Features and target creations
X = fraud_df.drop(['Class'], axis=1)
y = fraud_df[['Class']]

Now we need to split the whole dataset into train and test dataset. Training data is used at the time of building the model and a test dataset is used to evaluate trained models. 

By using the train_test_split method from the sklearn library we can do this process of splitting the dataset to train and test sets.

In [11]:
# splitting dataset to train & test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train.shape
X_test.shape
y_train.shape
y_test.shape

(85443, 1)

# Building Credit Card Fraud Detection using Machine Learning algorithms

Credit card fraud detection is a classification problem. Target variable values of Classification problems have integer(0,1) or categorical values(fraud, non-fraud). The target variable of our dataset ‘Class’ has only two labels - 0 (non-fraudulent) and 1 (fraudulent).

# Decision tree algorithm Implementation using python sklearn library

In [12]:
## Building decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def decision_tree_classification(X_train, y_train, X_test, y_test):
    # initialize object for DecisionTreeClassifier class
    dt_classifier = DecisionTreeClassifier()
    # train model by using fit method
    print("Model training starts........")
    dt_classifier.fit(X_train, y_train.values.ravel())
    print("Model training completed")
    acc_score = dt_classifier.score(X_test, y_test)
    print(f'Accuracy of model on test dataset :- {acc_score}')
    # predict result using test dataset
    y_pred = dt_classifier.predict(X_test)
    # confusion matrix
    print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
    # classification report for f1-score
    print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")



# calling decision_tree_classification method to train and evaluate model
decision_tree_classification(X_train, y_train, X_test, y_test)

Model training starts........
Model training completed
Accuracy of model on test dataset :- 0.9992509626300574
Confusion Matrix :- 
 [[85266    30]
 [   34   113]]
Classification Report :- 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85296
           1       0.79      0.77      0.78       147

    accuracy                           1.00     85443
   macro avg       0.89      0.88      0.89     85443
weighted avg       1.00      1.00      1.00     85443



our decision tree classification gives 99% accuracy on test data. 

But  f1-score on label 1 too less ?. 



 the accuracy evaluation metric is not suitable for this problem,why?

# Credit Card Fraud Detection with Random Forest Algorithm

Same as the above decision tree implementation, we use X_train and y_train dataset for training purposes and X_test for evaluation. Here we train the ensemble technique model of RandomForestClassifier from the sklearn. We can see the variations in the evaluation results.

Random forest algorithm Implementation using sklearn library

In [13]:
## Model with randomforest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def random_forest_classifier(X_train, y_train, X_test, y_test):
     # initialize object for DecisionTreeClassifier class
     rf_classifier = RandomForestClassifier(n_estimators=50)
     # train model by using fit method
     print("Model training starts........")
     rf_classifier.fit(X_train, y_train.values.ravel())
     acc_score = rf_classifier.score(X_test, y_test)
     print(f'Accuracy of model on test dataset :- {acc_score}')
     # predict result using test dataset
     y_pred = rf_classifier.predict(X_test)
     # confusion matrix
     print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
     # classification report for f1-score
     print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")


# calling random_forest_classifier
random_forest_classifier(X_train, y_train, X_test, y_test)

Model training starts........
Accuracy of model on test dataset :- 0.9994616293903538
Confusion Matrix :- 
 [[85289     7]
 [   39   108]]
Classification Report :- 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85296
           1       0.94      0.73      0.82       147

    accuracy                           1.00     85443
   macro avg       0.97      0.87      0.91     85443
weighted avg       1.00      1.00      1.00     85443



# Why Accuracy not suitable for Data Imbalance Problems?


Now we will check our dataset and what are the best evaluation metrics for these kinds of problems.

For this discussion, we have to remember two things that are previously discussed.

The number of samples for each Class (target variable) value.
Evaluation metrics at both the decision tree and random forest classification models.

In [14]:

fraud_df['Class'].value_counts()


0    284315
1       492
Name: Class, dtype: int64

the number of samples for Class-1 (fraudulent) less than the samples for class-0 (non-fraudulent). 

This kind of dataset is called unbalanced data. Which means one class label samples are  higher and dominating the other class label. 

For a balanced dataset, accuracy is suitable because we take the divided value of the correctly predicted samples count with the total number of samples for accuracy. 

Accuracy = number of correctly predicted samples / total number of samples

We can use any of the below-mentioned metrics for unbalanced or skewed datasets.

Recall
Precision
F1-score
Area Under ROC curve.
We can see the huge difference among different evaluation metrics for both classifications (decision tree & random forest) models. 




# Model Improvement Using Sampling Techniques


Data sampling is the statistical method for selecting data points (here, the data point is a single row) from the whole dataset. In machine learning problems, there are many sampling techniques available.

Here we take undersampling and oversampling strategies for handling imbalanced data.  

What is this undersampling and oversampling?
Let us take an example of a dataset that has nine samples. 

Six samples belong to class-0,
Three samples belong to class-1
Oversampling = 6 class-0 samples x  2 times of class-1 samples of 3

Undersampling = 3 Class-1 samples x 3 samples from Class-0

Here what we are trying to do is the number of samples of both target classes to be equal. 

In the oversampling technique, samples are repeated, and the dataset size is larger than the original dataset.

In the undersampling technique, samples are not repeated, and the dataset size is less than the original dataset.

Applying Sampling Techniques 
For undersampling techniques, we are checking the number of samples of both classes and selecting the smaller number and taking random samples from other class samples to create a new dataset.  

The new dataset has an equal number of samples for both target classes.

This is a whole process of undersampling, and now we are going to implement this entire process using python.



In [15]:
# Target class distribution
class_val = fraud_df['Class'].value_counts()
print(f"Number of samples for each class :- \n {class_val}")
non_fraud = class_val[0]
fraud = class_val[1]
print(f"Non Fraudulent Numbers :- {non_fraud}")
print(f"Fraudulent Numbers :- {fraud}")

Number of samples for each class :- 
 0    284315
1       492
Name: Class, dtype: int64
Non Fraudulent Numbers :- 284315
Fraudulent Numbers :- 492


The above is the target class distributions, now let's see how we can change this.


In [16]:
## Equal both the target samples to the same level
# take indexes of non fraudulent
nonfraud_indexies = fraud_df[fraud_df.Class == 0].index
fraud_indices = np.array(fraud_df[fraud_df['Class'] == 1].index)
# take random samples from non fraudulent that are equal to fraudulent samples
random_normal_indexies = np.random.choice(nonfraud_indexies, fraud, replace=False)
random_normal_indexies = np.array(random_normal_indexies)
# view raw

Here first, we take indexes of both classes and randomly choose Class-0 samples indexes that are equal to the number of Class-1 samples. 

In the below code snippet, Combine both classes indexes. Then we extract all features of gathered indexes.

In [17]:
## Equal both the target samples to the same level
# take indexes of non fraudulent
nonfraud_indexies = fraud_df[fraud_df.Class == 0].index
fraud_indices = np.array(fraud_df[fraud_df['Class'] == 1].index)
# take random samples from non fraudulent that are equal to fraudulent samples
random_normal_indexies = np.random.choice(nonfraud_indexies, fraud, replace=False)
random_normal_indexies = np.array(random_normal_indexies)


## Undersampling techniques

# concatenate both indices of fraud and non fraud
under_sample_indices = np.concatenate([fraud_indices, random_normal_indexies])

#extract all features from whole data for under sample indices only
under_sample_data = fraud_df.iloc[under_sample_indices, :]

# now we have to divide under sampling data to all features & target
x_undersample_data = under_sample_data.drop(['Class'], axis=1)
y_undersample_data = under_sample_data[['Class']]
# now split dataset to train and test datasets as before
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(
x_undersample_data, y_undersample_data, test_size=0.2, random_state=0)

The above code first divides features and targets as x_undersample_data and y_undersample_data and then splits new undersample data into train and test dataset.

# Decision tree classification after applying sampling techniques



In [18]:
## DecisionTreeClassifier after applying undersampling technique

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

def decision_tree_classification(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 dt_classifier = DecisionTreeClassifier()
 # train model by using fit method
 print("Model training start........")
 dt_classifier.fit(X_train, y_train.values.ravel())
 print("Model training completed")
 acc_score = dt_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = dt_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

# calling decision tree classifier function 
decision_tree_classification(X_train_sample, y_train_sample, 
X_test_sample, y_test_sample)

Model training start........
Model training completed
Accuracy of model on test dataset :- 0.9086294416243654
Confusion Matrix :- 
 [[92 14]
 [ 4 87]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.96      0.87      0.91       106
           1       0.86      0.96      0.91        91

    accuracy                           0.91       197
   macro avg       0.91      0.91      0.91       197
weighted avg       0.91      0.91      0.91       197

AROC score :- 
 0.9119842421729216


# Random Forest Tree Classifier after applying the sampling techniques



In [21]:
## RandomForestClassifier after apply the undersampling techniques

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

def random_forest_classifier(X_train, y_train, X_test, y_test):
 # initialize object for DecisionTreeClassifier class
 rf_classifier = RandomForestClassifier(n_estimators=50)
 # train model by using fit method
 print("Model training start........")
 rf_classifier.fit(X_train, y_train.values.ravel())
 acc_score = rf_classifier.score(X_test, y_test)
 print(f'Accuracy of model on test dataset :- {acc_score}')
 # predict result using test dataset
 y_pred = rf_classifier.predict(X_test)
 # confusion matrix
 print(f"Confusion Matrix :- \n {confusion_matrix(y_test, y_pred)}")
 # classification report for f1-score
 print(f"Classification Report :- \n {classification_report(y_test, y_pred)}")
 # area under roc curve
 print(f"AROC score :- \n {roc_auc_score(y_test, y_pred)}")

random_forest_classifier(X_train_sample, y_train_sample, X_test_sample, y_test_sample)

Model training start........
Accuracy of model on test dataset :- 0.9593908629441624
Confusion Matrix :- 
 [[103   3]
 [  5  86]]
Classification Report :- 
               precision    recall  f1-score   support

           0       0.95      0.97      0.96       106
           1       0.97      0.95      0.96        91

    accuracy                           0.96       197
   macro avg       0.96      0.96      0.96       197
weighted avg       0.96      0.96      0.96       197

AROC score :- 
 0.9583765291312462


# Result

In case of, Random Forest Tree Classifier,the results of the F1-score for both target values are 95%, and the Area Under ROC curve is near to 1. 
so,Random Forest Tree  Algorithm works better here.