<h1>Credit card Fraud detection predictive model</h1>

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

Due to confidentiality issues, there are not provided the original features and more background information about the data.

<h1>Importing the dataset and libraries and defining values for future use</h1>

In [None]:
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline 
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)


import gc
from datetime import datetime 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from catboost import CatBoostClassifier
from sklearn import svm
import lightgbm as lgb
from lightgbm import LGBMClassifier
import xgboost as xgb

pd.set_option('display.max_columns', 100)


RFC_METRIC = 'gini' 
NUM_ESTIMATORS = 100 
NO_JOBS = 4 



VALID_SIZE = 0.20 
TEST_SIZE = 0.20 
NUMBER_KFOLDS = 5 



RANDOM_STATE = 2018

MAX_ROUNDS = 1000 
EARLY_STOP = 50 
OPT_ROUNDS = 1000 
VERBOSE_EVAL = 50 

<h2>Importing the data<h1>

In [None]:
data_df = pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")

<h2>Checking the data</h2>

In [None]:
print("Credit card fraud detection data - rows:", data_df.shape[0],"Columns:",data_df.shape[1])

In [None]:
#Checking some of the rows in data

data_df.head()

In [None]:
#More details of the credit cards

data_df.describe()

In [None]:
#Checking for missing values in data

total=data_df.isnull().sum().sort_values(ascending = False)
percent =(data_df.isnull().sum()/data_df.isnull().count()*100).sort_values(ascending = False)
pd.concat([total,percent], axis =1, keys=['Total','Percent']).transpose()

There is no missing value in the dataset

<h2>Checking Unbalanced data</h2>

In [None]:
temp=data_df["Class"].value_counts()
df=pd.DataFrame({'Class': temp.index, 'values':temp.values})

trace= go.Bar(
    x=df["Class"],y=df["values"],
    name="Credit card fraud - data unbalanced(Not fraud=0, fraud =1)",
    marker=dict(color ="Red"),
    text=df["values"]
)

data=[trace]
layout=dict(title='Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)',
           xaxis=dict(title='Class', showticklabels=True),
            yaxis=dict(title='Number of transactions'),
            hovermode='closest',width=600
           )

fig = dict(data=data, layout=layout)
iplot(fig, filename='class')

Only 492 (or 0.172%) of transaction are fraudulent. That means the data is highly unbalanced with respect with target variable Class.

<h1>Data Exploration</h1>

<h2>Transactions in times</h2>

In [None]:
class_0 = data_df.loc[data_df['Class'] == 0]["Time"]
class_1 = data_df.loc[data_df['Class'] == 1]["Time"]

hist_data = [class_0, class_1]
group_labels = ['Not Fraud', 'Fraud']

fig = ff.create_distplot(hist_data, group_labels, show_hist=False, show_rug=False)
fig['layout'].update(title='Credit Card Transactions Time Density Plot', xaxis=dict(title='Time [s]'))
iplot(fig, filename='dist_only')

Fraudulent transactions have a distribution more even than valid transactions - are equaly distributed in time, including the low real transaction times, during night in Europe timezone.