# Credit Card Fraud Detection

### Problem statement:

Credit Card Fraud is one of the biggest issues faced by the government and the amount of money involved in this is generally enormous.As world is getting more towards digitalization, the risk of online fraud is also increasing. The websites with online payment mode contributes to rise in online frauds. Also, due to this pandemic situation(COVID-19), everyone prefers to do cashless transaction which increases the chances of people getting trapped into such frauds.

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.Among all of the online frauds, one such fraud is credit card fraud which is an ever growing menace in the financial industry. Detecting fraudulent transaction is of great importance for any credit card company.

We are going to approach this real life problem using Data Science.

### Proposal:

The development of a model that provide best results in identifying credit card fraudulent transactions.

This helps both, the credit card company and the customers from getting charged unecessarily.

### Data set

The dataset is obtained from Kaggle.
https://www.kaggle.com/mlg-ulb/creditcardfraud

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

There are 284807 number of transactions(rows) and 31 features in this dataset.

Time: It contains the seconds elapsed between each transaction and the first transaction in the dataset.

Amount: It is the transaction Amount.

Class: It is the response variable and it takes value 1 in case of fraud and 0 otherwise

In [None]:
# Load the packages
import pandas as pd
import pandas_profiling as pp
import matplotlib.pyplot as plt
from matplotlib import gridspec
from matplotlib import __version__ as mpv
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import __version__ as sklv
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
#$$
print('Using version %s of pandas' % pd.__version__)
print('Using version %s of pandas_profiling' % pp.__version__)
print('Using version %s of matplotlib' % mpv)
print('Using version %s of seaborn' % sns.__version__)
print('Using version %s of sklearn' % sklv)
print('Using version %s of numpy' % np.__version__)

In [None]:
# Load data into a dataframe
addr = "./creditcard.csv"
data = pd.read_csv(addr)

In [None]:
# Check the dimension of the table
print("The dimension of the table is: ", data.shape)

In [None]:
# Looking at the column names
data.columns

Since we will not have unseen data, we have two options. First, we can build our model and wait for new data to be generated or captured and see how we do. Often times, this is infeasible for a myriad of reasons.

For this we can create a holdout or a test set. This set, which we assume to be independent and identically distrubted (iid.) will serve as our objective "unseen" data. That said, once we "see" or "peek" at that data, then it is no longer unseen. 

So we are splitting the data into two sets, train and test. We will be only looking at the train data for now, so that we can use the test data strictly as future data.

In [None]:
seed = 42
train, test = train_test_split(data,
                               train_size=0.80,
                               random_state=seed,
                               )


# save the train and test file
# again using the '\t' separator to create tab-separated-values files
#train.to_csv(train_path, sep='\t', index=False)
#test.to_csv(test_path, sep='\t', index=False)

In [None]:
train.describe()

In [None]:
# Checking for NULL values
train.isnull().sum().max()

In [None]:
pp.ProfileReport(test).to_notebook_iframe()

We will firstly focus on the features such as Time, Amount and Class, since rest of them are anonymized(unnamed).

In [None]:
# The classes are heavily skewed we need to solve this issue later.
print('No Frauds', round(train['Class'].value_counts()[0]/len(train) * 100,2), '% of the dataset')
print('Frauds', round(train['Class'].value_counts()[1]/len(train) * 100,2), '% of the dataset')

In [None]:
colors = ["#0101DF", "#DF0101"]
sns.countplot()
sns.countplot( 'Class', data=test, palette=colors)
plt.title('Class Distributions ( 0 = No Fraud  1 = Fraud )', fontsize=14)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,4))

amount_val = test['Amount'].values
time_val = test['Time'].values

sns.distplot(amount_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Amount', fontsize=14)
ax[0].set_xlim([min(amount_val), max(amount_val)])

sns.distplot(time_val, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Time', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])



plt.show()

From the above graph, we have noticed that the distribution of time is bimodal in nature which inturns also indicates that there is a sudden fall in the volume of transactions after 28 hours of the first transaction been made.
As the timing of the transactions are not provided, we can assume that the drop in volume occured during night.

Fraud VS Non-Fraud Time Distribution

In [None]:
fraud_time = test[test['Class'] == 1]['Time']
no_fraud_time = test[test['Class'] == 0]['Time']

fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(20,10))
bins=50

ax1.hist(fraud_time, bins = bins)
ax1.set_title('Fraud')

ax2.hist(no_fraud_time, bins = bins)
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
fig.text(0.04,0.5, 'Number of Transactions', va='center', rotation='vertical')

plt.show()

Amount Distribution

In [None]:
fraud_amt = test[test['Class'] == 1]['Amount']
no_fraud_amt = test[test['Class'] == 0]['Amount']

plt.subplots(1, 2, figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.distplot(no_fraud_amt)
plt.xlabel('Amount ($)')
plt.title('Distribution of Non-Fraudulent Data Amount')

plt.subplot(1, 2, 2)
sns.distplot(fraud_amt)
plt.xlabel('Amount ($)')
plt.title('Distribution of Fraudulent Data Amount');

In [None]:
# Describing the Fraud amount
print(f"Fraud Amount Info: \n {fraud_amt.describe()}")
print('\n\n')
# Describing the Non-Fraud amount
print(f"Non-Fraud Amount Info: \n {no_fraud_amt.describe()}")

distribution of anomalous features

In [None]:

features = test.iloc[:,1:28].columns
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 2)
for i, c in enumerate(test[features]):
 ax = plt.subplot(gs[i])
 sns.distplot(test[c][test.Class == 1], bins=50)
 sns.distplot(test[c][test.Class == 0], bins=50)
 ax.set_xlabel('')
 ax.set_title('histogram of feature:' + str(c))
plt.show()




Correlation between features

In [None]:
#%matplotlib inline
#plt.figure(figsize = (20,20))
plt.rcParams['figure.figsize'] = (20,20)
plt.title('Credit Card Transactions Features Correlation Plot (Pearson)')
corr = test.corr()
sns.heatmap(corr, xticklabels=corr.columns,yticklabels=corr.columns,linewidths=.1,cmap="Blues")
plt.show()

As we can see, some of the predictors do seem to have correlation between them. 
But majority of the predictors are not correlated. 
This could be due below factors
The dimensionality of data is already reduced using PCA(Principle Component Analysis), therefore our predictors are principal components. Principal Components are orthogonal to each other.
The huge class imbalance might distort the importance of certain correlations with regards to our class variable.

Dividing the data into features and label sets

In [None]:
features = test[test.columns[:-1]]
labels = test['Class']
features.head()


In [None]:
labels.head()

In [None]:
x_train = features.values
y_tain = labels.values

Pearson Ranking

In [None]:
#set up the figure size
#%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 7)

# import the package for visulization of the correlation
from yellowbrick.features import Rank2D

# extract the numpy arrays from the data frame
X = x_train

# instantiate the visualizer with the Covariance ranking algorithm
visualizer = Rank2D(features=features.columns, algorithm='pearson')
visualizer.fit(X)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.poof(outpath="./pcoords1.png") # Draw/show/poof the data
#plt.show()