In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [8]:
# load the csv data
df = pd.read_csv('../input/creditcardfraud/creditcard.csv')
df.head()

In [9]:
df.describe()

In [10]:
df.info()

****Preprocessing the dataset****

In [13]:
df.isnull().sum() #check for null values

****Exploratory Data Analysis****

In [14]:
#Let us first explore the categorical column "Class".
sns.countplot(df['Class'])

**The number of fraudulent classes is low.****

****Hence, we need to balance the data for reasonable results.**

In [15]:
#To display all the 28 PCA columns, we need to run a loop.
df_temp = df.drop(columns=['Time', 'Amount', 'Class'], axis=1)

# create dist plots
fig, ax = plt.subplots(ncols=4, nrows=7, figsize=(20, 50))
index = 0
ax = ax.flatten()

for col in df_temp.columns:
    sns.distplot(df_temp[col], ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.5, h_pad=5)

In [16]:
#Let us explore the column "Time".

sns.distplot(df['Time'])

In [17]:
#To display the column "Amount".

sns.distplot(df['Amount'])


****In this specific project, the correlation matrix is insignificant because of the lack of meaningful information. All the columns containing random pieces of information is dynamically reduced using PCA transformation.****

**Model training**

In [18]:
#Input Split

X = df.drop(columns=['Class'], axis=1)
y = df['Class']


**Standard scaling for all variables except output class**

In [19]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_scaler = sc.fit_transform(X)

In [27]:
x_scaler [-1] #to print

**All Input attributes are in the X and y contains the output Class.

****After running the code, we can see an array with a scaled value ranging from 0-1. ****

****To understand the process, please go through the formula of Standard Scalar.**

****Model Training and Testing****

In [28]:
#Splitting the Data:

# train test split
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
x_train, x_test, y_train, y_test = train_test_split(x_scaler, y, test_size=0.25, random_state=42, stratify=y)

**We have to use stratify to uniformly distribute class variables (Because the class is not balanced).**

In [36]:
sns.distplot(df['Class'])

In [37]:
#Logistic Regression:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# training
model.fit(x_train, y_train)
# testing
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print("F1 Score:",f1_score(y_test, y_pred))

**Here, we can observe accuracy as 100% (Because of the Standard Scaling). However, the majority of the accuracy is based on Non-Fraudulent samples.****

****F1-Score is a combination of Precision and Recall.****

****Since the F1 score is around 72%, we have to consider a better Model for training.**

In [38]:
#Random Forest:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
# training
model.fit(x_train, y_train)
# testing
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print("F1 Score:",f1_score(y_test, y_pred))

**After running the code, we have to wait longer than usual due to the larger number of Data-Set values.****

****Now the F1-Score has improved.**** 

****Due to unbalanced training, we are observing a low score.****

****Let us try one boosting model.****

**

In [39]:
#XGBoost:

from xgboost import XGBClassifier
model = XGBClassifier(n_jobs=-1)
# training
model.fit(x_train, y_train)
# testing
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print("F1 Score:",f1_score(y_test, y_pred))

**We can observe an F1-Score of 86%, which is a good result.****

****However, let us try to balance this data and see if the results improve in terms of F1-Score and Macro Average.**

****Balancing the classes using SMOTE****

****We will now balance the class with equal distribution and train them with similar models.
Before that, let us see the class ratio.****

In [41]:
#The difference between 0 and 1 classes is large.

sns.countplot(y_train)

**Class Imbalancement:**

In [42]:
# balance the class with equal distribution
from imblearn.over_sampling import SMOTE
over_sample = SMOTE()
x_smote, y_smote = over_sample.fit_resample(x_train, y_train)

****We can use Random Under_Sampling to reduce the data and Random Over_Sampling for increasing the data.**** 

****The use of these balancing methods will result in good values.**

In [43]:
sns.countplot(y_smote)

**Now the sample is equally distributed, the model will give weightage for both of these classes.**

In [None]:
#XGBoost again:

from xgboost import XGBClassifier
model = XGBClassifier(n_jobs=-1)
# training
model.fit(x_smote, y_smote)
# testing
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print("F1 Score:",f1_score(y_test, y_pred))

**since we did'nt got better f1 score we will stick to imbalanced data in this one**