## Credit Card Fraud Detection
#### Submitted by: Safalya Mohanta

**Objective:** Predict fraudulent credit card transactions with the help of Machine learning models. 

In [1]:
# Import the required libraries here
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import metrics
from sklearn import preprocessing

from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import power_transform

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection  import cross_val_score
from sklearn.metrics import accuracy_score, mean_squared_error

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

from sklearn import linear_model #import the package
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import GridSearchCV

from sklearn import tree
from pprint import pprint

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

from scipy import stats

In [2]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

## Exploratory data analysis -

In [None]:
# Load the given data
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/creditcard.csv')
df.head()

In [None]:
#observe the different feature type present in the data
print(df.shape)

In [None]:
df.describe().T

In [None]:
df.info()

**Insight:**
* Data has no records will null values
* Data is already PCA transformed
* Data has **284,807 rows** and **31 columns**
* Datatype of the 'Class' variable is 'int'. Class vaiable should be categorical (0: non fraud & 1:fraud), we need to change  datatype.


In [None]:
#Changing the data type of Class

df['Class'] = df['Class'].astype('category')

#Renaming the classes
df['Class'] = df['Class'].cat.rename_categories({1:'Fraudulent',0:'Non_Fraudulent'})

df['Class']

* Here we will observe the distribution of our classes

In [None]:
classes=df['Class'].value_counts()
normal_share=classes[0]/df['Class'].count()*100
fraud_share=classes[1]/df['Class'].count()*100
print(normal_share)
print(fraud_share)

In [None]:
#Creating a df for percentage of each class
class_share = {'Class':['1','0'],'Percentage':[fraud_share,normal_share]}
class_share = pd.DataFrame(class_share)
class_share.head()

* Data is **imbalanced**. Only **0.17%** data represents **fradulent** cases.

In [None]:
# Create a bar plot for the number and percentage of fraudulent vs non-fraudulent transcations
plt.figure(figsize=(13,7))
plt.subplot(121)
plt.title('Fraudulent BarPlot', fontweight='bold',fontsize=14)
ax = df['Class'].value_counts().plot(kind='bar')
total = float(len(df))
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.5f}'.format(height/total),
            ha="center") 


plt.subplot(122)
df["Class"].value_counts().plot.pie(autopct = "%1.5f%%")
plt.show()


In [None]:
# Create a scatter plot to observe the distribution of classes with time
plt.figure(figsize=(10,6))
sns.stripplot(x= 'Class', y= 'Time',data=df)
plt.title('Distribution of Classes with Time\n (0: Non-Fraudulent || 1: Fraudulent)')
plt.show()

**Insight**
* There isn't any particular time interval at which fraudulent transactions happen. It can happen at any time.
* The Time column is evenly distributed for fraudulent transactions and doesn't seem to have any role in deciding whether a transaction is fraud or not.

In [None]:
# Create a scatter plot to observe the distribution of classes with Amount
plt.figure(figsize=(10,6))
sns.stripplot(x= 'Class', y= 'Amount',data=df)
plt.title('Distribution of Classes with Amount\n (0: Non-Fraudulent || 1: Fraudulent)')
plt.show()

**Insight:**
* Fraudulent transactions do not have any high amount transactions. The maximum amount for a fraudulent transaction is  around $2500.

In [None]:
# Drop unnecessary columns
# Dropping the column 'Time' since it does not have any impact on deciding a fraud transaction
df=df.drop('Time',axis=1)
df.shape

In [None]:
plt.figure(figsize=(8,6))

sns.heatmap(df.corr(),linewidths=0.5,cmap='YlGnBu')

plt.show()

* This is a PCA converted data, there isn't much to conclude from the heatmap.

### Splitting the data into train & test data

In [None]:
X = df.drop(["Class"], axis = 1)
y= df['Class']
X.head(5)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)
#Using stratify=y so that proportion of each class is same in both train and test set

##### Preserve X_test & y_test to evaluate on the test data once you build the model

In [None]:
#print(np.sum(y))
#print(np.sum(y_train))
#print(np.sum(y_test))

In [None]:
print('Total count for each class:\n', y.value_counts())
print("\nCount of each class in train data:\n",y_train.value_counts())
print("\nCount of each class in test data:\n",y_test.value_counts())

### Plotting the distribution of a variable

In [None]:
# plot the histogram of a variable from the dataset to see the skewness
# ploting distribution plot for all columns to check the skewness

#Loop for creating histplot.

collist = list(X_train.columns)

c = len(collist)
m = 1
n = 0

plt.figure(figsize=(20,30))

for i in collist:
  if m in range(1,c+1):
    plt.subplot(8,4,m)
    sns.histplot(X_train[X_train.columns[n]])
    m=m+1
    n=n+1

plt.show()

**Insight:**

* Plotted distribution plots for all the variables and it is clearly that there are some variables which are skewed either towards left or right.
* This implies all variables are not normally distributed as expected even if this is a PCA transformed dataset.
* Transform the data to remove the skewness.

### If there is skewness present in the distribution use:
- <b>Power Transformer</b> package present in the <b>preprocessing library provided by sklearn</b> to make distribution more gaussian

In [None]:
# - Apply : preprocessing.PowerTransformer(copy=False) to fit & transform the train & test data
X_train = power_transform(X_train,method='yeo-johnson')
X_test = power_transform(X_test,method='yeo-johnson')

In [None]:
# Converting X_train & X_test back to dataframe
cols = X.columns

X_train = pd.DataFrame(X_train)
X_train.columns = cols

X_test = pd.DataFrame(X_test)
X_test.columns = cols

In [None]:
# plot the histogram of a variable from the dataset again to see the result 
# Plotting same set of variables as earlier to identify the difference.

#Loop for creating histplot.

collist = list(X_train.columns)

c = len(collist)
m = 1
n = 0

plt.figure(figsize=(20,30))

for i in collist:
  if m in range(1,c+1):
    plt.subplot(8,4,m)
    sns.histplot(X_train[X_train.columns[n]])
    m=m+1
    n=n+1

plt.show()

**Insight:**
* After the Power transformation the variables are more gaussian like.
* Changes in V1, V12, V26 and Amount coulmn are quite evident. 
* Skewness has been removed to some extent.

## Model Building
- Build different models on the imbalanced dataset and see the result

In [None]:
# Function to plot ROC curve and classification score which will be used for each model
def plot_roc(fpr,tpr):
    plt.plot(fpr, tpr, color='green', label='ROC')
    plt.plot([0, 1], [0, 1], color='yellow', linestyle='--')
    plt.title("Receiver Operating Characteristic (ROC) Curve")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.legend()
    plt.show()

def clf_score(clf):
    prob = clf.predict_proba(X_test)
    prob = prob[:, 1]
    auc = roc_auc_score(y_test, prob)    
    print('AUC: %.2f' % auc)
    fpr, tpr, thresholds = roc_curve(y_test,prob, pos_label='Non_Fraudulent')
    plot_roc(fpr,tpr)
    predicted=clf.predict(X_test)
    report = classification_report(y_test, predicted)
    print(report)
    return auc

#### Logistic Regresson

In [None]:
# Logistic Regression
num_C = [0.001,0.01,0.1,1,10,100] #--> list of values
#cv_num =   #--> list of values

In [None]:
for cv_num in num_C:
  clf = LogisticRegression(penalty='l2',C=cv_num,random_state = 0)
  clf.fit(X_train, y_train)
  print('C:', cv_num)
  print('Coefficient of each feature:', clf.coef_)
  print('Training accuracy:', clf.score(X_train, y_train))
  print('Test accuracy:', clf.score(X_test, y_test))
  print('')

* The best C value is the one for which the difference between train and test score is the least.
* In our case the best value of **C=0.01**


#### perfom cross validation on the X_train & y_train to create:
- X_train_cv
- X_test_cv 
- y_train_cv
- y_test_cv 

In [None]:
#perform cross validation
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}  # l2 ridge

lsr = LogisticRegression()
clf_lsr_cv = GridSearchCV(lsr,grid,cv=3,scoring='roc_auc')
clf_lsr_cv.fit(X_train,y_train)

#perform hyperparameter tuning
print("tuned hpyerparameters :(best parameters) ",clf_lsr_cv.best_params_)
print("accuracy :",clf_lsr_cv.best_score_)
#print the evaluation result by choosing a evaluation metric

#print the optimum value of hyperparameters

In [None]:
# Fitting the model with best parameters .

lsr_best = LogisticRegression(penalty='l2',C=0.01,random_state = 0)
lsr_clf = lsr_best.fit(X_train,y_train)
clf_score(lsr_clf)

* Best parameters : {'C': 0.01, 'penalty': 'l2'}

### Similarly explore other algorithms by building models like:
- KNN
- SVM
- Decision Tree
- Random Forest
- XGBoost

#### KNN

In [None]:
#K-Nearest Neighbor
# Taking only odd integers as K values to apply the majority rule. 
k_range = np.arange(1, 20, 2)
scores = [] #to store cross val score for each k
k_range

In [None]:
# Finding the best k with stratified K-fold method. 
# We will use cv=3 in cross_val_score to specify the number of folds in the (Stratified)KFold.

for k in k_range:
  knn_clf = KNeighborsClassifier(n_neighbors=k)
  knn_clf.fit(X_train,y_train)
  score = cross_val_score(knn_clf, X_train, y_train, cv=3, n_jobs = -1)
  scores.append(score.mean())

#Storing the mean squared error to decide optimum k
mse = [1-x for x in scores]

In [None]:
#Plotting a line plot to decide optimum value of K

plt.figure(figsize=(20,8))
plt.subplot(121)
sns.lineplot(k_range,mse,markers=True,dashes=False)
plt.xlabel("Value of K")
plt.ylabel("Mean Squared Error")
plt.subplot(122)
sns.lineplot(k_range,scores,markers=True,dashes=False)
plt.xlabel("Value of K")
plt.ylabel("Cross Validation Accuracy")

plt.show()

* From the above plot optimum K value is 3 for KNN

In [None]:
#Fitting the best parameter to the model
# 3 fold cross validation with K=3

knn = KNeighborsClassifier(n_neighbors=3)

knn_clf = knn.fit(X_train,y_train)

In [None]:
# Checking AUC 

clf_score(knn_clf)


* The KNN model with imbalanced data gives AUC of 0.94 which is pretty good but recall is 0.77 which is the score should be  improved in this case.

#### Decision Tree

In [None]:
# 5 fold cross validation for getting best parameter

depth_score=[]
dep_rng = [x for x in range(1,20)]
for i in dep_rng:
  clf = tree.DecisionTreeClassifier(max_depth=i)
  score_tree = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=5, n_jobs=-1)
  depth_score.append(score_tree.mean())
print(depth_score)

In [None]:
#Plotting depth against score

plt.figure(figsize=(8,6))
sns.lineplot(x=dep_rng,y=depth_score,markers=True,dashes=False)
plt.xlabel("Depth")
plt.ylabel("Cross Validation Accuracy")

plt.show()

* The score for depth=5 is the highest. We will use this in our model.

In [None]:
#Fitting the model with depth=5 and plotting ROC curve

dt = tree.DecisionTreeClassifier(max_depth = 5)
dt_clf = dt.fit(X_train,y_train)

#Plotting ROC
clf_score(dt_clf)

* The AUC score for decision tree is only 0.88 which is not satisfactory. The precison and recall are also lower than KNN and logistic regression model.

#### Random Forest

In [None]:
# Using grid search cv to find the best parameters.

param = {'n_estimators': [10, 20, 30, 40, 50], 'max_depth': [2, 3, 4, 7, 9]}
rfc = RandomForestClassifier()
clf_rfc_cv = GridSearchCV(rfc, param, cv=5,scoring='roc_auc', n_jobs=-1)
clf_rfc_cv.fit(X_train,y_train)

print("tuned hpyerparameters :(best parameters) ",clf_rfc_cv.best_params_)
print("accuracy :",clf_rfc_cv.best_score_)

* We will use these parameters for Random forest {'max_depth': 9, 'n_estimators': 50}. The Accuracy is 0.979 which is very good.

In [None]:
#Fitting model and plotting ROC

rf = RandomForestClassifier(max_depth=9, n_estimators=50)
RFC_clf = rf.fit(X_train,y_train)

#Plotting ROC
clf_score(RFC_clf)

* We are getting very good precision(0.97) for Faudulent class which is very good along with the AUC of 0.97


#### XGBoost

In [None]:
from xgboost import XGBClassifier

In [None]:
# Using grid search cv to find the best parameters.

xgbst = XGBClassifier()

param_xgb = {'n_estimators': [25],
             } 

clf_xgb_cv = GridSearchCV(xgbst, param_xgb, cv=3,scoring='roc_auc', n_jobs=-1)
clf_xgb_cv.fit(X_train,y_train)

print("tuned hpyerparameters :(best parameters) ",clf_xgb_cv.best_params_)
print("accuracy :",clf_xgb_cv.best_score_)

* We got the best parameters for XGboost as following.
tuned hpyerparameters : {'max_depth': 5, 'min_child_weight': 3, 'n_estimators': 150} AUC : 0.985

In [None]:
#Fitting the model with best parameters.

xgbst = XGBClassifier(n_estimators=25,max_depth=5,min_child_weight=3)

xgb_clf = xgbst.fit(X_train,y_train)

#Plotting ROC
clf_score(xgb_clf)

* Got AUC of 0.96 with f1-score of 0.82 which is good.
* Recall is 0.74 which is better than our other models.

#### Proceed with the model which shows the best result 
- Apply the best hyperparameter on the model
- Predict on the test dataset

* Out of the 5 models XGBoost performed the best with AUC of 0.98 and Recall of 0.78.


In [None]:
clf = XGBClassifier(n_estimators=25,max_depth=5,min_child_weight=3)  #initialise the model with optimum hyperparameters
clf.fit(X_train, y_train)

# print the evaluation score on the X_test by choosing the best evaluation metric
clf_score(clf)

### Print the important features of the best model to understand the dataset
- This will not give much explanation on the already transformed dataset
- But it will help us in understanding if the dataset is not PCA transformed

In [None]:
var_imp = []
for i in clf.feature_importances_:
    var_imp.append(i)
print('Top var =', var_imp.index(np.sort(clf.feature_importances_)[-1])+1)
print('2nd Top var =', var_imp.index(np.sort(clf.feature_importances_)[-2])+1)
print('3rd Top var =', var_imp.index(np.sort(clf.feature_importances_)[-3])+1)

# Variable on Index-16 and Index-13 seems to be the top 2 variables
top_var_index = var_imp.index(np.sort(clf.feature_importances_)[-1])
second_top_var_index = var_imp.index(np.sort(clf.feature_importances_)[-2])

X_train_1 = X_train.to_numpy()[np.where(y_train==1.0)]
X_train_0 = X_train.to_numpy()[np.where(y_train==0.0)]

np.random.shuffle(X_train_0)



## Model building with balancing Classes

##### Perform class balancing with :
- Random Oversampling
- SMOTE
- ADASYN

#### Class balancing with Random Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_train_ros, y_train_ros = ros.fit_sample(X_train,y_train)

#### Class balancing with SMOTE

In [None]:
#importing SMOTE

from imblearn.over_sampling import SMOTE

sm = SMOTE()
X_sm, y_sm = sm.fit_resample(X_train, y_train)

In [None]:
#CHecking shape and class count after smote
from collections import Counter

print('Resampled dataset shape %s' % Counter(y_sm))
print(X_sm.shape)
print(y_sm.shape)

* As seen above the count of each class is same after SMOTE resampling.

#### Class Balancing with ADASYN

In [None]:
# importing ADASYN

from imblearn.over_sampling import ADASYN

ada = ADASYN()
X_ada, y_ada = ada.fit_resample(X_train, y_train)

In [None]:
# CHecking shape and class count after ADASYN
from collections import Counter

print('Resampled dataset shape %s' % Counter(y_ada))
print(X_ada.shape)
print(y_ada.shape)

## Model Building
- Build different models on the balanced dataset and see the result

* Use tuned models which was built on imbalanced data, with both SMOTE and ADASYN technique and see which one gives the best result.

#### Logistic Regression with Random Oversampling

In [None]:
lsr_best = LogisticRegression(penalty='l2',C=0.01,random_state = 0)
lsr_ros = lsr_best.fit(X_train_ros,y_train_ros)

# Printing ROC curve and accuracy scores
clf_score(lsr_ros)

#### Logistic Regression with SMOTE

In [None]:
# Logistic Regression
# Using the best parameters that we got from the cross validation on imbalanced data.

lsr_best = LogisticRegression(penalty='l2',C=0.01,random_state = 0)
lsr_sm = lsr_best.fit(X_sm,y_sm)

# Printing ROC curve and accuracy scores
clf_score(lsr_sm)

#### Logistic regression with ADASYN¶

In [None]:
lsr_ada = lsr_best.fit(X_ada,y_ada)

# Printing ROC curve and accuracy scores
clf_score(lsr_ada)


**Insight**
* AUC & Recall both are better on SMOTE.
* But the f1-score is extremely low. Model is overfitting.

#### KNN with Random Oversampling

In [None]:
# KNN with ROS re-sampled data

knn = KNeighborsClassifier(n_neighbors=3)

knn_roc = knn.fit(X_train_ros,y_train_ros)

#Printing ROC 

clf_score(knn_roc)

#### KNN with SMOTE

In [None]:
# KNN with SMOTE re-sampled data

knn = KNeighborsClassifier(n_neighbors=3)

knn_sm = knn.fit(X_sm,y_sm)

#Printing ROC 

clf_score(knn_sm)

#### KNN on ADASYN¶

In [None]:
# KNN with ADASYN re-sampled data

knn = KNeighborsClassifier(n_neighbors=3)

knn_ada = knn.fit(X_ada,y_ada)

#Printing ROC 

clf_score(knn_ada)


* KNN gives same recall(0.88) on both SMOTE and ADASYN.
* But on SMOTE, the AUC & f1-score are slightly better. So, KNN performs better on SMOTE.

#### Decision Tree with Random Oversampling

In [None]:
# Building model with ROS

dt = tree.DecisionTreeClassifier(max_depth = 5)
dt_ros = dt.fit(X_train_ros,y_train_ros)

#Plotting ROC
clf_score(dt_ros)

#### Decision Tree on Smote¶

In [None]:
# Building model with SMOTE

dt = tree.DecisionTreeClassifier(max_depth = 5)
dt_sm = dt.fit(X_sm,y_sm)

#Plotting ROC
clf_score(dt_sm)


#### Decision Tree on ADASYN

In [None]:
# Building model with ADASYN

dt = tree.DecisionTreeClassifier(max_depth = 5)
dt_ada = dt.fit(X_ada,y_ada)

#Plotting ROC
clf_score(dt_ada)


* AUC is higher in SMOTE by a small margin but Recall is better in ADASYN than SMOTE.
* The Precision is extremely low in both, resulting in low f1-score. So the model is not good enough.

#### Random Forest with Random Oversampling

In [None]:
#Building Random forest with best parameters on SMOTE
rf = RandomForestClassifier(max_depth=9, n_estimators=30)
RFC_ros = rf.fit(X_train_ros,y_train_ros)

#Plotting ROC
clf_score(RFC_ros)

#### Random Forest on SMOTE¶


In [None]:
#Building Random forest with best parameters on SMOTE
rf = RandomForestClassifier(max_depth=9, n_estimators=30)
RFC_sm = rf.fit(X_sm,y_sm)

#Plotting ROC
clf_score(RFC_sm)


#### Random Forest on ADASYN¶


In [None]:
#Building Random forest with best parameters on ADASYN
rf = RandomForestClassifier(max_depth=9, n_estimators=30)
RFC_ada = rf.fit(X_ada,y_ada)

#Plotting ROC
clf_score(RFC_ada)


**Insight**
* Random Forest performs better on SMOTE.
* Both AUC and Recall for Fraud transactions are better on ADASYN sampled data, but Precision is extremely low.
* Where as in SMOTE we have a fair precision with good recall resulting in a fair f1-score(0.57).

#### XGBoost with Random Oversampling

In [None]:
X_ros = pd.DataFrame(X_train_ros)
X_ros.columns = cols

X_train_ros = pd.DataFrame(X_train_ros)
X_train_ros.columns = cols

xgbst = XGBClassifier(n_estimators=25,max_depth=5,min_child_weight=3)

xgb_ros = xgbst.fit(X_train_ros,y_train_ros)

#Plotting ROC
clf_score(xgb_ros)

#### XGBoost with SMOTE¶

In [None]:
# Since X_sm and X_ada are arrays, we need to covert them to dataframes to avoid feature mismatch error 
X_sm = pd.DataFrame(X_sm)
X_sm.columns = cols

X_ada = pd.DataFrame(X_ada)
X_ada.columns = cols


In [None]:
#Fitting the XGBoost model with best parameters on SMOTE

xgbst = XGBClassifier(n_estimators=25,max_depth=5,min_child_weight=3)

xgb_sm = xgbst.fit(X_sm,y_sm)

#Plotting ROC
clf_score(xgb_sm)

#### XGBoost with ADASYN¶


In [None]:
#Fitting the XGBoost model with best parameters on ADASYN

xgbst = XGBClassifier(n_estimators=25,max_depth=5,min_child_weight=3)

xgb_ada = xgbst.fit(X_ada,y_ada)

#Plotting ROC
clf_score(xgb_ada)


* AUC is similar in both resampled data scenarios.
* With SMOTE XGBoost gives a better Recall but both have a low precision & f1-score.

### Choosing the best model

* Base on various scenarios we have applied XGBoost on SMOTE data and got best evaluation metrices. 
* Instead of aiming for overall accuracy; we consider detecting most of the fraud cases (recall), whilst keeping the cost at which this is achieved under control (precision)

In [None]:
#Predicting on the test data using the best model
y_predicted = xgb_sm.predict(X_test)

In [None]:
print(classification_report(y_test, y_predicted))

In [None]:
target = 'Class'
pca_comp = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',\
       'Amount']

In [None]:
tmp = pd.DataFrame({'Feature': pca_comp, 'Feature importance': xgb_sm.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show() 

**Insight**
* V14 and V4 features are able to explain maximum variance and hence these variable to be target for detect fraud 

### Closing Notes

* Based in provided imbalanced data logistic regression model is built. To work with such imbalance dataset various techniquies like ROS, SMOTE, ADASYN to balance the data are used.

* Using the famous logistic regression models like Random Forest, Logistic regression, and boosting techniques (XGboosting) to arrest fraud transactions.

* Focus on Recall and AUC as given scenario Accuracy was not a major concern. Also the feature that will important for detecting fraud transactions could be dertermined.

