# **Scenario:** Predict Credit Card Fraud using **SVM**, **KNN**, and **Naive Bayes**

Credit card fraud happens when someone — a fraudster or a thief — uses your stolen credit card or the information from that card to make unauthorized purchases in your name or take out cash advances using your account.

### **Problem Statement:**

Credit card companies such as **Citibank**, **HSBC**, and **American Express** need to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

### **Aim:**

In this demo, you have to build a classification model to identify fraudulent credit card transactions

### **Dataset Description**
The datasets contains transactions made by credit cards in September 2013 by european cardholders. 

Presents transactions that occurred in two days, where we have **492** frauds out of **284,807** transactions. 

- **Time** - Number of seconds elapsed between this transaction and the first transaction in the dataset
- **V1-V28** - Encrpted attributes (or columns) to protect user identities and sensitive features (v1-v28)
- **Amount** - Transaction Amount
- **Class** - **1** for fraudulent transactions, **0** otherwise

### **Tasks to be performed:**

- Install the required dependencies, import the required libraries and load the data set 
- Perform Exploratory Data Analysis (EDA) on the data set
  -  Generate a Data Report using Pandas Profiling and record your observations
  - Plot **Univariate Distributions**
    - What is the distribution of the **amount** & **class** columns in the data set?
    
- Pre-process that data set for modeling
  - Handle Missing values present in the data set
  - Scale the data set using **RobustScaler()**
  - Split the data into training and testing set using sklearn's **train_test_split** function
- Modelling
  - Build and evaluate a SVM Model
  - Build and evaluate a KNN Model
  - Build and evaluate a Naive Bayes Model

- Model Optimization: Implement **GridSearchCV**
- Model Boosting: Implement **Gradient Boosting** & **XGBoost**
- Dealing with Imbalanced Classes: Re-sampling the data set
- Model Interpertation: Interpret Fraud Detection Model With **Eli5**
- Use **PyCaret** to find the best model and perform Automatic Hyperparameter tuning 

  - Import PyCaret and load the data set
  - Initialize or setup the environment 
  - Compare Multiple Models and their Accuracy Metrics
  - Create the model
  - Tune the model
  - Evaluate the model
- Deploy the model using **Streamlit**

### Importing Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set()

import warnings
warnings.filterwarnings("ignore")
import os

In [None]:
cd = pd.read_csv(r"C:\Users\Shivani Dussa\Downloads\creditcard.csv")
cd.shape

In [None]:
cd.head()

In [None]:
cd.columns

In [None]:
cd.Class.value_counts()    # 0 means out of 284315 we could find only 492 frauds 
                               # // o means yes 1 means no // i.e., fruads is true (1) // no frauds false (0)

## Exploratory Data Analysis 

### *Generate a Data Report using Pandas Profiling and record your observations*

In [None]:
import pandas_profiling 
from pandas_profiling import ProfileReport
prof = ProfileReport(cd)
prof.to_file(output_file = 'output.html') # Creating the Data report 

In [None]:
ProfileReport(cd).to_notebook_iframe()

___
**Observations:**

- There are **31** variables or features in the dataframe and the total number of instances or rows are **2,84,807**
- We have **30** Numeric and **1** Boolean variable
- There are no missing cells in the dataset which is a big relief
- There are **773** duplicate rows in the data set which accounts for **0.3%** of the entire data set
___

In [None]:
#cd.info()

In [None]:
cd.duplicated().sum()

In [None]:
#cd.isna().sum()

**Note:** Answer the following questions:

- What columns seems to have **outliers** based on **min**, **max** and **percentile values**, **IQR range** along with the **standard deviation** and **mean absolute deviation**?
- What columns have missing values? (Check the **Missing Values** section in **Pandas Profiling**)
- What columns have high amount of zeros/NaN

- What columns have **high variance** and **standard deviation**?
- Comment on the distribution of the continuous values **(Real Number: ℝ≥0)**
- Do you see any alarming trends in the extreme values (minimum 5 and maximum 5)?
- How many booleans columns are there in the data set and out of those how many are imbalanced?
- Check for **duplicate records** across all columns (**Check Warning Section**)

- How many columns are categorical?
  - Are those categories in sync with the domain categories?
  - Check if all the categories are unique and they represent distinct information
  - Is there any imbalance in the categorical columns?

Based on the above questions and your observations, chart out a plan for **Data Pre-processing** and feature engineering

**Note:** Feature Engineering (Feature Selection and Feature Creation)

- From the **Interaction Tab**, write at least 3 observations that may be very crucial for prediction. Make sure that they are in story format

**For Example:** Av monthly hours vs Satisfaction Level..

- Check **Pearson** and **Spearman** tab in the **correlation** section and note down the columns which are highly correlated (Postive and Negative Correlation). Create two bands of thresholds. (Consider 60 (0.6) to 80 (0.8) or 80 to 100 as high) 


In [None]:
fig = px.histogram(cd,x = 'Time')
fig.show()

- **16k members are transactions are very low**
- **8103k people transcations are also very low**

In [None]:
fig = px.histogram(cd,x = 'scaled_amount')
fig.show()

- **we hae more amount transactions at upto 487**

In [None]:
cd.Class.value_counts()

In [None]:
# converting above to percentage
round((cd.Class.value_counts()/cd.shape[0]),5)*100

In [None]:
# we have 99.827 are no frauds 
# 0.173 have frauds i.e., the prediction it has did is very low so from 100% data we have only 0.173% frauds

In [None]:
%matplotlib inline

plt.figure(figsize=(12,8))
ax = sns.countplot(cd["Class"], color='green')
for p in ax.patches:
    x = p.get_bbox().get_points()[:,0]
    y = p.get_bbox().get_points()[1,1]
    ax.annotate('{:.2g}%'.format(100.*y/len(cd)), (x.mean(), y), ha='center', va='bottom')
    plt.show()

___
**Observations:**

The data set is **Highly Unbalanced** with only **0.17%** of transactions being classified as **Fraudulent**. 

Several ways to approach this Imbalance Classification problem:

- **Acquire More Data** (Not Possible in our case)
- **Changing the performance metric:**
 - Use the **Confusion Matrix**
 - **F1-Score** (Weighted Average of **Precision** & **Recall**)
 - **ROC Curves**

- **Re-sampling the dataset:** Essentially this is a method that will process the data to have an approximate 50-50 ratio.

 - **Over-sampling**, which is adding copies of the under-represented class (better when you have little data)

 - **Under-sampling**, which deletes instances from the over-represented class (better when we have lot's of data)

**NOTE:** We will use the 2nd Method first and then, use **SMOTE** while using the 3rd approach
___

## Data Implementation

**RobustScaler:** Unlike the previous scalers, the 
centering and scaling statistics of RobustScaler is based on percentiles and are therefore not influenced by a few number of very large marginal outliers. Consequently, the resulting range of the transformed feature values is larger than for the previous scalers and, more importantly, are approximately similar: for both features most of the transformed values lie in a [-2, 3] range

In [None]:
cd['Amount'].values.shape

In [None]:
cd['Time'].values.shape

In [None]:
from sklearn.preprocessing import RobustScaler
rs = RobustScaler()

# we are adding the scaled amount and scaled time by using fit transform and droping the columns amount and time from dataset
cd['scaled_amount'] = rs.fit_transform(cd['Amount'].values.reshape(-1,1))
cd['scaled_time'] = rs.fit_transform(cd['Time'].values.reshape(-1,1))
cd.drop(['Amount','Time'],axis = 1,inplace = True)

In [None]:
cd.head()

In [None]:
cd.columns

In [None]:
cd = cd[['scaled_amount', 'scaled_time','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Class']]
cd.head(4)

In [None]:
from sklearn.model_selection import train_test_split
X = cd.iloc[:,cd.columns != 'Class']
y = cd.iloc[:,cd.columns == 'Class']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
...     X, y,test_size=0.2, random_state=0)

In [None]:
X_train.shape , y_train.shape

In [None]:
X_test.shape , y_test.shape

## Model Building 

### SVM
- Baseline Modeling

In [None]:
from sklearn import svm
shiv = svm.LinearSVC(random_state = 20)
shiv.fit(X_train,y_train)
#pred = shiv.predict(X_test)

In [None]:
pred = shiv.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,pred)

In [None]:
# here from(truly ngtv) 85282 it predicted fraud data is 106(truly positv) from 106(truly pstv) + 41(false ngtv) = 147 
# it predicted  14 (false positive)

In [None]:
from sklearn.metrics import accuracy_score,precision_score,f1_score,recall_score,roc_auc_score
svm_acc = accuracy_score(y_test,pred)
svm_preci = precision_score(y_test,pred)
svm_recal = recall_score(y_test,pred)
svm_f1 = f1_score(y_test,pred)
svm_roc_auc = roc_auc_score(y_test,pred)
print('Accuracy:',svm_acc)
print('Precision:',svm_preci)
print('recall:',svm_recal)
print('f1 score:',svm_f1)
print('roc auc:',svm_roc_auc)

## KNN

In [None]:
cd.head(2)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
darsh = KNeighborsClassifier(n_neighbors = 3,metric = 'euclidean')  # here k = 3
darsh.fit(X_train,y_train)
predi = darsh.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predi))

In [None]:
from sklearn.metrics import accuracy_score,precision_score,f1_score,recall_score,roc_auc_score,classification_report
knn_acc = accuracy_score(y_test,predi)
knn_preci = precision_score(y_test,predi)
knn_recal = recall_score(y_test,predi)
knn_f1 = f1_score(y_test,predi)
knn_roc_auc = roc_auc_score(y_test,predi)
print('Accuracy:',knn_acc)
print('Precision:',knn_preci)
print('recall:',knn_recal)
print('f1 score:',knn_f1)
print('roc auc:',knn_roc_auc)

In [None]:
print(classification_report(y_test,predi))

- **macro avg**
 - This function computes f1 for each label, and returns the average without considering the proportion for each label in the dataset
 
- **weighted avg**

 - This function computes f1 for each label, and returns the average considering the proportion for each label in the dataset

## How to choose K in KNN

In [None]:
# this cell will take 30-40 mins to execute
error_rate = []
for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
    
plt.figure(figsize = (10,6))
plt.plot(range(1,40),error_rate)

plt.title('Error Rate vs K value')
plt.xlabel('K')
plt.ylabel('Error rate')

print("Minimum error:-",min(error_rate),"at k = ",error_rate.index(min(error_rate)))

In [None]:
y_test.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
X_train.shape

In [None]:
error_rate = []
for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate)

plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

print("Minimum error:-",min(error_rate),"at K =",error_rate.index(min(error_rate)))

## Re-train the KNN Model with Optimal value k

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors = 7 ,metric = 'euclidean')
model.fit(X_train,y_train)
pred1 = model.predict(X_test)

In [None]:
print(classification_report(y_test,pred1))

## Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
# create and Train a Guassian Classifier
gnb = GaussianNB()
gnb.fit(X_train,y_train)
pred2 = gnb.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,pred2)

In [None]:
from sklearn.metrics import accuracy_score,precision_score,f1_score,recall_score,roc_auc_score,classification_report
acc = accuracy_score(y_test,pred1)
preci = precision_score(y_test,pred1)
recal = recall_score(y_test,pred1)
f1 = f1_score(y_test,pred1)
roc_auc = roc_auc_score(y_test,pred1)
print('Accuracy:',acc)
print('Precision:',preci)
print('recall:',recal)
print('f1 score:',f1)
print('roc auc:',roc_auc)

# Model Evaluation

- **Precision:**
  - What percentage of positive predictions made were correct? This is **Precision**
  - No. of True Positives divided by the no. of True Positives plus the No. of False Positives
 
- **Recall:** Ratio of True Positives to all the positives in your Dataset

- **When to use Precision & Recall:** 
 - In the credit card fraud detection task, lets say we modify the model slightly, and identify a single transaction correctly as fraud. 

 - Now, our precision will be 1.0 (no false positives) but our recall will be very low because we will still have many false negatives. 

 - If we go to the other extreme and classify all transactions as fraud, we will have a recall of 1.0 — we’ll catch every fraud transaction — but our precision will be very low and we’ll misclassify many legit transactions. In other words, as we increase precision we decrease recall and vice-versa.

- **F1-Score:**
 F1 Score is the weighted average of Precision and Recall. F1 is usually more useful than accuracy, especially when we have an uneven class distribution

 - **When to use F1-Score:** 
   - Useful when you have data with imbalance classes
   - Let us say, we have a model with a precision of 1, and recall of 0 which gives a simple average as 0.5 and an F1 score of 0
   - If one of the parameters is low, the second one no longer matters in the F1 score 
   - The F1 score favors classifiers that have similar precision and recall
   - F1 score is a better measure to use if you are seeking a balance between Precision and Recall

- **roc_auc_score**
 - roc_auc_score always runs from 0 to 1, and is sorting predictive possibilities. 0.5 is the baseline for random guessing
 - This metric shows how good at ranking predictions our model is
  
   When to/ not to use it?
    - Should not use it when your data is heavily imbalanced
    - Should use it when you care equally about positive and negative classes



In [None]:
cd.Class.value_counts()

In [None]:
results = pd.DataFrame([['Naive Bayes Classifier',acc,preci,recal,f1,roc_auc],
                       ['Support Vector Machine',svm_acc,svm_preci,svm_f1,svm_roc_auc],
                       ['K Nearest Neighbor',knn_acc,knn_preci,knn_f1,knn_roc_auc]
                       ],
            columns = ['Model','Accuracy','Precision','Recall','f1 score','ROC'])
results

## Model Optimization: GridSearchCV

**SVM Hyperparameters:**

- **Gamma**
 - Used with non-linear SVM. Commonly used non-linear kernel is the Radial Basis Function (RBF)
 - Gamma parameter of RBF controls the distance of the influence of a single training point
 - Low values of gamma indicate a large similarity radius which results in more points being grouped together
 - For high values of gamma, the points need to be very close to each other in order to be considered in the same group (or class)
 - Models with very large gamma values tend to overfit.
 
- **C**
 - Adds a penalty for each misclassified data point
 - If c is small, the penalty for misclassified points is low so a decision boundary with a large margin is chosen at the expense of a greater number of misclassifications
 - If c is large, SVM tries to minimize the number of misclassified examples due to high penalty which results in a decision boundary with a smaller margin
 - Penalty is not same for all misclassified examples
 - It is directly proportional to the distance to decision boundary

**NOTE:** If you want to learn more about SVM Hyper-parameters, [**Click Here!**](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
# GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV


param_grid = {'C':[0.1,1,10],
             'gamma':['scale','auto'],
             'kernel':['rbf']}
svc_grid = GridSearchCV(SVC(),param_grid,refit = True,verbose = 3,cv = 2)

svc_grid.fit(X_train,y_train)

In [None]:
print('Best Parameters:',svc_grid.best_params_,'Best estimator:',svc_grid.best_estimator_)

In [None]:
g_p = svc_grid.predict(X_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,g_p))


In [None]:
from sklearn.metrics import accuracy_score,precision_score,f1_score,recall_score,roc_auc_score,classification_report
grid_acc = accuracy_score(y_test,g_p)
grid_preci = precision_score(y_test,g_p)
grid_recal = recall_score(y_test,g_p)
grid_f1 = f1_score(y_test,g_p)
grid_roc_auc = roc_auc_score(y_test,g_p)
print('Accuracy:',grid_acc)
print('Precision:',grid_preci)
print('recall:',grid_recal)
print('f1 score:',grid_f1)
print('roc auc:',grid_roc_auc)

In [None]:
cd2 = {'Model': 'SVC With GridSearchCV', 'Accuracy': grid_acc, 'Precision': grid_preci, 'Recall': grid_recal, 
            'f1 score': grid_f1, 'ROC': grid_roc_auc} 
results2 = results.append(cd2,ignore_index = True)
results2

## Model Boosting

**Gradient Boosting Classifier**

Parameters

- **n_estimators:** Represents the number of trees in the forest
- **learning_rate:** Shrinks the contribution of each tree by learning_rate.
- **max_features:** Represents the number of features to consider when looking for the best split 
- **max_depth:** Indicates how deep the built tree can be
- **random_state:** Random state ensures that the splits that you generate are reproducible. Used as a seed to the random number generator. This ensures that the random numbers are generated in the same order


In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix,classification_report

gbc = GradientBoostingClassifier(n_estimators = 200,learning_rate = 0.1,max_features = 2,max_depth = 2,random_state = 0)
gbc.fit(X_train,y_train)

print('Accuracy score(training set):{0:3f}'.format(gbc.score(X_train,y_train)))
print('Accuracy score(testing set):{0:3f}'.format(gbc.score(X_train,y_train)))

pred3 = gbc.predict(X_test)

print('Confusion matrix:')
print(confusion_matrix(y_test,pred3))

print('Classification report:')
print(classification_report(y_test,pred3))

**Let us try different learning rates to compare the performance of the classifier's performance at different learning rates**