#  Project 

Outline: 
* intro
* data
* data cleaning 
* build model
* analysis result

# Introduction
Credit card have been used a lot in our daily life. However, But credit card fraud has been a long-standing problem, costing both customers and banks a lot of money. In this project, we would like to analyze the credit card fraud data, find any potential pattern of fraud happened, and detect any transactions are fraud. 

# Data

**Acknowledgements**

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.
More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

Yann-Aël Le Borgne, Gianluca Bontempi Machine Learning for Credit Card Fraud Detection - Practical Handbook

## import packages

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('../input/creditcardfraud/creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
# Time VS Amount, red - fraud, blue - non-fraud. 
fraud_df = df[df.Class == 1]
plt.scatter(df.Time, df.Amount, color = 'blue')
plt.xlabel('Time')
plt.ylabel('Amount')
# plt.show()
plt.scatter(fraud_df.Time, fraud_df.Amount, color = 'red')
plt.xlabel('Time')
plt.ylabel('Amount')
plt.show()

In [None]:
# correlation maatrix

corr = df.corr()
round(corr,2)

In [None]:
sns.heatmap(corr);

In [None]:
# look potential statistical distrbution in different features based on fraud or not. 

fig, axes = plt.subplots(7, 4, figsize=(24, 16))
fig.suptitle('Density Plot for each feature')

sns.kdeplot(ax=axes[0,0],x='V1', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[0,1],x='V2', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[0,2],x='V3', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[0,3],x='V4', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[1,0],x='V5', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[1,1],x='V6', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[1,2],x='V7', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[1,3],x='V8', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[2,0],x='V9', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[2,1],x='V10', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[2,2],x='V11', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[2,3],x='V12', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[3,0],x='V13', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[3,1],x='V14', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[3,2],x='V15', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[3,3],x='V16', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[4,0],x='V17', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[4,1],x='V18', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[4,2],x='V19', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[4,3],x='V20', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[5,0],x='V21', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[5,1],x='V22', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[5,2],x='V23', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[5,3],x='V24', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[6,0],x='V25', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[6,1],x='V26', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[6,2],x='V27', hue='Class', data= df, shade=True)
sns.kdeplot(ax=axes[6,3],x='V28', hue='Class', data= df, shade=True)


In [None]:
sns.distplot(df.Time.values)
sns.distplot(fraud_df.Time.values)

In [None]:
sns.distplot(df.Amount.values)
sns.distplot(fraud_df.Amount.values)

In [None]:
# sns.set_theme(style="darkgrid")
sns.countplot('Class', hue = 'Class', data=df)
plt.title('Data Class Distributions  \n (0: No Fraud, 1: Fraud)', fontsize=14)

From above graphs, we find that the values for Time and Amount are in large scale than other 28 features, we need to rescale "Time" and "Amount" otherwise the model we built may inaccurate. 
Meanwhile, we have also found there is imbalanced class data. The number of fraud transactions are much less than the number of normal transactions. And our object is to detect those fraud credit card transactions, therefore we need to handle these imbalanced data. 

### Understand the data

This dataset only contains numerical variables, and due to confidential issues, original features and more background information about the data are removed. The only 28 features V1, V2, ... V28 are the principal components obtained with PCA. Features "Time", "Amount", "Class" are remained the same. Feature "Class" is the target variable, it uses 1 to represent fraud and 0 for other cases. Feature "Amount" is the transaction amount, and feature "Time" is the time difference in seconds between current transaction time and the first transaction time. These two features could help analyze the fraud transaction amount and any seasonal pattern within fraud transactions [1].    


There is no missing value. 
And from the plot, we see most fraud transactions do not have the large transaction amount. And it happened during most time stamp. There is no obvious time seasonality trend found. 

From the corerlation matrix and map, we found there is no obvious co-linear relationship between data features. 



**Reference** 
1. https://www.kaggle.com/mlg-ulb/creditcardfraud





## Data Cleaning

In [4]:
# rescale "Time" and "Amount"
from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()
df['scaled_amount'] = robust_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = robust_scaler.fit_transform(df['Time'].values.reshape(-1,1))

df.drop(['Time','Amount'], axis=1, inplace=True)

In [None]:
df.head()

In [5]:
# Split data
from sklearn.model_selection import train_test_split 

# train, test_ = train_test_split(df, train_size=0.8, random_state=2021, shuffle=True )
# train_df, validation = train_test_split(train, test_size=0.5, random_state=2021, shuffle=True )

X = df.drop('Class', axis = 1)
y = df['Class']
train_X,test_X, train_y, test_y = train_test_split(X,y, test_size=0.3, random_state=2021)

# train, validation = train_test_split(train, test_size=0.223, random_state=2021) # 0.777 x 0.9 = 0.7 

# train --> 0.7, validation --> 0.2,  test --> 0.1

In [None]:
print(train_X.shape, train_y.shape)
print(test_X.shape, test_y.shape)

In [6]:
# Handle imbalanced data
from imblearn.over_sampling import SMOTE
# from imblearn.under_sampling import RandomUnderSampler


oversample = SMOTE()
train_X, train_y = oversample.fit_resample(train_X, train_y)

print(sum(train_y==0))
print(sum(train_y==1))
# Now the class is balanced. 

199014
199014


Synthetic Minority Oversampling Technique

Reference: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

# Build model

1. logistic regression
2. KNN
3. SVM

In [None]:
# from sklearn.metrics import roc_auc_score
# from sklearn.model_selection import cross_val_predict
# from sklearn.model_selection import cross_val_score

# # Classifier Libraries
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.tree import DecisionTreeClassifier


# classifiers = {
#     "LogisiticRegression": LogisticRegression(),
#     "KNearest": KNeighborsClassifier(),
# #     "Support Vector Classifier": SVC(),
# #     "DecisionTreeClassifier": DecisionTreeClassifier()
# }


# for _, classifier in classifiers.items():
#     classifier.fit(train_X, train_y)
#     accuracy = cross_val_score(classifier, train_X, train_y, cv=5)
#     print("Classifiers: ", classifier.__class__.__name__, "Has a training score of", round(accuracy.mean(), 2) * 100, "% accuracy score")
#     pred = cross_val_predict(classifier, train_X, train_y, cv=5) #,
#                              #method="decision_function")
#     print(classifier.__class__.__name__, roc_auc_score(train_y, pred))
    


## SVM

In [None]:
from sklearn.svm import SVC
svm_model = SVC()
from sklearn.model_selection import GridSearchCV
svm_hyparam = {"C": np.arange(1,5), "kernel":["linear", "rbf"]}
svm_cv_model = GridSearchCV(svm_model, svm_hyparam, cv=5).fit(train_X, train_y)

In [None]:
svm_cv_model.best_score_

In [None]:
best_param = svm_cv_model.best_params_
print(best_param)

In [7]:
from sklearn.svm import SVC

# predict the result 
# svm = SVC(C = best_params['C'], kernel=best_params['kernel'], probability=True).fit(train_X, train_y)
svm = SVC(C = 3, kernel='rbf', probability=True).fit(train_X, train_y)

In [8]:
svm_y_predict = svm.predict(test_X)
print('accuracy_score', accuracy_score(y_test, svm_y_predict))

NameError: name 'accuracy_score' is not defined

In [10]:
from sklearn.metrics import accuracy_score
print('accuracy_score', accuracy_score(test_y, svm_y_predict))

accuracy_score 0.9830413257961448


In [11]:
from sklearn.metrics import classification_report

print(classification_report(test_y, svm_y_predict))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     85301
           1       0.08      0.85      0.14       142

    accuracy                           0.98     85443
   macro avg       0.54      0.92      0.57     85443
weighted avg       1.00      0.98      0.99     85443



## Logistc Regression

In [None]:
from sklearn.linear_model import LogisticRegression
log_hyparam={"C":np.logspace(-5,5,6), "penalty":["l1","l2"]}# l1 lasso l2 ridge
log=LogisticRegression()
log_cv=GridSearchCV(logreg,log_hyparam,cv=10)
log_cv.fit(train_X,train_y)

print(log_cv.best_params_)
print("accuracy ",log_cv.best_score_)
best_param = log_cv.best_params_


In [None]:
log = LogisticRegression(C = best_params['C'], penalty=best_params['penalty'], probability=True).fit(train_X, train_y)

In [None]:

log_y_predict = log.predict(test_X)
print('accuracy_score', accuracy_score(y_test, log_y_predict))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_y, log_y_predict))

## KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_hyparam={"n_neighbors":[2,4,6,8,10], "metric":["euclidean", "manhattan"]}
KNN=KNeighborsClassifier()
KNN_cv=GridSearchCV(KNN,KNN_hyparam,cv=10)
KNN_cv.fit(train_X,train_y)

print(KNN_cv.best_params_)
print("accuracy ",KNN_cv.best_score_)
best_param = log_cv.best_params_

In [None]:
KNN = LogisticRegression(n_neighbors = best_params['n_neighbors'], metric=best_params['metric'], probability=True).fit(train_X, train_y)

In [None]:
KNN_y_predict = KNN.predict(test_X)
print('accuracy_score', accuracy_score(y_test, KNN_y_predict))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_y, KNN_y_predict))

Reference: 
1. https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

2. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html