# 1. Abstract

### Aim
Credit card based online payments has grown intensely, compelling the financial organisations to implement and continuously improve their fraud detection system. However, credit card fraud dataset is heavily imbalanced and different types of misclassification errors may have different costs and it is essential to control them, to a certain degree, to compromise those errors. Classification techniques are the promising solutions to detect the fraud and non-fraud transactions. Unfortunately, in a certain condition, classification techniques do not perform well when it comes to huge numbers of differences in minority and majority cases. Hence in this study, resampling methods, Under Sampling, Over Sampling and HybridTechniques, were applied in the credit card dataset to overcome the rare events in the dataset.  Then, the three resampled datasets were classified using classification techniques. The performances were measured by their  accuracy, precision, area under curve (AUC) and recall. The findings disclosed that by resampling the dataset, the models were more practicable, gave better performance and were statistically better. 

### Goals and objectives: 

•	The goal of this project is to detect anonymous credit card transactions and label it as fraudulent or genuine in transactional data. Fraud detection involves monitoring the behavior of users to estimate, detect, or avoid undesirable behavior.
•	The objective is to handle data imbalance problem and create simple and commonly used machine learning models like logistic regression, Random forest and maybe others to compare how they perform regarding the metric chosen (Precision, Recall etc) for the task of predicting fraudulent credit card transactions.


### Data description


The [dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud/data) contains two-days credit card transactions made in September 2013 by European cardholders. The dataset is highly unbalanced with a low percentage of fraudulent transactions within several records of normal transactions. The positive class (frauds) account for 0.172% (492 frauds out of 284,807 transactions) of all transactions.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.  Feature 'Class' is the target variable with value 1 in case of fraud and 0 otherwise. 


### Metrics

* Usually accuracy is the first metric that comes to mind when someone is assessing a model performance. However, we must be careful. The data in this case is highly unbalanced, so accuracy is not a good metric at all. If we created a model that always classifies a transaction as non-fraudulent, we would have an astonishing accuracy of 99.83%! So, what is the solution? We should use other metrics to consider a model as good or bad.

* The metrics to be used will be the Area Under the ROC curve (also called AUC), and the recall and precision scores obtained from the confusion matrix. We are also using model accuracy.

* The ROC curve is a plot with the true positive rate on the y-axis and false positive rate on the x-axis. The true positive rate answers the question "When the actual classification is positive, how often does the classifier predict positive?" and the false positive rate answers the question "When the actual classification is negative, how often does the classifier incorrectly predict positive?"

* The AUC shows how good the classifier is in separating the classes. The closer to 1, the better is the classifier.

* Precision answers the question "what proportion of positive identifications was actually correct?" and recall answers "what proportion of actual positives was identified correctly?"

* Model accuracy is the overall accuracy of the model. So we are alos including this, just to check if the accuracy of the model is changing after performing resampling techniques.

With these 4 metrics the we can to tell whether the model performance is good or poor.




### Project Flowchart

![image.png](attachment:image.png)

### Methods used

1. Analysis on Imbalanced dataset
2. Undersampling balancing techniques
3. Oversampling techniques.
4. Hybrid techniques
5. Comparing results
6. Predictive models- just to check how it is performing with our dataset(Extra research work)

### Conclusion:

1. Credit card fraud now a days has been in light for bad.
2. We hear a lot of stories regarding the false transactions, duplicate transactions and error transactions.
3. We collected a data set which has the information of a banking sector with all the credit card transactions. 
4. The data is unbalanced and major analysis involves prediction of  credit card fraud in the transactional data. 
5. It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
6. Machine Learning Algorithms are usually designed to improve accuracy by reducing the error. Thus, they do not take into account the class distribution / proportion or balance of classes.
7. Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. 
8. Thus, there is a high probability of misclassification of the minority class as compared to the majority class.
9. Our research is more about different techniques used to handle imbalance datasets to avoid this problem.

# PART A ( Classification techniques on Imbalance dataset)

### Import Libraries

#### Overview of this section : We have used logistic Regression, Naive and Random forest to classify the dataset to fraud and non fraud. The results obtained in this section are the results on Imbalance dataset.

In [1]:
# Import basic libraries 
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import matplotlib.pyplot as plt
import os
from imblearn.over_sampling import ADASYN 
from collections import Counter
import seaborn as sn

# plot functions
import plot_functions as pf

# scikit packages
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import BernoulliNB 
from sklearn import metrics

# settings
%matplotlib inline
sn.set_style("dark")
sn.set_palette("colorblind")

ModuleNotFoundError: No module named 'plot_functions'

### Load Data  
The dataset used in this project is freely available at: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

In [None]:
df = pd.read_csv("creditcard.csv")

In [None]:
# View top 5 records
df.head()

In [None]:
# determine the number of records in the dataset
print('The dataset contains {0} rows and {1} columns.'.format(df.shape[0], df.shape[1]))

In [None]:
# check for missing values and data types of the columns
df.info()

### Explore label class

In [None]:
print('Normal transactions count: ', df['Class'].value_counts().values[0])
print('Fraudulent transactions count: ', df['Class'].value_counts().values[1])

* So there are 284315 normal classes and 492 fraud classes, dataset is highly imbalanced.That proves dataset is highly imbalance.

![image.png](attachment:image.png)

### Separate feature data (predictors) from labels

* Splitting the data into test and train, so we are separating before that.

In [None]:
# feature data (predictors)
X = df.iloc[:, :-1]

# label class
y = df['Class']

### Standardize data
Data standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1

In [None]:
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)

### Partition data into train and test sets

In [None]:
# Partition data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.33, random_state=42)

### Train models without resampling methods

* The models to be used are shown below. There are also functions to plot the confusion matrix and the ROC curve for the models
Three machine learning algorithms: Logistic Regression, Naive Baye, and RandomForest classifiers were trained using the processed feature data.

In [None]:
# Train LogisticRegression Model
LGR_Classifier = LogisticRegression()
LGR_Classifier.fit(X_train, y_train);

# Train Decision Tree Model
RDF_Classifier = RandomForestClassifier(random_state=0)
RDF_Classifier.fit(X_train, y_train);

# Train Bernoulli Naive Baye Model
BNB_Classifier = BernoulliNB()
BNB_Classifier.fit(X_train, y_train);

In [None]:
# Evaluate models
modlist = [('RandomForest Classifier', RDF_Classifier),('LogisticRegression', LGR_Classifier),
('Naive Baiye Classifier', BNB_Classifier)] 

models = [j for j in modlist]

print()
print('========================== Model Evaluation Results ========================' "\n")  

for i, v in models:
    scores = cross_val_score(v, X_train, y_train, cv=10)
    accuracy = metrics.accuracy_score(y_train, v.predict(X_train))
    confusion_matrix = metrics.confusion_matrix(y_train, v.predict(X_train))
    classification = metrics.classification_report(y_train, v.predict(X_train))
    print('===== {} ====='.format(i))
    print()
    print ("Cross Validation Mean Score: ", '{}%'.format(np.round(scores.mean(), 3) * 100))  
    print() 
    print ("Model Accuracy: ", '{}%'.format(np.round(accuracy, 3) * 100)) 
    print()
    print("Confusion Matrix:" "\n", confusion_matrix)
    print()
    print("Classification Report:" "\n", classification) 
    print()

In [None]:
# Test models
classdict = {'normal':0, 'fraudulent':1}
print()
print('========================== Model Test Results ========================' "\n")   

for i, v in models:
    accuracy = metrics.accuracy_score(y_test, v.predict(X_test))
    confusion_matrix = metrics.confusion_matrix(y_test, v.predict(X_test))
    classification = metrics.classification_report(y_test, v.predict(X_test))   
    print('=== {} ==='.format(i))
    print ("Model Accuracy: ",  '{}%'.format(np.round(accuracy, 3) * 100))
    print()
    print("Confusion Matrix:" "\n", confusion_matrix)
    print()
    pf.plot_confusion_matrix(confusion_matrix, classes = list(classdict.keys()), title='Confusion Matrix Plot', cmap=plt.cm.summer)
    print() 
    print("Classification Report:" "\n", classification) 
    print() 

print('============================= ROC Curve ===============================' "\n")      
pf.plot_roc_auc(arg1=models, arg2=X_test, arg3=y_test)

### Conclusion 

1. Machine Learning Algorithms are usually designed to improve accuracy by reducing the error. Thus, they do not take into account the class distribution / proportion or balance of classes.

2. Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets which is proved here
3. These algorithms tend to show a bias for the majority class, treating the minority class as a noise in the dataset. 
4. The accuracy score of this algorithm will yield an accuracy of 99% which seems impressive, but is it really? The minority class is totally ignored in this case and this can prove expensive in some classification problems, such as the case of a credit card fraud, which can cost individuals and businesses lots of money.
5. Recall for all the methods Logistic Regression,RandomForest Classifier, Naive Bayes is 63%, 76% and 67% respectively. Recall is the metric for our fraudalent observations. If this low that means we didn't do a good job in classifying fraudalent data.



# PART B: Data Sampling  techniques 


In later notebooks we implemented Resampling techniques to balance out data. Three widely-used methods for resampling in this study are Under Sampling, Over Sampling and Hybrid sampling. For under sampling, RUS, Cluster Centroid and AllKNN is chosen. For over sampling ROS, SMOTE and ADASYN were chosen as oversampling methods because of its widely usage. Furthermore, ROS is an intuitive way of balancing data, whereas SMOTE is more complex creating synthetic samples using K-Nearest Neighbour (KNN). For Hybrid sampling we used the Techniques SMOTEEN and SMOTETomek.

![image.png](attachment:image.png)

* Credit Card dataset is a binary classification task. Either the transaction is classified as non-fraud (0) or fraud (1). After the data have been resampled accordingly, the models are needed to be trained using classifiers to evaluate the methods. Thus, in this study, five different classification techniques were explored: Naïve Bayes (NB), Logistic Regression (LR), Random Forest (RF), Decision Tree(DT), KNN and Ensembled Learning. 

# 4. CONTRIBUTION

By us : 70%

Tried to replicate the code of one of the sampling techniques in our own way. All the sampling techniques code included in this project are through our research.

By External Source: 30%

# 5. CITATION

Data set is taken from the kaggle website : https://www.kaggle.com/mlg-ulb/creditcardfraud

### Code sources

#### Imblearn provides some great functionality for dealing with imbalanced data.

1. For Random Under sampling: 
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
2. For Random Oversampling: 
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html
3. For SMOTE:
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
4. For SMOTE-ENN: 
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.combine.SMOTEENN.html
5. For ADASYN: 
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.ADASYN.html

#### For most of the classification algorithms

* https://scikit-learn.org/stable/supervised_learning.html
* http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
* https://www.kaggle.com/randyrose2017/using-scikit-learn-and-keras-for-fraud-detection
* We use t- SNE for visualizing the data and tensor flow to build the predictive model.
https://www.datascience.com/blog/fraud-detection-with-tensorflow



#### Research papers

1. Anderson M. (2008). ‗From Subprime Mortgages to Subprime Credit Cards ‘. Communities and Banking, Federal Reserve Bank of Boston, pp. 21-23.
2. Chan P.K. et al (1999). ‗Distributed Data Mining in Credit Card Fraud Detection ‘, IEEE Intelligent Systems, pp. 67-74. 9. Chang C. & Chang S. (2010). 
3. The Design of E-Traveler ‘s Check with efficiency and Mutual Authentication ‘. Journal of Networks, Vol. 5, No. 3, pp. 275-282.
4. Delamaire et al. (2009) ‘Credit Card Fraud Detection Techniques: A Review’, Banks and banks Systems, Vol. 4, Issue 2, pp. 57-68.
* http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.9248&rep=rep1&type=pdf
* https://thesai.org/Downloads/Volume9No11/Paper_55-Handling_Class_Imbalance_in_Credit_Card_Fraud.pdf
* https://arxiv.org/pdf/1608.06048.pdf
* https://www.hindawi.com/journals/complexity/2018/5764370/
* https://pdfs.semanticscholar.org/0be1/e1f748845244bf8ff4041bb5e7d35b9057ee.pdf?_ga=2.64650601.731729570.1553044261-      1175306196.1553044261
*https://www.researchgate.net/publication/326986162_Credit_Card_Fraud_Detection_Using_Machine_Learning_As_Data_Mining_Technique




## 6. COPYRIGHT

Copyright 2019 Monisha Vodnala Fairy DMonte Harshitha Sanikommu

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.