**About the data**

The data for this article can be found here. This dataset contains the real bank transactions made by European cardholders in the year 2013. As a security concern, the actual variables are not being shared but — they have been transformed versions of PCA. As a result, we can find 29 feature columns and 1 final class column.

>**Tentang data**

>Data untuk artikel ini dapat ditemukan di sini. Dataset ini berisi transaksi bank nyata yang dilakukan oleh pemegang kartu Eropa pada tahun 2013. Sebagai masalah keamanan, variabel sebenarnya tidak dibagikan tetapi — mereka telah mengubah versi PCA. Hasilnya, kita dapat menemukan 29 kolom fitur dan 1 kolom kelas akhir.

**Importing Necessary Libraries**

It is a good practice to import all the necessary libraries in one place — so that we can modify them quickly.
For this credit card data, the features that we have in the dataset are the transformed version of PCA, so we will not need to perform the feature selection again. Otherwise, it is recommended to use RFE, RFECV, SelectKBest and VIF score to find the best features for your model.

>**Mengimpor library yang Diperlukan**

>Ini adalah praktik yang baik untuk mengimpor semua perpustakaan yang diperlukan di satu tempat — sehingga kami dapat memodifikasinya dengan cepat. Untuk data kartu kredit ini, fitur yang kita miliki dalam dataset adalah versi PCA yang diubah, jadi kita tidak perlu melakukan pemilihan fitur lagi. Jika tidak, disarankan untuk menggunakan skor RFE, RFECV, SelectKBest, dan VIF untuk menemukan fitur terbaik untuk model Anda.

In [1]:
#Packages related to general operating system & warnings
import os 
import warnings
warnings.filterwarnings('ignore')

#Packages related to data importing, manipulation, exploratory data #analysis, data understanding
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from termcolor import colored as cl # text customization

#Packages related to data visualizaiton
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Setting plot sizes and type of plot
plt.rc("font", size=14)
plt.rcParams['axes.grid'] = True
plt.figure(figsize=(6,3))
plt.gray()
from matplotlib.backends.backend_pdf import PdfPages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.preprocessing import  PolynomialFeatures, KBinsDiscretizer, FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, OrdinalEncoder
import statsmodels.formula.api as smf
import statsmodels.tsa as tsa
from sklearn.linear_model import LogisticRegression, LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz, export_text
from sklearn.ensemble import BaggingClassifier, BaggingRegressor,RandomForestClassifier,RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier,GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor 
from sklearn.svm import LinearSVC, LinearSVR, SVC, SVR
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

<Figure size 600x300 with 0 Axes>

**Importing Dataset**

Importing the dataset is pretty much simple. You can use pandas module in python to import it.

>**Mengimpor Dataset**

>Mengimpor dataset cukup sederhana. Anda dapat menggunakan modul pandas dengan python untuk mengimpornya.

In [2]:
data=pd.read_csv("creditcardpy.csv")

**Data Processing & Understanding**

The one main thing you will notice about this data is that — the dataset is imbalanced towards a feature. Which seems pretty valid for such kind of data. Because today many banks have adopted different security mechanisms — so it is harder for hackers to make such moves.

Still, sometimes when there is some vulnerability in the system — the chance of such activities can increase.

That’s why we can see the majority of transactions belongs to our datasets are normal and only a few percentages of transactions are fraudulent.

Let’s check the transaction distribution.

>**Pengolahan Data & Pemahaman**

>Satu hal utama yang akan Anda perhatikan tentang data ini adalah bahwa — dataset tidak seimbang terhadap suatu fitur. Yang tampaknya cukup valid untuk data semacam itu. Karena saat ini banyak bank telah mengadopsi mekanisme keamanan yang berbeda — sehingga lebih sulit bagi peretas untuk melakukan langkah seperti itu.

>Namun, kadang — kadang ketika ada beberapa kerentanan dalam sistem-kemungkinan kegiatan tersebut dapat meningkat.

>Itu sebabnya kita dapat melihat sebagian besar transaksi milik dataset kita normal dan hanya beberapa persentase transaksi yang curang.

>Mari kita periksa distribusi transaksi.

In [3]:
Total_transactions = len(data)
normal = len(data[data.Class == 0])
fraudulent = len(data[data.Class == 1])
fraud_percentage = round(fraudulent/normal*100, 2)
print(cl('Total number of Transactions are {}'.format(Total_transactions), attrs = ['bold']))
print(cl('Number of Normal Transactions are {}'.format(normal), attrs = ['bold']))
print(cl('Number of fraudulent Transactions are {}'.format(fraudulent), attrs = ['bold']))
print(cl('Percentage of fraud Transactions is {}'.format(fraud_percentage), attrs = ['bold']))

[1mTotal number of Transactions are 284807[0m
[1mNumber of Normal Transactions are 284315[0m
[1mNumber of fraudulent Transactions are 492[0m
[1mPercentage of fraud Transactions is 0.17[0m


Only 0.17% of transactions are fraudulent.

We can also check for null values using the following line of code.

>Hanya 0,17% transaksi yang curang.

>Kami juga dapat memeriksa nilai nol menggunakan baris kode berikut.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

As per the count per column, we have no null values. Also, feature selection is not the case for this use case. Anyway, you can try applying feature selection mechanisms to check if the results are optimised.

I have observed in our data 28 features are transformed versions of PCA but the Amount is the original one. And, while checking the minimum and maximum is in the amount — I found the difference is huge that can deviate our result.

>Sesuai hitungan per kolom, kami tidak memiliki nilai nol. Selain itu, pemilihan fitur tidak berlaku untuk kasus penggunaan ini. Bagaimanapun, Anda dapat mencoba menerapkan mekanisme pemilihan fitur untuk memeriksa apakah hasilnya dioptimalkan.

>Saya telah mengamati dalam data kami 28 fitur adalah versi PCA yang diubah tetapi jumlahnya adalah yang asli. Dan, saat memeriksa minimum dan maksimum dalam jumlah — saya menemukan perbedaannya sangat besar yang dapat menyimpang hasil kami.

In [5]:
min(data.Amount),max(data.Amount)

(0.0, 25691.16)

In this case, it is a good practice to scale this variable. We can use a standard scaler to make it fix.

>Dalam hal ini, merupakan praktik yang baik untuk menskalakan variabel ini. Kita dapat menggunakan scaler standar untuk memperbaikinya.

In [6]:
sc = StandardScaler()
amount = data['Amount'].values
data['Amount'] = sc.fit_transform(amount.reshape(-1, 1))

We have one more variable which is the time which can be an external deciding factor — but in our modelling process, we can drop it.

>Kami memiliki satu variabel lagi yaitu waktu yang dapat menjadi faktor penentu eksternal — tetapi dalam proses pemodelan kami, kami dapat menjatuhkannya.

In [7]:
data.drop(['Time'], axis=1, inplace=True)

We can also check for any duplicate transactions. Before removing any duplicate transaction, we are having 284807 transactions in our data. Let’s remove the duplicate and observe the changes.

>Kami juga dapat memeriksa transaksi duplikat. Sebelum menghapus transaksi duplikat, kami memiliki 284807 transaksi dalam data kami. Mari kita hapus duplikat dan amati perubahannya.

In [8]:
data.shape

(284807, 30)

Run the below line of code to remove any duplicates.

>Jalankan baris kode di bawah ini untuk menghapus duplikat apa pun.

In [9]:
data.drop_duplicates(inplace=True)

Let’s now check the count again.

>Sekarang mari kita periksa hitungannya lagi.

In [10]:
data.shape

(275663, 30)

So, we were having around ~9000 duplicate transactions.

Here we go!! We now have properly scaled data with no duplicate, no missing. Let’s now split it for our model building.

>Jadi, kami memiliki sekitar ~9000 transaksi duplikat.

>Ini dia!! Kami sekarang memiliki data yang diskalakan dengan benar tanpa duplikat, tidak ada yang hilang. Sekarang mari kita membaginya untuk bangunan model kita.

**Train & Test Split**

Before splitting train & test — we need to define dependent and independent variables. The dependent variable is also known as X and the independent variable is known as y.

>**Latih & Uji Split**

>Sebelum memisahkan train & test - kita perlu mendefinisikan variabel dependen dan independen. Variabel dependen juga dikenal sebagai X dan variabel independen dikenal sebagai y.

In [11]:
X = data.drop('Class', axis = 1).values
y = data['Class'].values

Now, let split our train and test data.

>Sekarang, mari kita membagi data kereta dan uji kami.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

That’s it. We now have two different data set — Train data we will be used for training our model and the data which is unseen will be used for testing.

>Itu saja. Kami sekarang memiliki dua kumpulan data yang berbeda-Data latih yang akan kami gunakan untuk melatih model kami dan data yang tidak terlihat akan digunakan untuk pengujian.

**Model Building**

We will be trying different machine learning models one by one. Defining models are much easier. A single line of code can define our model. And, in the same way, a single line of code can fit the model on our data.

We can also tune these models by selecting different optimized parameters. But, if the accuracy is better even with less parameter tuning then — no need to make it complex.

>**Bangun Model**

>Kami akan mencoba model pembelajaran mesin yang berbeda satu per satu. Mendefinisikan model jauh lebih mudah. Satu baris kode dapat menentukan model kita. Dan, dengan cara yang sama, satu baris kode dapat memuat model pada data kami.

>Kami juga dapat menyetel model ini dengan memilih parameter yang dioptimalkan berbeda. Tetapi, jika akurasinya lebih baik bahkan dengan penyetelan parameter yang lebih sedikit maka — tidak perlu membuatnya rumit.

***Decision Tree***

In [13]:
DT = DecisionTreeClassifier(max_depth = 4, criterion = 'entropy')
DT.fit(X_train, y_train)
dt_yhat = DT.predict(X_test)

Let’s check the accuracy of our decision tree model.

>Mari kita periksa keakuratan model decision tree kita.

In [14]:
print('Accuracy score of the Decision Tree model is {}'.format(accuracy_score(y_test, dt_yhat)))

Accuracy score of the Decision Tree model is 0.9991293748911718


Checking F1-Score for the decision tree model.

>Memeriksa Skor F1 untuk model decision tree.

In [15]:
print('F1 score of the Decision Tree model is {}'.format(f1_score(y_test, dt_yhat)))

F1 score of the Decision Tree model is 0.7435897435897436


Checking the confusion matrix:

>Memeriksa matriks kebingungan:

In [16]:
confusion_matrix(y_test, dt_yhat, labels = [0, 1])

array([[68769,    19],
       [   41,    87]], dtype=int64)

Here, the first row represents positive and the second row represents negative. So, we have 68782 as true positive and 18 are false positive. That says, out of 68782+18=68800, we have 68782 that are successfully classified as a normal transaction and 18 were falsely classified as normal — but they were fraudulent.

>Di sini, baris pertama mewakili positif dan baris kedua mewakili negatif. Jadi, kita memiliki 68782 sebagai positif sejati dan 18 adalah positif palsu. Yang mengatakan, dari 68782 + 18 = 68800, kami memiliki 68782 yang berhasil diklasifikasikan sebagai transaksi normal dan 18 salah diklasifikasikan sebagai normal — tetapi mereka curang.

Let’s now try different models and check their performance.

>Sekarang mari kita coba model yang berbeda dan periksa kinerjanya.

***K-Nearest Neighbors***

In [17]:
n = 7
KNN = KNeighborsClassifier(n_neighbors = n)
KNN.fit(X_train, y_train)
knn_yhat = KNN.predict(X_test)

Let’s check the accuracy of our K-Nearest Neighbors model.

>Mari kita periksa keakuratan model K-Nearest Neighbors kami.

In [18]:
print('Accuracy score of the K-Nearest Neighbors model is {}'.format(accuracy_score(y_test, knn_yhat)))

Accuracy score of the K-Nearest Neighbors model is 0.999288989494457


Checking F1-Score for the K-Nearest Neighbors model.

>Memeriksa F1-Skor untuk model K-Nearest Neighbors.

In [19]:
print('F1 score of the K-Nearest Neighbors model is {}'.format(f1_score(y_test, knn_yhat)))

F1 score of the K-Nearest Neighbors model is 0.7949790794979079


***Logistic Regression***

In [20]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_yhat = lr.predict(X_test)

Let’s check the accuracy of our Logistic Regression model.

>Mari kita periksa keakuratan model Logistic Regression kita.

In [21]:
print('Accuracy score of the Logistic Regression model is {}'.format(accuracy_score(y_test, lr_yhat)))

Accuracy score of the Logistic Regression model is 0.9989552498694062


Checking F1-Score for the Logistic Regression model.

>Memeriksa Skor F1 untuk model Logistic Regression.

In [22]:
print('F1 score of the Logistic Regression model is {}'.format(f1_score(y_test, lr_yhat)))

F1 score of the Logistic Regression model is 0.6666666666666666


***Support Vector Machines***

In [23]:
svm = SVC()
svm.fit(X_train, y_train)
svm_yhat = svm.predict(X_test)

Let’s check the accuracy of our Support Vector Machines model.

>Mari kita periksa keakuratan model Support Vector Machines kami.

In [24]:
print('Accuracy score of the Support Vector Machines model is {}'.format(accuracy_score(y_test, svm_yhat)))

Accuracy score of the Support Vector Machines model is 0.999318010331418


Checking F1-Score for the Support Vector Machines model.

>Memeriksa Skor F1 untuk model Support Vector Machines.

In [25]:
print('F1 score of the Support Vector Machines model is {}'.format(f1_score(y_test, svm_yhat)))

F1 score of the Support Vector Machines model is 0.7813953488372093


***Random Forest***

In [26]:
rf = RandomForestClassifier(max_depth = 4)
rf.fit(X_train, y_train)
rf_yhat = rf.predict(X_test)

Let’s check the accuracy of our Random Forest model.

>Mari kita periksa keakuratan model Random Forest kita.

In [27]:
print('Accuracy score of the Random Forest model is {}'.format(accuracy_score(y_test, rf_yhat)))

Accuracy score of the Random Forest model is 0.9991293748911718


Checking F1-Score for the Random Forest model.

>Memeriksa Skor F1 untuk model Random Forest

In [28]:
print('F1 score of the Random Forest model is {}'.format(f1_score(y_test, rf_yhat)))

F1 score of the Random Forest model is 0.724770642201835


***XGBoost***

In [29]:
xgb = XGBClassifier(max_depth = 4)
xgb.fit(X_train, y_train)
xgb_yhat = xgb.predict(X_test)



Let’s check the accuracy of our XGBoost model.

>Mari kita periksa keakuratan model XGBoost kami.

In [30]:
print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))

Accuracy score of the XGBoost model is 0.999506645771664


Checking F1-Score for the XGBoost model.

>Memeriksa Skor F1 untuk model XGBoost.

In [31]:
print('F1 score of the XGBoost model is {}'.format(f1_score(y_test, xgb_yhat)))

F1 score of the XGBoost model is 0.8495575221238937


**Conclusion**

Well, congratulation!! We just received 99.95% accuracy in our credit card fraud detection. This number should not be surprising as our data was balanced towards one class. The good thing that we have noticed from the confusion matrix is that — our model is not overfitted.

Finally, based on our accuracy score — XGBoost is the winner for our case. The only catch here is the data that we have received for model training. The data features are the transformed version of PCA. If the actual features follow a similar pattern then we are doing great!!

>**Kesimpulan**

>Nah, selamat!! Kami baru saja menerima akurasi 99,95% dalam deteksi penipuan kartu kredit kami. Jumlah ini seharusnya tidak mengejutkan karena data kami seimbang terhadap satu kelas. Hal baik yang kami perhatikan dari matriks kebingungan adalah bahwa — model kami tidak overfitted.

>Akhirnya, berdasarkan skor akurasi kami-XGBoost adalah pemenang untuk kasus kami. Satu-satunya tangkapan di sini adalah data yang kami terima untuk pelatihan model. Fitur data adalah versi PCA yang diubah. Jika fitur yang sebenarnya mengikuti pola yang sama maka kita lakukan besar!!