<a href="https://colab.research.google.com/github/KZoc/MachineLearning-Projects/blob/main/Fraud_Detection/Decision_Tree_Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 - Description, Initialization and Importing Librarys

## Data description:
- It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.


- Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

- Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.

- The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning.

- Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.


## Kaggle link to the dataset and info:
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Importing Librarys

import pandas as pd


In [3]:
dt = pd.read_csv('/content/drive/MyDrive/Alura/Machine Learning/Curso_DT-Fraud-Detection/creditcard.csv')

# 2 - Exploring the data

In [4]:
dt.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
# Veryfing dataset size

dt.shape

(284807, 31)

In [6]:
# Checking total Null values on dataset

print(dt.isna().sum())

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


In [7]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [8]:
# Verifying numbers

n_fraud = dt['Class'].sum()
n_valid = dt.shape[0] - n_fraud
fraud_percent = n_fraud / dt.shape[0]
valid_percent = n_valid / dt.shape[0]


print('Numbers of Fraud transactions: ', n_fraud,  " -> ", "%.2f" %(fraud_percent * 100), '%')
print('Numbers of Valid transactions: ', n_valid,  " -> ", "%.2f" %(valid_percent * 100), '%')

Numbers of Fraud transactions:  492  ->  0.17 %
Numbers of Valid transactions:  284315  ->  99.83 %


# 3 - Creating Decision Tree

In [9]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import tree

In [10]:
X = dt.drop('Class', axis=1).values
y = dt['Class'].values

In [11]:
validator = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)

In [12]:
for train_id, test_id in validator.split(X, y):
  X_train, X_test = X[train_id], X[test_id]
  y_train, y_test = y[train_id], y[test_id]

In [13]:
dt_classificator = tree.DecisionTreeClassifier()
model_dt = dt_classificator.fit(X_train, y_train)
y_pred = model_dt.predict(X_test)

## 3.1 - Creanting Functions for Optimization

In [14]:
# Function to select fields

def xy_selector(data):
  X = data.drop('Class', axis=1).values
  y = data['Class'].values
  return X, y

In [15]:
# Function to spli the data for train and test

def exec_validator(X, y):
  validator = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
  for train_id, test_id in validator.split(X, y):
    X_train, X_test = X[train_id], X[test_id]
    y_train, y_test = y[train_id], y[test_id]
  return X_train, X_test, y_train, y_test

In [16]:
# Funtion to classify the data

%%time

def exec_classifier(classificator, X_train, X_test, y_train):
  model_dt = classificator.fit(X_train, y_train)
  y_predict = model_dt.predict(X_test)
  return y_predict

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.06 µs


In [17]:
# Pipeline - From dataframe to predicted values

dt_classificator = tree.DecisionTreeClassifier(random_state=0)

X, y = xy_selector(dt)

X_train, X_test, y_train, y_test = exec_validator(X, y)

y_predict = exec_classifier(dt_classificator, X_train, X_test, y_train)

In [18]:
# Saving DecisionTree figure

import matplotlib.pyplot as plt


def save_tree(classificator, name):
  plt.figure(figsize=(200,100))
  tree.plot_tree(classificator, filled=True, fontsize=14)
  plt.savefig(name)
  plt.close()

In [19]:
save_tree(dt_classificator, 'DecisionTree-1.png')

# 4 - Verifying Metrics

In [20]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [21]:
def validate_tree(y_test, y_predict):
  print('Accuracy = ', "%.3f" %accuracy_score(y_test, y_predict))
  print('Precision = ', "%.3f" %precision_score(y_test, y_predict))
  print('Recall = ', "%.3f" %recall_score(y_test, y_predict))


  print("\n Confusion Matrix \n", confusion_matrix(y_test, y_predict))

In [22]:
validate_tree(y_test, y_predict)

Accuracy =  0.999
Precision =  0.735
Recall =  0.735

 Confusion Matrix 
 [[28419    13]
 [   13    36]]


# 5 - Configuring Hyperparameters

## 5.1 - Checking tree

In [23]:
print('Tree depth = ', dt_classificator.get_depth())

Tree depth =  21


## 1st Modification:
- max_depth = 10

In [24]:
dt_classificator = tree.DecisionTreeClassifier(max_depth=10, random_state=0)

y_predict = exec_classifier(dt_classificator, X_train, X_test, y_train)

In [25]:
validate_tree(y_test, y_predict)

Accuracy =  0.999
Precision =  0.947
Recall =  0.735

 Confusion Matrix 
 [[28430     2]
 [   13    36]]


In [26]:
print('Tree depth = ', dt_classificator.get_depth())

Tree depth =  10


## 2nd Modification:
- max_depth = 10
- min_samples_leaf = 10

In [27]:
dt_classificator = tree.DecisionTreeClassifier(max_depth=10, min_samples_leaf=10, random_state=0)

y_predict = exec_classifier(dt_classificator, X_train, X_test, y_train)

In [28]:
validate_tree(y_test, y_predict)

Accuracy =  0.999
Precision =  0.860
Recall =  0.755

 Confusion Matrix 
 [[28426     6]
 [   12    37]]


# 6 - Random Forest Classifier

In [29]:
%%time

from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=100, random_state=0)
y_pred_RFC = exec_classifier(RFC, X_train, X_test, y_train)

CPU times: user 7min 1s, sys: 476 ms, total: 7min 2s
Wall time: 7min 4s


In [30]:
validate_tree(y_test, y_pred_RFC)

Accuracy =  1.000
Precision =  0.949
Recall =  0.755

 Confusion Matrix 
 [[28430     2]
 [   12    37]]


In [31]:
# Limiting max_depth to 10
%%time

RFC = RandomForestClassifier(n_estimators=100, random_state=0, max_depth=10)
y_pred_RFC = exec_classifier(RFC, X_train, X_test, y_train)

CPU times: user 4min 5s, sys: 274 ms, total: 4min 5s
Wall time: 4min 7s


In [32]:
validate_tree(y_test, y_pred_RFC)

Accuracy =  1.000
Precision =  0.949
Recall =  0.755

 Confusion Matrix 
 [[28430     2]
 [   12    37]]


In [33]:
# Changing n_estimators to 50

%%time

RFC = RandomForestClassifier(n_estimators=50, random_state=0, max_depth=10)
y_pred_RFC = exec_classifier(RFC, X_train, X_test, y_train)

CPU times: user 2min 2s, sys: 114 ms, total: 2min 2s
Wall time: 2min 3s


In [34]:
validate_tree(y_test, y_pred_RFC)

Accuracy =  1.000
Precision =  0.974
Recall =  0.755

 Confusion Matrix 
 [[28431     1]
 [   12    37]]


This Random Forest Model is return the better metrics results. I'll use it as a baseline for the next models.

# 7 - Boosting and Ada-Boosting

In [35]:
# Importing AdaBoostingClassifier

from sklearn.ensemble import AdaBoostClassifier

In [36]:
%%time

adaboost_classifier = AdaBoostClassifier(random_state=0)
y_pred_adaboost = exec_classifier(adaboost_classifier, X_train, X_test, y_train)

CPU times: user 1min 53s, sys: 102 ms, total: 1min 53s
Wall time: 1min 54s


In [37]:
validate_tree(y_test, y_pred_adaboost)

# The result is worst then we had before

Accuracy =  0.999
Precision =  0.889
Recall =  0.653

 Confusion Matrix 
 [[28428     4]
 [   17    32]]


In [38]:
# Tring to improve the metrics with a few modifications

%%time

# Set n_estimators = 100
adaboost_classifier = AdaBoostClassifier(random_state=0, n_estimators=100)

y_pred_adaboost = exec_classifier(adaboost_classifier, X_train, X_test, y_train)

CPU times: user 3min 53s, sys: 210 ms, total: 3min 53s
Wall time: 3min 54s


In [39]:
validate_tree(y_test, y_pred_adaboost)

# The result get better

Accuracy =  0.999
Precision =  0.864
Recall =  0.776

 Confusion Matrix 
 [[28426     6]
 [   11    38]]


In [40]:
# Tring to improve the metrics with a few modifications

%%time

# Set n_estimators = 200
adaboost_classifier = AdaBoostClassifier(random_state=0, n_estimators=200)

y_pred_adaboost = exec_classifier(adaboost_classifier, X_train, X_test, y_train)

CPU times: user 7min 40s, sys: 492 ms, total: 7min 41s
Wall time: 7min 43s


In [41]:
validate_tree(y_test, y_pred_adaboost)

# The result get better

Accuracy =  1.000
Precision =  0.929
Recall =  0.796

 Confusion Matrix 
 [[28429     3]
 [   10    39]]


# Conclusion

Here we had 2 good models, the RandomForestClassifier and AdaBoostClassifier.

We tried a few configurations for each one, below are the code of both good models configuration:

```
RFC = RandomForestClassifier(n_estimators=50, random_state=0, max_depth=10)
y_pred_RFC = exec_classifier(RFC, X_train, X_test, y_train)
```




```
adaboost_classifier = AdaBoostClassifier(random_state=0, n_estimators=200)

y_pred_adaboost = exec_classifier(adaboost_classifier, X_train, X_test, y_train)
```


