# Planejamento da solução

1. Explorar os Dados usando ferramentas e testes Estatísticos com o objetivo de encontrar inconsistências dos dados e tratar possíveis dados faltantes.
2. Levantar Hipóteses sobre as características de transações legítimas e fraudulentas, validando ou refutando essas hipóteses através dos dados.
3. Preparar os Dados para que os Algoritmos de Machine Learning sejam capazes de aprender a tarefa.
4. Treinar e avaliar vários algoritmos Classificadores.
5. Contabilizar a Performance do Modelo de Machine Learning e transformar em Performance de Negócio.
6. Desenvolver uma API que retorne “Fraude” ou “Legítima” quando receber como entrada uma transação.

# 0.0 Imports

In [1]:
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display    import display, HTML
from sklearn import metrics as m
from sklearn.model_selection import StratifiedKFold

## 0.1 Helper Functions

In [2]:
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [18, 9]
    plt.rcParams['font.size'] = 24

    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    pd.set_option('display.float_format', lambda x: '%.3f' % x)
    warnings.filterwarnings('ignore')
    sns.set()
    
jupyter_settings()

def cross_validation(model_name, model, x, y):
    
    x = x.to_numpy()
    y = y.to_numpy()
    
    
    balanced_accuracy = []
    precision = []
    recall = []
    f1 = []
    
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    for train_index, test_index in skf.split(x, y):
        x_train_cv, x_test_cv = x[train_index], x[test_index]
        y_train_cv, y_test_cv = y[train_index], y[test_index]
        
        
        model.fit(x_train_cv, y_train_cv)
        pred = model.predict(x_test_cv)

        balanced_accuracy.append(m.balanced_accuracy_score(y_test_cv, pred))
        precision.append(m.precision_score(y_test_cv, pred))
        recall.append(m.recall_score(y_test_cv, pred))
        f1.append(m.f1_score(y_test_cv, pred))

    
    acurracy_mean, acurracy_std = np.round(np.mean(balanced_accuracy), 2), np.round(np.std(balanced_accuracy),2)
    precision_mean, precision_std = np.round(np.mean(precision),2), np.round(np.std(precision),2)
    recall_mean, recall_std = np.round(np.mean(recall),2), np.round(np.std(recall),2)
    f1_mean, f1_std = np.round(np.mean(f1),2), np.round(np.std(f1),2)
    
    
    return pd.DataFrame({"Balanced Accuracy": "{} +/- {}".format(acurracy_mean, acurracy_std),
                        "Precision": "{} +/- {}".format(precision_mean, precision_std),
                        "Recall": "{} +/- {}".format(recall_mean, recall_std),
                        "F1": "{} +/- {}".format(f1_mean, f1_std)}, index=[model_name])

def ml_metrics(model_name, y_true, pred):
    
    accuracy = m.balanced_accuracy_score(y_true, pred)
    precision = m.precision_score(y_true, pred)
    recall = m.recall_score(y_true, pred)
    f1 = m.f1_score(y_true, pred)
    
    return pd.DataFrame({'Balanced Accuracy': np.round(accuracy, 2), 
                         'Precision': np.round(precision, 2), 
                         'Recall': np.round(recall, 2),
                         'F1': np.round(f1, 2)}, index=[model_name])


def frequency_encoding(df, column):
    encoder = df.groupby(column).size()/len(df)
    return encoder

Populating the interactive namespace from numpy and matplotlib


## 0.2 Loading Data

In [3]:
df_raw = pd.read_csv('../data/fraud_detection-dataset.csv', low_memory=False)

In [4]:
df_raw.drop(columns='Unnamed: 0', inplace=True)

In [5]:
df_raw.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFlaggedFraud,isFraud
0,353,CASH_OUT,150540.16,C1389413404,9912.0,0.0,C819390946,29817.59,180357.75,0,0
1,282,CASH_OUT,66723.64,C958468196,0.0,0.0,C257205272,1136277.81,1203001.45,0,0
2,228,TRANSFER,1039375.01,C857481806,2328.0,0.0,C134214261,437583.33,1476958.34,0,0
3,36,PAYMENT,9178.61,C558963849,96237.62,87059.01,M635090135,0.0,0.0,0,0
4,48,PAYMENT,4527.24,C1644082954,51925.0,47397.76,M332145827,0.0,0.0,0,0


# 1.0 Data Description

In [6]:
df1 = df_raw.copy()

- **step** - mapeia uma unidade de tempo no mundo real. Neste caso, 1 etapa corresponde a 1 hora de tempo. Total de etapas 744 (simulação de 30 dias).

- **type** - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

- **amount** - valor da transação em moeda local.

- **nameOrig** - cliente que iniciou a transação

- **oldbalanceOrg** - saldo inicial antes da transação

- **newbalanceOrig** - saldo após a transação

- **nameDest** - cliente que é o destinatário da transação

- **oldbalanceDest** - destinatário do saldo inicial antes da transação. Observe que não há informações para clientes que começam com M (Merchants).

- **newbalanceDest** - novo destinatário do saldo após a transação. Observe que não há informações para clientes que começam com  M (Merchants).

- **isFraud** - São as transações feitas pelos agentes fraudulentos dentro da simulação. Neste conjunto de dados específico, o comportamento fraudulento dos agentes visa lucrar ao assumir o controle das contas dos clientes e tentar esvaziar os fundos transferindo para outra conta e retirando do sistema.

- **isFlaggedFraud** - O modelo de negócios visa controlar as transferências em massa de uma conta para outra e sinaliza tentativas ilegais. Uma tentativa ilegal neste conjunto de dados é uma tentativa de transferir mais de 200.000 em uma única transação.

## 1.1 Data Dimensions

In [7]:
df1.shape

(636262, 11)

## 1.2 Rename Columns

In [8]:
cols_new = ['step', 'type', 'amount', 'name_orig', 'old_balance_org', 'new_balance_orig',
       'name_dest', 'old_balance_dest', 'new_balance_dest', 'is_flagged_fraud',
       'is_fraud']

df1.columns = cols_new

## 1.3 Data Types

In [9]:
df1.dtypes

step                  int64
type                 object
amount              float64
name_orig            object
old_balance_org     float64
new_balance_orig    float64
name_dest            object
old_balance_dest    float64
new_balance_dest    float64
is_flagged_fraud      int64
is_fraud              int64
dtype: object

## 1.4 Check Na

In [10]:
df1.isna().sum()

step                0
type                0
amount              0
name_orig           0
old_balance_org     0
new_balance_orig    0
name_dest           0
old_balance_dest    0
new_balance_dest    0
is_flagged_fraud    0
is_fraud            0
dtype: int64

## 1.5 Fillout Na

There are not Nan values in the dataset

## 1.6 Change Types

## 1.7 Descriptive Statistical

### 1.7.1 Numerical Atributes

### 1.7.2 Categorical Atributes

# 2.0 feature Engineering

# 3.0 Variable Filtering

In [11]:
df3 = df1.copy()

In [12]:
df3.drop(columns=['name_orig', 'name_dest'], axis=1, inplace=True)

# 4.0 EDA

# 5.0 Data Preparation

In [13]:
df5 = df3.copy()

## 5.1 Split dataframe into train, test and validation

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X = df3.drop('is_fraud', axis=1)
y = df3['is_fraud'].copy()

In [16]:
X_train, X_temp, y_train, y_temp = train_test_split(X,y, train_size=0.8, random_state=42, stratify=y)

In [17]:
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

## 5.2 Rescaling

In [18]:
from sklearn.preprocessing import MinMaxScaler

In [19]:
mm = MinMaxScaler()

# amount
X_train['amount'] = mm.fit_transform(X_train[['amount']].values)
X_valid['amount'] = mm.fit_transform(X_valid[['amount']].values)

# old balance org
X_train['old_balance_org'] = mm.fit_transform(X_train[['old_balance_org']].values)
X_valid['old_balance_org'] = mm.fit_transform(X_valid[['old_balance_org']].values)

# new balance orig
X_train['new_balance_orig'] = mm.fit_transform(X_train[['new_balance_orig']].values)
X_valid['new_balance_orig'] = mm.fit_transform(X_valid[['new_balance_orig']].values)

# old balance dest
X_train['old_balance_dest'] = mm.fit_transform(X_train[['old_balance_dest']].values)
X_valid['old_balance_dest'] = mm.fit_transform(X_valid[['old_balance_dest']].values)

# new balance dest
X_train['new_balance_dest'] = mm.fit_transform(X_train[['new_balance_dest']].values)
X_valid['new_balance_dest'] = mm.fit_transform(X_valid[['new_balance_dest']].values)

# step
X_train['step'] = mm.fit_transform(X_train[['step']].values)
X_valid['step'] = mm.fit_transform(X_valid[['step']].values)

## 5.3 Encoding

In [20]:
# type - frequency encoding
fe_type_train = frequency_encoding(X_train, 'type')
fe_type_valid = frequency_encoding(X_valid, 'type')
X_train['type'] = X_train['type'].map(fe_type_train)
X_valid['type'] = X_valid['type'].map(fe_type_valid)

# 6.0 Feature Selection

# 7.0 Machine Learning Modeling

## 7.1 Baseline Model

In [21]:
from sklearn.dummy import DummyClassifier

In [22]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
pred = dummy.predict(X_valid)

### Results

In [23]:
dummy_result = ml_metrics('dummy', y_valid, pred)
dummy_result

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
dummy,0.5,0.0,0.0,0.0


### Cross Validation

In [24]:
dummy_result_cv = cross_validation('Dummy_CV', DummyClassifier(), X_train, y_train)
dummy_result_cv

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
Dummy_CV,0.5 +/- 0.0,0.0 +/- 0.0,0.0 +/- 0.0,0.0 +/- 0.0


## 7.2 Logistic Regression

In [25]:
from sklearn.linear_model import LogisticRegression

In [26]:
lg = LogisticRegression(class_weight='balanced')
lg.fit(X_train, y_train)
pred = lg.predict(X_valid)

### Results

In [27]:
logistic_regression_result = ml_metrics('LogisticRegression', y_valid, pred)
logistic_regression_result

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
LogisticRegression,0.89,0.01,0.87,0.03


### Cross Validation

In [28]:
logistic_regression_result_cv = cross_validation('LogisticRegression_CV', LogisticRegression(class_weight='balanced'), X_train, y_train)
logistic_regression_result_cv

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
LogisticRegression_CV,0.85 +/- 0.01,0.01 +/- 0.0,0.78 +/- 0.01,0.02 +/- 0.0


## 7.3 KNN

In [49]:
from sklearn.neighbors import KNeighborsClassifier

In [50]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
pred = knn.predict(X_valid)

KeyboardInterrupt: 

### Results

In [None]:
knn_result = ml_metrics('KNN', y_valid, pred)
knn_result

### Cross Validation

In [None]:
knn_result_cv = cross_validation('KNN_CV', KNeighborsClassifier(), X_train, y_train)
knn_result_cv

## 7.4 ADA Boosting

In [45]:
from sklearn.ensemble import AdaBoostClassifier

In [46]:
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
pred = ada.predict(X_valid)

### Results

In [47]:
ada_result = ml_metrics('AdaBoost', y_valid, pred)
ada_result

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
AdaBoost,0.79,0.44,0.59,0.5


### Cross Validation

In [48]:
ada_result_cv = cross_validation('AdaBoost_CV', AdaBoostClassifier(), X_train, y_train)
ada_result_cv

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
AdaBoost_CV,0.76 +/- 0.01,0.96 +/- 0.02,0.52 +/- 0.02,0.68 +/- 0.02


## 7.5 LightGBM

In [33]:
from lightgbm import LGBMClassifier 

In [34]:
lgb = LGBMClassifier(objective='binary', class_weight='balanced')
lgb.fit(X_train, y_train)
pred = lgb.predict(X_valid)

### Results

In [35]:
lgbm_result = ml_metrics('LightGBM', y_valid, pred)
lgbm_result

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
LightGBM,0.98,0.15,0.98,0.26


### Cross Validation

In [36]:
lgbm_result_cv = cross_validation('LightGBM_CV', LGBMClassifier(objective='binary', class_weight='balanced'), X_train, y_train)
lgbm_result_cv

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
LightGBM_CV,0.96 +/- 0.0,0.55 +/- 0.01,0.93 +/- 0.01,0.69 +/- 0.01


## 7.6 Random Forest

In [37]:
from sklearn.ensemble import RandomForestClassifier

In [38]:
rf = RandomForestClassifier(class_weight='balanced')
rf.fit(X_train, y_train)
pred = rf.predict(X_valid)

### Results

In [39]:
rf_result = ml_metrics('Random Forest', y_valid, pred)
rf_result

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
Random Forest,0.87,0.79,0.74,0.77


### Cross Validation

In [40]:
rf_result_cv = cross_validation('Random Forest CV', RandomForestClassifier(class_weight='balanced'), X_train, y_train)
rf_result_cv

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
Random Forest CV,0.86 +/- 0.01,0.99 +/- 0.01,0.72 +/- 0.01,0.83 +/- 0.01


## 7.7 XGBoost

In [41]:
from xgboost import XGBClassifier

In [42]:
xgb = XGBClassifier(objective='binary:logistic', verbosity=0)
xgb.fit(X_train, y_train)
pred = xgb.predict(X_valid)

### Results

In [43]:
xgb_result = ml_metrics('XGBoost', y_valid, pred)
xgb_result

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
XGBoost,0.97,0.41,0.94,0.57


### Cross Validation

In [44]:
xgb_result_cv = cross_validation('XGBoost_CV', XGBClassifier(objective='binary:logistic', verbosity=0), X_train, y_train)
xgb_result_cv

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
XGBoost_CV,0.9 +/- 0.01,0.97 +/- 0.01,0.81 +/- 0.01,0.88 +/- 0.01


## 7.8 Results

In [51]:
df_results = pd.concat([dummy_result, logistic_regression_result, lgbm_result, ada_result, rf_result, xgb_result])
df_results.style.highlight_max(color='lightgreen', axis=0)

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
dummy,0.5,0.0,0.0,0.0
LogisticRegression,0.89,0.01,0.87,0.03
LightGBM,0.98,0.15,0.98,0.26
AdaBoost,0.79,0.44,0.59,0.5
Random Forest,0.87,0.79,0.74,0.77
XGBoost,0.97,0.41,0.94,0.57


### Cross Validation Results

In [52]:
df_results_cv = pd.concat([dummy_result_cv, logistic_regression_result_cv, lgbm_result_cv, ada_result_cv, rf_result_cv, xgb_result_cv])
df_results_cv

Unnamed: 0,Balanced Accuracy,Precision,Recall,F1
Dummy_CV,0.5 +/- 0.0,0.0 +/- 0.0,0.0 +/- 0.0,0.0 +/- 0.0
LogisticRegression_CV,0.85 +/- 0.01,0.01 +/- 0.0,0.78 +/- 0.01,0.02 +/- 0.0
LightGBM_CV,0.96 +/- 0.0,0.55 +/- 0.01,0.93 +/- 0.01,0.69 +/- 0.01
AdaBoost_CV,0.76 +/- 0.01,0.96 +/- 0.02,0.52 +/- 0.02,0.68 +/- 0.02
Random Forest CV,0.86 +/- 0.01,0.99 +/- 0.01,0.72 +/- 0.01,0.83 +/- 0.01
XGBoost_CV,0.9 +/- 0.01,0.97 +/- 0.01,0.81 +/- 0.01,0.88 +/- 0.01


# 8.0 Hyperparameter Fine Tuning

# 9.0 Conlclusions

# 10.0 Deploy