# Models
- Author: Francisco Martínez García

In this part of the project we will proceed to train the training data and validate the predictions obtained in order to select the correct model for the test data. The results obtained will be commented in the next notebook

## Libraries

In [1]:
#Import the libraries needed
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

import category_encoders as ce

from mlxtend.plotting import plot_confusion_matrix

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.linear_model import LogisticRegressionCV 
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve, fbeta_score,mean_squared_error


from catboost import CatBoostClassifier 
import lightgbm as lgb
import pickle
import warnings
warnings.filterwarnings('ignore')

#Import de functions file
import functions as fx

  from pandas import MultiIndex, Int64Index


## Read data

In [3]:
#Read the traing data
pd_fraud = pd.read_parquet('../data/training_data.parquet')

## Process the data

Firstly, the preprocessor for the Pipeline will be created. Furthermore, we will divide the training data in train and validation

In [4]:
#Defining the steps in the numerical pipeline 
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

#Defining the steps in the categorical pipeline 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

#Numerical features to pass down the numerical pipeline 
numeric_features = pd_fraud.select_dtypes(include=['int64', 'float64']).drop(['isFraud'], axis=1).columns
#Categrical features to pass down the categorical pipeline 
categorical_features = pd_fraud.select_dtypes(include=['object']).columns

In [5]:
#Create the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [6]:
#Save the preprocessor
with open ('../models/preprocessor.pickle','wb') as f:
    pickle.dump(preprocessor,f)

In [7]:
with open('../models/preprocessor.pickle', 'rb') as f:
    preprocessor = pickle.load(f)

In [8]:
#Separate the training data in training and validation
X_train, X_validation, y_train, y_validation = train_test_split(pd_fraud, pd_fraud['isFraud'], 
                                                                test_size=0.15, 
                                                                random_state=1)

In [9]:
X_train = X_train.drop(['isFraud'], axis=1)
X_validation = X_validation.drop(['isFraud'], axis=1)

## Base model

The first model will be a dummy classifier, which assumes every case is a fraud

In [10]:
model_base = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    ('clasificador', DummyClassifier(strategy='most_frequent',random_state=1))])

In [11]:
#Train the model
model_base.fit(X_train, y_train)

In [12]:
#Save the model
with open('../models/base.pickle', 'wb') as f:
    pickle.dump(model_base, f)

In [13]:
#Load the model
with open('../models/base.pickle', 'rb') as f:
    model_base = pickle.load(f)

### Results

In [14]:
y_pred_base = model_base.predict(X_validation)
y_pred_proba_base = model_base.predict_proba(X_validation)
fx.evaluate_model(y_validation, y_pred_base, y_pred_proba_base)

ROC-AUC score of the model: 0.5
Accuracy of the model: 0.9987522749127784

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.00      0.00      0.00       157

    accuracy                           1.00    125829
   macro avg       0.50      0.50      0.50    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125672      0]
 [   157      0]]

F2 Score: 
0.4998751028214034



### Threshold setting prediction

In [15]:
# keep probabilities for the positive outcome only
yhat_base = y_pred_proba_base[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_validation, yhat_base)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

y_pred_new_threshold_base = (y_pred_proba_base[:,1]>thresholds[ix]).astype(int)
fx.evaluate_model(y_validation,y_pred_new_threshold_base,y_pred_proba_base)

Best Threshold=1.000000, G-Mean=0.000
ROC-AUC score of the model: 0.5
Accuracy of the model: 0.9987522749127784

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.00      0.00      0.00       157

    accuracy                           1.00    125829
   macro avg       0.50      0.50      0.50    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125672      0]
 [   157      0]]

F2 Score: 
0.4998751028214034



## Lasso
Lasso is a regression analysis method which does a selection of the variables to obtain a better result

In [16]:
model_lasso = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    ('clasificador', LogisticRegression(random_state=1))])

In [17]:
#Train the model
model_lasso.fit(X_train, y_train)

In [18]:
#Save the model
with open('../models/model_lasso.pickle', 'wb') as f:
    pickle.dump(model_lasso, f)

In [19]:
#Load the model
with open('../models/model_lasso.pickle', 'rb') as f:
    model_lasso = pickle.load(f)

### Results

In [20]:
y_pred_lasso = model_lasso.predict(X_validation)
y_pred_proba_lasso = model_lasso.predict_proba(X_validation)
fx.evaluate_model(y_validation, y_pred_lasso, y_pred_proba_lasso)

ROC-AUC score of the model: 0.9577539428288299
Accuracy of the model: 0.99912579770959

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.96      0.31      0.47       157

    accuracy                           1.00    125829
   macro avg       0.98      0.66      0.74    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125670      2]
 [   108     49]]

F2 Score: 
0.6803200829274987



### Threshold setting prediction

In [21]:
# keep probabilities for the positive outcome only
yhat_lasso = y_pred_proba_lasso[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_validation, yhat_lasso)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

y_pred_new_threshold_lasso = (y_pred_proba_lasso[:,1]>thresholds[ix]).astype(int)
fx.evaluate_model(y_validation,y_pred_new_threshold_lasso,y_pred_proba_lasso)

Best Threshold=0.002079, G-Mean=0.889
ROC-AUC score of the model: 0.9577539428288299
Accuracy of the model: 0.9193747069435504

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.92      0.96    125672
           1       0.01      0.85      0.03       157

    accuracy                           0.92    125829
   macro avg       0.51      0.89      0.49    125829
weighted avg       1.00      0.92      0.96    125829


Confusion matrix: 
[[115550  10122]
 [    23    134]]

F2 Score: 
0.49801707175519894



## Random Forest
The random forest is a classification algorithm which consists in the calculation of many dession trees to correct the flaws these trees have by themselves

In [22]:
model_rf = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    
    ('clasificador', RandomForestClassifier(n_jobs=-1, random_state=0))])

In [23]:
#Train the model
model_rf.fit(X_train, y_train)

In [24]:
#Save the model
with open ('../models/random_forest.pickle','wb') as f:
    pickle.dump(model_rf,f)

In [25]:
#Load the model
with open('../models/random_forest.pickle', 'rb') as f:
    model_rf = pickle.load(f)

In [26]:
y_pred_rf = model_rf.predict(X_validation)
y_pred_proba_rf = model_rf.predict_proba(X_validation)

### Results

In [27]:
fx.evaluate_model(y_validation, y_pred_rf,y_pred_proba_rf)

ROC-AUC score of the model: 0.9607479869748893
Accuracy of the model: 0.9996900555515819

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.99      0.76      0.86       157

    accuracy                           1.00    125829
   macro avg       1.00      0.88      0.93    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125671      1]
 [    38    119]]

F2 Score: 
0.8976938543627675



### Threshold setting prediction

In [28]:
# keep probabilities for the positive outcome only
yhat_rf = y_pred_proba_rf[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_validation, yhat_rf)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

y_pred_new_threshold_rf = (y_pred_proba_rf[:,1]>thresholds[ix]).astype(int)
fx.evaluate_model(y_validation,y_pred_new_threshold_rf,y_pred_proba_rf)

Best Threshold=0.010000, G-Mean=0.952
ROC-AUC score of the model: 0.9607479869748893
Accuracy of the model: 0.9953905697414746

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.20      0.90      0.33       157

    accuracy                           1.00    125829
   macro avg       0.60      0.95      0.66    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125107    565]
 [    15    142]]

F2 Score: 
0.7641057490914109



## General Linear Model (GLM)
The linear regression calculates the predicted weight, which estimate the regression function

In [29]:
model_glm = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    
    ('clasificador', LinearRegression())])

In [30]:
#Train the model
model_glm.fit(X_train, y_train)

In [31]:
#Save the model
with open('../models/GLM.pickle', 'wb') as f:
    pickle.dump(model_glm, f)

In [32]:
#Load the model
with open('../models/GLM.pickle', 'rb') as f:
    model_glm = pickle.load(f)

### Results

In [33]:
y_pred_glm = model_glm.predict(X_validation)

mean_squared_error(y_validation, y_pred_glm)

0.0010485205698375088

The characteristics of the linear regression do not allowed to calculate the same metrics as the other models, therefore only the MSE will be calculated. It will not be possible to compare with the rest of the models.

## Support Vector Machine (SVM)
The SVM is a type of deep learning algorithm which creates data points that are influence the postion, orentation and  are closer to the hyperplane.

In [34]:
model_svm = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    
    ('clasificador', SVC(random_state=1))])

In [35]:
#Train the model
model_svm.fit(X_train, y_train)

In [36]:
#Save the model
with open('../models/model_svm.pickle', 'wb') as f:
    pickle.dump(model_svm, f)

In [37]:
#Load the model
with open('../models/model_svm.pickle', 'rb') as f:
    model_svm = pickle.load(f)

In [38]:
y_pred_svm = model_svm.predict(X_validation)
# y_pred_proba_svm = model_svm.predict_proba(X_validation)

In [39]:
#As the prediction for theSVM takes time, it will be saved too
with open('../models/pred_svm.pickle', 'wb') as f:
    pickle.dump(y_pred_svm, f)

In [40]:
with open('../models/pred_svm.pickle', 'rb') as f:
    y_pred_svm = pickle.load(f)

### Results

As the probability is false for the SVM, the threshold setting prediction will not be calculated

In [41]:
fx.evaluate_model(y_validation, y_pred_svm)#,y_pred_proba_svm)

Accuracy of the model: 0.999038377480549

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.97      0.24      0.38       157

    accuracy                           1.00    125829
   macro avg       0.99      0.62      0.69    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125671      1]
 [   120     37]]

F2 Score: 
0.6387902380190904



## Light Gradient-boosting Machine (LightGBM)
The LightGBM is based on decision trees algorithm to increases the efficiency of the model and reduces memory usage

In [43]:
model_lgbm = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', lgb.LGBMClassifier(random_state=0))])

In [44]:
#Train the model
model_lgbm.fit(X_train, y_train)

In [45]:
#Save the model
with open('../models/LightGBM.pickle', 'wb') as f:
    pickle.dump(model_lgbm, f)

In [46]:
#Load the model
with open('../models/LightGBM.pickle', 'rb') as f:
    model_lgbm = pickle.load(f)

### Results

In [47]:
y_pred_lgbm = model_lgbm.predict(X_validation)
y_pred_proba_lgbm = model_lgbm.predict_proba(X_validation)
fx.evaluate_model(y_validation, y_pred_lgbm, y_pred_proba_lgbm)

ROC-AUC score of the model: 0.6512083016227057
Accuracy of the model: 0.997679390283639

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.24      0.38      0.29       157

    accuracy                           1.00    125829
   macro avg       0.62      0.69      0.65    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125477    195]
 [    97     60]]

F2 Score: 
0.6691774674668929



### Threshold setting prediction

In [48]:
# keep probabilities for the positive outcome only
yhat_lgbm = y_pred_proba_lgbm[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_validation, yhat_lgbm)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

y_pred_new_threshold_lgbm = (y_pred_proba_lgbm[:,1]>thresholds[ix]).astype(int)
fx.evaluate_model(y_validation,y_pred_new_threshold_lgbm,y_pred_proba_lgbm)

Best Threshold=0.000000, G-Mean=0.651
ROC-AUC score of the model: 0.6512083016227057
Accuracy of the model: 0.9572197188247542

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.96      0.98    125672
           1       0.01      0.44      0.02       157

    accuracy                           0.96    125829
   macro avg       0.51      0.70      0.50    125829
weighted avg       1.00      0.96      0.98    125829


Confusion matrix: 
[[120377   5295]
 [    88     69]]

F2 Score: 
0.5117235545271103



## XGBoost
"Extreme Gradient Boosting” implements Machine Learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting to solve problems

In [49]:
from xgboost import XGBClassifier
# from sklearn.model_selection import GrideSearchCV

In [50]:
model_xgb = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', XGBClassifier(random_state=0))])

In [51]:
#Train the model
model_xgb.fit(X_train, y_train)



In [52]:
#Save the model
with open('../models/XGBoost.pickle', 'wb') as f:
    pickle.dump(model_xgb, f)

In [53]:
#Load the model
with open('../models/XGBoost.pickle', 'rb') as f:
    model_xgb = pickle.load(f)

### Resuts

In [54]:
y_pred_xgb = model_xgb.predict(X_validation)
y_pred_proba_xgb = model_xgb.predict_proba(X_validation)
fx.evaluate_model(y_validation, y_pred_xgb, y_pred_proba_xgb)

ROC-AUC score of the model: 0.9965812834786176
Accuracy of the model: 0.9997695284870737

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.99      0.82      0.90       157

    accuracy                           1.00    125829
   macro avg       1.00      0.91      0.95    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125671      1]
 [    28    129]]

F2 Score: 
0.9254362794085605



### Threshold setting prediction

In [55]:
# keep probabilities for the positive outcome only
yhat_xgb = y_pred_proba_xgb[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_validation, yhat_xgb)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

y_pred_new_threshold_xgb = (y_pred_proba_xgb[:,1]>thresholds[ix]).astype(int)
fx.evaluate_model(y_validation,y_pred_new_threshold_xgb,y_pred_proba_xgb)

Best Threshold=0.000254, G-Mean=0.982
ROC-AUC score of the model: 0.9965812834786176
Accuracy of the model: 0.9833583673080133

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.98      0.99    125672
           1       0.07      0.97      0.13       157

    accuracy                           0.98    125829
   macro avg       0.53      0.98      0.56    125829
weighted avg       1.00      0.98      0.99    125829


Confusion matrix: 
[[123582   2090]
 [     4    153]]

F2 Score: 
0.6265512517215941



## AdaBoost
AdaBoost(Adapting boosting), is an algorithm that uses an ensemble learning approach to weight various inputs. 

In [56]:
model_ada = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', AdaBoostClassifier(n_estimators=100, random_state=0))])

In [57]:
model_ada.fit(X_train, y_train)

In [58]:
#Save the model
with open('../models/AdaBoost.pickle', 'wb') as f:
    pickle.dump(model_ada, f)

In [59]:
#Load the model
with open('../models/AdaBoost.pickle', 'rb') as f:
    model_ada = pickle.load(f)

### Results

In [63]:
y_pred_ada = model_ada.predict(X_validation)
y_pred_proba_ada = model_ada.predict_proba(X_validation)
fx.evaluate_model(y_validation,y_pred_ada,y_pred_proba_ada)

ROC-AUC score of the model: 0.988396925897078
Accuracy of the model: 0.999475478625754

Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    125672
           1       0.93      0.63      0.75       157

    accuracy                           1.00    125829
   macro avg       0.96      0.82      0.87    125829
weighted avg       1.00      1.00      1.00    125829


Confusion matrix: 
[[125664      8]
 [    58     99]]

F2 Score: 
0.8366630845778902



### Threshold setting prediction

In [64]:
# keep probabilities for the positive outcome only
yhat_ada = y_pred_proba_ada[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_validation, yhat_ada)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

y_pred_new_threshold_ada = (y_pred_proba_ada[:,1]>thresholds[ix]).astype(int)
fx.evaluate_model(y_validation,y_pred_new_threshold_ada,y_pred_proba_ada)

Best Threshold=0.481327, G-Mean=0.955
ROC-AUC score of the model: 0.988396925897078
Accuracy of the model: 0.9600569026218121

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.96      0.98    125672
           1       0.03      0.94      0.06       157

    accuracy                           0.96    125829
   macro avg       0.51      0.95      0.52    125829
weighted avg       1.00      0.96      0.98    125829


Confusion matrix: 
[[120655   5017]
 [     9    148]]

F2 Score: 
0.5477661249642923

