## Model Training and Evaluation
- In this notebook, I trained 4 machine learning models, evaluated their performance across 4 metrics and logged all the necessary parameters and metrics into the mlflow. 

### Loading important libraries and scripts

In [1]:
import pandas as pd
import sys
import os
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,f1_score,recall_score
from sklearn.ensemble import RandomForestClassifier
import mlflow.sklearn
import mlflow
import warnings
warnings.filterwarnings('ignore')
os.chdir('..')
from scripts.model import train_and_evaluate

### Loading the datasets

In [2]:
fraud_label=pd.read_csv('data/fraud_transformed_label.csv')
fraud_onehot=pd.read_csv('data/fraud_transformed.csv')
credit_df=pd.read_csv('data/credit_transformed.csv')

In [4]:
fraud_onehot

Unnamed: 0.1,Unnamed: 0,purchase_value,age,class,transaction_frequency,total_purchase_value,transaction_velocity,dayofweek,month,day,...,country_United States of America,country_Uruguay,country_Uzbekistan,country_Vanuatu,country_Venezuela,country_Viet Nam,country_Virgin Islands (U.S.),country_Yemen,country_Zambia,country_Zimbabwe
0,0,-0.160204,0.679914,0,-0.261514,-0.249985,-0.163219,0.991020,-0.754946,0.308768,...,False,False,False,False,False,False,False,False,False,False
1,1,-1.142592,2.304476,0,-0.261514,-0.408448,-1.164096,-1.501259,-0.003243,-0.825780,...,True,False,False,False,False,False,False,False,False,False
2,2,-1.197169,2.304476,1,3.941861,1.035319,-1.219700,-0.005891,-1.882499,-1.619963,...,True,False,False,False,False,False,False,False,False,False
3,3,0.385567,0.911994,0,-0.261514,-0.161951,0.392823,-1.501259,-0.379095,-1.279599,...,False,False,False,False,False,False,False,False,False,False
4,4,0.112681,1.376155,0,-0.261514,-0.205968,0.114802,-0.504347,1.124310,-0.712325,...,True,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151107,151107,0.330990,-0.596528,1,0.120611,0.110956,0.031396,1.489476,-1.130797,1.556770,...,True,False,False,False,False,False,False,False,False,False
151108,151108,-0.105627,-0.132367,0,-0.261514,-0.241182,-0.107615,-1.002803,-0.379095,1.216406,...,False,False,False,False,False,False,False,False,False,False
151109,151109,0.167258,-0.828608,0,-0.261514,-0.197165,0.170406,-0.504347,-0.379095,0.535677,...,False,False,False,False,False,False,False,False,False,False
151110,151110,0.494721,0.447833,0,0.120611,0.295829,0.615240,-1.501259,1.124310,-0.939235,...,True,False,False,False,False,False,False,False,False,False


In [4]:
credit_df

Unnamed: 0.1,Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.996823,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.244200,0
1,1,-1.996823,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,-0.342584,0
2,2,-1.996802,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.158900,0
3,3,-1.996802,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.139886,0
4,4,-1.996781,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,-0.073813,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
283721,284802,1.642235,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,-0.350252,0
283722,284803,1.642257,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,-0.254325,0
283723,284804,1.642278,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,-0.082239,0
283724,284805,1.642278,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,-0.313391,0


#### Training the Logistic Regression model with the Fraud dataset

In [4]:
train_and_evaluate(data='fraud_data',model='LogisticRegression')

2024/10/24 10:59:26 INFO mlflow.tracking.fluent: Experiment with name 'Fraud Detection with Fraud Dataset' does not exist. Creating a new experiment.


The following metrics are found when the fraud dataset is trained with LogisticRegression model
------------------------
The accuracy is 0.951648
------------------------
The precision is 0.530946
------------------------
The recall is 0.91276
------------------------
The f1 score is 0.671364
------------------------




#### Training the Random Forest model with the Fraud dataset

In [5]:
train_and_evaluate(data='fraud_data',model='DecisionTree')

The following metrics are found when the fraud dataset is trained with DecisionTree model
------------------------
The accuracy is 0.956214
------------------------
The precision is 0.554185
------------------------
The recall is 0.957002
------------------------
The f1 score is 0.701907
------------------------




#### Training the Random Forest model with the Fraud dataset

In [6]:
train_and_evaluate(data='fraud_data',model='RandomForest')

The following metrics are found when the fraud dataset is trained with RandomForest model
------------------------
The accuracy is 0.957008
------------------------
The precision is 0.542803
------------------------
The recall is 0.990909
------------------------
The f1 score is 0.701394
------------------------




#### Training the XGBoost model with the Fraud dataset

In [7]:
train_and_evaluate(data='fraud_data',model='XGBoost')

The following metrics are found when the fraud dataset is trained with XGBoost model
------------------------
The accuracy is 0.956986
------------------------
The precision is 0.543277
------------------------
The recall is 0.989633
------------------------
The f1 score is 0.70147
------------------------




#### Training the Logistic Regression model with Credit Card dataset

In [8]:
train_and_evaluate(data='creditdata',model='LogisticRegression')

2024/10/24 11:04:57 INFO mlflow.tracking.fluent: Experiment with name 'Fraud Detection with Credit Dataset' does not exist. Creating a new experiment.


The following metrics are found when the credit dataset is trained with LogisticRegression model
------------------------
The accuracy is 0.999137
------------------------
The precision is 0.511111
------------------------
The recall is 0.901961
------------------------
The f1 score is 0.652482
------------------------




#### Training the Decision Tree model with the Credit Card dataset

In [9]:
train_and_evaluate(data='creditdata',model='DecisionTree')

The following metrics are found when the credit dataset is trained with DecisionTree model
------------------------
The accuracy is 0.999418
------------------------
The precision is 0.722222
------------------------
The recall is 0.890411
------------------------
The f1 score is 0.797546
------------------------




#### Training the Random Forest model with the Credit Card dataset

In [10]:
train_and_evaluate(data='creditdata',model='RandomForest')

The following metrics are found when the credit dataset is trained with RandomForest model
------------------------
The accuracy is 0.999524
------------------------
The precision is 0.722222
------------------------
The recall is 0.970149
------------------------
The f1 score is 0.828025
------------------------




#### Training the XGBoost model with the Credit Card dataset

In [11]:
train_and_evaluate(data='creditdata',model='XGBoost')

The following metrics are found when the credit dataset is trained with XGBoost model
------------------------
The accuracy is 0.999507
------------------------
The precision is 0.733333
------------------------
The recall is 0.942857
------------------------
The f1 score is 0.825
------------------------


