# Project 4 - Fraud Detection Etherium - Scikit-Learn - Top 8 Models

- Modelling Fraud Detection Etherium with **Scikit-Learn**

- Course Name :         Applied Machine Learning
- Course instructor :   Sohail Tehranipour
- Student Name :        Afshin Masoudi Ashtiani
- Project 4 -           Fraud Detection Etherium
- Date :                September 2024
- File(ipynb) :         3/3

## Step 1 : Install Necessary Libraries

In [1]:
!pip install pandas numpy joblib
!pip install imbalanced-learn
!pip install lightgbm xgboost
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


## Step 2 : Import Required Libraries

In [4]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Step 3 : Load Your Data and Prepare It

In [5]:
# Load the dataset
dataset_path = r'/content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/datasets/cleaned_transaction_dataset.csv'
# dataset_path = r'C:\Users\Afshin\Desktop\10_Projects\Project_4_Fraud_Detection_Etherium\datasets\cleaned_transaction_dataset.csv'

df = pd.read_csv(dataset_path)
df.info()
df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9288 entries, 0 to 9287
Data columns (total 17 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   FLAG                                           9288 non-null   int64  
 1   Avg min between sent tnx                       9288 non-null   float64
 2   Avg min between received tnx                   9288 non-null   float64
 3   Time difference between first and last (mins)  9288 non-null   float64
 4   Sent tnx                                       9288 non-null   int64  
 5   Received tnx                                   9288 non-null   int64  
 6   Number of created contracts                    9288 non-null   int64  
 7   Max value received                             9288 non-null   float64
 8   Avg value received                             9288 non-null   float64
 9   Avg value sent                                 9288 

Unnamed: 0,FLAG,Avg min between sent tnx,Avg min between received tnx,Time difference between first and last (mins),Sent tnx,Received tnx,Number of created contracts,Max value received,Avg value received,Avg value sent,Total either sent,Total either balance,ERC20 total either received,ERC20 total either sent,ERC20 total either sent contract,ERC20 unique sent address,ERC20 unique received token name
0,0,844.26,1093.71,704785.63,721,89,0,45.806785,6.589513,1.200681,865.691093,-279.224419,35588540.0,35603170.0,0.0,30.0,57.0
1,0,12709.07,2958.44,1218216.73,94,8,0,2.613269,0.385685,0.032844,3.087297,-0.001819,403.4283,2.260809,0.0,1.0,7.0
2,0,246194.54,2434.02,516729.3,2,10,0,1.165453,0.358906,1.794308,3.588616,0.000441,521.5121,0.0,0.0,0.0,8.0
3,0,10219.6,15785.09,397555.9,25,9,0,500.0,99.48884,70.001834,1750.045862,-854.646303,17111.05,11412.23,0.0,2.0,11.0
4,0,36.61,10707.77,382472.42,4598,20,1,12.802411,2.671095,0.022688,104.318883,-50.896986,162829.7,123539.9,0.0,4.0,27.0


- Display Columns and Values

In [6]:
def check_columns(df:pd.DataFrame) -> None:
    # Identify feature types
    numeric_features = df.select_dtypes(include=['number']).columns.tolist()
    boolean_features = df.select_dtypes(include=['bool']).columns.tolist()
    categorical_features = df.select_dtypes(include=['object']).columns.tolist()

    print('>> Columns : ')
    print(f'> Numeric Features     is {numeric_features}')
    print(f'> Boolean Features     is {boolean_features}')
    print(f'> Categorical Features is {categorical_features}')
    print('>> Values : ')
    for feature in df.columns:
        print(f'> {feature} : {df[feature].unique().tolist()}')

- Display data

In [7]:
print(">> Loaded Dataset <<")
check_columns(df)

>> Loaded Dataset <<
>> Columns : 
> Numeric Features     is ['FLAG', 'Avg min between sent tnx', 'Avg min between received tnx', 'Time difference between first and last (mins)', 'Sent tnx', 'Received tnx', 'Number of created contracts', 'Max value received', 'Avg value received', 'Avg value sent', 'Total either sent', 'Total either balance', 'ERC20 total either received', 'ERC20 total either sent', 'ERC20 total either sent contract', 'ERC20 unique sent address', 'ERC20 unique received token name']
> Boolean Features     is []
> Categorical Features is []
>> Values : 
> FLAG : [0, 1]
> Avg min between sent tnx : [844.26, 12709.07, 246194.54, 10219.6, 36.61, 9900.12, 69.46, 1497.39, 0.0, 2570.59, 32.45, 3716.41, 9520.7, 14106.66, 757.91, 3.13, 27681.45, 770.29, 163.78, 725.77, 91.14, 2477.34, 1797.42, 155.22, 14626.26, 3.33, 1128.56, 2152.95, 45.91, 1067.84, 1051.87, 2.12, 17441.04, 45561.42, 379.78, 1478.05, 164.22, 1790.44, 803.95, 18649.44, 65.85, 181.7, 23359.72, 20827.93, 4183.18, 

## Step 5 : Split the dataset into features and target

In [8]:
# Features and target variable
X = df.drop(columns='FLAG', axis=1)
y = df['FLAG']
print(f"> Shape of features and target is {X.shape}, {y.shape}.")

> Shape of features and target is (9288, 16), (9288,).


## Step 6 : Split the dataset into train and test sets

In [9]:
from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

## Step 7 : Preprocess the Data

- Create a scaler

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f'> Shape of X_train samples after scaling the data is {X_train_scaled.shape}.')
print(f'> Shape of X_test samples after scaling the data is {X_test_scaled.shape}.')


> Shape of X_train samples after scaling the data is (7430, 16).
> Shape of X_test samples after scaling the data is (1858, 16).


- Save the preprocessor

## Step 8 : Handle Imbalance using SMOTE

- Define the full pipeline including preprocessing and SMOTE

In [12]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=123)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)
X_test_resampled, y_test_resampled = smote.fit_resample(X_test_scaled, y_test)
print(f'> Shape of train features and target samples after resampling the data is {X_train_resampled.shape} and {y_train_resampled.shape}.')
print(f'> Shape of test features and target samples after resampling the data is {X_test_resampled.shape} and {y_test_resampled.shape}.')

> Shape of train features and target samples after resampling the data is (12210, 16) and (12210,).
> Shape of test features and target samples after resampling the data is (3054, 16) and (3054,).


## Step 9: Hyperparameter Tuning using GridSearchCV

- Define a function using **GridSearchCV**

In [13]:
import time
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import (classification_report, accuracy_score, recall_score, f1_score, precision_score)

def calculate_metrics(y_true, y_pred, average_type='binary'):
    """Calculate and return recall, F1, and precision scores."""
    acc = accuracy_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred, average=average_type)
    f1 = f1_score(y_true, y_pred, average=average_type)
    prec = precision_score(y_true, y_pred, average=average_type)
    return acc, rec, f1, prec

def tune_model(model, params, X_train, X_test, y_train, y_test, scoring='accuracy', average_type='binary') -> dict:
    start_time = time.time()

    # Define cross-validation
    cv = KFold(n_splits=5, shuffle=True, random_state=123)

    # Set up GridSearchCV
    grid_search = GridSearchCV(model, params, refit=True, scoring=scoring, cv=cv, verbose=2, n_jobs=-1)

    try:
        # Fit GridSearchCV
        grid_search.fit(X_train, y_train)
    except Exception as e:
        print(f"Error during model fitting: {e}")
        return {}

    # Extract best estimator, parameters, and score
    best_estimator = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_

    # Make predictions
    y_train_pred = grid_search.predict(X_train)
    y_test_pred = grid_search.predict(X_test)

    # Calculate metrics
    train_metrics = calculate_metrics(y_train, y_train_pred, average_type)
    test_metrics = calculate_metrics(y_test, y_test_pred, average_type)

    # Classification report
    report_train = classification_report(y_train, y_train_pred)
    report_test = classification_report(y_test, y_test_pred)

    runtime = time.time() - start_time

    # Print results
    print(f">> Results from Grid Search {'--' * 10}")
    print(f"> Best Score        : {best_score * 100:.2f}%")
    print(f"> Accuracy  -> Train: {train_metrics[0] * 100:.2f}%, Test: {test_metrics[0] * 100:.2f}%")
    print(f"> Recall    -> Train: {train_metrics[1] * 100:.2f}%, Test: {test_metrics[1] * 100:.2f}%")
    print(f"> F1        -> Train: {train_metrics[2] * 100:.2f}%, Test: {test_metrics[2] * 100:.2f}%")
    print(f"> Precision -> Train: {train_metrics[3] * 100:.2f}%, Test: {test_metrics[3] * 100:.2f}%")
    print(f"> Runtime           : {runtime:.4f} seconds")
    print(f"> Best Parameters   : {best_params}")
    print(f"> Classification Report -> Train : \n{report_train}")
    print(f"> Classification Report -> Test  : \n{report_test}")

    return {
        'Best_Score': best_score,
        'Train_Accuracy': train_metrics[0],
        'Test_Accuracy': test_metrics[0],
        'Train_Recall': train_metrics[1],
        'Test_Recall': test_metrics[1],
        'Train_F1': train_metrics[2],
        'Test_F1': test_metrics[2],
        'Train_Precision': train_metrics[3],
        'Test_Precision': test_metrics[3],
        'Runtime': runtime,
        'Best_Estimator': best_estimator,
    }

- Define models and their respective parameter grids

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import (AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier)
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Set random state for reproducibility
RANDOM_STATE = 123

# Define models and their respective parameter grids
models_param_grids = {
    'Extra Trees Classifier': (ExtraTreesClassifier(random_state=RANDOM_STATE), {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'max_features': ['auto', 'sqrt', 'log2'],
        'bootstrap': [True, False]
    }),

    'CatBoost Classifier': (CatBoostClassifier(verbose=0), {
        'iterations': [100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.2],
        'depth': [4, 6, 8, 10],
        'l2_leaf_reg': [1, 3, 5],
    }),

    'LGBM Classifier': (LGBMClassifier(random_state=RANDOM_STATE), {
        'n_estimators': [100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [-1, 3, 5, 10],
        'num_leaves': [31, 50, 100],
        'subsample': [0.5, 0.7, 1.0],
        'colsample_bytree': [0.5, 0.7, 1.0],
    }),

    'XGBoost Classifier': (XGBClassifier(random_state=RANDOM_STATE), {
        'n_estimators': [100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7, 10],
        'subsample': [0.5, 0.7, 1.0],
        'colsample_bytree': [0.5, 0.7, 1.0],
        'objective': ['binary:logistic']  # for binary classification
    }),

    'Random Forest Classifier': (RandomForestClassifier(random_state=RANDOM_STATE), {
        'n_estimators': [50, 100, 200],
        'max_features': ['auto', 'sqrt', 'log2'],
        'max_depth': [None, 10, 20, 30, 40, 50],
        'bootstrap': [True, False],
    }),

    'Ada Boost Classifier': (AdaBoostClassifier(random_state=RANDOM_STATE), {
        'n_estimators': [50, 100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.5, 1.0],
        'algorithm': ['SAMME', 'SAMME.R']
    }),

    'Gradient Boosting Classifier': (GradientBoostingClassifier(random_state=RANDOM_STATE), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2, 0.3],
        'max_depth': [3, 5, 10],
        'subsample': [0.5, 0.8, 1.0],
        'max_features': [None, 'sqrt', 'log2']
    }),

    'SVM-Linear Classifier': (SVC(kernel='linear', random_state=RANDOM_STATE), {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto', 0.1, 1, 10],
        'degree': [2, 3, 4]
    }),

    'Logistic Regression': (LogisticRegression(random_state=RANDOM_STATE), {
        'C': [0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga'],
        'max_iter': [100, 200, 300],
        'class_weight': [None, 'balanced']
    }),
}

- Tune each model

In [15]:
# Store results
results = []

# Tune each model
for model_name, (model, param_grid) in models_param_grids.items():
    print(f"\n>>>> Tuning {model_name} ......")
    new_row = {'Name' : model_name} | tune_model(model, param_grid, X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled)
    results.append(new_row)

results_df = pd.DataFrame(results)


>>>> Tuning Extra Trees Classifier ......
Fitting 5 folds for each of 72 candidates, totalling 360 fits
>> Results from Grid Search --------------------
> Best Score        : 98.44%
> Accuracy  -> Train: 99.99%, Test: 95.94%
> Recall    -> Train: 99.98%, Test: 93.78%
> F1        -> Train: 99.99%, Test: 95.85%
> Precision -> Train: 100.00%, Test: 98.02%
> Runtime           : 207.5114 seconds
> Best Parameters   : {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 100}
> Classification Report -> Train : 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6105
           1       1.00      1.00      1.00      6105

    accuracy                           1.00     12210
   macro avg       1.00      1.00      1.00     12210
weighted avg       1.00      1.00      1.00     12210

> Classification Report -> Test  : 
              precision    recall  f1-score   support

           0       0.94      0.98      0.96 

- Store the results in a DataFrame for analysis

In [16]:
# Sort the results in a DataFrame for analysis
sorted_results_df = results_df[results_df.Best_Score > 0.7].sort_values(by=['Best_Score'], ascending=[False])

best_score_index = results_df.Best_Score.idxmax()
test_acc_index = results_df.Test_Accuracy.idxmax()
test_rec_index = results_df.Test_Recall.idxmax()
test_f1_index = results_df.Test_F1.idxmax()
test_prec_index = results_df.Test_Precision.idxmax()

best_score_model = results_df.loc[best_score_index, 'Best_Estimator']
test_acc_model = results_df.loc[test_acc_index, 'Best_Estimator']
test_rec_model = results_df.loc[test_rec_index, 'Best_Estimator']
test_f1_model = results_df.loc[test_f1_index, 'Best_Estimator']
test_prec_model = results_df.loc[test_prec_index, 'Best_Estimator']

- Summary of Model Tuning Results

In [17]:
print(">> Summary of Model Tuning Results :")
print(f"> Maximum Best Score     Model : {results_df.loc[best_score_index, 'Name']}")
print(f"> Maximum Test Accuracy  Model : {results_df.loc[test_acc_index, 'Name']}")
print(f"> Maximum Test Recall    Model : {results_df.loc[test_rec_index, 'Name']}")
print(f"> Maximum Test F1        Model : {results_df.loc[test_f1_index, 'Name']}")
print(f"> Maximum Test Precision Model : {results_df.loc[test_prec_index, 'Name']}")
sorted_results_df

>> Summary of Model Tuning Results :
> Maximum Best Score     Model : LGBM Classifier
> Maximum Test Accuracy  Model : LGBM Classifier
> Maximum Test Recall    Model : LGBM Classifier
> Maximum Test F1        Model : LGBM Classifier
> Maximum Test Precision Model : LGBM Classifier


Unnamed: 0,Name,Best_Score,Train_Accuracy,Test_Accuracy,Train_Recall,Test_Recall,Train_F1,Test_F1,Train_Precision,Test_Precision,Runtime,Best_Estimator
2,LGBM Classifier,0.99009,1.0,0.978716,1.0,0.971185,1.0,0.978555,1.0,0.986037,2736.284101,"LGBMClassifier(learning_rate=0.2, n_estimators..."
1,CatBoost Classifier,0.98665,0.999509,0.97184,1.0,0.962017,0.999509,0.971561,0.999018,0.981296,2559.751323,<catboost.core.CatBoostClassifier object at 0x...
3,XGBoost Classifier,0.98665,1.0,0.97315,1.0,0.965291,1.0,0.972937,1.0,0.980705,949.869062,"XGBClassifier(base_score=None, booster=None, c..."
6,Gradient Boosting Classifier,0.98493,0.999918,0.974787,1.0,0.967911,0.999918,0.974613,0.999836,0.981408,6105.963592,([DecisionTreeRegressor(criterion='friedman_ms...
0,Extra Trees Classifier,0.984357,0.999918,0.959398,0.999836,0.937787,0.999918,0.958501,1.0,0.980151,207.511359,"(ExtraTreeClassifier(random_state=843828734), ..."
4,Random Forest Classifier,0.984029,0.999918,0.967583,1.0,0.954158,0.999918,0.967142,0.999836,0.980485,1027.697197,"(DecisionTreeClassifier(max_depth=20, max_feat..."
5,Ada Boost Classifier,0.9638,0.973055,0.953831,0.979361,0.956123,0.973224,0.953937,0.967163,0.95176,423.252669,"(DecisionTreeClassifier(max_depth=1, random_st..."


- Save the tuned best model

In [18]:
import pickle
import joblib
from typing import Any

def save_to_file(pkl: bool, path: str, model: Any) -> None:
    """Saves the specified model to a file using either pickle or joblib."""
    try:
        if pkl:
            with open(path, 'wb') as file:
                pickle.dump(model, file)
        else:
            joblib.dump(model, path)

        print(f"> Model saved to {path}")
    except Exception as err:
        print(f"> Error occurred while saving the model: {err}")

def load_from_file(pkl: bool, path: str) -> Any:
    """Loads and returns a model from a file using either pickle or joblib."""
    try:
        if pkl:
            with open(path, 'rb') as file:
                return pickle.load(file)
        else:
            return joblib.load(path)

    except Exception as err:
        print(f"> Error occurred while loading the model: {err}")
        return None

In [19]:
for index, row in sorted_results_df.iterrows():
    tuned_best_model_name = row.Name.replace(' ', '')
    tuned_best_model= row.Best_Estimator

    # model_path = f"C:\\Users\\Afshin\\Desktop\\10_Projects\\Project_4_Fraud_Detection_Etherium\\models\\{tuned_best_model_name}.joblib"
    model_path = f'/content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/{tuned_best_model_name}.joblib'
    save_to_file(pkl=False, path=model_path, model=tuned_best_model)

    if tuned_best_model_name == 'XGBoostClassifier':
        # model_path = f"C:\\Users\\Afshin\\Desktop\\10_Projects\\Project_4_Fraud_Detection_Etherium\\models\\{tuned_best_model_name}.pkl"
        model_path = f'/content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/{tuned_best_model_name}.pkl'

        tuned_best_model.save_model(model_path)
        print(f"> Model saved to {model_path}")

> Model saved to /content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/LGBMClassifier.joblib
> Model saved to /content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/CatBoostClassifier.joblib
> Model saved to /content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/XGBoostClassifier.joblib
> Model saved to /content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/XGBoostClassifier.pkl
> Model saved to /content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/GradientBoostingClassifier.joblib
> Model saved to /content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/ExtraTreesClassifier.joblib
> Model saved to /content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/RandomForestClassifier.joblib
> Model saved to /content/drive/My Drive/Ap

## Step 10: Make Predictions

- Load the tuned best model

In [20]:
model_path = r'/content/drive/My Drive/Applied Machine Learning/Project 4 : Fraud Detection Etherium/models/LGBMClassifier.joblib'
# model_path = 'C:\\Users\\Afshin\\Desktop\\10_Projects\\Project_4_Fraud_Detection_Etherium\\models\\LGBMClassifier.joblib'

# Load the model from the pkl file
loaded_tuned_best_model = load_from_file(pkl=False, path=model_path)
loaded_tuned_best_model

- Predict the random sample

In [22]:
sample = df.sample(1)
X_sample = sample.drop(['FLAG'], axis=1)
y_sample = sample.FLAG

X_sample_processed = scaler.transform(X_sample)
y_sample_pred = loaded_tuned_best_model.predict(X_sample_processed)

sample['Predicted FLAG'] = y_sample_pred[0]
print(f'> FLAG           : {y_sample.values[0]}')
print(f'> Predicted FLAG : {y_sample_pred[0]}')
sample

> FLAG           : 0
> Predicted FLAG : 0


Unnamed: 0,FLAG,Avg min between sent tnx,Avg min between received tnx,Time difference between first and last (mins),Sent tnx,Received tnx,Number of created contracts,Max value received,Avg value received,Avg value sent,Total either sent,Total either balance,ERC20 total either received,ERC20 total either sent,ERC20 total either sent contract,ERC20 unique sent address,ERC20 unique received token name,Predicted FLAG
2352,0,18.32,0.0,54.97,3,1,0,101.0,101.0,33.665988,100.997965,0.002035,0.0,0.0,0.0,0.0,0.0,0
