# SMOTE-Tomek for Ethereum Fraud Detection

In [1]:
using Imbalance
using MLJBalancing
using CSV
using DataFrames
using ScientificTypes
using CategoricalArrays
using MLJ
using Plots
using Random
using Impute

## Loading Data
In this example, we will consider the [Ethereum Fraud Detection Dataset](https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset) found on Kaggle where the objective is to predict whether an Ethereum transaction is fraud or not (called `FLAG`) given some features about the transaction.

`CSV` gives us the ability to easily read the dataset after it's downloaded as follows

In [21]:
df = CSV.read("../datasets/transactions.csv", DataFrame)
first(df, 5) |> pretty

There are plenty of useless columns that we can get rid of such as `Column1`, `Index` and probably, `Address`. We also have to get rid of the categorical features because SMOTE won't be able to deal with those and it leaves us with more options for the model.

In [22]:
df = df[:,
	Not([
		:Column1,
		:Index,
		:Address,
		Symbol(" ERC20 most sent token type"),
		Symbol(" ERC20_most_rec_token_type"),
	]),
] 
first(df, 5) |> pretty

If you scroll through the printed data frame, you find that some columns also have `Missing` for their element type, meaning that they may be containing missing values. We will use *linear interpolation*, *last-observation carried forward* and *next observation carried backward* techniques to fill up the missing values. This will allow us to call `disallowmissing!(df)` to return a dataframe where `Missing` is not an element type for any column.

In [23]:
df = Impute.interp(df) |> Impute.locf() |> Impute.nocb(); disallowmissing!(df)
first(df, 5) |> pretty

## Coercing Data

Let's look at the schema first

In [24]:
ScientificTypes.schema(df)

┌──────────────────────────────────────────────────────┬────────────┬─────────┐
│[22m names                                                [0m│[22m scitypes   [0m│[22m types   [0m│
├──────────────────────────────────────────────────────┼────────────┼─────────┤
│ FLAG                                                 │ Count      │ Int64   │
│ Avg min between sent tnx                             │ Continuous │ Float64 │
│ Avg min between received tnx                         │ Continuous │ Float64 │
│ Time Diff between first and last (Mins)              │ Continuous │ Float64 │
│ Sent tnx                                             │ Count      │ Int64   │
│ Received Tnx                                         │ Count      │ Int64   │
│ Number of Created Contracts                          │ Count      │ Int64   │
│ Unique Received From Addresses                       │ Count      │ Int64   │
│ Unique Sent To Addresses                             │ Count      │ Int64   │
│ min value r

The `FLAG` target should definitely be Multiclass, the rest seems fine.

In [25]:
df = coerce(df, :FLAG =>Multiclass)
ScientificTypes.schema(df)

┌──────────────────────────────────────────────────────┬───────────────┬────────
│[22m names                                                [0m│[22m scitypes      [0m│[22m types[0m ⋯
├──────────────────────────────────────────────────────┼───────────────┼────────
│ FLAG                                                 │ Multiclass{2} │ Categ ⋯
│ Avg min between sent tnx                             │ Continuous    │ Float ⋯
│ Avg min between received tnx                         │ Continuous    │ Float ⋯
│ Time Diff between first and last (Mins)              │ Continuous    │ Float ⋯
│ Sent tnx                                             │ Count         │ Int64 ⋯
│ Received Tnx                                         │ Count         │ Int64 ⋯
│ Number of Created Contracts                          │ Count         │ Int64 ⋯
│ Unique Received From Addresses                       │ Count         │ Int64 ⋯
│ Unique Sent To Addresses                             │ Count         │ Int64 ⋯
│

## Unpacking and Splitting Data

Both `MLJ` and the pure functional interface of `Imbalance` assume that the observations table `X` and target vector `y` are separate. We can accomplish that by using `unpack` from `MLJ`

In [26]:
y, X = unpack(df, ==(:FLAG); rng=123);
first(X, 5) |> pretty

Splitting the data into train and test portions is also easy using `MLJ`'s `partition` function.

In [None]:
(X_train, X_test), (y_train, y_test) = partition(
	(X, y),
	0.8,
	multi = true,
	shuffle = true,
	stratify = y,
	rng = Random.Xoshiro(41)
)

## Resampling



Before deciding to oversample, let's see how adverse is the imbalance problem, if it exists. Ideally, you may as well check if the classification model is robust to this problem.

In [28]:
checkbalance(y)         # comes from Imbalance

This signals a potential class imbalance problem. Let's consider using `SMOTE-Tomek` to resample this data. The `SMOTE-Tomek` algorithm is nothing but `SMOTE` followed by `TomekUndersampler`. We can wrap these in a pipeline along with a classification model for predictions using `BalancedModel` from `MLJBalancing`. Let's go for a `RandomForestClassifier` from `DecisionTree.jl` for the model.

In [None]:
import Pkg; Pkg.add("DecisionTree")

#### Construct the Resampling & Classification Models

In [29]:
oversampler = Imbalance.MLJ.SMOTE(ratios=Dict(1=>0.5), rng=Random.Xoshiro(42))
undersampler = Imbalance.MLJ.TomekUndersampler(min_ratios=Dict(0=>1.3), force_min_ratios=true)
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree
model = RandomForestClassifier(n_trees=2, rng=Random.Xoshiro(42))

RandomForestClassifier(
  max_depth = -1, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = -1, 
  n_trees = 2, 
  sampling_fraction = 0.7, 
  feature_importance = :impurity, 
  rng = Xoshiro(0xa379de7eeeb2a4e8, 0x953dccb6b532b3af, 0xf597b8ff8cfd652a, 0xccd7337c571680d1))

#### Form the Pipeline using `BalancedModel`

In [30]:
balanced_model = BalancedModel(model=model, balancer1=oversampler, balancer2=undersampler)

BalancedModelProbabilistic(
  model = RandomForestClassifier(
        max_depth = -1, 
        min_samples_leaf = 1, 
        min_samples_split = 2, 
        min_purity_increase = 0.0, 
        n_subfeatures = -1, 
        n_trees = 2, 
        sampling_fraction = 0.7, 
        feature_importance = :impurity, 
        rng = Xoshiro(0xa379de7eeeb2a4e8, 0x953dccb6b532b3af, 0xf597b8ff8cfd652a, 0xccd7337c571680d1)), 
  balancer1 = SMOTE(
        k = 5, 
        ratios = Dict(1 => 0.5), 
        rng = Xoshiro(0xa379de7eeeb2a4e8, 0x953dccb6b532b3af, 0xf597b8ff8cfd652a, 0xccd7337c571680d1), 
        try_preserve_type = true), 
  balancer2 = TomekUndersampler(
        min_ratios = Dict(0 => 1.3), 
        force_min_ratios = true, 
        rng = TaskLocalRNG(), 
        try_preserve_type = true))

Now we can treat `balanced_model` like any `MLJ` model.

#### Fit the `BalancedModel`

In [31]:
# 3. Wrap it with the data in a machine
mach_over = machine(balanced_model, X_train, y_train)

# 4. fit the machine learning model
fit!(mach_over, verbosity=0)

trained Machine; does not cache data
  model: BalancedModelProbabilistic(model = RandomForestClassifier(max_depth = -1, …), …)
  args: 
    1:	Source @967 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Count}}}
    2:	Source @913 ⏎ AbstractVector{Multiclass{2}}


#### Validate the `BalancedModel`

In [32]:
cv=CV(nfolds=10)
evaluate!(mach_over, resampling=cv, measure=balanced_accuracy) 

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌─────────────────────┬──────────────┬─────────────┬─────────┬──────────────────
│[22m measure             [0m│[22m operation    [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├─────────────────────┼──────────────┼─────────────┼─────────┼──────────────────
│ BalancedAccuracy(   │ predict_mode │ 0.93        │ 0.00757 │ [0.927, 0.936,  ⋯
│   adjusted = false) │              │             │         │                 ⋯
└─────────────────────┴──────────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


#### Compare with `RandomForestClassifier` only

To see if this represents any form of improvement, fitting and validating the original model by itself.

In [33]:
# 3. Wrap it with the data in a machine
mach = machine(model, X_train, y_train, scitype_check_level=0)
fit!(mach)

evaluate!(mach, resampling=cv, measure=balanced_accuracy) 

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌─────────────────────┬──────────────┬─────────────┬─────────┬──────────────────
│[22m measure             [0m│[22m operation    [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├─────────────────────┼──────────────┼─────────────┼─────────┼──────────────────
│ BalancedAccuracy(   │ predict_mode │ 0.908       │ 0.00932 │ [0.903, 0.898,  ⋯
│   adjusted = false) │              │             │         │                 ⋯
└─────────────────────┴──────────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


Assuming normal scores, the `95%` confidence interval was `90.8±0.9` and after resampling it has become `93±0.7` which corresponds to a small improvement in accuracy.

In [1]:
import sys; sys.path.append("..")
from convert import convert_to_md
convert_to_md('fraud_detection')

[NbConvertApp] Converting notebook fraud_detection.ipynb to markdown


An error occurred: [Errno 2] No such file or directory: './assets'
Conversion Complete!


[NbConvertApp] Writing 11899 bytes to fraud_detection.md
