In [1]:
using Imbalance
using CSV
using DataFrames
using ScientificTypes
using CategoricalArrays
using MLJ
using Plots
using Random

## Loading Data
In this example, we will consider the [Churn for Bank Customers](https://www.kaggle.com/datasets/mathchi/churn-for-bank-customers) found on Kaggle where the objective is to predict whether a customer is likely to leave a bank given financial and demographic features.

`CSV` gives us the ability to easily read the dataset after it's downloaded as follows

In [2]:
df = CSV.read("../datasets/churn.csv", DataFrame)
first(df, 5) |> pretty

┌───────────┬────────────┬──────────┬─────────────┬───────────┬─────────┬───────┬────────┬────────────┬───────────────┬───────────┬────────────────┬─────────────────┬────────┐
│[1m RowNumber [0m│[1m CustomerId [0m│[1m Surname  [0m│[1m CreditScore [0m│[1m Geography [0m│[1m Gender  [0m│[1m Age   [0m│[1m Tenure [0m│[1m Balance    [0m│[1m NumOfProducts [0m│[1m HasCrCard [0m│[1m IsActiveMember [0m│[1m EstimatedSalary [0m│[1m Exited [0m│
│[90m Int64     [0m│[90m Int64      [0m│[90m String31 [0m│[90m Int64       [0m│[90m String7   [0m│[90m String7 [0m│[90m Int64 [0m│[90m Int64  [0m│[90m Float64    [0m│[90m Int64         [0m│[90m Int64     [0m│[90m Int64          [0m│[90m Float64         [0m│[90m Int64  [0m│
│[90m Count     [0m│[90m Count      [0m│[90m Textual  [0m│[90m Count       [0m│[90m Textual   [0m│[90m Textual [0m│[90m Count [0m│[90m Count  [0m│[90m Continuous [0m│[90m Count         [0m│[90m Count     [0m│[90

There are plenty of useless columns that we can get rid of such as `RowNumber` and `CustomerID`. We also have to get rid of the cateogircal features because SMOTE won't be able to deal with those; however, other variants such as SMOTE-NC can which we will consider in another tutorial.

In [3]:
df = df[:, Not([:RowNumber, :CustomerId, :Surname, 
           :Geography, :Gender])]

first(df, 5) |> pretty

┌─────────────┬───────┬────────┬────────────┬───────────────┬───────────┬────────────────┬─────────────────┬────────┐
│[1m CreditScore [0m│[1m Age   [0m│[1m Tenure [0m│[1m Balance    [0m│[1m NumOfProducts [0m│[1m HasCrCard [0m│[1m IsActiveMember [0m│[1m EstimatedSalary [0m│[1m Exited [0m│
│[90m Int64       [0m│[90m Int64 [0m│[90m Int64  [0m│[90m Float64    [0m│[90m Int64         [0m│[90m Int64     [0m│[90m Int64          [0m│[90m Float64         [0m│[90m Int64  [0m│
│[90m Count       [0m│[90m Count [0m│[90m Count  [0m│[90m Continuous [0m│[90m Count         [0m│[90m Count     [0m│[90m Count          [0m│[90m Continuous      [0m│[90m Count  [0m│
├─────────────┼───────┼────────┼────────────┼───────────────┼───────────┼────────────────┼─────────────────┼────────┤
│ 619.0       │ 42.0  │ 2.0    │ 0.0        │ 1.0           │ 1.0       │ 1.0            │ 1.01349e5       │ 1.0    │
│ 608.0       │ 41.0  │ 1.0    │ 83807.9    │ 1.0         

Ideally, we may even remove ordinal variables because SMOTE will treat them as continuous and the synthetic data it generates will taking floating point values which will not occur in future data. Some models may be robust to this whatsoever and the main purpose of this tutorial is to later compare SMOTE-NC with SMOTE.

## Coercing Data

Let's coerce everything to continuous except for the target variable.

In [4]:
df = coerce(df, :Age=>Continuous,
                :Tenure=>Continuous,
                :Balance=>Continuous,
                :NumOfProducts=>Continuous,
                :HasCrCard=>Continuous,
                :IsActiveMember=>Continuous,
                :EstimatedSalary=>Continuous,
                :Exited=>Multiclass)

ScientificTypes.schema(df)

┌─────────────────┬───────────────┬─────────────────────────────────┐
│[22m names           [0m│[22m scitypes      [0m│[22m types                           [0m│
├─────────────────┼───────────────┼─────────────────────────────────┤
│ CreditScore     │ Count         │ Int64                           │
│ Age             │ Continuous    │ Float64                         │
│ Tenure          │ Continuous    │ Float64                         │
│ Balance         │ Continuous    │ Float64                         │
│ NumOfProducts   │ Continuous    │ Float64                         │
│ HasCrCard       │ Continuous    │ Float64                         │
│ IsActiveMember  │ Continuous    │ Float64                         │
│ EstimatedSalary │ Continuous    │ Float64                         │
│ Exited          │ Multiclass{2} │ CategoricalValue{Int64, UInt32} │
└─────────────────┴───────────────┴─────────────────────────────────┘


## Unpacking and Splitting Data

Both `MLJ` and the pure functional interface of `Imbalance` assume that the observations table `X` and target vector `y` are separate. We can accomplish that by using `unpack` from `MLJ`

In [5]:
y, X = unpack(df, ==(:Exited); rng=123);
first(X, 5) |> pretty

┌─────────────┬────────────┬────────────┬────────────┬───────────────┬────────────┬────────────────┬─────────────────┐
│[1m CreditScore [0m│[1m Age        [0m│[1m Tenure     [0m│[1m Balance    [0m│[1m NumOfProducts [0m│[1m HasCrCard  [0m│[1m IsActiveMember [0m│[1m EstimatedSalary [0m│
│[90m Int64       [0m│[90m Float64    [0m│[90m Float64    [0m│[90m Float64    [0m│[90m Float64       [0m│[90m Float64    [0m│[90m Float64        [0m│[90m Float64         [0m│
│[90m Count       [0m│[90m Continuous [0m│[90m Continuous [0m│[90m Continuous [0m│[90m Continuous    [0m│[90m Continuous [0m│[90m Continuous     [0m│[90m Continuous      [0m│
├─────────────┼────────────┼────────────┼────────────┼───────────────┼────────────┼────────────────┼─────────────────┤
│ 669.0       │ 31.0       │ 6.0        │ 1.13001e5  │ 1.0           │ 1.0        │ 0.0            │ 40467.8         │
│ 822.0       │ 37.0       │ 3.0        │ 105563.0   │ 1.0           │ 1.0    

Splitting the data into train and test portions is also easy using `MLJ`'s `partition` function.

In [6]:
train_inds, test_inds = partition(eachindex(y), 0.8, shuffle=true, rng=Random.Xoshiro(42))
X_train, X_test = X[train_inds, :], X[test_inds, :]
y_train, y_test = y[train_inds], y[test_inds]

(CategoricalValue{Int64, UInt32}[0, 1, 1, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 1, 0, 0, 0, 0, 1, 0], CategoricalValue{Int64, UInt32}[0, 0, 0, 0, 0, 1, 1, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Oversampling



Before deciding to oversample, let's see how adverse is the imbalance problem, if it exists. Ideally, you may as well check if the classification model is robust to this problem.

In [7]:
checkbalance(y)         # comes from Imbalance

1: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 2037 (25.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7963 (100.0%) 


Looks like we have a class imbalance problem. Let's oversample with SMOTE and set the desired ratios so that the positive minority class is 90% of the majority class

In [8]:
Xover, yover = smote(X, y; k=3, ratios=Dict(1=>0.9), rng=42)
checkbalance(yover)

1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7167 (90.0%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7963 (100.0%) 


## Training the Model



Because we have scientific types setup, we can easily check what models will be able to train on our data. This should guarantee that the model we choose won't throw an error due to types after feeding it the data.

In [9]:
models(matching(Xover, yover))

54-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = AdaBoostClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = BaggingClassifier, package_name = MLJScikitLearnInterface, ... )
 (name = BayesianLDA, package_name = MLJScikitLearnInterface, ... )
 (name = BayesianLDA, package_name = MultivariateStats, ... )
 (name = BayesianQDA, package_

Let's go for a logistic classifier form MLJLinearModels

In [10]:
import Pkg; Pkg.add("MLJLinearModels")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`


┌ Error: Some registries failed to update:
│     — /Users/essam/.julia/registries/General.toml — failed to download from https://pkg.julialang.org/registry/23338594-aafe-5451-b93e-139f81909106/95646b6cd2d61c2d6784757067e14d5bcb846090. Exception: HTTP/2 200 (Operation too slow. Less than 1 bytes/sec transferred the last 20 seconds) while requesting https://pkg.julialang.org/registry/23338594-aafe-5451-b93e-139f81909106/95646b6cd2d61c2d6784757067e14d5bcb846090
└ @ Pkg.Registry /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-4.0/build/default-macmini-aarch64-4-0/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Pkg/src/Registry/Registry.jl:449
[32m[1m   Resolving[22m[39m package versions...


[32m[1m    Updating[22m[39m `~/Documents/GitHub/Imbalance.jl/Project.toml`
 [90m [6ee0df7b] [39m[92m+ MLJLinearModels v0.9.2[39m
[32m[1m    Updating[22m[39m `~/Documents/GitHub/Imbalance.jl/Manifest.toml`


 [90m [6a86dc24] [39m[92m+ FiniteDiff v2.21.1[39m
 [90m [42fd0dbc] [39m[92m+ IterativeSolvers v0.9.2[39m
 [90m [d3d80556] [39m[92m+ LineSearches v7.2.0[39m
 [90m [7a12625a] [39m[92m+ LinearMaps v3.11.0[39m
 [90m [6ee0df7b] [39m[92m+ MLJLinearModels v0.9.2[39m
 [90m [d41bc354] [39m[92m+ NLSolversBase v7.8.3[39m
 [90m [429524aa] [39m[92m+ Optim v1.7.7[39m
 [90m [85a6dd25] [39m[92m+ PositiveFactorizations v0.2.4[39m
 [90m [3cdcf5f2] [39m[92m+ RecipesBase v1.3.4[39m


### Before Oversampling

In [11]:
# 1. Load the model
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0

# 2. Instantiate it
model = LogisticClassifier()

# 3. Wrap it with the data in a machine
mach = machine(model, X_train, y_train)

# 4. fit the machine learning model
fit!(mach, verbosity=0)

│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc MLJLinearModels.LogisticClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{Union{AbstractVector{Continuous}, AbstractVector{Count}}}, AbstractVector{Multiclass{2}}}
│ 
│ fit_data_scitype(model) = Tuple{Table{<:AbstractVector{<:Continuous}}, AbstractVector{<:Finite}}
└ @ MLJBase /Users/essam/.julia/packages/MLJBase/ByFwA/src/machines.jl:230


trained Machine; caches model-specific representations of data
  model: LogisticClassifier(lambda = 2.220446049250313e-16, …)
  args: 
    1:	Source @148 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Count}}}
    2:	Source @042 ⏎ AbstractVector{Multiclass{2}}


### After Oversampling

In [12]:
# 3. Wrap it with the data in a machine
mach_over = machine(model, Xover, yover)

# 4. fit the machine learning model
fit!(mach_over)

┌ Info: Training machine(LogisticClassifier(lambda = 2.220446049250313e-16, …), …).
└ @ MLJBase /Users/essam/.julia/packages/MLJBase/ByFwA/src/machines.jl:492
┌ Info: Solver: MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, NamedTuple{(), Tuple{}}}
│   optim_options: Optim.Options{Float64, Nothing}
│   lbfgs_options: NamedTuple{(), Tuple{}} NamedTuple()
└ @ MLJLinearModels /Users/essam/.julia/packages/MLJLinearModels/zSQnL/src/mlj/interface.jl:72


trained Machine; caches model-specific representations of data
  model: LogisticClassifier(lambda = 2.220446049250313e-16, …)
  args: 
    1:	Source @525 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @636 ⏎ AbstractVector{Multiclass{2}}


## Evaluating the Model



To evaluate the model, we will use the balanced accuracy metric which equally account for all classes. 

### Before Oversampling

In [13]:
y_pred = predict_mode(mach, X_test)                         

score = round(balanced_accuracy(y_pred, y_test), digits=2)

0.5

### After Oversampling

In [14]:
y_pred_over = predict_mode(mach_over, X_test)

score = round(balanced_accuracy(y_pred_over, y_test), digits=2)

0.66

In [1]:
import sys; sys.path.append("..")
from convert import convert_to_md
convert_to_md('smote_churn_dataset')

[NbConvertApp] Converting notebook smote_churn_dataset.ipynb to markdown


Conversion Complete!


[NbConvertApp] Writing 17062 bytes to smote_churn_dataset.md
