In [None]:
using Imbalance
using CSV
using DataFrames
using ScientificTypes
using CategoricalArrays
using MLJ
using Plots
using Random

## Loading Data
In this example, we will consider the [Churn for Bank Customers](https://www.kaggle.com/datasets/mathchi/churn-for-bank-customers) found on Kaggle where the objective is to predict whether a customer is likely to leave a bank given financial and demographic features. 

We already considered this dataset using SMOTE, in this example we see if the results are any better using SMOTE-NC.

In [None]:
df = CSV.read("../datasets/churn.csv", DataFrame)
first(df, 5) |> pretty

┌───────────┬────────────┬──────────┬─────────────┬───────────┬─────────┬───────┬────────┬────────────┬───────────────┬───────────┬────────────────┬─────────────────┬────────┐
│[1m RowNumber [0m│[1m CustomerId [0m│[1m Surname  [0m│[1m CreditScore [0m│[1m Geography [0m│[1m Gender  [0m│[1m Age   [0m│[1m Tenure [0m│[1m Balance    [0m│[1m NumOfProducts [0m│[1m HasCrCard [0m│[1m IsActiveMember [0m│[1m EstimatedSalary [0m│[1m Exited [0m│
│[90m Int64     [0m│[90m Int64      [0m│[90m String31 [0m│[90m Int64       [0m│[90m String7   [0m│[90m String7 [0m│[90m Int64 [0m│[90m Int64  [0m│[90m Float64    [0m│[90m Int64         [0m│[90m Int64     [0m│[90m Int64          [0m│[90m Float64         [0m│[90m Int64  [0m│
│[90m Count     [0m│[90m Count      [0m│[90m Textual  [0m│[90m Count       [0m│[90m Textual   [0m│[90m Textual [0m│[90m Count [0m│[90m Count  [0m│[90m Continuous [0m│[90m Count         [0m│[90m Count     [0m│[90

Let's get rid of useless columns such as `RowNumber` and `CustomerId`

In [None]:
df = df[:, Not([:Surname, :RowNumber, :CustomerId])]

first(df, 5) |> pretty

┌─────────────┬───────────┬─────────┬───────┬────────┬────────────┬───────────────┬───────────┬────────────────┬─────────────────┬────────┐
│[1m CreditScore [0m│[1m Geography [0m│[1m Gender  [0m│[1m Age   [0m│[1m Tenure [0m│[1m Balance    [0m│[1m NumOfProducts [0m│[1m HasCrCard [0m│[1m IsActiveMember [0m│[1m EstimatedSalary [0m│[1m Exited [0m│
│[90m Int64       [0m│[90m String7   [0m│[90m String7 [0m│[90m Int64 [0m│[90m Int64  [0m│[90m Float64    [0m│[90m Int64         [0m│[90m Int64     [0m│[90m Int64          [0m│[90m Float64         [0m│[90m Int64  [0m│
│[90m Count       [0m│[90m Textual   [0m│[90m Textual [0m│[90m Count [0m│[90m Count  [0m│[90m Continuous [0m│[90m Count         [0m│[90m Count     [0m│[90m Count          [0m│[90m Continuous      [0m│[90m Count  [0m│
├─────────────┼───────────┼─────────┼───────┼────────┼────────────┼───────────────┼───────────┼────────────────┼─────────────────┼────────┤
│ 619         

## Coercing Data

Let's coerce the nominal data to `Multiclass`, the ordinal data to `OrderedFactor` and the continuous data to `Continuous`.

In [None]:
df = coerce(df, 
              :Geography => Multiclass, 
              :Gender=> Multiclass,
              :CreditScore => OrderedFactor,
              :Age => OrderedFactor,
              :Tenure => OrderedFactor,
              :Balance => Continuous,
              :NumOfProducts => OrderedFactor,
              :HasCrCard => Multiclass,
              :IsActiveMember => Multiclass,
              :EstimatedSalary => Continuous,
              :Exited => Multiclass
              )

ScientificTypes.schema(df)

┌─────────────────┬────────────────────┬───────────────────────────────────┐
│[22m names           [0m│[22m scitypes           [0m│[22m types                             [0m│
├─────────────────┼────────────────────┼───────────────────────────────────┤
│ CreditScore     │ OrderedFactor{460} │ CategoricalValue{Int64, UInt32}   │
│ Geography       │ Multiclass{3}      │ CategoricalValue{String7, UInt32} │
│ Gender          │ Multiclass{2}      │ CategoricalValue{String7, UInt32} │
│ Age             │ OrderedFactor{70}  │ CategoricalValue{Int64, UInt32}   │
│ Tenure          │ OrderedFactor{11}  │ CategoricalValue{Int64, UInt32}   │
│ Balance         │ Continuous         │ Float64                           │
│ NumOfProducts   │ OrderedFactor{4}   │ CategoricalValue{Int64, UInt32}   │
│ HasCrCard       │ Multiclass{2}      │ CategoricalValue{Int64, UInt32}   │
│ IsActiveMember  │ Multiclass{2}      │ CategoricalValue{Int64, UInt32}   │
│ EstimatedSalary │ Continuous         │ Float64 

## Unpacking and Splitting Data

In [None]:
y, X = unpack(df, ==(:Exited); rng=123);
first(X, 5) |> pretty

┌─────────────────────────────────┬───────────────────────────────────┬───────────────────────────────────┬─────────────────────────────────┬─────────────────────────────────┬────────────┬─────────────────────────────────┬─────────────────────────────────┬─────────────────────────────────┬─────────────────┐
│[1m CreditScore                     [0m│[1m Geography                         [0m│[1m Gender                            [0m│[1m Age                             [0m│[1m Tenure                          [0m│[1m Balance    [0m│[1m NumOfProducts                   [0m│[1m HasCrCard                       [0m│[1m IsActiveMember                  [0m│[1m EstimatedSalary [0m│
│[90m CategoricalValue{Int64, UInt32} [0m│[90m CategoricalValue{String7, UInt32} [0m│[90m CategoricalValue{String7, UInt32} [0m│[90m CategoricalValue{Int64, UInt32} [0m│[90m CategoricalValue{Int64, UInt32} [0m│[90m Float64    [0m│[90m CategoricalValue{Int64, UInt32} [0m│[90m Categorical

In [None]:
train_inds, test_inds = partition(eachindex(y), 0.8, shuffle=true, 
                                  rng=Random.Xoshiro(42))
X_train, X_test = X[train_inds, :], X[test_inds, :]
y_train, y_test = y[train_inds], y[test_inds]

(CategoricalValue{Int64, UInt32}[0, 1, 1, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 1, 0, 0, 0, 0, 1, 0], CategoricalValue{Int64, UInt32}[0, 0, 0, 0, 0, 1, 1, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Oversampling



Before deciding to oversample, let's see how adverse is the imbalance problem, if it exists. Ideally, you may as well check if the classification model is robust to this problem.

In [None]:
checkbalance(y)

1: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 2037 (25.6%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7963 (100.0%) 


Looks like we have a class imbalance problem. Let's oversample with SMOTE-NC and set the desired ratios so that the positive minority class is 90% of the majority class

In [None]:
Xover, yover = smotenc(X, y; k=3, ratios=Dict(1=>0.9), rng=42)

([1m15130×10 DataFrame[0m
[1m   Row [0m│[1m CreditScore [0m[1m Geography [0m[1m Gender [0m[1m Age  [0m[1m Tenure [0m[1m Balance        [0m[1m NumOfPr[0m ⋯
       │[90m Cat…        [0m[90m Cat…      [0m[90m Cat…   [0m[90m Cat… [0m[90m Cat…   [0m[90m Float64        [0m[90m Cat…   [0m ⋯
───────┼────────────────────────────────────────────────────────────────────────
     1 │ 669          France     Female  31    6            1.13001e5  1       ⋯
     2 │ 822          France     Male    37    3       105563.0        1
     3 │ 423          France     Female  36    5        97665.6        1
     4 │ 623          France     Male    21    10           0.0        2
     5 │ 691          Germany    Female  37    7            1.23068e5  1       ⋯
     6 │ 628          France     Male    69    5            0.0        2
     7 │ 613          France     Female  24    7            1.40454e5  1
     8 │ 711          France     Male    34    8            0.0        2
  

In [None]:
checkbalance(yover)

1: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7167 (90.0%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7963 (100.0%) 


## Training the Model



Let's find possible models

In [None]:
ms = models(matching(Xover, yover))

5-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = CatBoostClassifier, package_name = CatBoost, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )

Let's go for a logistic classifier form MLJLinearModels

In [None]:
import Pkg; Pkg.add("BetaML")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`


Let's go for a decision tree from BetaML. We can't go for logistic regression as we did in the SMOTE tutorial because it does not support categotical features.

### Before Oversampling

In [None]:
# 1. Load the model
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=BetaML

# 2. Instantiate it
model = DecisionTreeClassifier( max_depth=4, rng=Random.Xoshiro(42))

# 3. Wrap it with the data in a machine
mach = machine(model, X_train, y_train)

# 4. fit the machine learning model
fit!(mach, verbosity=0)

### After Oversampling

In [None]:
# 3. Wrap it with the data in a machine
mach_over = machine(model, Xover, yover)

# 4. fit the machine learning model
fit!(mach_over)

## Evaluating the Model



To evaluate the model, we will use the balanced accuracy metric which equally accounts for all classes. 

### Before Oversampling

In [None]:
y_pred = predict_mode(mach, X_test)                         

score = round(balanced_accuracy(y_pred, y_test), digits=2)

### After Oversampling

In [None]:
y_pred_over = predict_mode(mach_over, X_test)

score = round(balanced_accuracy(y_pred_over, y_test), digits=2)

Although the results do get better compared to when we just used SMOTE, it holds in this case that the extra categorical features we took into account aren't that important. The difference can be attributed to the decision tree.

In [None]:
import sys; sys.path.append("..")
from convert import convert_to_md
convert_to_md('smotenc_churn_dataset')