In [None]:
# this installs Julia 1.7
%%capture
%%shell
wget -O - https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/colab.sh | bash
#This should take around one minute to finish. Once it does, change the runtime to `Julia` by choosing `Runtime` 
# from the toolbar then `Change runtime type`. You can then delete this cell.

# Balanced Bagging for Cerebral Stroke Prediction

In [42]:
import Pkg;
Pkg.add(["Random", "CSV", "DataFrames", "MLJ", "Imbalance", "MLJBalancing", 
         "ScientificTypes","Impute", "StatsBase",  "Plots", "Measures", "HTTP"])

using Random
using CSV
using DataFrames
using MLJ
using Imbalance
using MLJBalancing
using StatsBase
using ScientificTypes
using Plots, Measures
using Impute
using HTTP: download

## Loading Data
In this example, we will consider the [Cerebral Stroke Prediction Dataset](https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset) found on Kaggle for the objective of predicting where a stroke has occurred given medical features about patients.

`CSV` gives us the ability to easily read the dataset after it's downloaded as follows

In [54]:
download("https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/cerebral_ensemble/cerebral.csv")
df = CSV.read("./cerebral.csv", DataFrame)

# Display the first 5 rows with DataFrames
first(df, 5) |> pretty

┌───────┬─────────┬────────────┬──────────────┬───────────────┬──────────────┬──────────────┬────────────────┬───────────────────┬────────────────────────────┬──────────────────────────┬────────┐
│[1m id    [0m│[1m gender  [0m│[1m age        [0m│[1m hypertension [0m│[1m heart_disease [0m│[1m ever_married [0m│[1m work_type    [0m│[1m Residence_type [0m│[1m avg_glucose_level [0m│[1m bmi                        [0m│[1m smoking_status           [0m│[1m stroke [0m│
│[90m Int64 [0m│[90m String7 [0m│[90m Float64    [0m│[90m Int64        [0m│[90m Int64         [0m│[90m String3      [0m│[90m String15     [0m│[90m String7        [0m│[90m Float64           [0m│[90m Union{Missing, Float64}    [0m│[90m Union{Missing, String15} [0m│[90m Int64  [0m│
│[90m Count [0m│[90m Textual [0m│[90m Continuous [0m│[90m Count        [0m│[90m Count         [0m│[90m Textual      [0m│[90m Textual      [0m│[90m Textual        [0m│[90m Continuous        [

It's obvious that the `id` column is useless for predictions so we may as well drop it.

In [55]:
df = df[:, Not(:id)]
first(df, 5) |> pretty

┌─────────┬────────────┬──────────────┬───────────────┬──────────────┬──────────────┬────────────────┬───────────────────┬────────────────────────────┬──────────────────────────┬────────┐
│[1m gender  [0m│[1m age        [0m│[1m hypertension [0m│[1m heart_disease [0m│[1m ever_married [0m│[1m work_type    [0m│[1m Residence_type [0m│[1m avg_glucose_level [0m│[1m bmi                        [0m│[1m smoking_status           [0m│[1m stroke [0m│
│[90m String7 [0m│[90m Float64    [0m│[90m Int64        [0m│[90m Int64         [0m│[90m String3      [0m│[90m String15     [0m│[90m String7        [0m│[90m Float64           [0m│[90m Union{Missing, Float64}    [0m│[90m Union{Missing, String15} [0m│[90m Int64  [0m│
│[90m Textual [0m│[90m Continuous [0m│[90m Count        [0m│[90m Count         [0m│[90m Textual      [0m│[90m Textual      [0m│[90m Textual        [0m│[90m Continuous        [0m│[90m Union{Missing, Continuous} [0m│[90m Union{Missi

## Visualize the Data
Since this dataset is composed mostly of categorical features, a bar chart for each categorical column is a good way to visualize the data.

In [None]:
# Create a bar chart for each column
bar_charts = []
for col in names(df)
 counts = countmap(df[!, col])
  k, v = collect(keys(counts)), collect(values(counts))
   if length(k) < 20
     push!(bar_charts, bar(k, v, legend=false, title=col, color="turquoise3", xrotation=90, margin=6mm))
    end
end

# Combine bar charts into a grid layout with specified plot size
plot_res = plot(bar_charts..., layout=(3, 4),
                size=(1300, 500),
                dpi=200
                )
savefig(plot_res, "./assets/cerebral-charts.png")


![Mushroom Features Plots](./assets/cerebral-charts.png)

Our target her is the `Stroke` variable; notice how imbalanced it is.

## Coercing Data
Typical models from `MLJ` assume that elements in each column of a table have some `scientific type` as defined by the [ScientificTypes.jl](https://juliaai.github.io/ScientificTypes.jl/dev/) package. It's often necessary to coerce the types found by default to the appropriate type.

In [39]:
ScientificTypes.schema(df)

┌───────────────────┬────────────────────────────┬──────────────────────────┐
│[22m names             [0m│[22m scitypes                   [0m│[22m types                    [0m│
├───────────────────┼────────────────────────────┼──────────────────────────┤
│ gender            │ Textual                    │ String7                  │
│ age               │ Continuous                 │ Float64                  │
│ hypertension      │ Count                      │ Int64                    │
│ heart_disease     │ Count                      │ Int64                    │
│ ever_married      │ Textual                    │ String3                  │
│ work_type         │ Textual                    │ String15                 │
│ Residence_type    │ Textual                    │ String7                  │
│ avg_glucose_level │ Continuous                 │ Float64                  │
│ bmi               │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ smoking_status    │ Union{Missing, 

For instance, here we need to coerce all the data to `Multiclass` as they are all nominal variables except for `Age`, `avg_glucose_level` and `bmi` which we can treat as continuous

In [139]:
df = coerce(df, :gender => Multiclass, :age => Continuous, :hypertension => Multiclass,
	:heart_disease => Multiclass, :ever_married => Multiclass, :work_type => Multiclass,
	:Residence_type => Multiclass, :avg_glucose_level => Continuous,
	:bmi => Continuous, :smoking_status => Multiclass, :stroke => Multiclass,
)
ScientificTypes.schema(df)

┌───────────────────┬───────────────┬────────────────────────────────────┐
│[22m names             [0m│[22m scitypes      [0m│[22m types                              [0m│
├───────────────────┼───────────────┼────────────────────────────────────┤
│ gender            │ Multiclass{3} │ CategoricalValue{String7, UInt32}  │
│ age               │ Continuous    │ Float64                            │
│ hypertension      │ Multiclass{2} │ CategoricalValue{Int64, UInt32}    │
│ heart_disease     │ Multiclass{2} │ CategoricalValue{Int64, UInt32}    │
│ ever_married      │ Multiclass{2} │ CategoricalValue{String3, UInt32}  │
│ work_type         │ Multiclass{5} │ CategoricalValue{String15, UInt32} │
│ Residence_type    │ Multiclass{2} │ CategoricalValue{String7, UInt32}  │
│ avg_glucose_level │ Continuous    │ Float64                            │
│ bmi               │ Continuous    │ Float64                            │
│ smoking_status    │ Multiclass{3} │ CategoricalValue{String15, UInt32} 

As shown in the types, some columns have missing values we will impute them using simple random sampling as dropping their rows would mean that we lose a big chunk of the dataset.

In [61]:
df = Impute.srs(df); disallowmissing!(df)
first(df, 5) |> pretty

┌───────────────────────────────────┬────────────┬─────────────────────────────────┬─────────────────────────────────┬───────────────────────────────────┬────────────────────────────────────┬───────────────────────────────────┬───────────────────┬────────────┬────────────────────────────────────┬─────────────────────────────────┐
│[1m gender                            [0m│[1m age        [0m│[1m hypertension                    [0m│[1m heart_disease                   [0m│[1m ever_married                      [0m│[1m work_type                          [0m│[1m Residence_type                    [0m│[1m avg_glucose_level [0m│[1m bmi        [0m│[1m smoking_status                     [0m│[1m stroke                          [0m│
│[90m CategoricalValue{String7, UInt32} [0m│[90m Float64    [0m│[90m CategoricalValue{Int64, UInt32} [0m│[90m CategoricalValue{Int64, UInt32} [0m│[90m CategoricalValue{String3, UInt32} [0m│[90m CategoricalValue{String15, UInt32} [0m│[9

## Unpacking and Splitting Data

Both `MLJ` and the pure functional interface of `Imbalance` assume that the observations table `X` and target vector `y` are separate. We can accomplish that by using `unpack` from `MLJ`

In [62]:
y, X = unpack(df, ==(:stroke); rng=123);
first(X, 5) |> pretty

┌───────────────────────────────────┬────────────┬─────────────────────────────────┬─────────────────────────────────┬───────────────────────────────────┬────────────────────────────────────┬───────────────────────────────────┬───────────────────┬────────────┬────────────────────────────────────┐
│[1m gender                            [0m│[1m age        [0m│[1m hypertension                    [0m│[1m heart_disease                   [0m│[1m ever_married                      [0m│[1m work_type                          [0m│[1m Residence_type                    [0m│[1m avg_glucose_level [0m│[1m bmi        [0m│[1m smoking_status                     [0m│
│[90m CategoricalValue{String7, UInt32} [0m│[90m Float64    [0m│[90m CategoricalValue{Int64, UInt32} [0m│[90m CategoricalValue{Int64, UInt32} [0m│[90m CategoricalValue{String3, UInt32} [0m│[90m CategoricalValue{String15, UInt32} [0m│[90m CategoricalValue{String7, UInt32} [0m│[90m Float64           [0m│[90m

Splitting the data into train and test portions is also easy using `MLJ`'s `partition` function. `stratify=y` guarantees that the data is distributed in the same proportions as the original dataset in both splits which is more representative of the real world.

In [None]:
(X_train, X_test), (y_train, y_test) = partition(
	(X, y),
	0.8,
	multi = true,
	shuffle = true,
	stratify = y,
	rng = Random.Xoshiro(42)
)

⚠️ Always split the data before oversampling. If your test data has oversampled observations then train-test contamination has occurred; novel observations will not come from the oversampling function.

## Oversampling



It was obvious from the bar charts that there is a severe imbalance problem. Let's look at that again.

In [64]:
checkbalance(y)         # comes from Imbalance

1: ▇ 783 (1.8%) 
0: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 42617 (100.0%) 


Indeed, may be too severe for most models.

## Training the Model



Because we have scientific types setup, we can easily check what models will be able to train on our data. This should guarantee that the model we choose won't throw an error due to types after feeding it the data.

In [12]:
ms = models(matching(Xover, yover))

6-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = CatBoostClassifier, package_name = CatBoost, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = OneRuleClassifier, package_name = OneRule, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )

Let's go for a `DecisionTreeClassifier`

In [20]:
import Pkg; Pkg.add("BetaML")

[32m[1m   Resolving[22m[39m package versions...


[32m[1m   Installed[22m[39m MLJBalancing ─ v0.1.0


[32m[1m    Updating[22m[39m `~/Documents/GitHub/Imbalance.jl/docs/Project.toml`
 [90m [45f359ea] [39m[92m+ MLJBalancing v0.1.0[39m
[32m[1m    Updating[22m[39m `~/Documents/GitHub/Imbalance.jl/docs/Manifest.toml`


 [90m [45f359ea] [39m[92m+ MLJBalancing v0.1.0[39m


[32m[1mPrecompiling[22m[39m project...


[32m  ✓ [39mMLJBalancing
  1 dependency successfully precompiled in 25 seconds. 262 already precompiled.


#### Load and Construct

In [132]:
# 1. Load the model
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=BetaML

# 2. Instantiate it
model = DecisionTreeClassifier(max_depth=4)

import BetaML ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/essam/.julia/packages/MLJModels/EkXIe/src/loading.jl:159


DecisionTreeClassifier(
  max_depth = 4, 
  min_gain = 0.0, 
  min_records = 2, 
  max_features = 0, 
  splitting_criterion = BetaML.Utils.gini, 
  rng = Random._GLOBAL_RNG())

#### Wrap in a machine and fit!

In [133]:
# 3. Wrap it with the data in a machine
mach = machine(model, X_train, y_train)

# 4. fit the machine learning model
fit!(mach, verbosity=0)

trained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = 4, …)
  args: 
    1:	Source @245 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{5}}, AbstractVector{Multiclass{2}}, AbstractVector{Multiclass{3}}}}
    2:	Source @251 ⏎ AbstractVector{Multiclass{2}}


#### Evaluate the Model

In [134]:
y_pred = MLJ.predict_mode(mach, X_test)                         

score = round(balanced_accuracy(y_pred, y_test), digits=2)

0.5

## Training BalancedBagging Model

The results suggest that the model is just as good as random guessing. Let's see if this gets better by using a `BalancedBaggingClassifier`. This classifier trains `T` of the given `model` on `T` undersampled versions of the dataset where in each undersampled version there are as much majority examples as there are minority examples.

This approach can allow us to workaround the imbalance issue without losing any data. For instance, if we set `T=Int(100/1.8)` (which is the default) then on average all majority examples will be used in one of the `T` bags.

#### Load and Construct

In [135]:
bagging_model = BalancedBaggingClassifier(model=model, T=30, rng=Random.Xoshiro(42))

BalancedBaggingClassifier(
  model = DecisionTreeClassifier(
        max_depth = 4, 
        min_gain = 0.0, 
        min_records = 2, 
        max_features = 0, 
        splitting_criterion = BetaML.Utils.gini, 
        rng = Random._GLOBAL_RNG()), 
  T = 30, 
  rng = Xoshiro(0xa379de7eeeb2a4e8, 0x953dccb6b532b3af, 0xf597b8ff8cfd652a, 0xccd7337c571680d1))

#### Wrap in a machine and fit!

In [136]:
# 3. Wrap it with the data in a machine
mach_over = machine(bagging_model, X_train, y_train)

# 4. fit the machine learning model
fit!(mach_over, verbosity=0)

trained Machine; does not cache data
  model: BalancedBaggingClassifier(model = DecisionTreeClassifier(max_depth = 4, …), …)
  args: 
    1:	Source @005 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{5}}, AbstractVector{Multiclass{2}}, AbstractVector{Multiclass{3}}}}
    2:	Source @531 ⏎ AbstractVector{Multiclass{2}}


#### Evaluate the Model

In [137]:
y_pred = MLJ.predict_mode(mach_over, X_test)                         

score = round(balanced_accuracy(y_pred, y_test), digits=2)

0.77

This is a dramatic improvement over what we had before. Let's confirm with cross-validation.

In [138]:
cv=CV(nfolds=10)
evaluate!(mach_over, resampling=cv, measure=balanced_accuracy, operation=predict_mode) 

[33mEvaluating over 10 folds:  20%[=====>                   ]  ETA: 0:01:23[39m[K

















PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌─────────────────────┬──────────────┬─────────────┬─────────┬──────────────────
│[22m measure             [0m│[22m operation    [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├─────────────────────┼──────────────┼─────────────┼─────────┼──────────────────
│ BalancedAccuracy(   │ predict_mode │ 0.772       │ 0.0146  │ [0.738, 0.769,  ⋯
│   adjusted = false) │              │             │         │                 ⋯
└─────────────────────┴──────────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


Under the normality of scores, the `95%` confidence interval is `77.2±1.4%` for the balanced accuracy.

In [1]:
import sys; sys.path.append("..")
from convert import convert_to_md
convert_to_md('cerebral_ensemble')

[NbConvertApp] Converting notebook cerebral_ensemble.ipynb to markdown


Copied cerebral-charts.png to ../assets/cerebral-charts.png
Conversion Complete!


[NbConvertApp] Writing 26298 bytes to cerebral_ensemble.md
