# Mutagenesis Example
Following example demonstrates learning to [predict the mutagenicity on Salmonella typhimurium](https://relational.fit.cvut.cz/dataset/Mutagenesis) (dataset is stored in json format [in MLDatasets.jl](https://juliaml.github.io/MLDatasets.jl/stable/datasets/Mutagenesis/) for your convenience).

We start by installing JsonGrinder and few other packages we need for the example.
Julia Ecosystem follows philosophy of many small single-purpose composable packages
which may be different from e.g. python where we usually use fewer larger packages.

In [1]:
using Pkg
pkg"add JsonGrinder#master MLDatasets Flux Mill Statistics"

└ @ Pkg.REPLMode /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/REPLMode/REPLMode.jl:377
     Cloning git-repo `https://github.com/CTUAvastLab/JsonGrinder.jl.git`
    Updating git-repo `https://github.com/CTUAvastLab/JsonGrinder.jl.git`
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
│ To update to the new format run `Pkg.upgrade_manifest()` which will upgrade the format without re-resolving.
└ @ Pkg.Types /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/manifest.jl:287
    Updating `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Project.toml`
  [d201646e] ~ JsonGrinder v2.3.1 `~/work/JsonGrinder.jl/JsonGrinder.jl` ⇒ v2.3.1 `https://github.com/CTUAvastLab/JsonGrinder.jl.git#master`
    Updating `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Manifest.toml`
  [d201646e] ~ JsonGrinder v2.3.1 `~/work/JsonGrinder.jl/JsonGrinder.jl` ⇒ v2.3.1 `https://github.com/CTUAvastLab/JsonGrinder.jl.git#maste

This example is taken from the [CTUAvastLab/JsonGrinderExamples](https://github.com/CTUAvastLab/JsonGrinderExamples/blob/main/mutagenesis/tuned.jl)
and heavily commented for more clarity.

Here we include libraries all necessary libraries

In [2]:
using JsonGrinder, Mill, Flux, MLDatasets, Statistics, Random

we stabilize the seed to obtain same results every run, for pedagogic purposes

In [3]:
Random.seed!(42)

Random.TaskLocalRNG()

We define the minibatch size.

In [4]:
BATCH_SIZE = 10

10

Here we load the training samples.

In [5]:
x_train, y_train = MLDatasets.Mutagenesis.traindata();

We create the schema of the training data, which is the first important step in using the JsonGrinder.
This computes both the structure (also known as JSON schema) and histogram of occurrences of individual values in the training data.

In [6]:
sch = JsonGrinder.schema(x_train)

[34m[Dict][39m[90m 	# updated = 100[39m
[34m  ├─── lumo: [39m[39m[Scalar - Float64], 98 unique values[90m 	# updated = 100[39m
[34m  ├─── inda: [39m[39m[Scalar - Int64], 1 unique values[90m 	# updated = 100[39m
[34m  ⋮[39m
[34m  └── atoms: [39m[31m[List][39m[90m 	# updated = 100[39m
[34m             [39m[31m  └── [39m[32m[Dict][39m[90m 	# updated = 2529[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

Then we use it to create the extractor converting jsons to Mill structures.
The `suggestextractor` is executed below with default setting, but it allows you heavy customization.

In [7]:
extractor = suggestextractor(sch)

[34mDict[39m
[34m  ├─── lumo: [39m[39mCategorical d = 99
[34m  ├─── inda: [39m[39mCategorical d = 2
[34m  ⋮[39m
[34m  └── atoms: [39m[31mArray of[39m
[34m             [39m[31m  └── [39m[32mDict[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

# Create the model
We create the model reflecting structure of the data

In [8]:
encoder = reflectinmodel(sch, extractor)

[34mProductModel ↦ Dense(50 => 10)[39m[90m 	# 2 arrays, 510 params, 2.070 KiB[39m
[34m  ├─── lumo: [39m[39mArrayModel(Dense(99 => 10))[90m 	# 2 arrays, 1_000 params, 3.984 KiB[39m
[34m  ├─── inda: [39m[39mArrayModel(Dense(2 => 10))[90m 	# 2 arrays, 30 params, 200 bytes[39m
[34m  ├─── logp: [39m[39mArrayModel(Dense(63 => 10))[90m 	# 2 arrays, 640 params, 2.578 KiB[39m
[34m  ├─── ind1: [39m[39mArrayModel(Dense(3 => 10))[90m 	# 2 arrays, 40 params, 240 bytes[39m
[34m  └── atoms: [39m[31mBagModel ↦ BagCount([SegmentedMean(10); SegmentedMax(10)]) ↦ Dense(21 => 10)[39m[90m 	# 4 arrays, 240 params, 1.094 KiB[39m
[34m             [39m[31m  └── [39m[32mProductModel ↦ Dense(31 => 10)[39m[90m 	# 2 arrays, 320 params, 1.328 KiB[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

this allows us to create model flexibly, without the need to hardcode individual layers.
Individual arguments of `reflectinmodel` are explained in [Mill.jl documentation](https://CTUAvastLab.github.io/Mill.jl/stable/manual/reflectin/#Model-Reflection).
But briefly: for every numeric array in the sample, model will create a dense layer with `neurons` neurons (20 in this example).
For every vector of observations (called bag in Multiple Instance Learning terminology), it will create aggregation function which will take mean, maximum of feature vectors and concatenate them.
The `fsm` keyword argument basically says that on the end of the NN, as a last layer, we want 2 neurons `length(labelnames)` in the output layer, not 20 as in the intermediate layers.
then we add layer with 2 output of the model at the end of the neural network

In [9]:
model = Dense(10, 2) ∘ encoder

Dense(10 => 2) ∘ ProductModel ↦ Dense(50 => 10)

We convert jsons to mill data samples and prepare list of classes. This classification problem is two-class, but we want to infer it from labels.
The extractor is callable, so we can pass it vector of samples to obtain vector of structures with extracted features.

In [10]:
ds_train = extractor.(x_train)

100-element Vector{Mill.ProductNode{NamedTuple{(:lumo, :inda, :logp, :ind1, :atoms), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000063, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000002, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000003f, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000003, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bonds, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bond_type, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000004, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Matrix{Float32}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000001d, 1, 2, Vector{UInt32}}, Nothing}}}, Nothi

# Train the model
Then, we define few handy functions and a loss function, which is logit binary crossentropy in our case.
Here we add +1 to labels, because the labels are {0,1} and idxmax of the model output is in the {1,2} range.

In [11]:
loss(ds, y) = Flux.Losses.logitbinarycrossentropy(model(ds), Flux.onehotbatch(y .+ 1, 1:2))
accuracy(ds, y) = mean(Flux.onecold(model(ds)) .== y .+ 1)

accuracy (generic function with 1 method)

We prepare the optimizer.

In [12]:
opt = AdaBelief()
ps = Flux.params(model)

Params([Float32[-0.026599297 0.31268165 … 0.015112174 0.041570194; 0.07317401 0.35801336 … -0.47258484 -0.5491943], Float32[0.0, 0.0], Float32[-0.021358307 0.011042915 … 0.012058655 -0.12661397; -0.15317094 -0.007966665 … -0.1895075 -0.22045441; … ; -0.21869266 -0.08064376 … 0.11733952 0.19284187; 0.012676428 -0.19030786 … -0.059899725 0.10005368], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.48929775 -0.18842651; -0.21674016 0.0628332; … ; -0.14078076 0.2071043; -0.61794806 0.2987844], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.06632891 -0.2691281 … -0.13112901 0.119861856; 0.28235346 -0.2236816 … 0.07369621 0.05193251; … ; 0.2619118 0.09443781 … 0.00561584 0.0068649817; -0.21950115 -0.200245 … -0.01521892 0.17143756], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.37894922 -0.6376908 -0.45976058; 0.13071936 0.40276232 -0.13780044; … ; -0.39515987 -0.17779931 -0.18430126; -0.57551384 0.17398533 0.23547177], Float

Lastly we turn our training data to minibatches, and we can start training

In [13]:
data_loader = Flux.Data.DataLoader((ds_train, y_train), batchsize=BATCH_SIZE, shuffle=true)

MLUtils.DataLoader{Tuple{Vector{Mill.ProductNode{NamedTuple{(:lumo, :inda, :logp, :ind1, :atoms), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000063, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000002, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000003f, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000003, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bonds, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bond_type, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000004, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Matrix{Float32}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000001d, 1, 2, Vector{UInt32}}, Noth

We can see the accuracy rising and obtaining over 80% quite quickly

In [14]:
Flux.@epochs 3 begin
    Flux.Optimise.train!(loss, ps, data_loader, opt)
    @show accuracy(ds_train, y_train)
end

[ Info: Epoch 1
accuracy(ds_train, y_train) = 0.82
[ Info: Epoch 2
accuracy(ds_train, y_train) = 0.83
[ Info: Epoch 3
accuracy(ds_train, y_train) = 0.84


# Classify test set
The Last part is inference and evaluation on test data.

In [15]:
x_test, y_test = MLDatasets.Mutagenesis.testdata();
ds_test = extractor.(x_test)

44-element Vector{Mill.ProductNode{NamedTuple{(:lumo, :inda, :logp, :ind1, :atoms), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000063, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000002, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000003f, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000003, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bonds, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bond_type, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000004, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Matrix{Float32}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000001d, 1, 2, Vector{UInt32}}, Nothing}}}, Nothin

we see that the test set accuracy is also over 80%

In [16]:
@show accuracy(ds_test, y_test)

probs = softmax(model(ds_test))
o = Flux.onecold(probs)
mean(o .== y_test .+ 1)

accuracy(ds_test, y_test) = 0.8863636363636364


0.8863636363636364

`pred_classes` contains the predictions for our test set.
we see the accuracy is around 75% on test set
predicted classes for test set

In [17]:
o

44-element Vector{Int64}:
 2
 1
 2
 1
 1
 2
 1
 1
 2
 2
 ⋮
 2
 2
 2
 2
 1
 2
 2
 1
 2

Ground truth classes for test set

In [18]:
y_test .+ 1

44-element Vector{Int64}:
 2
 2
 2
 1
 2
 2
 1
 1
 2
 2
 ⋮
 2
 2
 2
 2
 2
 2
 2
 1
 2

probabilities for test set

In [19]:
probs

2×44 Matrix{Float32}:
 0.00161392  0.66709  0.0494638  0.989868   …  0.0492495  0.702054  0.00384
 0.998386    0.33291  0.950536   0.0101324     0.95075    0.297946  0.99616

We can look at individual samples. For instance, some sample from test set is

In [20]:
ds_test[2]

[34mProductNode[39m[90m 	# 1 obs, 104 bytes[39m
[34m  ├─── lumo: [39m[39mArrayNode(99×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── inda: [39m[39mArrayNode(2×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── logp: [39m[39mArrayNode(63×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── ind1: [39m[39mArrayNode(3×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  └── atoms: [39m[31mBagNode[39m[90m 	# 1 obs, 136 bytes[39m
[34m             [39m[31m  └── [39m[32mProductNode[39m[90m 	# 24 obs, 64 bytes[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

and the corresponding classification is

In [21]:
y_test[2] + 1

2

if you want to see the probability distribution, it can be obtained by applying `softmax` to the output of the network.

In [22]:
softmax(model(ds_test[2]))

2×1 Matrix{Float32}:
 0.66709
 0.33291003

so we can see that the probability that given sample belongs to the first class is > 60%.

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*