# Mutagenesis Example
Following example demonstrates learning to [predict the mutagenicity on Salmonella typhimurium](https://relational.fit.cvut.cz/dataset/Mutagenesis) (dataset is stored in json format [in MLDatasets.jl](https://juliaml.github.io/MLDatasets.jl/stable/datasets/Mutagenesis/) for your convenience).

We start by installing JsonGrinder and few other packages we need for the example.
Julia Ecosystem follows philosophy of many small single-purpose composable packages
which may be different from e.g. python where we usually use fewer larger packages.

In [1]:
using Pkg
pkg"add JsonGrinder#master MLDatasets Flux Mill MLDataPattern Statistics"

└ @ Pkg.REPLMode /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/REPLMode/REPLMode.jl:377
     Cloning git-repo `https://github.com/CTUAvastLab/JsonGrinder.jl.git`
    Updating git-repo `https://github.com/CTUAvastLab/JsonGrinder.jl.git`
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed Mill ─ v2.7.0
│ To update to the new format run `Pkg.upgrade_manifest()` which will upgrade the format without re-resolving.
└ @ Pkg.Types /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/manifest.jl:287
    Updating `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Project.toml`
  [d201646e] ~ JsonGrinder v2.2.1 `~/work/JsonGrinder.jl/JsonGrinder.jl` ⇒ v2.2.1 `https://github.com/CTUAvastLab/JsonGrinder.jl.git#master`
  [1d0525e4] ~ Mill v2.7.0 `https://github.com/CTUAvastLab/Mill.jl.git#master` ⇒ v2.7.0
    Updating `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Manifest.toml`
  [d201646e] ~ JsonGrin

Here we include libraries all necessary libraries

In [2]:
using JsonGrinder, MLDatasets, Flux, Mill, MLDataPattern, Statistics

Here we load all samples.

In [3]:
train_x, train_y = MLDatasets.Mutagenesis.traindata();
test_x, test_y = MLDatasets.Mutagenesis.testdata();

We define some basic parameters for the construction and training of the neural network.
Minibatch size is self-explanatory, iterations is number of iterations of gradient descent
Neurons is number of neurons in hidden layers for each version of part of the neural network.

In [4]:
minibatchsize = 100
iterations = 5_000
neurons = 20

20

We create the schema of the training data, which is the first important step in using the JsonGrinder.
This computes both the structure (also known as JSON schema) and histogram of occurrences of individual values in the training data.

In [5]:
sch = JsonGrinder.schema(train_x)

[34m[Dict][39m[90m 	# updated = 100[39m
[34m  ├─── lumo: [39m[39m[Scalar - Float64], 98 unique values[90m 	# updated = 100[39m
[34m  ├─── inda: [39m[39m[Scalar - Int64], 1 unique values[90m 	# updated = 100[39m
[34m  ⋮[39m
[34m  └── atoms: [39m[31m[List][39m[90m 	# updated = 100[39m
[34m             [39m[31m  └── [39m[32m[Dict][39m[90m 	# updated = 2529[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

Then we use it to create the extractor converting jsons to Mill structures.
The `suggestextractor` is executed below with default setting, but it allows you heavy customization.
We also prepare list of classes. This classification problem is two-class, but we want to infer it from labels.

In [6]:
extractor = suggestextractor(sch)
labelnames = unique(train_y)

2-element Vector{Int64}:
 1
 0

# Create the model
We create the model reflecting structure of the data

In [7]:
model = reflectinmodel(sch, extractor,
	layer -> Dense(layer, neurons, relu),
	bag -> SegmentedMeanMax(bag),
	fsm = Dict("" => layer -> Dense(layer, length(labelnames))),
)

[34mProductModel ↦ Dense(100, 2)[39m[90m 	# 2 arrays, 202 params, 888 bytes[39m
[34m  ├─── lumo: [39m[39mArrayModel(Dense(99, 20, relu))[90m 	# 2 arrays, 2_000 params, 7.891 KiB[39m
[34m  ├─── inda: [39m[39mArrayModel(Dense(2, 20, relu))[90m 	# 2 arrays, 60 params, 320 bytes[39m
[34m  ├─── logp: [39m[39mArrayModel(Dense(63, 20, relu))[90m 	# 2 arrays, 1_280 params, 5.078 KiB[39m
[34m  ├─── ind1: [39m[39mArrayModel(Dense(3, 20, relu))[90m 	# 2 arrays, 80 params, 400 bytes[39m
[34m  └── atoms: [39m[31mBagModel ↦ [SegmentedMean(20); SegmentedMax(20)] ↦ Dense(40, 20, relu)[39m[90m 	# 4 arrays, 860 params, 3.516 KiB[39m
[34m             [39m[31m  └── [39m[32mProductModel ↦ Dense(61, 20, relu)[39m[90m 	# 2 arrays, 1_240 params, 4.922 KiB[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

this allows us to create model flexibly, without the need to hardcode individual layers.
Individual arguments of `reflectinmodel` are explained in [Mill.jl documentation](https://CTUAvastLab.github.io/Mill.jl/stable/manual/reflectin/#Model-Reflection). But briefly: for every numeric array in the sample, model will create a dense layer with `neurons` neurons (20 in this example). For every vector of observations (called bag in Multiple Instance Learning terminology), it will create aggregation function which will take mean, maximum of feature vectors and concatenate them. The `fsm` keyword argument basically says that on the end of the NN, as a last layer, we want 2 neurons `length(labelnames)` in the output layer, not 20 as in the intermediate layers.

We convert jsons to mill data samples and prepare list of classes. This classification problem is two-class, but we want to infer it from labels.
The extractor is callable, so we can pass it vector of samples to obtain vector of structures with extracted features.

In [8]:
train_data = extractor.(train_x)
test_data = extractor.(test_x)

44-element Vector{Mill.ProductNode{NamedTuple{(:lumo, :inda, :logp, :ind1, :atoms), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000063, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000002, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000003f, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000003, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bonds, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:element, :bond_type, :charge, :atom_type), Tuple{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000007, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000004, 1, 2, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Matrix{Float32}, Nothing}, Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x0000001d, 1, 2, Vector{UInt32}}, Nothing}}}, Nothin

# Train the model
Then, we define few handy functions and a loss function, which is categorical crossentropy in our case.

In [9]:
loss(x,y) = Flux.logitcrossentropy(model(x), Flux.onehotbatch(y, labelnames))
accuracy(x,y) = mean(labelnames[Flux.onecold(model(x))] .== y)
loss(xy::Tuple) = loss(xy...)

loss (generic function with 2 methods)

And we can add a callback which will be printing train and test accuracy during the training
and then we can start trining

In [10]:
cb = () -> begin
	train_acc = accuracy(train_data, train_y)
	test_acc = accuracy(test_data, test_y)
	println("accuracy: train = $train_acc, test = $test_acc")
end

#7 (generic function with 1 method)

Lastly we turn our training data to minibatches, and we can start training

In [11]:
minibatches = RandomBatches((train_data, train_y), size = minibatchsize, count = iterations)
Flux.Optimise.train!(loss, Flux.params(model), minibatches, ADAM(), cb = Flux.throttle(cb, 2))

accuracy: train = 0.22, test = 0.11363636363636363
accuracy: train = 0.95, test = 0.8409090909090909
accuracy: train = 1.0, test = 0.8181818181818182
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.8181818181818182
accuracy: train = 1.0, test = 0.7727272727272727
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.8181818181818182
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7727272727272727
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train = 1.0, test = 0.7954545454545454
accuracy: train =

We can see the accuracy rising and obtaining over 98% on training set quite quickly, and on test set we get over 70%.

# Classify test set
The Last part is inference on test data.

In [12]:
probs = softmax(model(test_data))
o = Flux.onecold(probs)
pred_classes = labelnames[o]
mean(pred_classes .== test_y)

0.7954545454545454

`pred_classes` contains the predictions for our test set.
we see the accuracy is around 75% on test set
predicted classes for test set

In [13]:
pred_classes

44-element Vector{Int64}:
 1
 1
 0
 0
 1
 1
 1
 0
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 0
 1
 1

Ground truth classes for test set

In [14]:
test_y

44-element Vector{Int64}:
 1
 1
 1
 0
 1
 1
 0
 0
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 1
 0
 1

probabilities for test set

In [15]:
probs

2×44 Matrix{Float32}:
 0.999977    0.998857    0.00595558  …  0.311291  0.979187   0.999737
 2.35166f-5  0.00114247  0.994044       0.688709  0.0208131  0.000262501

We can look at individual samples. For instance, some sample from test set is

In [16]:
test_data[2]

[34mProductNode[39m[90m 	# 1 obs, 104 bytes[39m
[34m  ├─── lumo: [39m[39mArrayNode(99×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── inda: [39m[39mArrayNode(2×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── logp: [39m[39mArrayNode(63×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── ind1: [39m[39mArrayNode(3×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  └── atoms: [39m[31mBagNode[39m[90m 	# 1 obs, 136 bytes[39m
[34m             [39m[31m  └── [39m[32mProductNode[39m[90m 	# 24 obs, 64 bytes[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

and the corresponding classification is

In [17]:
pred_classes[2]

1

if you want to see the probability distribution, it can be obtained by applying `softmax` to the output of the network.

In [18]:
softmax(model(test_data[2]))

2×1 Matrix{Float32}:
 0.9988575
 0.0011424734

so we can see that the probability that given sample is `mutagenetic` is almost 1.

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*