# Mutagenesis Example
Following example demonstrates learning to [predict the mutagenicity on Salmonella typhimurium](https://relational.fit.cvut.cz/dataset/Mutagenesis) (dataset is stored in json format [in MLDatasets.jl](https://juliaml.github.io/MLDatasets.jl/stable/datasets/Mutagenesis/) for your convenience).

Here we include libraries all necessary libraries

In [1]:
using MLDatasets, JsonGrinder, Flux, Mill, MLDataPattern, Statistics, ChainRulesCore

[ Info: Installing scipy via the Conda scipy package...
[ Info: Running `conda install -q -y scipy` in root environment
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/runner/.julia/conda/3

  added / updated specs:
    - scipy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libgfortran-ng-7.5.0       |      ha8ba4b0_17          22 KB
    libgfortran4-7.5.0         |      ha8ba4b0_17         995 KB
    scipy-1.7.1                |   py39h292c36d_2        16.9 MB
    ------------------------------------------------------------
                                           Total:        17.9 MB

The following NEW packages will be INSTALLED:

  libgfortran-ng     pkgs/main/linux-64::libgfortran-ng-7.5.0-ha8ba4b0_17
  libgfortran4       pkgs/main/linux-64::libgfortran4-7.5.0-ha8b

Here we load all samples.

In [2]:
train_x, train_y = MLDatasets.Mutagenesis.traindata();
test_x, test_y = MLDatasets.Mutagenesis.testdata();

We define some basic parameters for the construction and training of the neural network.
Minibatch size is self-explanatory, iterations is number of iterations of gradient descent
Neurons is number of neurons in hidden layers for each version of part of the neural network.

In [3]:
minibatchsize = 100
iterations = 5_000
neurons = 20

20

We create the schema of the training data, which is the first important step in using the JsonGrinder.
This computes both the structure (also known as JSON schema) and histogram of occurrences of individual values in the training data.

In [4]:
sch = JsonGrinder.schema(train_x)
extractor = suggestextractor(sch)

[34mDict[39m
[34m  ├─── lumo: [39m[39mCategorical d = 99
[34m  ├─── inda: [39m[39mCategorical d = 2
[34m  ⋮[39m
[34m  └── atoms: [39m[31mArray of[39m
[34m             [39m[31m  └── [39m[32mDict[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

Then we use it to create the extractor converting jsons to Mill structures.
The `suggestextractor` is executed below with default setting, but it allows you heavy customization.

In [5]:
train_data = extractor.(train_x)
test_data = extractor.(test_x)
labelnames = unique(train_y)

@show train_data[1]

train_data[1] = ProductNode


[34mProductNode[39m[90m 	# 1 obs, 104 bytes[39m
[34m  ├─── lumo: [39m[39mArrayNode(99×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── inda: [39m[39mArrayNode(2×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── logp: [39m[39mArrayNode(63×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── ind1: [39m[39mArrayNode(3×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  └── atoms: [39m[31mBagNode[39m[90m 	# 1 obs, 136 bytes[39m
[34m             [39m[31m  └── [39m[32mProductNode[39m[90m 	# 26 obs, 64 bytes[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

# Create the model

In [6]:
model = reflectinmodel(sch, extractor,
	layer -> Dense(layer, neurons, relu),
	bag -> SegmentedMeanMax(bag),
	fsm = Dict("" => layer -> Dense(layer, length(labelnames))),
)

[34mProductModel ↦ ArrayModel(Dense(100, 2))[39m[90m 	# 2 arrays, 202 params, 888 bytes[39m
[34m  ├─── lumo: [39m[39mArrayModel(Dense(99, 20, relu))[90m 	# 2 arrays, 2_000 params, 7.891 KiB[39m
[34m  ├─── inda: [39m[39mArrayModel(Dense(2, 20, relu))[90m 	# 2 arrays, 60 params, 320 bytes[39m
[34m  ├─── logp: [39m[39mArrayModel(Dense(63, 20, relu))[90m 	# 2 arrays, 1_280 params, 5.078 KiB[39m
[34m  ├─── ind1: [39m[39mArrayModel(Dense(3, 20, relu))[90m 	# 2 arrays, 80 params, 400 bytes[39m
[34m  └── atoms: [39m[31mBagModel ↦ [SegmentedMean(20); SegmentedMax(20)] ↦ ArrayModel(Dense(40, 20, relu))[39m[90m 	# 4 arrays, 860 params, 3.516 KiB[39m
[34m             [39m[31m  └── [39m[32mProductModel ↦ ArrayModel(Dense(61, 20, relu))[39m[90m 	# 2 arrays, 1_240 params, 4.922 KiB[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

# Train the model
let's define loss and some helper functions

In [7]:
loss(x,y) = Flux.logitcrossentropy(inference(x), Flux.onehotbatch(y, labelnames))
inference(x::AbstractMillNode) = model(x).data
inference(x::AbstractVector{<:AbstractMillNode}) = inference(reduce(catobs, x))
accuracy(x,y) = mean(labelnames[Flux.onecold(inference(x))] .== y)
loss(xy::Tuple) = loss(xy...)
@non_differentiable Base.reduce(catobs, x::AbstractVector{<:AbstractMillNode})
cb = () -> begin
	train_acc = accuracy(train_data, train_y)
	test_acc = accuracy(test_data, test_y)
	println("accuracy: train = $train_acc, test = $test_acc")
end

#9 (generic function with 1 method)

create minibatches

In [8]:
minibatches = RandomBatches((train_data, train_y), size = minibatchsize, count = iterations)
Flux.Optimise.train!(loss, Flux.params(model), minibatches, ADAM(), cb = Flux.throttle(cb, 2))

accuracy: train = 0.45, test = 0.29545454545454547
accuracy: train = 0.72, test = 0.6818181818181818
accuracy: train = 0.81, test = 0.8863636363636364
accuracy: train = 0.82, test = 0.8863636363636364
accuracy: train = 0.82, test = 0.8863636363636364
accuracy: train = 0.82, test = 0.8863636363636364
accuracy: train = 0.82, test = 0.8863636363636364
accuracy: train = 0.82, test = 0.8863636363636364
accuracy: train = 0.83, test = 0.8863636363636364
accuracy: train = 0.83, test = 0.8863636363636364
accuracy: train = 0.83, test = 0.8863636363636364
accuracy: train = 0.83, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.86, test = 0.8863636363636364
accuracy: train = 0.86, test = 0.8863636363636364
accuracy: train = 0.86, test = 0.8863636363636364
accuracy: train = 0.86, test = 0.8863636363636364
accuracy: train = 0.88, test = 0.8863636363636364
accuracy: train = 0.88, test = 0.8863636363636364
accuracy: train = 0.88, test = 0.8863636363636364

# Classify test set

In [9]:
probs = softmax(inference(test_data))
o = Flux.onecold(probs)
pred_classes = labelnames[o]
mean(pred_classes .== test_y)

0.8863636363636364

we see the accuracy is around 75% on test set
predicted classes for test set

In [10]:
pred_classes

44-element Vector{Int64}:
 1
 0
 1
 0
 0
 1
 0
 0
 1
 1
 ⋮
 1
 1
 1
 1
 0
 1
 1
 0
 1

Ground truth classes for test set

In [11]:
test_y

44-element Vector{Int64}:
 1
 1
 1
 0
 1
 1
 0
 0
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 1
 0
 1

probabilities for test set

In [12]:
probs

2×44 Matrix{Float32}:
 0.953016   0.382744  0.953175   0.253781  …  0.946358   0.446038  0.546188
 0.0469837  0.617256  0.0468252  0.746219     0.0536416  0.553962  0.453812

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*