# Mutagenesis Example
Following example demonstrates learning to [predict the mutagenicity on Salmonella typhimurium](https://relational.fit.cvut.cz/dataset/Mutagenesis) (dataset is stored in json format [in MLDatasets.jl](https://juliaml.github.io/MLDatasets.jl/stable/datasets/Mutagenesis/) for your convenience).

In [1]:
using MLDatasets, JsonGrinder, Flux, Mill, MLDataPattern, Statistics, ChainRulesCore

[ Info: Installing scipy via the Conda scipy package...
[ Info: Running `conda install -q -y scipy` in root environment
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/runner/.julia/conda/3

  added / updated specs:
    - scipy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libgfortran-ng-7.5.0       |      ha8ba4b0_17          22 KB
    libgfortran4-7.5.0         |      ha8ba4b0_17         995 KB
    scipy-1.7.1                |   py39h292c36d_2        16.9 MB
    ------------------------------------------------------------
                                           Total:        17.9 MB

The following NEW packages will be INSTALLED:

  libgfortran-ng     pkgs/main/linux-64::libgfortran-ng-7.5.0-ha8ba4b0_17
  libgfortran4       pkgs/main/linux-64::libgfortran4-7.5.0-ha8b

start by loading all samples

In [2]:
train_x, train_y = MLDatasets.Mutagenesis.traindata();
test_x, test_y = MLDatasets.Mutagenesis.testdata();
minibatchsize = 100
iterations = 5_000
neurons = 20 		# neurons per layer

20

 Create the schema and extractor from training data

In [3]:
sch = JsonGrinder.schema(train_x)
extractor = suggestextractor(sch)

[34mDict[39m
[34m  ├─── lumo: [39m[39mCategorical d = 99
[34m  ├─── inda: [39m[39mCategorical d = 2
[34m  ⋮[39m
[34m  └── atoms: [39m[31mArray of[39m
[34m             [39m[31m  └── [39m[32mDict[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

 Convert samples to Mill structure and extract targets

In [4]:
train_data = extractor.(train_x)
test_data = extractor.(test_x)
labelnames = unique(train_y)

@show train_data[1]

train_data[1] = ProductNode


[34mProductNode[39m[90m 	# 1 obs, 104 bytes[39m
[34m  ├─── lumo: [39m[39mArrayNode(99×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── inda: [39m[39mArrayNode(2×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── logp: [39m[39mArrayNode(63×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  ├─── ind1: [39m[39mArrayNode(3×1 OneHotArray with Bool elements)[90m 	# 1 obs, 60 bytes[39m
[34m  └── atoms: [39m[31mBagNode[39m[90m 	# 1 obs, 136 bytes[39m
[34m             [39m[31m  └── [39m[32mProductNode[39m[90m 	# 26 obs, 64 bytes[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

# Create the model

In [5]:
model = reflectinmodel(sch, extractor,
	layer -> Dense(layer, neurons, relu),
	bag -> SegmentedMeanMax(bag),
	fsm = Dict("" => layer -> Dense(layer, length(labelnames))),
)

[34mProductModel ↦ ArrayModel(Dense(100, 2))[39m[90m 	# 2 arrays, 202 params, 888 bytes[39m
[34m  ├─── lumo: [39m[39mArrayModel(Dense(99, 20, relu))[90m 	# 2 arrays, 2_000 params, 7.891 KiB[39m
[34m  ├─── inda: [39m[39mArrayModel(Dense(2, 20, relu))[90m 	# 2 arrays, 60 params, 320 bytes[39m
[34m  ├─── logp: [39m[39mArrayModel(Dense(63, 20, relu))[90m 	# 2 arrays, 1_280 params, 5.078 KiB[39m
[34m  ├─── ind1: [39m[39mArrayModel(Dense(3, 20, relu))[90m 	# 2 arrays, 80 params, 400 bytes[39m
[34m  └── atoms: [39m[31mBagModel ↦ [SegmentedMean(20); SegmentedMax(20)] ↦ ArrayModel(Dense(40, 20, relu))[39m[90m 	# 4 arrays, 860 params, 3.516 KiB[39m
[34m             [39m[31m  └── [39m[32mProductModel ↦ ArrayModel(Dense(61, 20, relu))[39m[90m 	# 2 arrays, 1_240 params, 4.922 KiB[39m
[34m             [39m[31m      [39m[32m  ⋮[39m

# Train the model
let's define loss and some helper functions

In [6]:
loss(x,y) = Flux.logitcrossentropy(inference(x), Flux.onehotbatch(y, labelnames))
inference(x::AbstractMillNode) = model(x).data
inference(x::AbstractVector{<:AbstractMillNode}) = inference(reduce(catobs, x))
accuracy(x,y) = mean(labelnames[Flux.onecold(inference(x))] .== y)
loss(xy::Tuple) = loss(xy...)
@non_differentiable Base.reduce(catobs, x::AbstractVector{<:AbstractMillNode})
cb = () -> begin
	train_acc = accuracy(train_data, train_y)
	test_acc = accuracy(test_data, test_y)
	println("accuracy: train = $train_acc, test = $test_acc")
end

#9 (generic function with 1 method)

create minibatches

In [7]:
minibatches = RandomBatches((train_data, train_y), size = minibatchsize, count = iterations)
Flux.Optimise.train!(loss, Flux.params(model), minibatches, ADAM(), cb = Flux.throttle(cb, 2))

accuracy: train = 0.42, test = 0.5227272727272727
accuracy: train = 0.62, test = 0.6818181818181818
accuracy: train = 0.78, test = 0.7727272727272727
accuracy: train = 0.84, test = 0.8636363636363636
accuracy: train = 0.85, test = 0.8636363636363636
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.84, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.85, test = 0.8863636363636364
accuracy: train = 0.86, test = 0.8863636363636364
accuracy: train = 0.86, test = 0.8863636363636364
accuracy: train = 0.86, test = 0.8863636363636364
accuracy: train = 0.87, test = 0.8863636363636364
accuracy: train = 0.88, test = 0.8863636363636364
accuracy: train = 0.87, test = 0.8863636363636364


# Classify test set

In [8]:
probs = softmax(inference(test_data))
o = Flux.onecold(probs)
pred_classes = labelnames[o]
print(mean(pred_classes .== test_y))

0.8863636363636364

we see the accuracy is around 79% on test set

In [9]:
#predicted classes for test set
print(pred_classes)

[1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1]

gt classes for test set

In [10]:
print(test_y)

[1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1]

probabilities for test set

In [11]:
print(probs)

Float32[0.9805322 0.2384057 0.9637127 0.19163147 0.28530887 0.9786587 0.23583542 0.2623012 0.9710335 0.96257037 0.975023 0.1800483 0.26517195 0.9785461 0.97219276 0.26467112 0.975127 0.18752997 0.16941528 0.26280475 0.090576045 0.8618955 0.97741437 0.86254597 0.37131914 0.97425836 0.75622874 0.17619236 0.93809325 0.9727401 0.9505506 0.21099679 0.08683983 0.25472873 0.96059674 0.9785876 0.96645653 0.95931476 0.8683027 0.23834394 0.9742777 0.9728862 0.31394815 0.98282164; 0.019467814 0.7615943 0.03628733 0.80836856 0.71469116 0.021341389 0.7641646 0.73769885 0.028966505 0.037429623 0.02497704 0.8199517 0.7348281 0.021453913 0.027807215 0.7353289 0.024873 0.8124701 0.8305847 0.73719525 0.9094239 0.13810456 0.022585655 0.13745403 0.6286808 0.02574167 0.24377124 0.8238077 0.061906718 0.02725992 0.04944938 0.7890032 0.91316015 0.74527127 0.039403267 0.02141239 0.03354348 0.04068529 0.13169727 0.76165605 0.025722267 0.027113833 0.68605185 0.017178303]

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*