## A gentle introduction to creation of neural networks reflexing structure of JSON documents 

This notebook serves as an introduction to Mill and JsonGrinder libraries. The former provides support for Multi-instance learning problems, their cascades, and their Cartesian product (see the paper... for theoretical explanation). The latter *JsonGrinder* simplifies processing of JSON documents. It allows to infer schema of JSON documents from which it suggests an extractor to convert JSON document to a structure in a *Mill*.*JsonGrinder* defines basic set of "extractors" converting values of keys to numeric representation (matrices) or to convert them to corresponding structures in *Mill*. Naturally, this set of extractors can be extended.

Below, the intended workflow is demonstrated on a simple problem of guessing type of a cuisine from a list of ingrediences. Note that the goal is not to achieve state of the art, but to demonstrate the workflow.

**Caution**
To decrease the computational load in OceanCode, we decrease the number samples (200), size of the minibatch (10), and size of the validation data (100). Of course these numbers are useless in practice, and therefore the resulting accuracy is poor. Using all samples (666920), setting minibatch size to 100, and leaving 1000 samples for validation / testing gives you accuracy 0.74 on validation data.

In [1]:
#nsamples, minisize, vsamples = 200, 10, 100 
nsamples, minisize, vsamples = 666920, 100, 1000

(666920, 100, 1000)

### Preparing the environment
Let's start by replicating the development environment and importing libraries

In [2]:
using Pkg
Pkg.activate("JsonGrinder.jl")
Pkg.instantiate()
using Revise, Flux, MLDataPattern, Mill, JsonGrinder, JSON, Statistics, Adapt

┌ Info: new environment will be placed at /Users/tpevny/Work/Julia/Pkg/JsonGrinder/examples/JsonGrinder.jl
└ @ Pkg.API /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/Pkg/src/API.jl:575


[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h[32m[1m Resolving[22m[39m package versions...
loaded


┌ Info: Precompiling JsonGrinder [d201646e-a9c0-11e8-1063-23b139159713]
└ @ Base loading.jl:1187


### Preparing data
Data are stored in a format "json per line". This means that each sample is one JSON document stored in each line. These samples are loaded and parsed to an array. On the end, one sample is printed to show, how data looks like.

In [3]:
samples = open("recipes.json","r") do fid 
	Array{Dict}(JSON.parse(read(fid, String)))
end;
samples = samples[1:nsamples]
JSON.print(samples[1],2)

BoundsError: BoundsError: attempt to access 39774-element Array{Dict,1} at index [1:666920]

Unline XML or ProtoBuf, JSON documents do not have any schema. Threfore *JsonGrinder* attempts to infer the schema, which is then used to recommend the extractor.

In [4]:
schema = JsonGrinder.schema(samples)

[34m[Dict][39m
[34m  ├── [39m[39m    cuisine: [Scalar - String], 20 unique values, updated = 39774
[34m  ├── [39m[39m         id: [Scalar - Int64], 1000 unique values, updated = 39774
[34m  └── [39m[31mingredients: [List] (updated = 39774)[39m
[34m      [39m[31m  └── [39m[39m[Scalar - String], 1000 unique values, updated = 428275


From the schema, we can create extractor. 

ID is deleted from the schema (keys not in the schema are not reflected into extractor and hence not propagated into dataset).

In [5]:
delete!(schema.childs,"id");
extractor = JsonGrinder.suggestextractor(Float32,schema,20)

[34m  ├── [39m[39m    cuisine: String
[34m  └── [39m[39mingredients: Array of [39mString


Since cuisine is a class-label, the extractor needs to be split into two. `extract_data` will extract the sample and `extract_target` will extract the target.

In [6]:
extract_data = JsonGrinder.ExtractBranch(nothing,deepcopy(extractor.other));
extract_target = JsonGrinder.ExtractBranch(nothing,deepcopy(extractor.other));
delete!(extract_target.other,"ingredients");
delete!(extract_data.other,"cuisine");
extract_target.other["cuisine"] = JsonGrinder.ExtractCategorical(keys(schema.childs["cuisine"]));

Now, `extract_data` is a functor extracting samples and `extract_target` extract targets. Let's first demonstrate extractor of datas.

In [7]:
data = cat(map(extract_data, samples)...)

[34mBagNode with 39774 bag(s)[39m
[34m  └── [39m[39mArrayNode(1, 428275)


The extractor has returned structure containing 39774 samples (bags in MIL nomenclature). In total, 39774 samples contains 428275 instances.

Let's investigate the first samples.

In [8]:
show(data[1].data.data)

["romaine lettuce" "black olives" "grape tomatoes" "garlic" "pepper" "purple onion" "seasoning" "garbanzo beans" "feta cheese crumbles"]

We see that the first sample contain nine instances. But at the moment, they are stored as string, which is difficult to be processed by a machine. This representation might be advantageous, as it saves memory and it can be converted to matrix format just before processing. 

To represent list of ingrediences as vectors, we define function `sentence2ngrams`, which split each ingredient to a set of words and represent each word by a histogram of trigrams. To decrease the number of trigrams, their index is the remainder after division `(modulo)`.

The function is applied on data using `mapdata` function provided by the library.
Be aware that this step might be time consuming...

In [9]:
function sentence2ngrams(ss::Array{T,N}) where {T<:AbstractString,N}
	function f(s)
		x = JsonGrinder.string2ngrams(split(s),3,2057)
		Mill.BagNode(Mill.ArrayNode(x),[1:size(x,2)])
	end
	cat(map(f,ss)...)
end
sentence2ngrams(x) = x

data = Mill.mapdata(sentence2ngrams,data)

[34mBagNode with 39774 bag(s)[39m
[34m  └── [39m[31mBagNode with 428275 bag(s)[39m
[34m      [39m[31m  └── [39m[39mArrayNode(2057, 807760)


Notice that at this moment, the sample consists of two MIL problems. Firstly, each dish is described by a set of ingrediences. Secondly, each ingredience is described by a set of words `e.g. ["black","olives"]`. Finally, each word is represented as a vector of dimension 2057.

Since histograms are sparse, to save memory and improve computational efficiency, we convert the data to SparseMatrix. Notive that the shape of data has not change.

In [10]:
data = Mill.mapdata(i -> Mill.sparsify(Float32.(i),0.05),data)

[34mBagNode with 39774 bag(s)[39m
[34m  └── [39m[31mBagNode with 428275 bag(s)[39m
[34m      [39m[31m  └── [39m[39mArrayNode(2057, 807760)


Before constructing the neural network the number of classes the classifier should recognize. 

In [11]:
target = cat(map(extract_target, samples)...);
odim = size(target.data,1)

20

### Defining the model reflecting the structure of data

Since manually creating a model reflecting the structure can be tedious, Mill support a semi-automated procedure. The function `reflectinmodel` takes as an input data sample and function, which for a given input dimension provides a feed-forward network. In the example below, the function creates a FeedForward network with a single layer with twenty neurons and relu nonlinearinty. 

Layers are wrapped to Chain (from Flux library), such that the last layer can be easily extended by the LinearLayer with appropriate number of output neurons, corresponding to number of target classes.

The structure of the network (output [12]) corresponds to the  structure of input data (Output [10]). You can observe that each module dealing with multiple-instance data contains an aggregation layer with element-wise mean.

In [15]:
m,k = Mill.reflectinmodel(data[1], k -> Chain(Dense(k,80,relu)));
push!(m,Dense(k,odim));
m

[34mBagModel[39m
[34m  ├── [39m[31mBagModel[39m
[34m  │   [39m[31m  ├── [39m[39mChain(Dense(2057, 80, NNlib.relu))
[34m  │   [39m[31m  ├── [39m[39mAggregation((Mill._segmented_mean,))
[34m  │   [39m[31m  └── [39m[39mChain(Dense(80, 80, NNlib.relu))
[34m  ├── [39m[39mAggregation((Mill._segmented_mean,))
[34m  └── [39m[39mChain(Dense(80, 80, NNlib.relu), Dense(80, 20))


### Training the model
Mill library is compatible with MLDataPattern for manipulating with data (training / testing / minibatchsize preparation) and with Flux. Please, refer to thos two libraries for support.

Below, data are first split into training and validation sets. Then Adam optimizer for training the model is initialized, and after defining intermediate output the data model is trained.

In [16]:
valdata = data[1:vsamples],target[1:vsamples]
data, target = data[vsamples + 1:nobs(data)], target[vsamples + 1:nobs(target)]
opt = Flux.Optimise.ADAM(params(m))
loss = (x,y) -> Flux.logitcrossentropy(m(getobs(x)).data,getobs(y).data) 
cb = () -> println("accuracy = ",mean(Flux.onecold(Flux.data(m(valdata[1]).data)) .== Flux.onecold(valdata[2].data)))
Flux.Optimise.train!(loss, RandomBatches((data,target),10,10000), opt, cb = Flux.throttle(cb, 10))

accuracy = 0.076
accuracy = 0.32
accuracy = 0.517
accuracy = 0.541
accuracy = 0.57
accuracy = 0.575
accuracy = 0.609
accuracy = 0.591
accuracy = 0.614
accuracy = 0.635
accuracy = 0.639
accuracy = 0.64
accuracy = 0.66
accuracy = 0.667
accuracy = 0.685
accuracy = 0.658
accuracy = 0.674
accuracy = 0.678
accuracy = 0.672
accuracy = 0.66
accuracy = 0.684
accuracy = 0.678
accuracy = 0.675
accuracy = 0.691
accuracy = 0.702
accuracy = 0.695
accuracy = 0.697
accuracy = 0.699
accuracy = 0.712
accuracy = 0.709
accuracy = 0.71
accuracy = 0.708
accuracy = 0.722
accuracy = 0.699
accuracy = 0.71
accuracy = 0.71
accuracy = 0.713
accuracy = 0.707
accuracy = 0.708
accuracy = 0.718
accuracy = 0.719
accuracy = 0.721
accuracy = 0.705
accuracy = 0.718
accuracy = 0.717
accuracy = 0.701
accuracy = 0.713
accuracy = 0.707
accuracy = 0.708
accuracy = 0.727
accuracy = 0.725
accuracy = 0.715
accuracy = 0.721
accuracy = 0.721
accuracy = 0.717
accuracy = 0.719
accuracy = 0.715
accuracy = 0.726
accuracy = 0.718
accur

### Reporting accuracy on validation data

In [14]:
println("accuracy = ",mean(Flux.onecold(Flux.data(m(valdata[1]).data)) .== Flux.onecold(valdata[2].data)))

accuracy = 0.684
