## A gentle introduction to creation of neural networks reflexing structure of JSON documents 

This notebook serves as an introduction to Mill (https://github.com/pevnak/Mill.jl) and JsonGrinder (https://github.com/pevnak/JsonGrinder) libraries. The former provides support for Multi-instance learning problems, their cascades, and their Cartesian product (see the paper... for theoretical explanation). The latter *JsonGrinder* simplifies processing of JSON documents. It allows to infer schema of JSON documents from which it suggests an extractor to convert JSON document to a structure in a *Mill*.*JsonGrinder* defines basic set of "extractors" converting values of keys to numeric representation (matrices) or to convert them to corresponding structures in *Mill*. Naturally, this set of extractors can be extended.

Below, the intended workflow is demonstrated on a simple problem of guessing type of a cuisine from a list of ingrediences. Note that the goal is not to achieve state of the art, but to demonstrate the workflow.

A basic knowledge of Julia and Flux library certainly improves the understanding...

Let's start by importing libraries

In [10]:
using Revise, Flux, MLDataPattern, Mill, JsonGrinder, FluxExtensions, JSON, Statistics, Adapt

Data are stored in a format "json per line". This means that each sample is one JSON document stored in each line. These samples are loaded and parsed to an array. On the end, one sample is printed to show, how data looks like.

In [2]:
samples = open("recipes.json","r") do fid 
	Array{Dict}(JSON.parse(read(fid, String)))
end;
JSON.print(samples[1],2)

{
  "id": 10259,
  "ingredients": [
    "romaine lettuce",
    "black olives",
    "grape tomatoes",
    "garlic",
    "pepper",
    "purple onion",
    "seasoning",
    "garbanzo beans",
    "feta cheese crumbles"
  ],
  "cuisine": "greek"
}


Unline XML or ProtoBuf, JSON documents do not have any schema. Threfore *JsonGrinder* attempts to infer the schema, which is then used to recommend the extractor.

In [3]:
schema = JsonGrinder.schema(samples)

[34m[Dict][39m
[34m  ├── [39m[39m    cuisine: [Scalar - String], 20 unique values, updated = 39774
[34m  ├── [39m[39m         id: [Scalar - Int64], 1000 unique values, updated = 39774
[34m  └── [39m[31mingredients: [List] (updated = 39774)[39m
[34m      [39m[31m  └── [39m[39m[Scalar - String], 1000 unique values, updated = 428275


From the schema, we can create extractor. 

ID is deleted from the schema (keys not in the schema are not reflected into extractor and hence not propagated into dataset).

In [4]:
delete!(schema.childs,"id");
extractor = JsonGrinder.suggestextractor(Float32,schema,2000)

[34m  ├── [39m[39m    cuisine: String
[34m  └── [39m[39mingredients: Array of [39mString


Since cuisine is a class-label, the extractor needs to be split into two. `extract_data` will extract the sample and `extract_target` will extract the target.

In [5]:
extract_data = JsonGrinder.ExtractBranch(nothing,deepcopy(extractor.other));
extract_target = JsonGrinder.ExtractBranch(nothing,deepcopy(extractor.other));
delete!(extract_target.other,"ingredients");
delete!(extract_data.other,"cuisine");
extract_target.other["cuisine"] = JsonGrinder.ExtractCategorical(keys(schema.childs["cuisine"]));

Now, `extract_data` is a functor extracting samples and `extract_target` extract targets. Let's first demonstrate extractor of datas.

In [6]:
data = cat(map(extract_data, samples)...)

[34mBagNode with 39774 bag(s)[39m
[34m  └── [39m[39mArrayNode(1, 428275)


The extractor has returned structure containing 39774 samples (bags in MIL nomenclature). In total, 39774 samples contains 428275 instances.

Let's investigate the first samples.

In [7]:
show(data[1].data.data)

["romaine lettuce" "black olives" "grape tomatoes" "garlic" "pepper" "purple onion" "seasoning" "garbanzo beans" "feta cheese crumbles"]

We see that the first sample contain nine instances. But at the moment, they are stored as string, which is difficult to be processed by a machine. This representation might be advantageous, as it saves memory and it can be converted to matrix format just before processing. 

To represent list of ingrediences as vectors, we define function `sentence2ngrams`, which split each ingredient to a set of words and represent each word by a histogram of trigrams. To decrease the number of trigrams, their index is the remainder after division `(modulo)`.

The function is applied on data using `mapdata` function provided by the library.
Be aware that this step might be time consuming...

In [8]:
function sentence2ngrams(ss::Array{T,N}) where {T<:AbstractString,N}
	function f(s)
		x = JsonGrinder.string2ngrams(split(s),3,2057)
		Mill.BagNode(Mill.ArrayNode(x),[1:size(x,2)])
	end
	cat(map(f,ss)...)
end
sentence2ngrams(x) = x

data = Mill.mapdata(sentence2ngrams,data)

[34mBagNode with 39774 bag(s)[39m
[34m  └── [39m[31mBagNode with 428275 bag(s)[39m
[34m      [39m[31m  └── [39m[39mArrayNode(2057, 807760)


Notice that at this moment, the sample consists of two MIL problems. Firstly, each dish is described by a set of ingrediences. Secondly, each ingredience is described by a set of words `e.g. ["black","olives"]`. Finally, each word is represented as a vector of dimension 2057.

Since histograms are sparse, to save memory and improve computational efficiency, we convert the data to SparseMatrix. Notive that the shape of data has not change.

In [9]:
data = Mill.mapdata(i -> Mill.sparsify(Float32.(i),0.05),data)

[34mBagNode with 39774 bag(s)[39m
[34m  └── [39m[31mBagNode with 428275 bag(s)[39m
[34m      [39m[31m  └── [39m[39mArrayNode(2057, 807760)


Since manually creating a model reflecting the structure can be tedious, Mill support a semi-automated procedure. The function `reflectinmodel` takes as an input data sample and function, which for a given input dimension provides a feed-forward network. In the example below, the function creates a FeedForward network with a single layer with twenty neurons and relu nonlinearinty. Note, that the layer is wrapped to Chain (from Flux library), such that the last layer can be easily extended by the LinearLayer with 18 output neurons, corresponding to 18 classes target classes.

In [45]:
m,k = Mill.reflectinmodel(data[1], k -> Chain(Dense(k,20,relu)));
push!(m,Dense(k,20));
m

[34mBagModel[39m
[34m  ├── [39m[31mBagModel[39m
[34m  │   [39m[31m  ├── [39m[39mChain(Dense(2057, 20, NNlib.relu))
[34m  │   [39m[31m  ├── [39m[39mAggregation((Mill.segmented_mean,))
[34m  │   [39m[31m  └── [39m[39mChain(Dense(20, 20, NNlib.relu))
[34m  ├── [39m[39mAggregation((Mill.segmented_mean,))
[34m  └── [39m[39mChain(Dense(20, 20, NNlib.relu), Dense(20, 20))


Let's extract the testing data, which is straightforward.

In [46]:
target = cat(map(extract_target, samples)...)

[39mArrayNode(20, 39774)


Finally, we are to go and train the model. We reserve first 1000 samples as validation, the rest will be used for training.

Mill library is compatible with Flux infrastructure for training and with MLDataPattern library for managing training samples. Please, refer to thos two libraries for support.

In [47]:
opt = Flux.Optimise.ADAM(params(m))
loss = (x,y) -> Flux.logitcrossentropy(m(getobs(x)).data,getobs(y).data) 
valdata = data[1:1000],target[1:1000]
data, target = data[1001:nobs(data)], target[1001:nobs(target)]
cb = () -> println("accuracy = ",mean(Flux.onecold(Flux.data(m(valdata[1]).data)) .== Flux.onecold(valdata[2].data)))
Flux.Optimise.train!(loss, RandomBatches((data,target),100,10000), opt, cb = Flux.throttle(cb, 10))

accuracy = 0.092
accuracy = 0.636
accuracy = 0.657
accuracy = 0.676
accuracy = 0.688
accuracy = 0.701
accuracy = 0.712
accuracy = 0.73
accuracy = 0.722
accuracy = 0.73
accuracy = 0.725
accuracy = 0.736
accuracy = 0.738
accuracy = 0.745
accuracy = 0.733


Finally, we can report accuracy on validation data.

In [52]:
mean(Flux.onecold(Flux.data(m(valdata[1]).data)) .== Flux.onecold(valdata[2].data))

0.732