# Recipe Ingredients Example
Following example demonstrates prediction of cuisine from set of ingredients.

## A gentle introduction to creation of neural networks reflexing structure of JSON documents

This notebook serves as an introduction to `Mill` and `JsonGrinder` libraries.
The former provides support for Multi-instance learning problems, their cascades, and their Cartesian product ([see the paper](https://arxiv.org/abs/2105.09107) for theoretical explanation).
The latter `JsonGrinder` simplifies processing of JSON documents. It allows to infer schema of JSON documents from which it suggests an extractor to convert JSON document to a `Mill` structure.
`JsonGrinder` defines basic set of "extractors" converting values of keys to numeric representation (matrices) or to convert them to corresponding structures in `Mill`. Naturally, this set of extractors can be extended.

Below, the intended workflow is demonstrated on a simple problem of guessing type of a cuisine from a list of ingredients.
The whole dataset and problem description can be found [on this Kaggle page](https://www.kaggle.com/kaggle/recipe-ingredients-dataset/home).
Note that the goal is not to achieve state of the art, but to demonstrate the workflow.

**Caution**

To reduce we keep locally in the repo only a subset of the whole dataset (`39774`).
To decrease the computational load we use only `5000` samples, size of the validation data = `100`, size of the minibatch `10` and train for 20 iterations.
Of course these numbers are useless in practice, and therefore the resulting accuracy is poor.
Using all samples (`39774`), leaving `4774` samples for validation, setting minibatch size to `1000`, and training for `1000` iterations gives you accuracy 0.73 on validation data.

In [1]:
n_samples, n_val, minibatchsize, iterations = 5_000, 100, 10, 20

(5000, 100, 10, 20)

n_samples, n_val, minibatchsize, iterations = 39_774, 4_774, 1_000, 1_000

We start by installing JsonGrinder and few other packages we need for the example.
Julia Ecosystem follows philosophy of many small single-purpose composable packages
which may be different from e.g. python where we usually use fewer larger packages.

In [2]:
using Pkg
pkg"add JsonGrinder#master Flux Mill MLDataPattern Statistics JSON"

    Updating git-repo `https://github.com/CTUAvastLab/JsonGrinder.jl.git`
   Resolving package versions...
│ To update to the new format run `Pkg.upgrade_manifest()` which will upgrade the format without re-resolving.
└ @ Pkg.Types /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/manifest.jl:287
  No Changes to `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Project.toml`
  No Changes to `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Manifest.toml`
└ @ ~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Manifest.toml:0


Let's start by importing all libraries we will need.

In [3]:
using JsonGrinder, Flux, Mill, MLDataPattern, Statistics, JSON

### Preparing data
After importing libraries we load all samples. Of course we can afford it only for small datasets, but
for the sake of simplicity we keep whole dataset in memory, while recognizing this is usually not
feasible in real-world scenarios.
Data are stored in a format "json per line".
This means that each sample is one JSON document stored in each line.
These samples are loaded and parsed to an array. On the end, one sample is printed to show, how data looks like.

In [4]:
data_file = "../../../data/recipes.json"
samples = open(data_file,"r") do fid
	Vector{Dict}(JSON.parse(read(fid, String)))
end
JSON.print(samples[1],2)

{
  "id": 10259,
  "ingredients": [
    "romaine lettuce",
    "black olives",
    "grape tomatoes",
    "garlic",
    "pepper",
    "purple onion",
    "seasoning",
    "garbanzo beans",
    "feta cheese crumbles"
  ],
  "cuisine": "greek"
}


Now we create schema of the JSON.
Unlike XML or ProtoBuf, JSON documents do not have any schema by default.
Therefore *JsonGrinder* attempts to infer the schema, which is then used to recommend the extractor.

In [5]:
sch = JsonGrinder.schema(samples[1:n_samples])

[34m[Dict][39m[90m 	# updated = 5000[39m
[34m  ├─────────── id: [39m[39m[Scalar - Int64], 5000 unique values[90m 	# updated = 5000[39m
[34m  ├────── cuisine: [39m[39m[Scalar - String], 20 unique values[90m 	# updated = 5000[39m
[34m  └── ingredients: [39m[31m[List][39m[90m 	# updated = 5000[39m
[34m                   [39m[31m  └── [39m[39m[Scalar - String], 3476 unique values[90m 	# updated = 53299[39m

ID is deleted from the schema (keys not in the schema are not
reflected into extractor and hence not propagated into dataset).

In [6]:
delete!(sch.childs,:id)

Dict{Symbol, Any} with 2 entries:
  :cuisine     => Entry
  :ingredients => ArrayEntry

From the schema, we can create the extractor.

In [7]:
extractor = suggestextractor(sch)

[34mDict[39m
[34m  ├────── cuisine: [39m[39mCategorical d = 21
[34m  └── ingredients: [39m[31mArray of[39m
[34m                   [39m[31m  └── [39m[39mCategorical d = 3477

Since cuisine is a class label we want to predict,
the extractor needs to be split into two.
`extract_data` will extract the sample and `extract_target` will extract the target.

In [8]:
extract_data = ExtractDict(deepcopy(extractor.dict))
extract_target = ExtractDict(deepcopy(extractor.dict))
delete!(extract_target.dict, :ingredients)
delete!(extract_data.dict, :cuisine)

Dict{Symbol, JsonGrinder.AbstractExtractor} with 1 entry:
  :ingredients => ExtractArray

Now, `extract_data` is a functor extracting samples and `extract_target` extract targets. Let's first demonstrate extractor of datas.

In [9]:
extract_data(samples[1])
extract_target(samples[1])[:cuisine]

21×1 Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000015, 1, 2, Vector{UInt32}}, Nothing}:
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 1
 ⋅
 ⋅
 ⋅
 ⋮
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅

Now we use these extractors to convert JSONs to Mill structures which behave as our datasets.

In [10]:
data = extract_data.(samples[1:n_samples])
data = reduce(catobs, data)

[34mProductNode[39m[90m 	# 5000 obs, 16 bytes[39m
[34m  └── ingredients: [39m[31mBagNode[39m[90m 	# 5000 obs, 78.188 KiB[39m
[34m                   [39m[31m  └── [39m[39mArrayNode(3477×53299 OneHotArray with Bool elements)[90m 	# 53299 obs, 208.254 KiB[39m

Now the `data` variable is a Mill structure containing `n_samples` obs (observations).
Each observation there is a sample from the dataset.
The root of this tree-like structure is ProductNode containing `ingredients` key, reflecting the
same name as samples have in training data, and then we have `BagNode` of `ArrayNode` of `OneHotArray`,
which is how the ingredients are represented. Each sample has set of ingredients, which are set of words,
`e.g. ["black","olives"]`., where each ingredient is encoded into one-hot vector of dimension 3477 (for the 5000 samples, it's larger if we use whole dataset).
And we do the same with targets, the extractor converts them to one-hot encoded matrix.

In [11]:
target = extract_target.(samples[1:n_samples])
target = reduce(catobs, target)[:cuisine].data

21×5000 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  …  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  1  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅
 ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  …  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  1  1  ⋅  1     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋮              ⋮              ⋮        ⋱        ⋮              ⋮           
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅

We see that target is `21x5000` One-hot matrix in case of 5000 samples. There are 21 cuisines which are the prediction targets.

#### Note about data representation
This approach, where we catobs all data to single sample, and then we slice it to obtain mini-batches, is usable only
for datasets which fit into the memory, which is not suited for many real-world tasks, but is usable for playing with
small datasets. The other approach can be seen in the Mutagenesis Example example.
Theoretically this approach is useful when you train in multiple epochs and all data fit into the memory,
because you perform the catobs only once.

### Defining the model reflecting the structure of data

Since manually creating a model reflecting the structure can be tedious, Mill support a semi-automated procedure.
The function `reflectinmodel` takes as an input data sample and function, which for a given input dimension provides a feed-forward network.
In the example below, the function creates a feed forward network with a single fully-connected layer with twenty neurons and relu non-linearity.
The structure of the network corresponds to the  structure of input data.
You can observe that each module dealing with multiple-instance data contains an aggregation layer with element-wise mean and maximum.

In [12]:
m = reflectinmodel(sch, extract_data,
	layer -> Dense(layer,20,relu),
	bag -> SegmentedMeanMax(bag),
	fsm = Dict("" => layer -> Dense(layer, size(target, 1))),
)

[34mProductModel ↦ ArrayModel(Dense(20, 21))[39m[90m 	# 2 arrays, 441 params, 1.801 KiB[39m
[34m  └── ingredients: [39m[31mBagModel ↦ [SegmentedMean(20); SegmentedMax(20)] ↦ ArrayModel(Dense(40, 20, relu))[39m[90m 	# 4 arrays, 860 params, 3.516 KiB[39m
[34m                   [39m[31m  └── [39m[39mArrayModel(Dense(3477, 20, relu))[90m 	# 2 arrays, 69_560 params, 271.797 KiB[39m

### Training the model
Mill library is compatible with MLDataPattern for manipulating with data (training / testing / minibatchsize preparation) and with Flux.
Please, refer to these two libraries for support.
Below, data are first split into training and validation sets.
Then Adam optimizer for training the model is initialized, and loss function is defined.
We also define callback which perpetually reports accuracy on validation data during the training.

In [13]:
valdata, valtarget = data[n_samples-n_val:n_samples], target[:,n_samples-n_val:n_samples]
traindata, traintarget = data[1:n_samples-n_val], target[:,1:n_samples-n_val]
opt = Flux.Optimise.ADAM()
loss(x, y) = Flux.logitcrossentropy(m(x).data, y)
loss(xy::Tuple) = loss(xy...)
cb = () -> println("accuracy = ",mean(Flux.onecold(m(valdata).data) .== Flux.onecold(valtarget)))

#9 (generic function with 1 method)

Here we compute the accuracy.

In [14]:
mean(Flux.onecold(m(traindata).data) .== Flux.onecold(traintarget))

0.06897959183673469

Here we obtain the trainable parameters from the model

In [15]:
ps = Flux.params(m)

Params([Float32[0.041041154 -0.01617369 … 0.0063797124 -0.017034061; 0.037732817 0.024376592 … -0.0028466952 -0.029412467; … ; -0.007948674 0.03527807 … -0.00081654266 0.0001483182; 0.03573877 -0.016277228 … 0.011694804 0.031276625], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.2374293 0.08604232 … 0.20610823 0.038974207; 0.025234412 -0.051515754 … 0.08548961 -0.22369963; … ; -0.09654004 0.20820796 … -0.10044876 -0.13736232; 0.2993193 0.026034877 … -0.28702736 0.23350024], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.0450968 -0.063696 … -0.17814659 0.38099957; -0.24863204 -0.21428676 … 0.14623253 -0.19950746; … ; 0.0970263 0.

We use `MLDataPattern.RandomBatches` to make mini-batches from the training data

In [16]:
minibatches = RandomBatches((traindata, traintarget), size = minibatchsize, count = iterations)

RandomBatches(::Tuple{Mill.ProductNode{NamedTuple{(:ingredients,), Tuple{Mill.BagNode{Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000d95, 1, 2, Vector{UInt32}}, Nothing}, Mill.AlignedBags{Int64}, Nothing}}}, Nothing}, Flux.OneHotArray{UInt32, 0x00000015, 1, 2, Vector{UInt32}}}, 10, 20, (ObsDim.Undefined(), ObsDim.Last()))
 Iterator providing 20 batches of size 10

Now we try to compute the loss and perform single step of the gradient descend to see if all works correctly.

In [17]:
loss(first(minibatches))
gs = gradient(() -> loss(first(minibatches)), ps)
Flux.Optimise.update!(opt, ps, gs)

In this step we finally train the classifier using the loss we have defined above.

In [18]:
Flux.Optimise.train!(loss, ps, minibatches, opt, cb = Flux.throttle(cb, 2))

accuracy = 0.0297029702970297


### Reporting accuracy on validation data
As last steps, we calculate accuracy on training and validation data after the model has been trained.

In [19]:
mean(Flux.onecold(m(traindata).data) .== Flux.onecold(traintarget))
mean(Flux.onecold(m(valdata).data) .== Flux.onecold(valtarget))

0.0297029702970297

This concludes our example on training the classifier to recogninze cuisine based on ingredients.

todo: describe differences between this approach when we catobs everything and catobsing using minibatches

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*