# Recipe Ingredients

The following simple example shows how to train a hierarchical model for predicting the
type of cuisine from a set of used ingredients.

The full environment, the script and the data are accessible [here](https://github.com/CTUAvastLab/JsonGrinder.jl/tree/master/docs/src/examples/recipes).

We start by activating the environment and installing required packages

In [1]:
using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

  Activating project at `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/src/examples/recipes`
Status `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/src/examples/recipes/Project.toml`
  [587475ba] Flux v0.14.20
  [0f8b85d8] JSON3 v1.14.0
  [d201646e] JsonGrinder v2.5.2
  [f1d291b0] MLUtils v0.4.4
  [1d0525e4] Mill v2.10.5
  [0b1bfda6] OneHotArrays v0.2.5


We recommend to first read the [Mutagenesis](https://github.com/CTUAvastLab/JsonGrinder.jl/tree/master/docs/src/examples/mutagenesis) example,
which introduces core concepts. This example shows application on another dataset and
integration with [`JSON3.jl`](https://github.com/quinnj/JSON3.jl).

We load all dependencies and fix the seed:

In [2]:
using JsonGrinder, Mill, Flux, OneHotArrays, JSON3, MLUtils, Statistics

using Random; Random.seed!(42);

The full dataset and the problem description can be also found on [Kaggle](https://www.kaggle.com/kaggle/recipe-ingredients-dataset/home), but for demonstration purposes we load only its small subset:

In [3]:
dataset = JSON3.read.(readlines("recipes.jsonl"));
shuffle!(dataset);
jss_train, jss_test = dataset[1:2000], dataset[2001:end];
jss_train[1]

JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 3 entries:
  :id          => 7950
  :ingredients => ["large egg whites", "brown rice", "all-purpose flour", "larg…
  :cuisine     => "korean"

Labels are stored in the `"cuisine"` field:

In [4]:
y_train = getindex.(jss_train, "cuisine");
y_test = getindex.(jss_test, "cuisine");
y_train

2000-element Vector{String}:
 "korean"
 "chinese"
 "italian"
 "korean"
 "japanese"
 "italian"
 "korean"
 "mexican"
 "french"
 "chinese"
 ⋮
 "italian"
 "indian"
 "french"
 "mexican"
 "chinese"
 "chinese"
 "indian"
 "jamaican"
 "jamaican"

In this example we have more classes than two, so we also encode all training labels into one-hot vectors:

In [5]:
classes = unique(y_train)

20-element Vector{String}:
 "korean"
 "chinese"
 "italian"
 "japanese"
 "mexican"
 "french"
 "greek"
 "british"
 "indian"
 "thai"
 "southern_us"
 "russian"
 "moroccan"
 "vietnamese"
 "brazilian"
 "jamaican"
 "cajun_creole"
 "spanish"
 "irish"
 "filipino"

In [6]:
y_train_oh = onehotbatch(y_train, classes)

20×2000 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 1  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  …  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  1  ⋅  ⋅  ⋅
 ⋅  ⋅  1  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1     ⋅  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  …  ⋅  1  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  …  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  

Now we create a schema:

In [7]:
sch = schema(jss_train)

[34mDictEntry[39m[90m[3m 2000x updated[23m[39m
[34m  ├────── cuisine: [39m[39mLeafEntry (20 unique `String` values)[90m[3m 2000x updated[23m[39m
[34m  ├─────────── id: [39m[39mLeafEntry (2000 unique `Real` values)[90m[3m 2000x updated[23m[39m
[34m  ╰── ingredients: [39m[31mArrayEntry[39m[90m[3m 2000x updated[23m[39m
[34m                   [39m[31m  ╰── [39m[39mLeafEntry (2471 unique `String` values)[90m[3m 21632x update [23m[39m[90m⋯[39m

Function `schema` accepts an optional argument, a function first mapping all elements of
an input array. We could thus reduce the schema creation into a single command
`schema(JSON3.read, readlines("recipes.jsonl"))`.

From the schema, we will delete the `"cuisine"` key storing the label, and also the `"id"` key,
which is just the id of the sample, which is not useful in training:

In [8]:
delete!(sch.children, :cuisine);
delete!(sch.children, :id);
sch

[34mDictEntry[39m[90m[3m 2000x updated[23m[39m
[34m  ╰── ingredients: [39m[31mArrayEntry[39m[90m[3m 2000x updated[23m[39m
[34m                   [39m[31m  ╰── [39m[39mLeafEntry (2471 unique `String` values)[90m[3m 21632x update [23m[39m[90m⋯[39m

We can see that only a single key `"ingredients"` is left. We can thus just take its content:

In [9]:
jss_train = getindex.(jss_train, "ingredients");
jss_test = getindex.(jss_test, "ingredients");
jss_train[1]

13-element JSON3.Array{String, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}:
 "large egg whites"
 "brown rice"
 "all-purpose flour"
 "large eggs"
 "top sirloin steak"
 "garlic cloves"
 "low sodium soy sauce"
 "green onions"
 "rice vinegar"
 "canola oil"
 "sesame seeds"
 "sesame oil"
 "dark sesame oil"

We can infer the schema again, or just take a subtree of the original schema

We can just take the only subtree of the original schema `sch`:

In [10]:
sch[:ingredients]

[34mArrayEntry[39m[90m[3m 2000x updated[23m[39m
[34m  ╰── [39m[39mLeafEntry (2471 unique `String` values)[90m[3m 21632x updated[23m[39m

Or infer it once again, this time `jss_train` is not a `Vector` of `Dict`s, but a `Vector` of `Vector`s:

In [11]:
sch = schema(jss_train)

[34mArrayEntry[39m[90m[3m 2000x updated[23m[39m
[34m  ╰── [39m[39mLeafEntry (2471 unique `String` values)[90m[3m 21632x updated[23m[39m

Next step is to create an extractor:

In [12]:
e = suggestextractor(sch)

[34mArrayExtractor[39m
[34m  ╰── [39m[39mNGramExtractor(n=3, b=256, m=2053)

If we have sufficient memory, we can extract all documents before training like in the
[Mutagenesis](https://github.com/CTUAvastLab/JsonGrinder.jl/tree/master/docs/src/examples/mutagenesis) example:

In [13]:
extract(e, jss_train)

[34mBagNode[39m[90m[3m  2000 obs, 31.336 KiB[23m[39m
[34m  ╰── [39m[39mArrayNode(2053×21632 NGramMatrix with Int64 elements)[90m[3m  21632 obs, 586.46 [23m[39m[90m⋯[39m

However, in this example we want to show how to extract online in the training loop.

We continue with the model definition, making use of some of the
We continue with the model definition, making use of some of the
`Mill.reflectinmodel` features.

In [14]:
encoder = reflectinmodel(sch, e, d -> Dense(d, 40, relu), d -> SegmentedMeanMaxLSE(d) |> BagCount)
model = Dense(40, length(classes)) ∘ encoder

Dense(40 => 20) ∘ BagModel ↦ BagCount([SegmentedMean(40); SegmentedMax(40); SegmentedLSE(40)]) ↦ Dense(121 => 40, relu)

We define important components for the training:

In [15]:
pred(m, x) = softmax(m(x))
opt_state = Flux.setup(Flux.Optimise.Adam(), model);
minibatch_iterator = Flux.DataLoader((jss_train, y_train_oh), batchsize=32, shuffle=true);
accuracy(p, y) = mean(onecold(p, classes) .== y)

accuracy (generic function with 1 method)

And run the training:

In [16]:
for i in 1:20
    Flux.train!(model, minibatch_iterator, opt_state) do m, jss, y
        x = Flux.@ignore_derivatives extract(e, jss)
        Flux.Losses.logitcrossentropy(m(x), y)
    end
    @info "Epoch $i" accuracy=accuracy(pred(model, extract(e, jss_train)), y_train)
end

┌ Info: Epoch 1
└   accuracy = 0.3395
┌ Info: Epoch 2
└   accuracy = 0.506
┌ Info: Epoch 3
└   accuracy = 0.595
┌ Info: Epoch 4
└   accuracy = 0.662
┌ Info: Epoch 5
└   accuracy = 0.7115
┌ Info: Epoch 6
└   accuracy = 0.7645
┌ Info: Epoch 7
└   accuracy = 0.79
┌ Info: Epoch 8
└   accuracy = 0.8385
┌ Info: Epoch 9
└   accuracy = 0.863
┌ Info: Epoch 10
└   accuracy = 0.8865
┌ Info: Epoch 11
└   accuracy = 0.9195
┌ Info: Epoch 12
└   accuracy = 0.9265
┌ Info: Epoch 13
└   accuracy = 0.951
┌ Info: Epoch 14
└   accuracy = 0.9615
┌ Info: Epoch 15
└   accuracy = 0.969
┌ Info: Epoch 16
└   accuracy = 0.981
┌ Info: Epoch 17
└   accuracy = 0.985
┌ Info: Epoch 18
└   accuracy = 0.986
┌ Info: Epoch 19
└   accuracy = 0.9935
┌ Info: Epoch 20
└   accuracy = 0.9965


Finally, let's measure the testing accuracy. In this case, the classifier is overfitted:

In [17]:
accuracy(model(extract(e, jss_test)), y_test)

0.6

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*