# Recipe Ingredients Example
Following example demonstrates prediction of cuisine from set of ingredients.
For simplicity, the repo contains small subset of the dataset, the whole dataset and problem description can
be found [on this Kaggle page](https://www.kaggle.com/kaggle/recipe-ingredients-dataset/home).

In [1]:
using MLDatasets, JsonGrinder, Flux, Mill, MLDataPattern, Statistics, ChainRulesCore
using JSON

start by loading all samples

In [2]:
data_file = "../../../data/recipes.json" #!src
samples = open(data_file,"r") do fid
	Vector{Dict}(JSON.parse(read(fid, String)))
end
JSON.print(samples[1],2)

{
  "id": 10259,
  "ingredients": [
    "romaine lettuce",
    "black olives",
    "grape tomatoes",
    "garlic",
    "pepper",
    "purple onion",
    "seasoning",
    "garbanzo beans",
    "feta cheese crumbles"
  ],
  "cuisine": "greek"
}


create schema of the JSON

In [3]:
sch = JsonGrinder.schema(samples)

[34m[Dict][39m[90m 	# updated = 39774[39m
[34m  ├─────────── id: [39m[39m[Scalar - Int64], 10000 unique values[90m 	# updated = 39774[39m
[34m  ├────── cuisine: [39m[39m[Scalar - String], 20 unique values[90m 	# updated = 39774[39m
[34m  └── ingredients: [39m[31m[List][39m[90m 	# updated = 39774[39m
[34m                   [39m[31m  └── [39m[39m[Scalar - String], 6714 unique values[90m 	# updated = 428275[39m

create extractor and split it into one for loading targets and
one for loading data, using custom function to set conditions for using n-gram representation

In [4]:
delete!(sch.childs,:id)

extractor = suggestextractor(sch)
extract_data = ExtractDict(deepcopy(extractor.dict))
extract_target = ExtractDict(deepcopy(extractor.dict))
delete!(extract_target.dict, :ingredients)
delete!(extract_data.dict, :cuisine)

extract_data(samples[1])
extract_target(samples[1])[:cuisine]

21×1 Mill.ArrayNode{Flux.OneHotArray{UInt32, 0x00000015, 1, 2, Vector{UInt32}}, Nothing}:
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 1
 ⋅
 ⋅
 ⋅
 ⋮
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅
 ⋅

we convert JSONs to Datasets
advised to use all the samples, this is just speedup to demonstrate functionality

In [5]:
data = extract_data.(samples[1:5_000])
data = reduce(catobs, data)
target = extract_target.(samples[1:5_000])
target = reduce(catobs, target)[:cuisine].data

21×5000 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  …  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  1  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅
 ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  …  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  1  1  ⋅  1     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋮              ⋮              ⋮        ⋱        ⋮              ⋮           
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅

	create the model according to the data

In [6]:
m = reflectinmodel(sch, extract_data,
	layer -> Dense(layer,20,relu),
	bag -> SegmentedMeanMax(bag),
	fsm = Dict("" => layer -> Dense(layer, size(target, 1))),
)

@non_differentiable getobs(x::DataSubset{<:ProductNode})

 train

In [7]:
opt = Flux.Optimise.ADAM()
loss(x, y) = Flux.logitcrossentropy(m(x).data, y)
loss(x::DataSubset, y) = loss(getobs(x), y)
loss(xy::Tuple) = loss(xy...)
valdata = data[1:1_000],target[:,1:1_000]
data, target = data[1_001:5_000], target[:,1001:5_000]

(ProductNode, Bool[0 0 … 0 0; 0 0 … 1 0; … ; 0 0 … 0 0; 0 0 … 0 0])

for less recourse-chungry training, we use only part of data for trainng, but it is advised to used all, as i following line:
data, target = data[1001:nobs(data)], target[:,1001:size(target,2)]

In [8]:
cb = () -> println("accuracy = ",mean(Flux.onecold(m(valdata[1]).data) .== Flux.onecold(valdata[2])))
ps = Flux.params(m)
mean(Flux.onecold(m(data).data) .== Flux.onecold(target))

iterations = 20
minibatchsize = 4000
minibatches = RandomBatches((data, target), size = minibatchsize, count = iterations)

@info "testing the gradient"
loss(first(minibatches))
gs = gradient(() -> loss(first(minibatches)), ps)
Flux.Optimise.update!(opt, ps, gs)

[ Info: testing the gradient


feel free to train for longer period of time, this example is learns only 20 iterations, so it runs fast

In [9]:
loss(first(minibatches))
gs = gradient(() -> loss(first(minibatches)), ps)
Flux.Optimise.train!(loss, ps, minibatches, opt, cb = Flux.throttle(cb, 2))

#calculate the accuracy
mean(Flux.onecold(m(data).data) .== Flux.onecold(target))

accuracy = 0.035
accuracy = 0.074


0.07125

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*