# Schema Examination
In this example we build schema of documents with complex structure and show how can we filter it and perform transformations.
We start by adding libraries we want to use

We start by installing JsonGrinder and few other packages we need for the example.
Julia Ecosystem follows philosophy of many small single-purpose composable packages
which may be different from e.g. python where we usually use fewer larger packages.

In [1]:
using Pkg
pkg"add JsonGrinder#master Flux Mill MLDataPattern JSON HierarchicalUtils StatsBase OrderedCollections"

using JsonGrinder, Flux, Mill, MLDataPattern, JSON, HierarchicalUtils, StatsBase, OrderedCollections
using JsonGrinder: DictEntry, Entry

data_dir = "../../../data/documents"

    Updating git-repo `https://github.com/CTUAvastLab/JsonGrinder.jl.git`
   Resolving package versions...
  No Changes to `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Project.toml`
  No Changes to `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Manifest.toml`


"../../../data/documents"

This is how some of the documents look like:

In [2]:
open(JSON.parse, first(readdir(data_dir, join=true)))

Dict{String, Any} with 7 entries:
  "bib_entries" => Dict{String, Any}("BIBREF9"=>Dict{String, Any}("ref_id"=>"b9…
  "body_text"   => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[…
  "back_matter" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[…
  "metadata"    => Dict{String, Any}("title"=>"", "authors"=>Any[Dict{String, A…
  "abstract"    => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[…
  "ref_entries" => Dict{String, Any}("FIGREF0"=>Dict{String, Any}("latex"=>noth…
  "paper_id"    => "0000fcce604204b1b9d876dc073eb529eb5ce305"

We load files in data/documents and parse them

In [3]:
sch = JsonGrinder.schema(readdir(data_dir, join=true), x->open(JSON.parse, x))

[34m[Dict][39m[90m  # updated = 16[39m
[34m  ├───── metadata: [39m[31m[Dict][39m[90m  # updated = 16[39m
[34m  │                [39m[31m  ├── authors: [39m[32m[List][39m[90m  # updated = 16[39m
[34m  │                [39m[31m  │            [39m[32m  ╰── [39m[33m[Dict][39m[90m  # updated = 91[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├─────── middle: [39m[36m[List][39m[90m  # updated = [39m[90m⋯[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  │                [39m[36m  ┊[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├──────── first: [39m[39m[Scalar - String],  [90m⋯[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├─────── suffix: [39m[39m[Scalar - String],  [90m⋯[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├───────── last: [39m[39m[Scalar - String],  [90m⋯[39m
[3

The default printing method restricts depth and width of the printed schema.
We can see the whole schema using the `printtree` function from [HierarchicalUtils](https://github.com/CTUAvastLab/HierarchicalUtils.jl).
The htrunc and vtrunc kwargs tell us maximum number of keys and max depth that will be rendered, respectively.

In [4]:
printtree(sch, htrunc=20, vtrunc=20)

[Dict]  # updated = 16
  ├───── metadata: [Dict]  # updated = 16
  │                  ├── authors: [List]  # updated = 16
  │                  │              ╰── [Dict]  # updated = 91
  │                  │                    ├─────── middle: [List]  # updated = 91
  │                  │                    │                  ╰── [Scalar - String], 12 unique values  # updated = 12
  │                  │                    ├──────── first: [Scalar - String], 86 unique values  # updated = 91
  │                  │                    ├─────── suffix: [Scalar - String], 1 unique values  # updated = 91
  │                  │                    ├───────── last: [Scalar - String], 79 unique values  # updated = 91
  │                  │                    ├──────── email: [Scalar - String], 4 unique values  # updated = 91
  │                  │                    ╰── affiliation: [Dict]  # updated = 91
  │                  │                                       ├─── laboratory: [Scalar - Stri

We suggest default extractor.

In [5]:
extractor = suggestextractor(sch)

└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/Js

[34mDict[39m
[34m  ├───── metadata: [39m[31mDict[39m
[34m  │                [39m[31m  ├── authors: [39m[32mArray of[39m
[34m  │                [39m[31m  │            [39m[32m  ╰── [39m[33mDict[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├─────── middle: [39m[36mArray of[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  │                [39m[36m  ┊[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├──────── first: [39m[39mCategorical d = 87
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├─────── suffix: [39m[39mCategorical d = 2
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├───────── last: [39m[39mCategorical d = 80
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├──────── email: [39m[39mCategorical d = 5
[34m  │                [39m[31m  │            [39m[32m   

We show the almost whole extractor. Feel free to remove the htrunc and vtrunc kwargs if you want to
see it whole.

In [6]:
printtree(extractor, htrunc=20, vtrunc=20)

Dict
  ├───── metadata: Dict
  │                  ├── authors: Array of
  │                  │              ╰── Dict
  │                  │                    ├─────── middle: Array of
  │                  │                    │                  ╰── Categorical d = 13
  │                  │                    ├──────── first: Categorical d = 87
  │                  │                    ├─────── suffix: Categorical d = 2
  │                  │                    ├───────── last: Categorical d = 80
  │                  │                    ├──────── email: Categorical d = 5
  │                  │                    ╰── affiliation: Dict
  │                  │                                       ├─── laboratory: Categorical d = 6
  │                  │                                       ├───── location: Dict
  │                  │                                       │                  ├── settlement: Categorical d = 10
  │                  │                                       │ 

We see that there are some dictionaries with lots of keys, so let's examine the schema more.

Mill.jl treats Dictionaries as a cartesian product of their embeddings](https://ctuavastlab.github.io/Mill.jl/stable/manual/nodes/#[ProductNodes-and-ProductModels)
which does make sense in case when there is consistent number of keys, and keys themselves don't carry semantic meaning.
Looking at the schema, we can hypothesize many different keys, which occur very scarcely in data, carry semantic information.

We want to examine how many unique keys are there in the schema in order to handle them differently and train also on key names in such case.
So let's take a look at histogram of number of children per Dictionary.

Function [list_lens](https://ctuavastlab.github.io/Mill.jl/stable/api/utilities/#Mill.list_lens) ¨
from [Mill.jl](https://github.com/CTUAvastLab/Mill.jl) lets us iterate over all nodes in our tree structure
in a way we know their position in the schema.

In [7]:
StatsBase.countmap([length(get(sch, i).childs) for i in list_lens(sch) if get(sch, i) isa DictEntry]) |> sort

OrderedCollections.OrderedDict{Int64, Int64} with 12 entries:
  0   => 59
  1   => 43
  2   => 8
  3   => 9
  4   => 97
  5   => 1
  6   => 1
  7   => 1
  8   => 12
  9   => 91
  13  => 1
  103 => 1

We see that 1 dict has 103 unique children, 1 dict has 13 unique children,
91 dicts have 9 unique children, 59 dicts don't have any children etc.

We can take a more detailed look at Dicts with > 5 children.

The following code prints paths to all Dictionaries in the schema and number of their children if they have more than 5 children.
In total there is lots of diction

In [8]:
for i in list_lens(sch)
    e = get(sch, i)
    if e isa DictEntry && length(e.childs) > 5
        @info i length(e.childs)
    end
end

┌ Info: (@lens _)
└   length(e.childs) = 7
┌ Info: (@lens _.childs[:metadata].childs[:authors].items)
└   length(e.childs) = 6
┌ Info: (@lens _.childs[:ref_entries])
└   length(e.childs) = 13
┌ Info: (@lens _.childs[:bib_entries])
└   length(e.childs) = 103
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF33])
└   length(e.childs) = 9
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF43])
└   length(e.childs) = 9
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF127])
└   length(e.childs) = 8
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF67])
└   length(e.childs) = 9
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF79])
└   length(e.childs) = 9
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF153])
└   length(e.childs) = 8
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF84])
└   length(e.childs) = 9
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF2])
└   length(e.childs) = 9
┌ Info: (@lens _.childs[:bib_entries].childs[:BIBREF223])
└   length(e.childs) = 8
┌ 

The dictionaries with most unique children are following ones:
```
┌ Info: (@lens _.childs[:ref_entries])
└   length(e.childs) = 13
┌ Info: (@lens _.childs[:bib_entries])
└   length(e.childs) = 103
```
because this is where keys have semantic meaning.
JsonGrinder contains ExtractKeyAsField extractor, which treats
dictionaries with large number of keys as array of pairs (key, value)
which leads to more reasonable model.

There is a default value, but we want to set it ourselves to 13 to cover
both cases we see in out data. This can be performed by creating new extractor
like this

In [9]:
extractor = suggestextractor(sch, (; key_as_field=13))

[ Info: [:ref_entries] seems to store values in keys, therefore node is treated as bag with keys as extra values.
[ Info: [:bib_entries] seems to store values in keys, therefore node is treated as bag with keys as extra values.
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:61


[34mDict[39m
[34m  ├───── metadata: [39m[31mDict[39m
[34m  │                [39m[31m  ├── authors: [39m[32mArray of[39m
[34m  │                [39m[31m  │            [39m[32m  ╰── [39m[33mDict[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├─────── middle: [39m[36mArray of[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  │                [39m[36m  ┊[39m
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├──────── first: [39m[39mCategorical d = 87
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├─────── suffix: [39m[39mCategorical d = 2
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├───────── last: [39m[39mCategorical d = 80
[34m  │                [39m[31m  │            [39m[32m      [39m[33m  ├──────── email: [39m[39mCategorical d = 5
[34m  │                [39m[31m  │            [39m[32m   

When we look at the larger part of extractor

In [10]:
printtree(extractor, htrunc=20, vtrunc=20)

Dict
  ├───── metadata: Dict
  │                  ├── authors: Array of
  │                  │              ╰── Dict
  │                  │                    ├─────── middle: Array of
  │                  │                    │                  ╰── Categorical d = 13
  │                  │                    ├──────── first: Categorical d = 87
  │                  │                    ├─────── suffix: Categorical d = 2
  │                  │                    ├───────── last: Categorical d = 80
  │                  │                    ├──────── email: Categorical d = 5
  │                  │                    ╰── affiliation: Dict
  │                  │                                       ├─── laboratory: Categorical d = 6
  │                  │                                       ├───── location: Dict
  │                  │                                       │                  ├── settlement: Categorical d = 10
  │                  │                                       │ 

we now see represenation of `bib_entries` and `ref_entries` is
more reasonable now.

So we can say this extractor looks much better.

But still, some values are very sparse,
let's print all parts of schema where each value is observed only once

In [11]:
for i in list_lens(sch)
    e = get(sch, i)
    if e isa Entry && maximum(values(e.counts)) == 1
        @info i
    end
end

[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:middle].items)
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:affiliation].childs[:location].childs[:region])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF4].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF4].childs[:html])
[ Info: (@lens _.childs[:ref_entries].childs[:FIGREF3].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:FIGREF2].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:FIGREF4].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF3].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF3].childs[:html])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF5].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF5].childs[:html])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF6].childs[:type])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF6].childs[:text])
[ Info: (@lens _.childs[:ref_entrie

 we can see lots of leaves under `bib_entries`, which is cased by uniqueness of keys here
but apart from that, we can see other interesting fields
```
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:middle].items)
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:last])
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:affiliation].childs[:location].childs[:region])
[ Info: (@lens _.childs[:paper_id])
[ Info: (@lens _.childs[:body_text].items.childs[:text])
[ Info: (@lens _.childs[:body_text].items.childs[:ref_spans].items.childs[:start])
[ Info: (@lens _.childs[:body_text].items.childs[:ref_spans].items.childs[:end])
[ Info: (@lens _.childs[:back_matter].items.childs[:text])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:ref_id])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:start])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:text])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:end])
[ Info: (@lens _.childs[:back_matter].items.childs[:ref_spans].items.childs[:start])
[ Info: (@lens _.childs[:back_matter].items.childs[:ref_spans].items.childs[:text])
[ Info: (@lens _.childs[:back_matter].items.childs[:ref_spans].items.childs[:end])
```

Let's remove some of them from the extractor so we don't train on them.

In [12]:
delete!(extractor.dict, :paper_id)
delete!(extractor.dict[:metadata].dict[:authors].item.dict, :last)
delete!(extractor.dict[:metadata].dict[:authors].item.dict, :middle)

Dict{Symbol, JsonGrinder.AbstractExtractor} with 4 entries:
  :first       => ExtractCategorical
  :suffix      => ExtractCategorical
  :email       => ExtractCategorical
  :affiliation => ExtractDict

Now the extractor looks even better!

In [13]:
printtree(extractor, htrunc=20, vtrunc=20)

Dict
  ├───── metadata: Dict
  │                  ├── authors: Array of
  │                  │              ╰── Dict
  │                  │                    ├──────── first: Categorical d = 87
  │                  │                    ├─────── suffix: Categorical d = 2
  │                  │                    ├──────── email: Categorical d = 5
  │                  │                    ╰── affiliation: Dict
  │                  │                                       ├─── laboratory: Categorical d = 6
  │                  │                                       ├───── location: Dict
  │                  │                                       │                  ├── settlement: Categorical d = 10
  │                  │                                       │                  ├──── addrLine: Categorical d = 5
  │                  │                                       │                  ├───── country: Categorical d = 8
  │                  │                                       │   

This concludes example about examining schema and modifying extractor accordingly.

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*