# Schema Examination
In this example we build schema of documents with complex structure and show how can we filter it and perform transformations.
We start by adding libraries we want to use

We start by installing JsonGrinder and few other packages we need for the example.
Julia Ecosystem follows philosophy of many small single-purpose composable packages
which may be different from e.g. python where we usually use fewer larger packages.

In [1]:
using Pkg
pkg"add JsonGrinder#master Flux Mill#master MLDataPattern JSON HierarchicalUtils StatsBase"

using JsonGrinder, Flux, Mill, MLDataPattern, JSON, HierarchicalUtils, StatsBase
using JsonGrinder: DictEntry, Entry

    Updating git-repo `https://github.com/CTUAvastLab/JsonGrinder.jl.git`
    Updating git-repo `https://github.com/CTUAvastLab/Mill.jl.git`
   Resolving package versions...
  No Changes to `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Project.toml`
  No Changes to `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/Manifest.toml`


We load files in data/documents and parse them

In [2]:
data_dir = "../../../data/documents"
sch = JsonGrinder.schema(readdir(data_dir, join=true), x->open(JSON.parse, x))

[34m[Dict][39m[90m 	# updated = 16[39m
[34m  ├───── metadata: [39m[31m[Dict][39m[90m 	# updated = 16[39m
[34m  │                [39m[31m  ├── authors: [39m[32m[List][39m[90m 	# updated = 16[39m
[34m  │                [39m[31m  │            [39m[32m  ⋮[39m
[34m  │                [39m[31m  └──── title: [39m[39m[Scalar - String], 14 unique values[90m 	# updated = 16[39m
[34m  ├── ref_entries: [39m[31m[Dict][39m[90m 	# updated = 16[39m
[34m  │                [39m[31m  ├── TABREF4: [39m[32m[Dict][39m[90m 	# updated = 3[39m
[34m  │                [39m[31m  │            [39m[32m  ⋮[39m
[34m  │                [39m[31m  ├── FIGREF3: [39m[32m[Dict][39m[90m 	# updated = 3[39m
[34m  │                [39m[31m  │            [39m[32m  ⋮[39m
[34m  │                [39m[31m  ⋮[39m
[34m  │                [39m[31m  └── TABREF2: [39m[32m[Dict][39m[90m 	# updated = 6[39m
[34m  │                [39m[31m               [39m[32m 

The default printing method restricts depth and width of the printed schema.
We can see the whole schema using the `printtree` function from `HierarchicalUtils`.

In [3]:
printtree(sch)

[Dict] 	# updated = 16
  ├───── metadata: [Dict] 	# updated = 16
  │                  ├── authors: [List] 	# updated = 16
  │                  │              └── [Dict] 	# updated = 91
  │                  │                    ├─────── middle: [List] 	# updated = 91
  │                  │                    │                  └── [Scalar - String], 12 unique values 	# updated = 12
  │                  │                    ├──────── first: [Scalar - String], 86 unique values 	# updated = 91
  │                  │                    ├─────── suffix: [Scalar - String], 1 unique values 	# updated = 91
  │                  │                    ├───────── last: [Scalar - String], 79 unique values 	# updated = 91
  │                  │                    ├──────── email: [Scalar - String], 4 unique values 	# updated = 91
  │                  │                    └── affiliation: [Dict] 	# updated = 91
  │                  │                                       ├─── laboratory: [Scalar - Stri

This is how some of the documents look like:

In [4]:
open(JSON.parse, first(readdir(data_dir, join=true)))

Dict{String, Any} with 7 entries:
  "bib_entries" => Dict{String, Any}("BIBREF9"=>Dict{String, Any}("ref_id"=>"b9…
  "body_text"   => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[…
  "back_matter" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[…
  "metadata"    => Dict{String, Any}("title"=>"", "authors"=>Any[Dict{String, A…
  "abstract"    => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[…
  "ref_entries" => Dict{String, Any}("FIGREF0"=>Dict{String, Any}("latex"=>noth…
  "paper_id"    => "0000fcce604204b1b9d876dc073eb529eb5ce305"

We suggest default extractor.

In [5]:
extractor = suggestextractor(sch)

└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/Js

[34mDict[39m
[34m  ├───── metadata: [39m[31mDict[39m
[34m  │                [39m[31m  ├── authors: [39m[32mArray of[39m
[34m  │                [39m[31m  │            [39m[32m  ⋮[39m
[34m  │                [39m[31m  └──── title: [39m[39mCategorical d = 15
[34m  ├── ref_entries: [39m[31mDict[39m
[34m  │                [39m[31m  ├── TABREF5: [39m[32mDict[39m
[34m  │                [39m[31m  │            [39m[32m  ⋮[39m
[34m  │                [39m[31m  ├── TABREF4: [39m[32mDict[39m
[34m  │                [39m[31m  │            [39m[32m  ⋮[39m
[34m  │                [39m[31m  ⋮[39m
[34m  │                [39m[31m  └── TABREF2: [39m[32mDict[39m
[34m  │                [39m[31m               [39m[32m  ⋮[39m
[34m  ⋮[39m
[34m  └───── abstract: [39m[31mArray of[39m
[34m                   [39m[31m  └── [39m[32mDict[39m
[34m                   [39m[31m      [39m[32m  ⋮[39m

We show the whole extractor.

In [6]:
printtree(extractor)

Dict
  ├───── metadata: Dict
  │                  ├── authors: Array of
  │                  │              └── Dict
  │                  │                    ├─────── middle: Array of
  │                  │                    │                  └── Categorical d = 13
  │                  │                    ├──────── first: Categorical d = 87
  │                  │                    ├─────── suffix: Categorical d = 2
  │                  │                    ├───────── last: Categorical d = 80
  │                  │                    ├──────── email: Categorical d = 5
  │                  │                    └── affiliation: Dict
  │                  │                                       ├─── laboratory: Categorical d = 6
  │                  │                                       ├───── location: Dict
  │                  │                                       │                  ├── settlement: Categorical d = 10
  │                  │                                       │ 

we see that there are some dictionaries with lots of keys, let's examine schema
list_lens lets us iterate over all elements in a way we know their position in schema
this prints lengths of children of all dict entries.

In [7]:
for i in list_lens(sch)
    e = get(sch, i)
    if e isa DictEntry
        @info i length(e.childs)
    end
end

┌ Info: (@lens _)
└   length(e.childs) = 7
┌ Info: (@lens _.childs[:metadata])
└   length(e.childs) = 2
┌ Info: (@lens _.childs[:metadata].childs[:authors].items)
└   length(e.childs) = 6
┌ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:affiliation])
└   length(e.childs) = 3
┌ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:affiliation].childs[:location])
└   length(e.childs) = 5
┌ Info: (@lens _.childs[:ref_entries])
└   length(e.childs) = 13
┌ Info: (@lens _.childs[:ref_entries].childs[:TABREF4])
└   length(e.childs) = 3
┌ Info: (@lens _.childs[:ref_entries].childs[:FIGREF3])
└   length(e.childs) = 2
┌ Info: (@lens _.childs[:ref_entries].childs[:FIGREF2])
└   length(e.childs) = 2
┌ Info: (@lens _.childs[:ref_entries].childs[:FIGREF4])
└   length(e.childs) = 2
┌ Info: (@lens _.childs[:ref_entries].childs[:TABREF3])
└   length(e.childs) = 3
┌ Info: (@lens _.childs[:ref_entries].childs[:TABREF5])
└   length(e.childs) = 3
┌ Info: (@lens _.childs[:ref_entr

that's a lots of numbers, let's see histogram

In [8]:
length_hist = StatsBase.countmap([length(get(sch, i).childs) for i in list_lens(sch) if get(sch, i) isa DictEntry])

Dict{Int64, Int64} with 12 entries:
  5   => 1
  8   => 12
  1   => 43
  0   => 59
  6   => 1
  9   => 91
  3   => 9
  7   => 1
  103 => 1
  4   => 97
  13  => 1
  2   => 8

we see highest lengths are 103 and 13, let's set 13 as a threshold

In [9]:
extractor = suggestextractor(sch, (; key_as_field=13))

[ Info: [:ref_entries] seems to store values in keys, therefore node is treated as bag with keys as extra values.
[ Info: [:bib_entries] seems to store values in keys, therefore node is treated as bag with keys as extra values.
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62


[34mDict[39m
[34m  ├───── metadata: [39m[31mDict[39m
[34m  │                [39m[31m  ├── authors: [39m[32mArray of[39m
[34m  │                [39m[31m  │            [39m[32m  ⋮[39m
[34m  │                [39m[31m  └──── title: [39m[39mCategorical d = 15
[34m  ├── ref_entries: [39m[31mKeyAsField[39m
[34m  │                [39m[31m  ├── [39m[39mString
[34m  │                [39m[31m  └── [39m[32mDict[39m
[34m  │                [39m[31m      [39m[32m  ⋮[39m
[34m  ⋮[39m
[34m  └───── abstract: [39m[31mArray of[39m
[34m                   [39m[31m  └── [39m[32mDict[39m
[34m                   [39m[31m      [39m[32m  ⋮[39m

show new extractor

In [10]:
printtree(extractor)

Dict
  ├───── metadata: Dict
  │                  ├── authors: Array of
  │                  │              └── Dict
  │                  │                    ├─────── middle: Array of
  │                  │                    │                  └── Categorical d = 13
  │                  │                    ├──────── first: Categorical d = 87
  │                  │                    ├─────── suffix: Categorical d = 2
  │                  │                    ├───────── last: Categorical d = 80
  │                  │                    ├──────── email: Categorical d = 5
  │                  │                    └── affiliation: Dict
  │                  │                                       ├─── laboratory: Categorical d = 6
  │                  │                                       ├───── location: Dict
  │                  │                                       │                  ├── settlement: Categorical d = 10
  │                  │                                       │ 

this extractor looks much better
but still, some values are very sparse,
let's print all parts of schema where each value is observed only once

In [11]:
for i in list_lens(sch)
    e = get(sch, i)
    if e isa Entry && maximum(values(e.counts)) == 1
        @info i
    end
end

[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:middle].items)
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:affiliation].childs[:location].childs[:region])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF4].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF4].childs[:html])
[ Info: (@lens _.childs[:ref_entries].childs[:FIGREF3].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:FIGREF2].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:FIGREF4].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF3].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF3].childs[:html])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF5].childs[:text])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF5].childs[:html])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF6].childs[:type])
[ Info: (@lens _.childs[:ref_entries].childs[:TABREF6].childs[:text])
[ Info: (@lens _.childs[:ref_entrie

 we can see lots of leaves under `bib_entries`, which is cased by uniqueness of keys here
but apart from that, we can see other interesting fields
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:middle].items)
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:last])
[ Info: (@lens _.childs[:metadata].childs[:authors].items.childs[:affiliation].childs[:location].childs[:region])
[ Info: (@lens _.childs[:paper_id])
[ Info: (@lens _.childs[:body_text].items.childs[:text])
[ Info: (@lens _.childs[:body_text].items.childs[:ref_spans].items.childs[:start])
[ Info: (@lens _.childs[:body_text].items.childs[:ref_spans].items.childs[:end])
[ Info: (@lens _.childs[:back_matter].items.childs[:text])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:ref_id])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:start])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:text])
[ Info: (@lens _.childs[:back_matter].items.childs[:cite_spans].items.childs[:end])
[ Info: (@lens _.childs[:back_matter].items.childs[:ref_spans].items.childs[:start])
[ Info: (@lens _.childs[:back_matter].items.childs[:ref_spans].items.childs[:text])
[ Info: (@lens _.childs[:back_matter].items.childs[:ref_spans].items.childs[:end])

let's remove some of them from extractor

In [12]:
delete!(extractor.dict, :paper_id)
delete!(extractor.dict[:metadata].dict[:authors].item.dict, :last)
delete!(extractor.dict[:metadata].dict[:authors].item.dict, :middle)

Dict{Symbol, JsonGrinder.AbstractExtractor} with 4 entries:
  :first       => ExtractCategorical
  :suffix      => ExtractCategorical
  :email       => ExtractCategorical
  :affiliation => ExtractDict

we can also notice, that some long texts are extracted as categorical variables, e.g.

In [13]:
extractor[:body_text].item[:text]
extractor[:body_text].item[:section]

[39mString

let's replace them manually by string extractors
note that we need to use the .dict, as the [] accessor on item is just readonly syntax-sugar

In [14]:
extractor[:body_text].item.dict[:text] = ExtractString()
extractor[:body_text].item.dict[:section] = ExtractString()

[39mString

this concludes example about examining schema and modifying extractor accordingly.

In [15]:
using JsonGrinder: is_intable, is_floatable, unify_types, extractscalar
function string_multi_representation_scalar_extractor()
	vcat([
	(e -> unify_types(sch[:paper_id]) <: String,
		(e, uniontypes) -> MultipleRepresentation((
			ExtractCategorical(top_n_keys(e, 20), uniontypes),
			extractscalar(unify_types(e), e, uniontypes)
		)))
	], JsonGrinder.default_scalar_extractor())
end

top_n_keys(e::Entry, n::Int) = map(x->x[1], sort(e.counts |> collect, by=x->x[2], rev=true)[begin:min(n, end)])
suggestextractor(sch, (;
	scalar_extractors=string_multi_representation_scalar_extractor(),
	key_as_field=13,
	)
) |> printtree
unify_types(sch[:paper_id]) <: String

[ Info: [:ref_entries] seems to store values in keys, therefore node is treated as bag with keys as extra values.
[ Info: [:bib_entries] seems to store values in keys, therefore node is treated as bag with keys as extra values.
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
└ @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/schema/dict.jl:62
Dict
  ├───── metadata: Dict
  │                  ├── authors: Array of
  │                  │              └── Dict
  │                  │                    ├─────── middle: Array of
  │                  │                    │                  └── MultiRepresentation
  │                  │                    │                        ├── e1: Categorical d = 13
  │                  │                    │                        └── e2: String
  │                  │                    ├──────── first: MultiRepresentation
  │                  │                    │                  ├── e1: Categorical d = 21
  │      

true

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*