Add an inspect command that gathers metrics about a Jelly file#65
Add an inspect command that gathers metrics about a Jelly file#65Ostrzyciel merged 6 commits intomainfrom
Conversation
|
Context - Test coverage OK, AOT tested |
src/main/scala/eu/neverblink/jelly/cli/util/MetricsPrinter.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/util/MetricsPrinter.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/util/MetricsPrinter.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/util/YamlDocBuilder.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/util/YamlDocBuilder.scala
Outdated
Show resolved
Hide resolved
src/test/scala/eu/neverblink/jelly/cli/command/rdf/RdfInspectSpec.scala
Outdated
Show resolved
Hide resolved
Ostrzyciel
left a comment
There was a problem hiding this comment.
- There is no newline at the end of the inspect output.
- Please add an empty line between the block about stream options and frames, for readability.
Ostrzyciel
left a comment
There was a problem hiding this comment.
- Logical and physical types should be written not only by ID, but also by name, for readability – e.g.,
GRAPHS (3)
Ostrzyciel
left a comment
There was a problem hiding this comment.
- Boolean options should be printed as true/false, not 0/1.
- The order of outputted keys seems random. Can you use an ordered structure so that it makes more sense?
|
Format after revisions for |
|
And default aggregation |
src/main/scala/eu/neverblink/jelly/cli/util/MetricsPrinter.scala
Outdated
Show resolved
Hide resolved
| printer.frameInfo += metrics | ||
|
|
||
| try { | ||
| val allRows = JellyUtil.iterateRdfStream(inputStream).toList |
There was a problem hiding this comment.
This requires you to allocate a data structure per frame and keep it in memory... making all of this a non-streaming algorithm. If you feed in a very long file, you're going to have OOMs.
Can you rewrite it so that it operates on iterators?
There was a problem hiding this comment.
I rewrote the whole thing to work on iterators - only in the case of the last step of the --per-frame aggregation, I write each frame stat to the output inside of a foreach statement, which in my understanding should be fine because I'm immediately discarding any materialized objects, but please let me know if it is not so.
src/main/scala/eu/neverblink/jelly/cli/command/rdf/RdfInspect.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/command/rdf/RdfInspect.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/command/rdf/RdfInspect.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/command/rdf/RdfInspect.scala
Outdated
Show resolved
Hide resolved
src/main/scala/eu/neverblink/jelly/cli/util/MetricsPrinter.scala
Outdated
Show resolved
Hide resolved
|
Wooo, it works amazing :) I tested it on a stream with 14M frames (https://w3id.org/riverbench/datasets/officegraph/dev), and it didn't even break a sweat. |
Issue: #39
--per-frame--to