Skip to content

Add an inspect command that gathers metrics about a Jelly file#65

Merged
Ostrzyciel merged 6 commits intomainfrom
39-stats-command
Apr 6, 2025
Merged

Add an inspect command that gathers metrics about a Jelly file#65
Ostrzyciel merged 6 commits intomainfrom
39-stats-command

Conversation

@Karolina-Bogacka
Copy link
Collaborator

Issue: #39

  • By default aggregates the metrics based on the whole file, but can also return them --per-frame
  • Returns a valid Yaml as an output
  • Can return the metrics to a file with --to
  • Accepts an input stream as well as a file

@Karolina-Bogacka
Copy link
Collaborator Author

Context - Test coverage OK, AOT tested

Copy link
Member

@Ostrzyciel Ostrzyciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There is no newline at the end of the inspect output.
  • Please add an empty line between the block about stream options and frames, for readability.

Copy link
Member

@Ostrzyciel Ostrzyciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Logical and physical types should be written not only by ID, but also by name, for readability – e.g., GRAPHS (3)

Copy link
Member

@Ostrzyciel Ostrzyciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Boolean options should be printed as true/false, not 0/1.
  • The order of outputted keys seems random. Can you use an ordered structure so that it makes more sense?

@Karolina-Bogacka
Copy link
Collaborator Author

Format after revisions for --per-frame

stream_options: 
  stream_name: ""
  physical_type: PHYSICAL_STREAM_TYPE_TRIPLES (1)
  generalized_statements: true
  rdf_star: true
  max_name_table_size: 128
  max_prefix_table_size: 16
  max_datatype_table_size: 16
  logical_type: LOGICAL_STREAM_TYPE_FLAT_TRIPLES (1)
  version: 1

frames: 
  - frame_index: 0
    option_count: 1
    triple_count: 3
    quad_count: 0
    graph_start_count: 0
    graph_end_count: 0
    namespace_count: 0
    name_count: 3
    prefix_count: 3
    datatype_count: 0
    
  - frame_index: 1
    option_count: 1
    triple_count: 3
    quad_count: 0
    graph_start_count: 0
    graph_end_count: 0
    namespace_count: 0
    name_count: 3
    prefix_count: 3
    datatype_count: 0

@Karolina-Bogacka
Copy link
Collaborator Author

And default aggregation

stream_options: 
  stream_name: ""
  physical_type: PHYSICAL_STREAM_TYPE_TRIPLES (1)
  generalized_statements: true
  rdf_star: true
  max_name_table_size: 128
  max_prefix_table_size: 16
  max_datatype_table_size: 16
  logical_type: LOGICAL_STREAM_TYPE_FLAT_TRIPLES (1)
  version: 1

frames: 
  frame_count: 1
  option_count: 1
  triple_count: 3
  quad_count: 0
  graph_start_count: 0
  graph_end_count: 0
  namespace_count: 0
  name_count: 3
  prefix_count: 3
  datatype_count: 0

printer.frameInfo += metrics

try {
val allRows = JellyUtil.iterateRdfStream(inputStream).toList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires you to allocate a data structure per frame and keep it in memory... making all of this a non-streaming algorithm. If you feed in a very long file, you're going to have OOMs.

Can you rewrite it so that it operates on iterators?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the whole thing to work on iterators - only in the case of the last step of the --per-frame aggregation, I write each frame stat to the output inside of a foreach statement, which in my understanding should be fine because I'm immediately discarding any materialized objects, but please let me know if it is not so.

@Ostrzyciel
Copy link
Member

Wooo, it works amazing :) I tested it on a stream with 14M frames (https://w3id.org/riverbench/datasets/officegraph/dev), and it didn't even break a sweat.

@Ostrzyciel Ostrzyciel merged commit 35e97a2 into main Apr 6, 2025
6 checks passed
@Ostrzyciel Ostrzyciel deleted the 39-stats-command branch April 6, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants