Flatten output #89

bendichter · 2022-03-07T14:51:10Z

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

How to test the behavior?

Show here how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

Have you ensured the PR description clearly describes the problem and solutions?
Have you checked to ensure that there aren't other open or previously closed Pull Requests for the same change?
Is your code contribution compliant with black format? If not, simply pip install black and run black . inside your fork.
If this PR fixes an issue, is the first line of the PR description fix #XX where XX is the issue number?

…nize is performed outside.

* add option to display by importance

for more information, see https://pre-commit.ci

bendichter · 2022-03-07T15:10:57Z

@CodyCBakerPhD

There is a usability issue with the current way to generate reports. What if the user has 100+ files? Maybe on the first 50 or so there are no critical errors, but there is a critical error on file 51? This would be difficult for the user to spot, because they'd probably scroll through the first few files and assume that they are all the same. To address this, we could display the results in a different order, e.g.:

0.  CRITICAL
------------
0.0.  check_data_orientation
  0.0.0.  test:/processing/behavior/Position/my_spatial_series - Data may be in the wrong orientation. Time should be in the first dimension, and is usually the longest dimension. Here, another dimension is longer.
0.1.  check_timestamps_match_first_dimension
  0.1.0.  test:/acquisition/test_time_series_3 - The length of the first dimension of data does not match the length of timestamps.
  0.1.1.  test2.nwb:/acquisition/test_time_series_3 - The length of the first dimension of data does not match the length of timestamps.

1.  BEST_PRACTICE_VIOLATION
---------------------------
1.0.  check_regular_timestamps
  1.0.0.  test:/acquisition/test_time_series_2 - TimeSeries appears to have a constant sampling rate. Consider specifying starting_time=1.2 and rate=2.0 instead of timestamps.

2.  BEST_PRACTICE_SUGGESTION
----------------------------
2.0.  check_small_dataset_compression
  2.0.0.  test:/acquisition/test_time_series_1 - data is not compressed. Consider enabling compression when writing a dataset.

This would gather critical errors across all files to the top, and would aggregate them by check.

In order to accomplish this, I reorganized the code to return a flat list of messages that includes "file", and delay the organization part, since these two different printing methods require different organization. Once that was done, it made sense to yield all messages, since Satra asked for that and it seemed like it might be a useful feature. Let's discuss in our meeting today.

CodyCBakerPhD · 2022-03-07T15:13:19Z

@bendichter Yeah, looking over this now. Mostly looks good; was there a reason the severity levels had to change?

bendichter · 2022-03-07T15:16:47Z

I wanted to be able to sort messages by the value of severity: sorted(messages, key=lambda x: x.severity.value)

CodyCBakerPhD · 2022-03-07T15:18:53Z

Right, but the actual values for the severity comparison changed:

https://github.com/NeurodataWithoutBorders/nwbinspector/pull/89/files#diff-cfa07db9e698a249418609afea3dcd87686a8f03b6217ca57e75e6f50579d6c6L28-R29

This means a dataset compression below 1 GB would now get ranked below missing metadata, for instance.

bendichter · 2022-03-07T15:32:08Z

@CodyCBakerPhD should "low" be above "no severity"? I find that pretty counter-intuitive.

CodyCBakerPhD · 2022-03-07T15:49:24Z

The way we've treated it thus far - Most of the time, severity is NO_SEVERITY because there is no basis for comparison, as is the case with missing metadata.

Basically there are two intentions (a) the act of specifying severity at all means you care about the ordering of this check relative to the average check that doesn't have ranked severity, (b) for two of the same checks raised on the same exact file but with clearly different magnitude and order of priority in fixing them (e.g., a 15GB dataset not compressed vs. a 0.8GB dataset not compressed), the higher severity one is ranked higher.

Point (b) makes the logical use of severity comparisons specific to an individual check irrespective of the severities specified by other checks of the same importance level.

Whereas by setting the baseline to be between HIGH and LOW, when writing a new check we would have to think about where it would fit into the global priority order of checks within that importance, which is a more difficult task as the number of checks grows.

bendichter · 2022-03-07T18:46:36Z

@CodyCBakerPhD all severity values have numerical values, which means they are being compared to each other and there really isn't a "no severity." Maybe the best would be to have severity = low, medium (default) or high

codecov-commenter · 2022-03-07T18:48:05Z

Codecov Report

Merging #89 (6c12489) into dev (47698e8) will decrease coverage by 6.39%.
The diff coverage is 73.33%.

@@            Coverage Diff             @@
##              dev      #89      +/-   ##
==========================================
- Coverage   94.31%   87.91%   -6.40%     
==========================================
  Files          12       12              
  Lines         422      513      +91     
==========================================
+ Hits          398      451      +53     
- Misses         24       62      +38

Flag	Coverage Δ
unittests	`87.91% <73.33%> (-6.40%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
nwbinspector/nwbinspector.py	`71.61% <59.79%> (-11.57%)`	⬇️
nwbinspector/utils.py	`96.15% <93.33%> (-3.85%)`	⬇️
nwbinspector/checks/tables.py	`100.00% <100.00%> (ø)`
nwbinspector/inspector_tools.py	`75.47% <100.00%> (-20.76%)`	⬇️
nwbinspector/register_checks.py	`98.86% <100.00%> (+0.04%)`	⬆️

CodyCBakerPhD · 2022-03-07T19:16:59Z

all severity values have numerical values, which means they are being compared to each other and there really isn't a "no severity." Maybe the best would be to have severity = low, medium (default) or high

That is indeed effectively how it is; the point to NO_SEVERITY as the default, however, is that the majority of checks do not specify any severity, and not specifying at all (the default) should still be the lowest case.

It's extra work during check development and review to have to think about whether the output of a check should be manually ranked lower than the average of all other checks of that importance, which makes it harder to add new checks. Especially for really basic stuff like missing metadata. The only time severity is intended to be specified and thought about is when there's some quantitative basis for comparison, which is mostly for data-related checks.

If it really bothers you the simplest thing we can just remove NO_SEVERITY altogether and set LOW to be the default for everything.

bendichter · 2022-03-07T19:43:25Z

If it really bothers you the simplest thing we can just remove NO_SEVERITY altogether and set LOW to be the default for everything.

I'm fine with that

CodyCBakerPhD · 2022-03-08T17:09:11Z

Putting this in draft while I work on splitting it into separate PRs for easier review and integration.

CodyCBakerPhD · 2022-03-08T17:12:00Z

nwbinspector/nwbinspector.py

+def organize_messages_by_file(messages):
+    files = sorted(set(message.file for message in messages))
+    out = {}
+    for file in files:
+        out[file] = {}
+        file_messages = list(filter(lambda x: x.file == file, messages))
+        for importance in sorted(set(message.importance for message in file_messages), key=lambda x: -x.value):
+            out[file][importance] = sorted(
+                filter(lambda x: x.importance == importance, file_messages),
+                key=lambda x: -x.severity.value,
+            )
+    return out


After testing this, one thing to note is the time complexity involved in applying a filter that iterates a list checking for a potentially small number of boolean matches is sub-optimal. Here's a self-contained comparison that can be applied to any directory with a decent number of files in it

"""Simple script for testing performance of two iteration methods for NWBInspector.""" from time import time from pathlib import Path import numpy as np files = [x.name for x in Path(".").iterdir()] nfiles = len(files) messages = [ dict(msg="foo", file=files[np.random.randint(low=0, high=nfiles)]) for _ in range(nfiles * 50) # average of 50 messages per file; feel free to change ] start = time() out1 = {file: list(filter(lambda x: x["file"] == file, messages)) for file in files} end = time() time1 = end - start start = time() out2 = {file: list() for file in files} for message in messages: out2[message["file"]] = message end = time() time2 = end - start

For my system at least, the binning method (proposed option 2) is 10x faster.

As such, I was going to refactor this type of functionality like

def organize_messages_by_file(messages: List[InspectorMessage]): """Order InspectorMessages by file name then importance.""" files = natsorted(set(message.file for message in messages)) importance_levels = [Importance.CRITICAL, Importance.BEST_PRACTICE_VIOLATION, Importance.BEST_PRACTICE_SUGGESTION] out = {file: {importance: list() for importance in importance_levels} for file in files} # always add the small number of importance bins for message in messages: # linear time, single pass through message list out[message.file][message.importance].append(message) for file in files: for importance in importance_levels: # Worst case complexity: (number of files) x 3 if out[file][importance]: out[file][importance] = sorted(out[file][importance], key=lambda x: -x.severity.value) else: out[file].pop(importance) return out

Basically, the more pre-sorting you can apply to the output structure ahead of actual iteration, the better.

@bendichter What do you think? Any other suggestions for improvement?

@CodyCBakerPhD how about this as a general solution

def hier_org(objects, levels): unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects)) if len(levels) > 1: return {attr: hier_org([obj for obj in objects if getattr(obj, levels[0]) == attr], levels[1:]) for attr in unique_attrs} else: return {attr: [obj for obj in objects if getattr(obj, levels[0]) == attr] for attr in unique_attrs} # test from collections import namedtuple Obj = namedtuple("Obj", "a b c") objects = [] for a in range(3): for b in range(3): for c in range(3): objects.append(Obj(a=a,b=b,c=c)) hier_org(objects, ["a", "b", "c"])

{0: {0: {0: [Obj(a=0, b=0, c=0)], 1: [Obj(a=0, b=0, c=1)], 2: [Obj(a=0, b=0, c=2)]}, 1: {0: [Obj(a=0, b=1, c=0)], 1: [Obj(a=0, b=1, c=1)], 2: [Obj(a=0, b=1, c=2)]}, 2: {0: [Obj(a=0, b=2, c=0)], 1: [Obj(a=0, b=2, c=1)], 2: [Obj(a=0, b=2, c=2)]}}, 1: {0: {0: [Obj(a=1, b=0, c=0)], 1: [Obj(a=1, b=0, c=1)], 2: [Obj(a=1, b=0, c=2)]}, 1: {0: [Obj(a=1, b=1, c=0)], 1: [Obj(a=1, b=1, c=1)], 2: [Obj(a=1, b=1, c=2)]}, 2: {0: [Obj(a=1, b=2, c=0)], 1: [Obj(a=1, b=2, c=1)], 2: [Obj(a=1, b=2, c=2)]}}, 2: {0: {0: [Obj(a=2, b=0, c=0)], 1: [Obj(a=2, b=0, c=1)], 2: [Obj(a=2, b=0, c=2)]}, 1: {0: [Obj(a=2, b=1, c=0)], 1: [Obj(a=2, b=1, c=1)], 2: [Obj(a=2, b=1, c=2)]}, 2: {0: [Obj(a=2, b=2, c=0)], 1: [Obj(a=2, b=2, c=1)], 2: [Obj(a=2, b=2, c=2)]}}}

That looks like it should cover my next question if implemented correctly, which was to have a single organize_messages function that had an argument that could dynamically specify the order in which we want things organized (["file", "importance"], or ["importance", "neurodata_type", "importance"], etc.)

I'll play around with this a bit and get back to you!

@bendichter That code had some great notions of recurrence logic in it, but it still suffered from the same nested iteration as before due to the line

return {attr: hier_org([obj for obj in objects if getattr(obj, levels[0]) == attr], levels[1:]) for attr in unique_attrs}

which still has iterations over every obj for each unique_aatr to extract the subset that correspond to a particular attr. A time complexity analysis of this shows a worst-case time of

which will tend to be the two largest values of

I took this logic and made two versions seen on this commit on another branch: 21a30aa

First version 'fancy' is pretty much that same exact code but built on our objects +/- some Enum logic for sorting.

Second version 'efficient' is my binned approach that separates the dictionary construction from the data filtering, though it needs some helper functions for indexing arbitrarily nested dictionaries. This decoupling approach improves the time complexity to

To test performance, I downloaded DANDISet 000004 and set up the following

messages = list() for folder in Path(".../000004").iterdir(): messages.extend(inspect_all(path=folder))

which gives us a list of 611 InspectorMessages. Now

levels = ["file", "importance"] # just for testing, technically any order and combination of attributes for an InspectorMessage can be used start = time() res1 = fancy_organize_messages(messages=messages, levels=levels) end = time() time1 = end - start start = time() res2 = efficient_organize_messages(messages=messages, levels=levels) end = time() time2 = end - start

As before, the 'efficient' method is clocking much faster (about 6x on the case of real data), so I'll proceed with adding tests and integrating that method.

Overall very cool advancement that allows arbitrary user choice of output structure, if they want any.

how about this. I create a objects_by_attr dict so that the objects are only looped over once per recursion

from collections import defaultdict def hier_org(objects, levels, reverse=None): if reverse is None: reverse = [False] * len(levels) unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects)) if reverse[0]: unique_attrs = unique_attrs[::-1] objects_by_attr = defaultdict(list) [objects_by_attr[getattr(obj, levels[0])].append(obj) for obj in objects] if len(levels) > 1: return {attr: hier_org(objects_by_attr[attr], levels[1:], reverse[1:]) for attr in unique_attrs} else: return {attr: objects_by_attr[attr][0] for attr in unique_attrs} # test from collections import namedtuple Obj = namedtuple("Obj", "a b c") objects = [] for a in range(3): for b in range(3): for c in range(3): objects.append(Obj(a=a, b=b, c=c)) hier_org(objects, ["a", "b", "c"], [True, False, True])

CodyCBakerPhD · 2022-03-09T23:50:12Z

All relevant code changes proposed by the PR have been merged along separate channels. Closing this now.

bendichter added 2 commits March 5, 2022 18:51

refactor: inspect_all produces a generator of InspectorMessages. Orga…

b1d2027

…nize is performed outside.

* error message if check errors

d1c223c

* add option to display by importance

bendichter requested a review from CodyCBakerPhD March 7, 2022 14:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

7514f96

for more information, see https://pre-commit.ci

Merge branch 'dev' into flatten_output

6c12489

bendichter marked this pull request as ready for review March 7, 2022 18:46

bendichter added 2 commits March 7, 2022 12:49

fix counter placement

f12403e

Merge remote-tracking branch 'origin/flatten_output' into flatten_output

58556ee

CodyCBakerPhD marked this pull request as draft March 8, 2022 17:08

CodyCBakerPhD reviewed Mar 8, 2022

View reviewed changes

bendichter marked this pull request as ready for review March 8, 2022 22:38

This was referenced Mar 8, 2022

Modular PR: flatten output and generalize organization #92

Merged

[Minor performance]: organization and main loop iterations #97

Closed

CodyCBakerPhD closed this Mar 9, 2022

CodyCBakerPhD deleted the flatten_output branch March 21, 2022 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten output #89

Flatten output #89

bendichter commented Mar 7, 2022

bendichter commented Mar 7, 2022

CodyCBakerPhD commented Mar 7, 2022

bendichter commented Mar 7, 2022 •

edited

CodyCBakerPhD commented Mar 7, 2022

bendichter commented Mar 7, 2022

CodyCBakerPhD commented Mar 7, 2022 •

edited

bendichter commented Mar 7, 2022

codecov-commenter commented Mar 7, 2022

CodyCBakerPhD commented Mar 7, 2022

bendichter commented Mar 7, 2022

CodyCBakerPhD commented Mar 8, 2022

CodyCBakerPhD Mar 8, 2022 •

edited

CodyCBakerPhD Mar 8, 2022 •

edited

bendichter Mar 8, 2022 •

edited

CodyCBakerPhD Mar 8, 2022

CodyCBakerPhD Mar 8, 2022 •

edited

bendichter Mar 8, 2022 •

edited

CodyCBakerPhD commented Mar 9, 2022

Flatten output #89

Flatten output #89

Conversation

bendichter commented Mar 7, 2022

Motivation

How to test the behavior?

Checklist

bendichter commented Mar 7, 2022

CodyCBakerPhD commented Mar 7, 2022

bendichter commented Mar 7, 2022 • edited

CodyCBakerPhD commented Mar 7, 2022

bendichter commented Mar 7, 2022

CodyCBakerPhD commented Mar 7, 2022 • edited

bendichter commented Mar 7, 2022

codecov-commenter commented Mar 7, 2022

Codecov Report

CodyCBakerPhD commented Mar 7, 2022

bendichter commented Mar 7, 2022

CodyCBakerPhD commented Mar 8, 2022

CodyCBakerPhD Mar 8, 2022 • edited

Choose a reason for hiding this comment

CodyCBakerPhD Mar 8, 2022 • edited

Choose a reason for hiding this comment

bendichter Mar 8, 2022 • edited

Choose a reason for hiding this comment

CodyCBakerPhD Mar 8, 2022

Choose a reason for hiding this comment

CodyCBakerPhD Mar 8, 2022 • edited

Choose a reason for hiding this comment

bendichter Mar 8, 2022 • edited

Choose a reason for hiding this comment

CodyCBakerPhD commented Mar 9, 2022

bendichter commented Mar 7, 2022 •

edited

CodyCBakerPhD commented Mar 7, 2022 •

edited

CodyCBakerPhD Mar 8, 2022 •

edited

CodyCBakerPhD Mar 8, 2022 •

edited

bendichter Mar 8, 2022 •

edited

CodyCBakerPhD Mar 8, 2022 •

edited

bendichter Mar 8, 2022 •

edited