Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten output #89

Closed
wants to merge 6 commits into from
Closed

Flatten output #89

wants to merge 6 commits into from

Conversation

bendichter
Copy link
Contributor

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

How to test the behavior?

Show here how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

  • Have you ensured the PR description clearly describes the problem and solutions?
  • Have you checked to ensure that there aren't other open or previously closed Pull Requests for the same change?
  • Is your code contribution compliant with black format? If not, simply pip install black and run black . inside your fork.
  • If this PR fixes an issue, is the first line of the PR description fix #XX where XX is the issue number?

@bendichter
Copy link
Contributor Author

@CodyCBakerPhD

There is a usability issue with the current way to generate reports. What if the user has 100+ files? Maybe on the first 50 or so there are no critical errors, but there is a critical error on file 51? This would be difficult for the user to spot, because they'd probably scroll through the first few files and assume that they are all the same. To address this, we could display the results in a different order, e.g.:

0.  CRITICAL
------------
0.0.  check_data_orientation
  0.0.0.  test:/processing/behavior/Position/my_spatial_series - Data may be in the wrong orientation. Time should be in the first dimension, and is usually the longest dimension. Here, another dimension is longer.
0.1.  check_timestamps_match_first_dimension
  0.1.0.  test:/acquisition/test_time_series_3 - The length of the first dimension of data does not match the length of timestamps.
  0.1.1.  test2.nwb:/acquisition/test_time_series_3 - The length of the first dimension of data does not match the length of timestamps.

1.  BEST_PRACTICE_VIOLATION
---------------------------
1.0.  check_regular_timestamps
  1.0.0.  test:/acquisition/test_time_series_2 - TimeSeries appears to have a constant sampling rate. Consider specifying starting_time=1.2 and rate=2.0 instead of timestamps.

2.  BEST_PRACTICE_SUGGESTION
----------------------------
2.0.  check_small_dataset_compression
  2.0.0.  test:/acquisition/test_time_series_1 - data is not compressed. Consider enabling compression when writing a dataset.

This would gather critical errors across all files to the top, and would aggregate them by check.

In order to accomplish this, I reorganized the code to return a flat list of messages that includes "file", and delay the organization part, since these two different printing methods require different organization. Once that was done, it made sense to yield all messages, since Satra asked for that and it seemed like it might be a useful feature. Let's discuss in our meeting today.

@CodyCBakerPhD
Copy link
Collaborator

@bendichter Yeah, looking over this now. Mostly looks good; was there a reason the severity levels had to change?

@bendichter
Copy link
Contributor Author

bendichter commented Mar 7, 2022

I wanted to be able to sort messages by the value of severity: sorted(messages, key=lambda x: x.severity.value)

@CodyCBakerPhD
Copy link
Collaborator

Right, but the actual values for the severity comparison changed:

https://github.com/NeurodataWithoutBorders/nwbinspector/pull/89/files#diff-cfa07db9e698a249418609afea3dcd87686a8f03b6217ca57e75e6f50579d6c6L28-R29

This means a dataset compression below 1 GB would now get ranked below missing metadata, for instance.

@bendichter
Copy link
Contributor Author

@CodyCBakerPhD should "low" be above "no severity"? I find that pretty counter-intuitive.

@CodyCBakerPhD
Copy link
Collaborator

CodyCBakerPhD commented Mar 7, 2022

The way we've treated it thus far - Most of the time, severity is NO_SEVERITY because there is no basis for comparison, as is the case with missing metadata.

Basically there are two intentions (a) the act of specifying severity at all means you care about the ordering of this check relative to the average check that doesn't have ranked severity, (b) for two of the same checks raised on the same exact file but with clearly different magnitude and order of priority in fixing them (e.g., a 15GB dataset not compressed vs. a 0.8GB dataset not compressed), the higher severity one is ranked higher.

Point (b) makes the logical use of severity comparisons specific to an individual check irrespective of the severities specified by other checks of the same importance level.

Whereas by setting the baseline to be between HIGH and LOW, when writing a new check we would have to think about where it would fit into the global priority order of checks within that importance, which is a more difficult task as the number of checks grows.

@bendichter
Copy link
Contributor Author

@CodyCBakerPhD all severity values have numerical values, which means they are being compared to each other and there really isn't a "no severity." Maybe the best would be to have severity = low, medium (default) or high

@bendichter bendichter marked this pull request as ready for review March 7, 2022 18:46
@codecov-commenter
Copy link

Codecov Report

Merging #89 (6c12489) into dev (47698e8) will decrease coverage by 6.39%.
The diff coverage is 73.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev      #89      +/-   ##
==========================================
- Coverage   94.31%   87.91%   -6.40%     
==========================================
  Files          12       12              
  Lines         422      513      +91     
==========================================
+ Hits          398      451      +53     
- Misses         24       62      +38     
Flag Coverage Δ
unittests 87.91% <73.33%> (-6.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
nwbinspector/nwbinspector.py 71.61% <59.79%> (-11.57%) ⬇️
nwbinspector/utils.py 96.15% <93.33%> (-3.85%) ⬇️
nwbinspector/checks/tables.py 100.00% <100.00%> (ø)
nwbinspector/inspector_tools.py 75.47% <100.00%> (-20.76%) ⬇️
nwbinspector/register_checks.py 98.86% <100.00%> (+0.04%) ⬆️

@CodyCBakerPhD
Copy link
Collaborator

all severity values have numerical values, which means they are being compared to each other and there really isn't a "no severity." Maybe the best would be to have severity = low, medium (default) or high

That is indeed effectively how it is; the point to NO_SEVERITY as the default, however, is that the majority of checks do not specify any severity, and not specifying at all (the default) should still be the lowest case.

It's extra work during check development and review to have to think about whether the output of a check should be manually ranked lower than the average of all other checks of that importance, which makes it harder to add new checks. Especially for really basic stuff like missing metadata. The only time severity is intended to be specified and thought about is when there's some quantitative basis for comparison, which is mostly for data-related checks.

If it really bothers you the simplest thing we can just remove NO_SEVERITY altogether and set LOW to be the default for everything.

@bendichter
Copy link
Contributor Author

If it really bothers you the simplest thing we can just remove NO_SEVERITY altogether and set LOW to be the default for everything.

I'm fine with that

@CodyCBakerPhD CodyCBakerPhD marked this pull request as draft March 8, 2022 17:08
@CodyCBakerPhD
Copy link
Collaborator

Putting this in draft while I work on splitting it into separate PRs for easier review and integration.

Comment on lines +38 to +49
def organize_messages_by_file(messages):
files = sorted(set(message.file for message in messages))
out = {}
for file in files:
out[file] = {}
file_messages = list(filter(lambda x: x.file == file, messages))
for importance in sorted(set(message.importance for message in file_messages), key=lambda x: -x.value):
out[file][importance] = sorted(
filter(lambda x: x.importance == importance, file_messages),
key=lambda x: -x.severity.value,
)
return out
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing this, one thing to note is the time complexity involved in applying a filter that iterates a list checking for a potentially small number of boolean matches is sub-optimal. Here's a self-contained comparison that can be applied to any directory with a decent number of files in it

"""Simple script for testing performance of two iteration methods for NWBInspector."""
from time import time
from pathlib import Path

import numpy as np

files = [x.name for x in Path(".").iterdir()]
nfiles = len(files)
messages = [
    dict(msg="foo", file=files[np.random.randint(low=0, high=nfiles)])
    for _ in range(nfiles * 50)  # average of 50 messages per file; feel free to change
]

start = time()
out1 = {file: list(filter(lambda x: x["file"] == file, messages)) for file in files}
end = time()
time1 = end - start

start = time()
out2 = {file: list() for file in files}
for message in messages:
    out2[message["file"]] = message
end = time()
time2 = end - start

For my system at least, the binning method (proposed option 2) is 10x faster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As such, I was going to refactor this type of functionality like

def organize_messages_by_file(messages: List[InspectorMessage]):
    """Order InspectorMessages by file name then importance."""
    files = natsorted(set(message.file for message in messages))
    importance_levels = [Importance.CRITICAL, Importance.BEST_PRACTICE_VIOLATION, Importance.BEST_PRACTICE_SUGGESTION]
    out = {file: {importance: list() for importance in importance_levels} for file in files}  # always add the small number of importance bins
    for message in messages:  # linear time, single pass through message list
        out[message.file][message.importance].append(message)
    for file in files:
        for importance in importance_levels:  # Worst case complexity: (number of files) x 3
            if out[file][importance]:
                out[file][importance] = sorted(out[file][importance], key=lambda x: -x.severity.value) 
            else:
                out[file].pop(importance)
    return out

Basically, the more pre-sorting you can apply to the output structure ahead of actual iteration, the better.

@bendichter What do you think? Any other suggestions for improvement?

Copy link
Contributor Author

@bendichter bendichter Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodyCBakerPhD how about this as a general solution

def hier_org(objects, levels):
    unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects))
    if len(levels) > 1:
        return {attr: hier_org([obj for obj in objects if getattr(obj, levels[0]) == attr], levels[1:])
            for attr in unique_attrs}
    else:
        return {attr: [obj for obj in objects if getattr(obj, levels[0]) == attr] for attr in unique_attrs}
    

# test
from collections import namedtuple

Obj = namedtuple("Obj", "a b c")

objects = []
for a in range(3):
    for b in range(3):
        for c in range(3):
            objects.append(Obj(a=a,b=b,c=c))
            
hier_org(objects, ["a", "b", "c"])
{0: {0: {0: [Obj(a=0, b=0, c=0)],
   1: [Obj(a=0, b=0, c=1)],
   2: [Obj(a=0, b=0, c=2)]},
  1: {0: [Obj(a=0, b=1, c=0)],
   1: [Obj(a=0, b=1, c=1)],
   2: [Obj(a=0, b=1, c=2)]},
  2: {0: [Obj(a=0, b=2, c=0)],
   1: [Obj(a=0, b=2, c=1)],
   2: [Obj(a=0, b=2, c=2)]}},
 1: {0: {0: [Obj(a=1, b=0, c=0)],
   1: [Obj(a=1, b=0, c=1)],
   2: [Obj(a=1, b=0, c=2)]},
  1: {0: [Obj(a=1, b=1, c=0)],
   1: [Obj(a=1, b=1, c=1)],
   2: [Obj(a=1, b=1, c=2)]},
  2: {0: [Obj(a=1, b=2, c=0)],
   1: [Obj(a=1, b=2, c=1)],
   2: [Obj(a=1, b=2, c=2)]}},
 2: {0: {0: [Obj(a=2, b=0, c=0)],
   1: [Obj(a=2, b=0, c=1)],
   2: [Obj(a=2, b=0, c=2)]},
  1: {0: [Obj(a=2, b=1, c=0)],
   1: [Obj(a=2, b=1, c=1)],
   2: [Obj(a=2, b=1, c=2)]},
  2: {0: [Obj(a=2, b=2, c=0)],
   1: [Obj(a=2, b=2, c=1)],
   2: [Obj(a=2, b=2, c=2)]}}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like it should cover my next question if implemented correctly, which was to have a single organize_messages function that had an argument that could dynamically specify the order in which we want things organized (["file", "importance"], or ["importance", "neurodata_type", "importance"], etc.)

I'll play around with this a bit and get back to you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendichter That code had some great notions of recurrence logic in it, but it still suffered from the same nested iteration as before due to the line

        return {attr: hier_org([obj for obj in objects if getattr(obj, levels[0]) == attr], levels[1:])
            for attr in unique_attrs}

which still has iterations over every obj for each unique_aatr to extract the subset that correspond to a particular attr. A time complexity analysis of this shows a worst-case time of

image

which will tend to be the two largest values of

image

image

I took this logic and made two versions seen on this commit on another branch: 21a30aa

First version 'fancy' is pretty much that same exact code but built on our objects +/- some Enum logic for sorting.

Second version 'efficient' is my binned approach that separates the dictionary construction from the data filtering, though it needs some helper functions for indexing arbitrarily nested dictionaries. This decoupling approach improves the time complexity to

image

To test performance, I downloaded DANDISet 000004 and set up the following

messages = list()
for folder in Path(".../000004").iterdir():
    messages.extend(inspect_all(path=folder))

which gives us a list of 611 InspectorMessages. Now

levels = ["file", "importance"]
 # just for testing, technically any order and combination of attributes for an InspectorMessage can be used

start = time()
res1 = fancy_organize_messages(messages=messages, levels=levels)
end = time()
time1 = end - start

start = time()
res2 = efficient_organize_messages(messages=messages, levels=levels)
end = time()
time2 = end - start

As before, the 'efficient' method is clocking much faster (about 6x on the case of real data), so I'll proceed with adding tests and integrating that method.

Overall very cool advancement that allows arbitrary user choice of output structure, if they want any.

Copy link
Contributor Author

@bendichter bendichter Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this. I create a objects_by_attr dict so that the objects are only looped over once per recursion

from collections import defaultdict

def hier_org(objects, levels, reverse=None):
    if reverse is None:
        reverse = [False] * len(levels)
    unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects))
    if reverse[0]:
        unique_attrs = unique_attrs[::-1]
    objects_by_attr = defaultdict(list)
    [objects_by_attr[getattr(obj, levels[0])].append(obj) for obj in objects]
    if len(levels) > 1:
        return {attr: hier_org(objects_by_attr[attr], levels[1:], reverse[1:]) for attr in unique_attrs}
    else:
        return {attr: objects_by_attr[attr][0] for attr in unique_attrs}
    
    

# test
from collections import namedtuple

Obj = namedtuple("Obj", "a b c")

objects = []
for a in range(3):
    for b in range(3):
        for c in range(3):
            objects.append(Obj(a=a, b=b, c=c))
            
hier_org(objects, ["a", "b", "c"], [True, False, True])

@CodyCBakerPhD
Copy link
Collaborator

All relevant code changes proposed by the PR have been merged along separate channels. Closing this now.

@CodyCBakerPhD CodyCBakerPhD deleted the flatten_output branch March 21, 2022 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants