-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flatten output #89
Flatten output #89
Conversation
…nize is performed outside.
* add option to display by importance
for more information, see https://pre-commit.ci
There is a usability issue with the current way to generate reports. What if the user has 100+ files? Maybe on the first 50 or so there are no critical errors, but there is a critical error on file 51? This would be difficult for the user to spot, because they'd probably scroll through the first few files and assume that they are all the same. To address this, we could display the results in a different order, e.g.:
This would gather critical errors across all files to the top, and would aggregate them by check. In order to accomplish this, I reorganized the code to return a flat list of messages that includes "file", and delay the organization part, since these two different printing methods require different organization. Once that was done, it made sense to |
@bendichter Yeah, looking over this now. Mostly looks good; was there a reason the severity levels had to change? |
I wanted to be able to sort messages by the value of severity: |
Right, but the actual values for the severity comparison changed: This means a dataset compression below 1 GB would now get ranked below missing metadata, for instance. |
@CodyCBakerPhD should "low" be above "no severity"? I find that pretty counter-intuitive. |
The way we've treated it thus far - Most of the time, severity is Basically there are two intentions (a) the act of specifying severity at all means you care about the ordering of this check relative to the average check that doesn't have ranked severity, (b) for two of the same checks raised on the same exact file but with clearly different magnitude and order of priority in fixing them (e.g., a 15GB dataset not compressed vs. a 0.8GB dataset not compressed), the higher severity one is ranked higher. Point (b) makes the logical use of severity comparisons specific to an individual check irrespective of the severities specified by other checks of the same importance level. Whereas by setting the baseline to be between HIGH and LOW, when writing a new check we would have to think about where it would fit into the global priority order of checks within that importance, which is a more difficult task as the number of checks grows. |
@CodyCBakerPhD all severity values have numerical values, which means they are being compared to each other and there really isn't a "no severity." Maybe the best would be to have severity = low, medium (default) or high |
Codecov Report
@@ Coverage Diff @@
## dev #89 +/- ##
==========================================
- Coverage 94.31% 87.91% -6.40%
==========================================
Files 12 12
Lines 422 513 +91
==========================================
+ Hits 398 451 +53
- Misses 24 62 +38
Flags with carried forward coverage won't be shown. Click here to find out more.
|
That is indeed effectively how it is; the point to It's extra work during check development and review to have to think about whether the output of a check should be manually ranked lower than the average of all other checks of that importance, which makes it harder to add new checks. Especially for really basic stuff like missing metadata. The only time If it really bothers you the simplest thing we can just remove |
I'm fine with that |
Putting this in draft while I work on splitting it into separate PRs for easier review and integration. |
def organize_messages_by_file(messages): | ||
files = sorted(set(message.file for message in messages)) | ||
out = {} | ||
for file in files: | ||
out[file] = {} | ||
file_messages = list(filter(lambda x: x.file == file, messages)) | ||
for importance in sorted(set(message.importance for message in file_messages), key=lambda x: -x.value): | ||
out[file][importance] = sorted( | ||
filter(lambda x: x.importance == importance, file_messages), | ||
key=lambda x: -x.severity.value, | ||
) | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing this, one thing to note is the time complexity involved in applying a filter
that iterates a list checking for a potentially small number of boolean matches is sub-optimal. Here's a self-contained comparison that can be applied to any directory with a decent number of files in it
"""Simple script for testing performance of two iteration methods for NWBInspector."""
from time import time
from pathlib import Path
import numpy as np
files = [x.name for x in Path(".").iterdir()]
nfiles = len(files)
messages = [
dict(msg="foo", file=files[np.random.randint(low=0, high=nfiles)])
for _ in range(nfiles * 50) # average of 50 messages per file; feel free to change
]
start = time()
out1 = {file: list(filter(lambda x: x["file"] == file, messages)) for file in files}
end = time()
time1 = end - start
start = time()
out2 = {file: list() for file in files}
for message in messages:
out2[message["file"]] = message
end = time()
time2 = end - start
For my system at least, the binning method (proposed option 2) is 10x faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As such, I was going to refactor this type of functionality like
def organize_messages_by_file(messages: List[InspectorMessage]):
"""Order InspectorMessages by file name then importance."""
files = natsorted(set(message.file for message in messages))
importance_levels = [Importance.CRITICAL, Importance.BEST_PRACTICE_VIOLATION, Importance.BEST_PRACTICE_SUGGESTION]
out = {file: {importance: list() for importance in importance_levels} for file in files} # always add the small number of importance bins
for message in messages: # linear time, single pass through message list
out[message.file][message.importance].append(message)
for file in files:
for importance in importance_levels: # Worst case complexity: (number of files) x 3
if out[file][importance]:
out[file][importance] = sorted(out[file][importance], key=lambda x: -x.severity.value)
else:
out[file].pop(importance)
return out
Basically, the more pre-sorting you can apply to the output structure ahead of actual iteration, the better.
@bendichter What do you think? Any other suggestions for improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CodyCBakerPhD how about this as a general solution
def hier_org(objects, levels):
unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects))
if len(levels) > 1:
return {attr: hier_org([obj for obj in objects if getattr(obj, levels[0]) == attr], levels[1:])
for attr in unique_attrs}
else:
return {attr: [obj for obj in objects if getattr(obj, levels[0]) == attr] for attr in unique_attrs}
# test
from collections import namedtuple
Obj = namedtuple("Obj", "a b c")
objects = []
for a in range(3):
for b in range(3):
for c in range(3):
objects.append(Obj(a=a,b=b,c=c))
hier_org(objects, ["a", "b", "c"])
{0: {0: {0: [Obj(a=0, b=0, c=0)],
1: [Obj(a=0, b=0, c=1)],
2: [Obj(a=0, b=0, c=2)]},
1: {0: [Obj(a=0, b=1, c=0)],
1: [Obj(a=0, b=1, c=1)],
2: [Obj(a=0, b=1, c=2)]},
2: {0: [Obj(a=0, b=2, c=0)],
1: [Obj(a=0, b=2, c=1)],
2: [Obj(a=0, b=2, c=2)]}},
1: {0: {0: [Obj(a=1, b=0, c=0)],
1: [Obj(a=1, b=0, c=1)],
2: [Obj(a=1, b=0, c=2)]},
1: {0: [Obj(a=1, b=1, c=0)],
1: [Obj(a=1, b=1, c=1)],
2: [Obj(a=1, b=1, c=2)]},
2: {0: [Obj(a=1, b=2, c=0)],
1: [Obj(a=1, b=2, c=1)],
2: [Obj(a=1, b=2, c=2)]}},
2: {0: {0: [Obj(a=2, b=0, c=0)],
1: [Obj(a=2, b=0, c=1)],
2: [Obj(a=2, b=0, c=2)]},
1: {0: [Obj(a=2, b=1, c=0)],
1: [Obj(a=2, b=1, c=1)],
2: [Obj(a=2, b=1, c=2)]},
2: {0: [Obj(a=2, b=2, c=0)],
1: [Obj(a=2, b=2, c=1)],
2: [Obj(a=2, b=2, c=2)]}}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like it should cover my next question if implemented correctly, which was to have a single organize_messages
function that had an argument that could dynamically specify the order in which we want things organized (["file", "importance"]
, or ["importance", "neurodata_type", "importance"]
, etc.)
I'll play around with this a bit and get back to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bendichter That code had some great notions of recurrence logic in it, but it still suffered from the same nested iteration as before due to the line
return {attr: hier_org([obj for obj in objects if getattr(obj, levels[0]) == attr], levels[1:])
for attr in unique_attrs}
which still has iterations over every obj
for each unique_aatr
to extract the subset that correspond to a particular attr
. A time complexity analysis of this shows a worst-case time of
which will tend to be the two largest values of
I took this logic and made two versions seen on this commit on another branch: 21a30aa
First version 'fancy' is pretty much that same exact code but built on our objects +/- some Enum logic for sorting.
Second version 'efficient' is my binned approach that separates the dictionary construction from the data filtering, though it needs some helper functions for indexing arbitrarily nested dictionaries. This decoupling approach improves the time complexity to
To test performance, I downloaded DANDISet 000004
and set up the following
messages = list()
for folder in Path(".../000004").iterdir():
messages.extend(inspect_all(path=folder))
which gives us a list of 611 InspectorMessages. Now
levels = ["file", "importance"]
# just for testing, technically any order and combination of attributes for an InspectorMessage can be used
start = time()
res1 = fancy_organize_messages(messages=messages, levels=levels)
end = time()
time1 = end - start
start = time()
res2 = efficient_organize_messages(messages=messages, levels=levels)
end = time()
time2 = end - start
As before, the 'efficient' method is clocking much faster (about 6x on the case of real data), so I'll proceed with adding tests and integrating that method.
Overall very cool advancement that allows arbitrary user choice of output structure, if they want any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this. I create a objects_by_attr
dict so that the objects are only looped over once per recursion
from collections import defaultdict
def hier_org(objects, levels, reverse=None):
if reverse is None:
reverse = [False] * len(levels)
unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects))
if reverse[0]:
unique_attrs = unique_attrs[::-1]
objects_by_attr = defaultdict(list)
[objects_by_attr[getattr(obj, levels[0])].append(obj) for obj in objects]
if len(levels) > 1:
return {attr: hier_org(objects_by_attr[attr], levels[1:], reverse[1:]) for attr in unique_attrs}
else:
return {attr: objects_by_attr[attr][0] for attr in unique_attrs}
# test
from collections import namedtuple
Obj = namedtuple("Obj", "a b c")
objects = []
for a in range(3):
for b in range(3):
for c in range(3):
objects.append(Obj(a=a, b=b, c=c))
hier_org(objects, ["a", "b", "c"], [True, False, True])
All relevant code changes proposed by the PR have been merged along separate channels. Closing this now. |
Motivation
What was the reasoning behind this change? Please explain the changes briefly.
How to test the behavior?
Checklist
black
format? If not, simplypip install black
and runblack .
inside your fork.fix #XX
whereXX
is the issue number?