Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add config file option #86

Merged
merged 18 commits into from
Mar 6, 2022
Merged

add config file option #86

merged 18 commits into from
Mar 6, 2022

Conversation

bendichter
Copy link
Contributor

fix #37

restructure available_checks
add config json schema
add example config file
add configure_checks function and corresponding test

TODO: expose to command line

add config json schema
add example config file
add configure_checks function and corresponding test

TODO: expose to command line
@bendichter
Copy link
Contributor Author

@CodyCBakerPhD I started to build out the feature for custom configuration file that changes the severity of checks. In doing so, I simplified the available_checks dictionary from:

{Importance: {nd_type: [checks]}}

to

{Importance: [checks]}

the nd_type is pulled from the attribute of the check. This is a simplification without loss of function for the library, and it makes the config feature easier to implement.

@CodyCBakerPhD
Copy link
Collaborator

@bendichter 👍 Yup, looking over it now.

The main reason for pre-sorting by neurodata type was to save time complexity from having to manually inspect it during each step of the outer iteration, but I can re-analyze it again and see if there's an even better way to structure that.

@CodyCBakerPhD
Copy link
Collaborator

CodyCBakerPhD commented Mar 2, 2022

As I look at the primary iteration usage of ignore and select, would these make sense to include somehow as config options used to shorten the full available_checks? Or even possibly importance_threshold?

That way, we could move those nested if checks outside of the iteration and instead pass a shortened collection of checks into inspect_nwb, reducing the number of operations that get multiplied. And also making the input arguments to inspect_nwb much simpler by only expecting some pre-configured specification of checks to run on the NWBFile.

Extreme example: we only want to select a couple checks. Currently the code still iterates over the entire registry of check functions for each object in the NWBFile.

                    for nwbfile_object in nwbfile.objects.values():    <---- could have a large number of objects
                        for check_function in check_functions:    <---- could have a large number of functions
                            if issubclass(type(nwbfile_object), check_function.neurodata_type):
                                if ignore is not None and check_function.__name__ in ignore: <-- line gets hit every time
                                    continue
                                if select is not None and check_function.__name__ not in select:
                                    continue
                                output = check_function(nwbfile_object)    <---- only reaches here a couple of times

@CodyCBakerPhD CodyCBakerPhD added the category: enhancement improvements of code or code behavior label Mar 2, 2022
@CodyCBakerPhD
Copy link
Collaborator

Oh, I also think it would help the design of the configuration approach (and probably other things too...) if the initial registry availabe_checks was a dictionary whose keys are the check function names and whose values are the check functions themselves. An advantage of this would be the ability to take those string names from the JSON and directly index like available_checks[check_function_name].

With this approach, the default configuration would simply be the available_checks structured into what they are now (an OrderedDict of Importance with lists of check functions as values at each level), but any other config would follow the same structure +/- manipulation of importance levels, inclusion or exclusion of particular check names, etc.

What do you think?

Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>
@bendichter
Copy link
Contributor Author

@CodyCBakerPhD yes, I think that could make things simpler

@bendichter
Copy link
Contributor Author

So the config logic would simply be something like:

checks = copy.copy(available_checks)
for importance, func_names in config.items():
    func = checks[func]
    func.importance = importance
    check[func] = func

@CodyCBakerPhD
Copy link
Collaborator

CodyCBakerPhD commented Mar 3, 2022

So the config logic would simply be something like:

Yeah, pretty much. Looks much cleaner.

@bendichter
Copy link
Contributor Author

@CodyCBakerPhD I went even further and just made available_checks a list of checks

@CodyCBakerPhD
Copy link
Collaborator

I went even further and just made available_checks a list of checks

@bendichter Yup, this all looks really good now.

@CodyCBakerPhD
Copy link
Collaborator

Also, I just remembered Dorota pointed out that it may be for the best to only allow elevation of Importance levels via the config to prevent people from being able to forcibly downgrade important things; do you agree with that?

Also also, Yarik mentioned having the config be able to specify SKIP type behavior on a per-NWBFile level when this config is meant to be run on a directory. Any ideas for how that could go into this schema?

@bendichter
Copy link
Contributor Author

Also also, Yarik mentioned having the config be able to specify SKIP type behavior on a per-NWBFile level when this config is meant to be run on a directory. Any ideas for how that could go into this schema?

I don't know what you mean

@CodyCBakerPhD
Copy link
Collaborator

CodyCBakerPhD commented Mar 3, 2022

From #37

It is needed since some of such ignores are "per dataset" and thus it should be possible to specify what to ignore in a specific dataset (or dandiset) in DANDI land, instead of relying on user to know what command line options to provide.

Unless he means to apply the SKIP rule over all files in that directory and not NWBFile-specific.

Example: /000003/file1.nwb has a check_data_orientation violation that is later confirmed to be OK so check_data_orientation is added to the SKIP field of the config used when running on the entire dataset 000003. But /000003/file2.nwb also had a check_data_orientation violation that was confirmed to be a user mistake.

@bendichter
Copy link
Contributor Author

@CodyCBakerPhD OK, I understand. Skipping checks for specific files. I'd say if you want to do that you can run your own script and do something like

for file in glob.glob(os.path.join(path, '*.nwb'):
    if file == 'bad_file.nwb':
        ignore = ["check_data_orientation"]
    else:
        ignore = None
    inspect_nwb(file, ignore=ignore)

Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>
@bendichter bendichter marked this pull request as ready for review March 4, 2022 17:00
@CodyCBakerPhD
Copy link
Collaborator

@bendichter Two minor suggestions for schema-pertaining items.

Otherwise LGTM

@CodyCBakerPhD CodyCBakerPhD self-requested a review March 4, 2022 17:23
@codecov-commenter
Copy link

Codecov Report

Merging #86 (e311f69) into dev (47698e8) will decrease coverage by 0.75%.
The diff coverage is 90.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev      #86      +/-   ##
==========================================
- Coverage   94.31%   93.55%   -0.76%     
==========================================
  Files          12       12              
  Lines         422      481      +59     
==========================================
+ Hits          398      450      +52     
- Misses         24       31       +7     
Flag Coverage Δ
unittests 93.55% <90.36%> (-0.76%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
nwbinspector/nwbinspector.py 80.64% <80.00%> (-2.54%) ⬇️
nwbinspector/utils.py 96.15% <93.33%> (-3.85%) ⬇️
nwbinspector/checks/tables.py 100.00% <100.00%> (ø)
nwbinspector/register_checks.py 98.85% <100.00%> (+0.02%) ⬆️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "config file" support
3 participants