Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forcing the validation of datasets prior to launch limits the use of apps on "far" datasets #19

Closed
gkiar opened this issue Jan 6, 2018 · 12 comments

Comments

@gkiar
Copy link
Contributor

gkiar commented Jan 6, 2018

Hi team!

My use case is the following: I have several datasets, all BIDS validated, stored "far away" (i.e. not where I wish to do processing). When I launch the BIDS apps I wish to run on them, such as this one, the validator forces the crawling of the dataset which means many queries and file accesses to my far away dataset, resulting in a huge amount of a) unnecessary data and metadata download (as I'd launch a job for a single participant on isolated nodes, and each would perform a full dataset validation), and b) latency while I wait for this download and crawl and validation to take place.

My proposal is that if developers of BIDS apps wish to have the validator run prior to their tool, it be added with a --validate option on the command-line, so that default behaviour is more cloud/distributed computing friendly. I think this is also a fair re-distribution of labour since it puts the onus of dataset validation on data storage platforms, rather than independent tasks (which could feasibly be running different versions of the validator, etc.).

Happy to hear opinions of others, or explain in more detail exactly how my data is stored and being accessed (I omitted these details here since this issue is consistent, though varies in impact, for other types of network or symbolic mounts; in my particular case this is prohibitive and I cannot run apps which use the validator).

Cheers,
G

@chrisgorgo
Copy link
Contributor

This has been brought up and implemented in context of the FreeSurfer app (see bids-apps/freesurfer#37). I think it would make sense to add it to other apps.

@chrisgorgo
Copy link
Contributor

BTW - most BIDS Apps devs are not subscribed to notifications on this repository, but are on the https://groups.google.com/forum/#!forum/bids-apps-dev

@gkiar
Copy link
Contributor Author

gkiar commented Jan 8, 2018

Hey @chrisfilo - I don't think the solution you linked to in FreeSurfer is what I'm referring to. What I'm suggesting is that no validation occurs at task launch, rather than some minimal set. The reason for this what I mentioned above, that validator versions may differ, and that it makes more sense to validate datasets on the data management side, rather than for every independent task.

I think my proposal is much simpler, more scalable, and implemented very easily (as shown in #20).

@gkiar
Copy link
Contributor Author

gkiar commented Jan 8, 2018

Re the mailing list: there are too many emails, I can't keep up with them, but I'll send a note to the thread and encourage people to contribute here if they have feelings.

@chrisgorgo
Copy link
Contributor

The FreeSurfer PR does implement optionally completely skipping validation - see https://github.com/BIDS-Apps/freesurfer/pull/37/files#diff-04c6e90faac2675aa89e2176d2eec7d8R40.

It's just the other way around - validation is turned on by default.

I would not assume that there is a "data management side" in context of BIDS and BIDS Apps. A lot of people will just use file systems. This is why I would be more keen on making the validator enabled by default.

PS We all have feelings ;)

@gkiar
Copy link
Contributor Author

gkiar commented Jan 8, 2018

Ah, thanks for pointing me to the line - I missed it in all the other issue discussion about exclusions from the validation process! Writing a note to the email list now and linking this issue. I have no problem updating my PR to have default behaviour be validation, if that's what you and others prefer.

P.s. <3 your PS :)

@raamana
Copy link

raamana commented Jan 8, 2018

I vote to turn validation ON by default, and implement features across on-boarding (while launching jobs) and processing platforms (clusters and apps themselves) to enable individual developers to request disabling if they choose to.

I would like apps to fail fast when the dataset has issues (not even schedule or launch jobs, when they are expected to fail).

Perhaps there can be different levels of validation e.g. fail on error, warning on error or ignore everything etc. Or validation can generate a code and developer can take an appropriate action (catching exceptions)

@raamana
Copy link

raamana commented Jan 8, 2018

Another way to reduce repetitive validation of the dataset is to leave some sort of validation results and time stamp within the directory being validated. This can help to skip the whole process if the latest validation was within an acceptable recent period of time (a month) and nothing changed in that directory (e.g. when the latest modification time in that directory was older than that)

@gkiar
Copy link
Contributor Author

gkiar commented Jan 8, 2018

@raamana in response to your first point, "implement features across on-boarding and processing platforms..." - I don't know what you mean, but without knowing the app intimately we could not expect platforms to do this, but a simple flag (like that in #20) gives users this freedom regardless of their context.

I too like things that fail fast, which is why I would run validation before submitting jobs on a cluster, waiting in a queue, and being provisioned resources - rendering validation by the app that would take place when the app is on core unnecessary. I agree some people may just run these jobs on their local systems or may not have my workflow, so am not proposing we remove the validator, simply enable "power users" (like those with similar workflows to my own) to use the spec as well.

Regarding the latter point about levels of validation or reducing repetitive validation, I think those are only relevant in the context of a data management platform rather than computational tools which may be racing to create or access these results. Supporting these or other cool features in a data platform would be great.

P.s. @chrisfilo I adopted the same flag name as was used in the FreeSurfer example - thanks again for pointing it out. 😄

@raamana
Copy link

raamana commented Jan 9, 2018

Greg, I mean to implement --skip-bids-validation at different layers from the very beginning right up to actual computation within a BIDS app. We both are in agreement there, except I would vote to have validation on by default. In my view and understanding, defaults for this eco-system should be geared towards "vanilla users", not developers who can configure advanced features themselves.

It is possible I might be missing some technical details in your use case, where no-validation-by-default is the only way to achieve it.

On higher-level though, the whole issues would be solved, if the validation were reasonably fast, esp. when repeated often. My suggestion to leave a time-stamp and validation results might be worth considering, along with forced rerun of validation regardless of how fresh the most recent validation was done before.

@gkiar
Copy link
Contributor Author

gkiar commented Jan 9, 2018

@raamana I don't disagree, in #20 the validator is on by default, and don't need it to be default one way or another, so I'm happy typing the extra 22 characters :)

Regarding your later point about a timestamp, I think that would be a great discussion to bring up on the validator repo.

@Remi-Gau
Copy link
Contributor

Remi-Gau commented Sep 1, 2023

given that an opt-out validation strategy was added I think this can be closed

@Remi-Gau Remi-Gau closed this as completed Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants