forcing the validation of datasets prior to launch limits the use of apps on "far" datasets #19

gkiar · 2018-01-06T16:39:32Z

Hi team!

My use case is the following: I have several datasets, all BIDS validated, stored "far away" (i.e. not where I wish to do processing). When I launch the BIDS apps I wish to run on them, such as this one, the validator forces the crawling of the dataset which means many queries and file accesses to my far away dataset, resulting in a huge amount of a) unnecessary data and metadata download (as I'd launch a job for a single participant on isolated nodes, and each would perform a full dataset validation), and b) latency while I wait for this download and crawl and validation to take place.

My proposal is that if developers of BIDS apps wish to have the validator run prior to their tool, it be added with a --validate option on the command-line, so that default behaviour is more cloud/distributed computing friendly. I think this is also a fair re-distribution of labour since it puts the onus of dataset validation on data storage platforms, rather than independent tasks (which could feasibly be running different versions of the validator, etc.).

Happy to hear opinions of others, or explain in more detail exactly how my data is stored and being accessed (I omitted these details here since this issue is consistent, though varies in impact, for other types of network or symbolic mounts; in my particular case this is prohibitive and I cannot run apps which use the validator).

Cheers,
G

The text was updated successfully, but these errors were encountered:

chrisgorgo · 2018-01-06T17:04:54Z

This has been brought up and implemented in context of the FreeSurfer app (see bids-apps/freesurfer#37). I think it would make sense to add it to other apps.

chrisgorgo · 2018-01-06T17:05:52Z

BTW - most BIDS Apps devs are not subscribed to notifications on this repository, but are on the https://groups.google.com/forum/#!forum/bids-apps-dev

gkiar · 2018-01-08T17:44:57Z

Hey @chrisfilo - I don't think the solution you linked to in FreeSurfer is what I'm referring to. What I'm suggesting is that no validation occurs at task launch, rather than some minimal set. The reason for this what I mentioned above, that validator versions may differ, and that it makes more sense to validate datasets on the data management side, rather than for every independent task.

I think my proposal is much simpler, more scalable, and implemented very easily (as shown in #20).

gkiar · 2018-01-08T17:46:24Z

Re the mailing list: there are too many emails, I can't keep up with them, but I'll send a note to the thread and encourage people to contribute here if they have feelings.

chrisgorgo · 2018-01-08T17:55:11Z

The FreeSurfer PR does implement optionally completely skipping validation - see https://github.com/BIDS-Apps/freesurfer/pull/37/files#diff-04c6e90faac2675aa89e2176d2eec7d8R40.

It's just the other way around - validation is turned on by default.

I would not assume that there is a "data management side" in context of BIDS and BIDS Apps. A lot of people will just use file systems. This is why I would be more keen on making the validator enabled by default.

PS We all have feelings ;)

gkiar · 2018-01-08T18:17:43Z

Ah, thanks for pointing me to the line - I missed it in all the other issue discussion about exclusions from the validation process! Writing a note to the email list now and linking this issue. I have no problem updating my PR to have default behaviour be validation, if that's what you and others prefer.

P.s. <3 your PS :)

raamana · 2018-01-08T19:29:55Z

I vote to turn validation ON by default, and implement features across on-boarding (while launching jobs) and processing platforms (clusters and apps themselves) to enable individual developers to request disabling if they choose to.

I would like apps to fail fast when the dataset has issues (not even schedule or launch jobs, when they are expected to fail).

Perhaps there can be different levels of validation e.g. fail on error, warning on error or ignore everything etc. Or validation can generate a code and developer can take an appropriate action (catching exceptions)

raamana · 2018-01-08T19:33:50Z

Another way to reduce repetitive validation of the dataset is to leave some sort of validation results and time stamp within the directory being validated. This can help to skip the whole process if the latest validation was within an acceptable recent period of time (a month) and nothing changed in that directory (e.g. when the latest modification time in that directory was older than that)

gkiar · 2018-01-08T19:59:14Z

@raamana in response to your first point, "implement features across on-boarding and processing platforms..." - I don't know what you mean, but without knowing the app intimately we could not expect platforms to do this, but a simple flag (like that in #20) gives users this freedom regardless of their context.

I too like things that fail fast, which is why I would run validation before submitting jobs on a cluster, waiting in a queue, and being provisioned resources - rendering validation by the app that would take place when the app is on core unnecessary. I agree some people may just run these jobs on their local systems or may not have my workflow, so am not proposing we remove the validator, simply enable "power users" (like those with similar workflows to my own) to use the spec as well.

Regarding the latter point about levels of validation or reducing repetitive validation, I think those are only relevant in the context of a data management platform rather than computational tools which may be racing to create or access these results. Supporting these or other cool features in a data platform would be great.

P.s. @chrisfilo I adopted the same flag name as was used in the FreeSurfer example - thanks again for pointing it out. 😄

raamana · 2018-01-09T20:42:05Z

Greg, I mean to implement --skip-bids-validation at different layers from the very beginning right up to actual computation within a BIDS app. We both are in agreement there, except I would vote to have validation on by default. In my view and understanding, defaults for this eco-system should be geared towards "vanilla users", not developers who can configure advanced features themselves.

It is possible I might be missing some technical details in your use case, where no-validation-by-default is the only way to achieve it.

On higher-level though, the whole issues would be solved, if the validation were reasonably fast, esp. when repeated often. My suggestion to leave a time-stamp and validation results might be worth considering, along with forced rerun of validation regardless of how fresh the most recent validation was done before.

gkiar · 2018-01-09T20:48:43Z

@raamana I don't disagree, in #20 the validator is on by default, and don't need it to be default one way or another, so I'm happy typing the extra 22 characters :)

Regarding your later point about a timestamp, I think that would be a great discussion to bring up on the validator repo.

Remi-Gau · 2023-09-01T22:30:12Z

given that an opt-out validation strategy was added I think this can be closed

Lestropie mentioned this issue Jan 8, 2018

Add --skip-bids-validator option bids-apps/MRtrix3_connectome#21

Closed

Remi-Gau closed this as completed Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

forcing the validation of datasets prior to launch limits the use of apps on "far" datasets #19

forcing the validation of datasets prior to launch limits the use of apps on "far" datasets #19

gkiar commented Jan 6, 2018

chrisgorgo commented Jan 6, 2018

chrisgorgo commented Jan 6, 2018

gkiar commented Jan 8, 2018

gkiar commented Jan 8, 2018

chrisgorgo commented Jan 8, 2018

gkiar commented Jan 8, 2018

raamana commented Jan 8, 2018

raamana commented Jan 8, 2018 •

edited

Loading

gkiar commented Jan 8, 2018

raamana commented Jan 9, 2018 •

edited

Loading

gkiar commented Jan 9, 2018

Remi-Gau commented Sep 1, 2023

forcing the validation of datasets prior to launch limits the use of apps on "far" datasets #19

forcing the validation of datasets prior to launch limits the use of apps on "far" datasets #19

Comments

gkiar commented Jan 6, 2018

chrisgorgo commented Jan 6, 2018

chrisgorgo commented Jan 6, 2018

gkiar commented Jan 8, 2018

gkiar commented Jan 8, 2018

chrisgorgo commented Jan 8, 2018

gkiar commented Jan 8, 2018

raamana commented Jan 8, 2018

raamana commented Jan 8, 2018 • edited Loading

gkiar commented Jan 8, 2018

raamana commented Jan 9, 2018 • edited Loading

gkiar commented Jan 9, 2018

Remi-Gau commented Sep 1, 2023

raamana commented Jan 8, 2018 •

edited

Loading

raamana commented Jan 9, 2018 •

edited

Loading