-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
forcing the validation of datasets prior to launch limits the use of apps on "far" datasets #19
Comments
This has been brought up and implemented in context of the FreeSurfer app (see bids-apps/freesurfer#37). I think it would make sense to add it to other apps. |
BTW - most BIDS Apps devs are not subscribed to notifications on this repository, but are on the https://groups.google.com/forum/#!forum/bids-apps-dev |
Hey @chrisfilo - I don't think the solution you linked to in FreeSurfer is what I'm referring to. What I'm suggesting is that no validation occurs at task launch, rather than some minimal set. The reason for this what I mentioned above, that validator versions may differ, and that it makes more sense to validate datasets on the data management side, rather than for every independent task. I think my proposal is much simpler, more scalable, and implemented very easily (as shown in #20). |
Re the mailing list: there are too many emails, I can't keep up with them, but I'll send a note to the thread and encourage people to contribute here if they have feelings. |
The FreeSurfer PR does implement optionally completely skipping validation - see https://github.com/BIDS-Apps/freesurfer/pull/37/files#diff-04c6e90faac2675aa89e2176d2eec7d8R40. It's just the other way around - validation is turned on by default. I would not assume that there is a "data management side" in context of BIDS and BIDS Apps. A lot of people will just use file systems. This is why I would be more keen on making the validator enabled by default. PS We all have feelings ;) |
Ah, thanks for pointing me to the line - I missed it in all the other issue discussion about exclusions from the validation process! Writing a note to the email list now and linking this issue. I have no problem updating my PR to have default behaviour be validation, if that's what you and others prefer. P.s. <3 your PS :) |
I vote to turn validation ON by default, and implement features across on-boarding (while launching jobs) and processing platforms (clusters and apps themselves) to enable individual developers to request disabling if they choose to. I would like apps to fail fast when the dataset has issues (not even schedule or launch jobs, when they are expected to fail). Perhaps there can be different levels of validation e.g. fail on error, warning on error or ignore everything etc. Or validation can generate a code and developer can take an appropriate action (catching exceptions) |
Another way to reduce repetitive validation of the dataset is to leave some sort of validation results and time stamp within the directory being validated. This can help to skip the whole process if the latest validation was within an acceptable recent period of time (a month) and nothing changed in that directory (e.g. when the latest modification time in that directory was older than that) |
@raamana in response to your first point, "implement features across on-boarding and processing platforms..." - I don't know what you mean, but without knowing the app intimately we could not expect platforms to do this, but a simple flag (like that in #20) gives users this freedom regardless of their context. I too like things that fail fast, which is why I would run validation before submitting jobs on a cluster, waiting in a queue, and being provisioned resources - rendering validation by the app that would take place when the app is on core unnecessary. I agree some people may just run these jobs on their local systems or may not have my workflow, so am not proposing we remove the validator, simply enable "power users" (like those with similar workflows to my own) to use the spec as well. Regarding the latter point about levels of validation or reducing repetitive validation, I think those are only relevant in the context of a data management platform rather than computational tools which may be racing to create or access these results. Supporting these or other cool features in a data platform would be great. P.s. @chrisfilo I adopted the same flag name as was used in the FreeSurfer example - thanks again for pointing it out. 😄 |
Greg, I mean to implement --skip-bids-validation at different layers from the very beginning right up to actual computation within a BIDS app. We both are in agreement there, except I would vote to have validation on by default. In my view and understanding, defaults for this eco-system should be geared towards "vanilla users", not developers who can configure advanced features themselves. It is possible I might be missing some technical details in your use case, where no-validation-by-default is the only way to achieve it. On higher-level though, the whole issues would be solved, if the validation were reasonably fast, esp. when repeated often. My suggestion to leave a time-stamp and validation results might be worth considering, along with forced rerun of validation regardless of how fresh the most recent validation was done before. |
given that an opt-out validation strategy was added I think this can be closed |
Hi team!
My use case is the following: I have several datasets, all BIDS validated, stored "far away" (i.e. not where I wish to do processing). When I launch the BIDS apps I wish to run on them, such as this one, the validator forces the crawling of the dataset which means many queries and file accesses to my far away dataset, resulting in a huge amount of a) unnecessary data and metadata download (as I'd launch a job for a single participant on isolated nodes, and each would perform a full dataset validation), and b) latency while I wait for this download and crawl and validation to take place.
My proposal is that if developers of BIDS apps wish to have the validator run prior to their tool, it be added with a
--validate
option on the command-line, so that default behaviour is more cloud/distributed computing friendly. I think this is also a fair re-distribution of labour since it puts the onus of dataset validation on data storage platforms, rather than independent tasks (which could feasibly be running different versions of the validator, etc.).Happy to hear opinions of others, or explain in more detail exactly how my data is stored and being accessed (I omitted these details here since this issue is consistent, though varies in impact, for other types of network or symbolic mounts; in my particular case this is prohibitive and I cannot run apps which use the validator).
Cheers,
G
The text was updated successfully, but these errors were encountered: