New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New accession specification and contributor naming guidelines -- DISCUSSION #84
Comments
Related to this: do we have an inventory anywhere of the 2-3 letter codes already in use in MassBank accession IDs @meier-rene? |
Yes, we have. I recently updated the list of contributors. https://github.com/MassBank/MassBank-data/blob/master/List_of_Contributors_Prefixes_and_Projects.md |
Brilliant, thank you! Looks like Luxembourg records should indeed start with L ;-) @MaliRemorker |
Yes, but the UFZ UN and UP assignments are missing now. @meier-rene, could you add them again, please? |
Done. |
Hi all. |
Hi Michael, However, RMassBank would allow only 3 digits for the metabolite number, because the last two digits are reserved for the ionisation and collission energy. This was a decision of the Eawag people during RMassBank development. I didn't like those two last digits from beginning, but I never found the time to program a more flexible version of RMassBank which gives the user the choice to use an own number or this Eawag constrained setting. At UFZ, we are running experiments with many collision energies and thus a quick increasing of record numbers (not yet uploaded...) and thus I need to get rid of the two last digits after RMassBank processing. I post-process the files and infuse new sequential numbers to the accessions by a small parsing script. In our case, it makes sense, because our accessions are arbitrary anyway and not related to any internal ID. Best, |
Yes, I was thinking the same way. We have three different LC methods for which we generate data:
Accordingly, I would use:
and then the combinations:
and so on... Tell me if you prefer the instrument on the first or last position. |
I would prefer the order instrument and then method ARP... |
I was thinking the same way. I can add it to the list and then make a PR. |
It isn't so hard to write another generator for the accession ID, I have to check but it's possibly even anticipated in the S4power. However, I am still deeply unhappy with the limitations to 8 digits from the MassBank format, and I would really wish to use a larger identifier space! Is there any way towards this? |
We do have the SPLASH as identifier, however that is not necessarily unique So my suggested roadmap would be that non-semantic unique identifiers are generated In the short run I would keep the ACCESSION format specs as-is. |
Hi, Best, |
Then what happens to the ACCESSION? Do we discard it entirely? I agree with the semantics aspect in principle, though I would extend that to the contributor letter code in the beginning! I would much rather have a CONTRIBUTOR (NAMESPACE?) tag separate from Accession, and Accession be any 8-letter alphanumeric that the Contributors choose by themselves. What do you suggest: how should contributors keep track of their own contributed spectra? |
I don't know, but maybe a first step would be to enlarge the allowed digits and the length of prefix? Just my impression... |
@sneumann @michaelwitting @meowcat @meier-rene, about our discussions yesterday and today. We agreed to migrate the accession to a more flexible format. The idea was an ID with a major and minor part with some flexibility to code also semantics (if required, but not recommended). The prerequisites of the new specification are:
The tasks are:
|
This is related to issue MassBank/MassBank-web#11. I suggest going ahead with discussion here and then use issue 11 for the final implementation. |
Paging @schymane My suggestion
I also want to make something clear that is perhaps not so clear: the "semantics" in the accession ID for RMassBank records was never a part of the record specification and purely a user-sided decision (except that I made that decision for users of RMassBank). I fully agree that we do not want any semantics at all prescribed by the record specification. Variations
Another truly backwards-compatible option is:
Note: in line with MassBank/MassBank-web#303, I would suggest to have the actual regex (or multiple regex) defined in a file. |
A question to @sneumann: Do we want backwards compatibility (in the sense that old IDs are valid IDs under the new rules) or only the guarantee of no collisions? The second could be achieved by any code that has at least one underscore. |
I totally agree that there should be no semantic in the ID. If users (like us) use own semantics in the ID, they just have to adhere to the rules for a MassBank ID. |
We discussed this also internally a bit. We would like to come close to the structure of doi (which is: 10.ORGANISATION/ID), but we dont want to see the '/'. We would like to limit the allowed charakters in the major and minor section, so that we have less problems with escaping charakters in different formats like URL, html, json, xml, sql...
Here comes my suggestion: We translate our current dataset: We would need a resolver for the old accessions 😟 but I hope I can solve this with an algorithm. Thank you for all your suggestions! Any objections? |
I like the suggestion @tsufz sent me via eMail, e.g. |
Can we get away without |
... and I like actually several of @meowcat 's variants as possibilities ... that previous comment was rather to advocate sticking to |
I will trust @sneumann and @meier-rene on interoperability. Nevertheless, I would consider removing Second, do we need the tilde? Do we see any reason people would want to use it? It may still have special meaning in filesystems, no? Allowing two separators (i.e. "-" for the canonical and "_" for user purposes in the minor) is probably desirable, especially for people who want to do acrobatics in the minor part. |
I would be able to live with "-" and "_" if necessary, but don't really like "." and "~", as @meowcat said ... |
"." and "~" look really weird to me. I think "-" and "_" are sufficient. |
Hi, quick question, has a consensus been reached here? (I understand that there is not a guarantee someone else will make a colliding identifier in the future.) |
I actually typoed Should there be something like a metadata YAML per contributor on |
I think we agreed that we will have a |
This fits creatly my plans, since we will soon start to create some new libraries for our new non-targeted platform we would like to setup in my new lab. Generally, I like this very much and should give everyone enough freedom... |
Btw. Some while ago, the colleagues of GNPS asked us to introduce the USI tag as well. It is like |
I agree with the suggestion of Michele. Insiders may interpret |
@meowcat You suggested introducing a provenance file. I think, it is not possible to maintain a public collection of e-mails etc. This is personal data and I prefer not to be responsible for such collection regarding data protection issues. We generate the provenance with our submission system. So far, we avoided data protection issues and I suggest doing so in future. Nevertheless, in NFDI4Chem, we develop a new repository style part of MassBank. Indeed, for this repo, access options are required with the priority to use SSO / AAI systems. |
Hi all, Regarding the |
Hi, now that this is going live: |
I think with the new accession name scheme this has been addressed. |
I would like to develop some guidelines for new contributors how to name their accession and how and when to create new directories. This has become urgent due to some email discussion about new contributors and particular some new contributions, like #82.
There are different demands for which we need to find some compromise.
I, as a maintainer of the whole project would like to have data compact and not cluttered. Some directories are desired but not too much. Technical we only support one level with directories at the moment.
There are demands from contributors. They would like to separate their contribution by contributing group, but also sometimes by a specific project, which supported the creation of this records. I expect that a entry in the COMMENT section does not suffice. For them its most likely also matter of public image. Sometimes this separation is not an issue, because there is just one contributing person/group at a particular institution. In other cases more "separation" or "distinguishability" is desired. You all know that everyone has to justify his/her projects somehow...
Possible solutions - technical view -:
-most easy way for contributors: allow subdirectories in the institution directories. This creates major headaches on my side, because it would mean a lot of adjustments to the codebase
-use a directory naming scheme like the one which is already in the current data and resulted in this this discussion issue. Examples for the scheme could be RIKEN_IMS, RIKEN_NPDepo... This is a easy solution because it works right now. Only drawback is the increasing directory number which makes the project view bit more confusing.
-ease the requirement on accession naming, thats easy to implement but might not be sufficient "distinguishability"
Besides directory naming we also have the question of accession naming.
The text was updated successfully, but these errors were encountered: