Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New accession specification and contributor naming guidelines -- DISCUSSION #84

Closed
meier-rene opened this issue Jun 28, 2019 · 37 comments
Closed

Comments

@meier-rene
Copy link
Collaborator

I would like to develop some guidelines for new contributors how to name their accession and how and when to create new directories. This has become urgent due to some email discussion about new contributors and particular some new contributions, like #82.

There are different demands for which we need to find some compromise.

I, as a maintainer of the whole project would like to have data compact and not cluttered. Some directories are desired but not too much. Technical we only support one level with directories at the moment.

There are demands from contributors. They would like to separate their contribution by contributing group, but also sometimes by a specific project, which supported the creation of this records. I expect that a entry in the COMMENT section does not suffice. For them its most likely also matter of public image. Sometimes this separation is not an issue, because there is just one contributing person/group at a particular institution. In other cases more "separation" or "distinguishability" is desired. You all know that everyone has to justify his/her projects somehow...

Possible solutions - technical view -:
-most easy way for contributors: allow subdirectories in the institution directories. This creates major headaches on my side, because it would mean a lot of adjustments to the codebase
-use a directory naming scheme like the one which is already in the current data and resulted in this this discussion issue. Examples for the scheme could be RIKEN_IMS, RIKEN_NPDepo... This is a easy solution because it works right now. Only drawback is the increasing directory number which makes the project view bit more confusing.
-ease the requirement on accession naming, thats easy to implement but might not be sufficient "distinguishability"

Besides directory naming we also have the question of accession naming.

@schymane
Copy link
Member

Related to this: do we have an inventory anywhere of the 2-3 letter codes already in use in MassBank accession IDs @meier-rene?
@MaliRemorker

@meier-rene
Copy link
Collaborator Author

Yes, we have. I recently updated the list of contributors. https://github.com/MassBank/MassBank-data/blob/master/List_of_Contributors_Prefixes_and_Projects.md

@schymane
Copy link
Member

Brilliant, thank you! Looks like Luxembourg records should indeed start with L ;-) @MaliRemorker

@tsufz
Copy link
Member

tsufz commented Aug 26, 2019

Yes, but the UFZ UN and UP assignments are missing now. @meier-rene, could you add them again, please?

@meier-rene
Copy link
Collaborator Author

Done.

@michaelwitting
Copy link
Contributor

Hi all.
We are currently collecting new data on our new Agilent 6560 and Sciex X500R using our SOP methods. The LC methods are the same to our current data (BGC Munich, RP). I'm thinking a lot about prefixes etc... Currently we use RMassBank and the shifts in the numbers are linked to adducts and collision energies used. I also would like to keep the IDs of metabolites the same between the different instruments. Therefore the only way of differentiating between the instruments would be the prefix. Let me know your current opinion and ideas.

@tsufz
Copy link
Member

tsufz commented Jul 17, 2020

Hi Michael,
This is a good question and maybe time again to continue discussion on the accession. You may use a third alphanumeric to annotate the instrument (e.g. RPA for Agilent and RPS for Sciex). We could reserve the IDs for your lab. See https://github.com/MassBank/MassBank-web/blob/master/Documentation/MassBankRecordFormat.md#2.1.1. for details on the accession.

However, RMassBank would allow only 3 digits for the metabolite number, because the last two digits are reserved for the ionisation and collission energy. This was a decision of the Eawag people during RMassBank development. I didn't like those two last digits from beginning, but I never found the time to program a more flexible version of RMassBank which gives the user the choice to use an own number or this Eawag constrained setting.

At UFZ, we are running experiments with many collision energies and thus a quick increasing of record numbers (not yet uploaded...) and thus I need to get rid of the two last digits after RMassBank processing. I post-process the files and infuse new sequential numbers to the accessions by a small parsing script. In our case, it makes sense, because our accessions are arbitrary anyway and not related to any internal ID.

Best,
Tobias

@michaelwitting
Copy link
Contributor

Yes, I was thinking the same way. We have three different LC methods for which we generate data:

RP = Reversed-phase metabolite profiling (water-ACN gradient)
HI = HILIC based metabolite profiling
LI = Reversed-phase lipid profiling (ACN-iPrOH gradient)

Accordingly, I would use:

A = Agilent
B = Bruker
S = Sciex

and then the combinations:

ARP or RPA = Reversed-phase metabolite profiling on Agilent 6560
AHI or HIA = HILIC based metabolite profiling on Agilent 6560
ALI or LIA = Reversed-phased lipid profiling on Agilent 6560

and so on...

Tell me if you prefer the instrument on the first or last position.

@tsufz
Copy link
Member

tsufz commented Jul 17, 2020

I would prefer the order instrument and then method ARP...

@michaelwitting
Copy link
Contributor

I was thinking the same way. I can add it to the list and then make a PR.

@meowcat
Copy link
Contributor

meowcat commented Jul 22, 2020

It isn't so hard to write another generator for the accession ID, I have to check but it's possibly even anticipated in the S4power. However, I am still deeply unhappy with the limitations to 8 digits from the MassBank format, and I would really wish to use a larger identifier space! Is there any way towards this?

@sneumann
Copy link
Member

We do have the SPLASH as identifier, however that is not necessarily unique
if you have nominal mass spectra with just one/few peak(s). In the long run
we also can't rely on client-generated unique identifiers, nor am I a friend
of any semantics in the bits/bytes in the identifier.

So my suggested roadmap would be that non-semantic unique identifiers are generated
by MassBank at the time-point when new spectra get merged into the dev branch.
The generated IDs should support namespaces and versioning just like DOIs,
to (potentially) allow a federated system (again) without collisions.

In the short run I would keep the ACCESSION format specs as-is.
Yours, Steffen

@tsufz
Copy link
Member

tsufz commented Jul 28, 2020

Hi,
I support the suggestions of @sneumann, we have better things to do than maintaining lists of internal identifiers. I was never a friend of the semantics and I use a running number anyway. Unfortunately, I had to split positive and negative spectra by the first characters because of increased number collision energies used. I am looking forward to get rid of all ID problems by the automated system. Let's go for it!

Best,
Tobias

@meowcat
Copy link
Contributor

meowcat commented Jul 28, 2020

I am looking forward to get rid of all ID problems by the automated system.

Then what happens to the ACCESSION? Do we discard it entirely?

I agree with the semantics aspect in principle, though I would extend that to the contributor letter code in the beginning! I would much rather have a CONTRIBUTOR (NAMESPACE?) tag separate from Accession, and Accession be any 8-letter alphanumeric that the Contributors choose by themselves.

What do you suggest: how should contributors keep track of their own contributed spectra?

@michaelwitting
Copy link
Contributor

I don't know, but maybe a first step would be to enlarge the allowed digits and the length of prefix? Just my impression...
We use for example the first 4 digits for our internal DB id and the last two code for ion mode, adduct and collision energy.

@tsufz
Copy link
Member

tsufz commented May 4, 2021

@sneumann @michaelwitting @meowcat @meier-rene, about our discussions yesterday and today. We agreed to migrate the accession to a more flexible format. The idea was an ID with a major and minor part with some flexibility to code also semantics (if required, but not recommended). The prerequisites of the new specification are:

  1. Backwards compatibility with old accessions
  2. URL compatibility (no crazy symbols, comma, etc.)

The tasks are:

  1. Find a general scheme
  2. Write the specification and Regex
  3. Integrate the specification in the Records Format
  4. Code where necessary (MassBank, RMassBank, external sources such as Wikidata).
  5. Inform friends about change of scheme and software backends
  6. Roll out.

@tsufz tsufz changed the title New contributor accession naming guidelines -- DISCUSSION New accession specification and contributor naming guidelines -- DISCUSSION May 4, 2021
@tsufz
Copy link
Member

tsufz commented May 4, 2021

This is related to issue MassBank/MassBank-web#11. I suggest going ahead with discussion here and then use issue 11 for the final implementation.

@meowcat
Copy link
Contributor

meowcat commented May 4, 2021

Paging @schymane

My suggestion

  • [A-Z0-9]{1,16}_[A-Z0-9]{1,32}
  • The first block is an equivalent to the current contributor ID
  • The second block is completely free to choose for the contributor. This would directly allow any mapping schemes on the contributor side e.g. by @michaelwitting.
  • If we want it truly backwards-compatible without rewriting or forwarding old IDs: ([A-Z0-9]{1,16}_)?[A-Z0-9]{1,32} so all old records match the second block only.

I also want to make something clear that is perhaps not so clear: the "semantics" in the accession ID for RMassBank records was never a part of the record specification and purely a user-sided decision (except that I made that decision for users of RMassBank). I fully agree that we do not want any semantics at all prescribed by the record specification.

Variations

  • making the second part purely numeric
    • This is then not truly backwards-compatible "in spirit" except if we actually rewrite old IDs ABC12345 as ABC_12345.
    • (Of course, [A-Z0-9]{1,16}(_[0-9]{1,32})? is a backwards-compatible solution, but it is somewhat irking to me that it matches the major rather than the minor part ;) )
    • this disallows some userside mapping schemes
    • there is no true good reason for this, except that many other identifiers are also numeric. (Or is there a reason I don't know about?)
  • Allowing underscores in the second part: ([A-Z0-9]{1,16}_)?[A-Z0-9_]{1,32} This makes even more mapping available, but might feel unelegant
  • Adding a prefix e.g. MSBNK_, MBEU_ or such to make the records immediately recognizable, such as MTBLS for MetaboLights or such.
  • Drawback: not truly backwards compatible, though we could of course write a forwarder.

Another truly backwards-compatible option is:

  • [A-Z_]{1,32}[0-9]{1,32} or [A-Z]{1,32}[0-9]{1,32}
  • Drawbacks: While a mapping like @michaelwitting's would be possible, the letter part would "pollute" the major space, unless we declare the major space delimited by the first underscore.

Note: in line with MassBank/MassBank-web#303, I would suggest to have the actual regex (or multiple regex) defined in a file.

@meowcat
Copy link
Contributor

meowcat commented May 4, 2021

A question to @sneumann: Do we want backwards compatibility (in the sense that old IDs are valid IDs under the new rules) or only the guarantee of no collisions? The second could be achieved by any code that has at least one underscore.

@michaelwitting
Copy link
Contributor

I totally agree that there should be no semantic in the ID. If users (like us) use own semantics in the ID, they just have to adhere to the rules for a MassBank ID.

@meier-rene
Copy link
Collaborator Author

meier-rene commented May 4, 2021

We discussed this also internally a bit. We would like to come close to the structure of doi (which is: 10.ORGANISATION/ID), but we dont want to see the '/'.
We as a organisation would like to keep track of the major id. Of course we can follow the suggestions and wishes of the contributor as close as possible. The minor id is free for the contributor and can be a scheme or just a consecutive number.

We would like to limit the allowed charakters in the major and minor section, so that we have less problems with escaping charakters in different formats like URL, html, json, xml, sql...
This limits our available charakter set to:

  • A–Z (not case sensitiv)
  • 0–9
  • - . _ ~
    We would like to limit major and minor to reasonable length. There are limitations with respect to databases and indexing

Here comes my suggestion:
[A-Z0–9._~]{1,32}-[A-Z0–9._~]{1,128}
major - minor

We translate our current dataset:
contributor-accession

We would need a resolver for the old accessions 😟 but I hope I can solve this with an algorithm.

Thank you for all your suggestions! Any objections?

@michaelwitting
Copy link
Contributor

I like the suggestion @tsufz sent me via eMail, e.g. UFZ-WANA-MASSBANK_[0-9]{1,16} . The first part could identify the organization (UFZ), the second part the method, sublibrary etc...

@schymane
Copy link
Member

schymane commented May 4, 2021

Can we get away without - . _ ~ in the character set identifiers and only use one (e.g. -) as separator? I fear this will create all sorts of interoperability issues ...
ie
[A-Z0–9]{1,32}-[A-Z0–9]{1,128}
major - minor

@schymane
Copy link
Member

schymane commented May 4, 2021

... and I like actually several of @meowcat 's variants as possibilities ... that previous comment was rather to advocate sticking to A-Z0-9 and not expanding beyond except for a separator

@meowcat
Copy link
Contributor

meowcat commented May 4, 2021

I will trust @sneumann and @meier-rene on interoperability.

Nevertheless, I would consider removing . at least from the minor part because the dot will usually be in the record filename. Also, I believe there are still scripts that struggle if directories contain dots, so do we need it in the major?

Second, do we need the tilde? Do we see any reason people would want to use it? It may still have special meaning in filesystems, no?

Allowing two separators (i.e. "-" for the canonical and "_" for user purposes in the minor) is probably desirable, especially for people who want to do acrobatics in the minor part.

@schymane
Copy link
Member

schymane commented May 4, 2021

I would be able to live with "-" and "_" if necessary, but don't really like "." and "~", as @meowcat said ...

@michaelwitting
Copy link
Contributor

"." and "~" look really weird to me. I think "-" and "_" are sufficient.

@meowcat
Copy link
Contributor

meowcat commented Jun 4, 2021

Hi,

quick question, has a consensus been reached here?
I am beginning to think we should have a database-scale prefix such as MSBNK_ or so, as it would enable to find MassBank records without specifying that the code we are using is a MassBank accession. Other repositories have it too, such as MTBLS, MSV, PXD E.g., the tool ppx retrieves repository data and doesn't require specifying what repository it is from. (Not that ppx is specifically relevant for us, but more as a general statement.) As a bad counterexample, Metabolomics Workbench apparently prefixes their accession codes with ST, such that their study accessions could as well be traditional-MassBank-accession records.

(I understand that there is not a guarantee someone else will make a colliding identifier in the future.)

@meowcat
Copy link
Contributor

meowcat commented Jun 4, 2021

I actually typoed MSBNK_ instead of MSBNK- but maybe it wouldn't even be so bad to enforce this within the organization level (but I'm confused about this).

Should there be something like a metadata YAML per contributor on MassBank-data? Either that would be in the contributor's directory, or in one directory .contributors or such where there is one file per contributor prefix with name, email contact etc.

@meier-rene
Copy link
Collaborator Author

I think we agreed that we will have a major - minor system. The major should stay under control of the consortium and the minor is free. We agreed on a limited character set A-Z 0-9 _.
I support the idea of a database-scale prefix. The advantages are already described and the additional effort is negligible.

@michaelwitting
Copy link
Contributor

This fits creatly my plans, since we will soon start to create some new libraries for our new non-targeted platform we would like to setup in my new lab.
To summarize a new accession number could look like MSBNK-MPC-RPS0000001, where MSBNK is the database-scale prefix, MPC would be the major part, e.g. identifying a lab (in our case Metabolomics and Proteomics Core) and RPS0000001 would be for everybody as long it uses A-Z 0-9 _.
Any ideas on the maximum length of each part?

Generally, I like this very much and should give everyone enough freedom...

@tsufz
Copy link
Member

tsufz commented Jun 10, 2021

Btw. Some while ago, the colleagues of GNPS asked us to introduce the USI tag as well. It is like mzspec:MASSBANK::accession:[A-Z]{1,3}[0-9]{0-6}. It could be just imputed in the records for easier re-use in GNPS, but not necessarily show up in the MassBank database.

@tsufz
Copy link
Member

tsufz commented Jun 10, 2021

I agree with the suggestion of Michele. Insiders may interpret [A-Z]{1,3}[0-9]{0-6} as MassBank records, but others don't. I remind on the representation of our records in PubChem. In many cases, those are origin MassBank records, but the source is MoNA. A tag, marking them MassBank would be helpful to increase awareness of the origins of the data.

@tsufz
Copy link
Member

tsufz commented Jun 10, 2021

@meowcat You suggested introducing a provenance file. I think, it is not possible to maintain a public collection of e-mails etc. This is personal data and I prefer not to be responsible for such collection regarding data protection issues. We generate the provenance with our submission system. So far, we avoided data protection issues and I suggest doing so in future. Nevertheless, in NFDI4Chem, we develop a new repository style part of MassBank. Indeed, for this repo, access options are required with the priority to use SSO / AAI systems.

@meowcat
Copy link
Contributor

meowcat commented Nov 10, 2021

Hi all,
is there anything holding us back here? Can I do anything to see this implemented?
As I see it, the latest suggestion is MSBNK-[A-Z0–9_]{1,32}-[A-Z0–9_]{1,128}.

Regarding the MSBNK- prefix, it probably makes sense to use it specifically for records uploaded to massbank.eu (since it's supposed to identify records that are from MassBank, rather than records that are in MassBank format), whereas if someone runs an internal database, they could use something else.

@meowcat
Copy link
Contributor

meowcat commented Jul 25, 2022

Hi,

now that this is going live:
Is it intended that the records are now in format MSBNK-[A-Za-z0–9_]{1,32}-[A-Z0–9_]{1,128} (note the lowercase letters in the contributor prefix)? Is this simply easier to handle (rather than uppercasing the contributor prefix)?

@meier-rene
Copy link
Collaborator Author

I think with the new accession name scheme this has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants