Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curating CH$NAME entries #81

Open
schymane opened this issue Jul 25, 2017 · 21 comments
Open

Curating CH$NAME entries #81

schymane opened this issue Jul 25, 2017 · 21 comments
Labels

Comments

@schymane
Copy link
Member

Hi all,

We now have many records from many contributors that are the same substance, but with different CH$NAME entries (and different combinations of CH$NAME entries). Since RMassBank starts from the Compound List, if the starting name is different, these CH$NAME fields are not even consistent between RMassBank records ... but depends on when the compound data was retrieved etc. How should we go about fixing this?
In terms of MassBank display, the title etc, the FIRST CH$NAME entry is critical. Ideally this should be the same for a unique compound across all contributors - but how do we choose which one is "right" and which CH$NAME entries to keep and which to discard? Which entry should be the FIRST CH$NAME entry?

Some random examples:
"Imidacloprid urea" and "1-[(6-chloropyridin-3-yl)methyl]imidazolidin-2-one"
"Imidacloprid-urea" and "1-[(6-Chloropyridin-3-yl)methyl]imidazolidin-2-one"
"Imidacloprid-urea" and "CHEMBL71188" and "1-[(6-chloropyridin-3-yl)methyl]imidazolidin-2-one"

"2-Isopropyl-6-methyl-pyrimidin-4-ol" and "6-methyl-2-propan-2-yl-1H-pyrimidin-4-one"
"2-Isopropyl-6-methyl-pyrimidin-4-ol" and "6-Methyl-2-propan-2-yl-1H-pyrimidin-4-one"
"Pyrimidinol" and "2-isopropyl-6-methyl-1H-pyrimidin-4-one"
"Pyrimidinol" and "2-Isopropyl-6-methyl-pyrimidin-4-ol" and "6-Methyl-2-propan-2-yl-1H-pyrimidin-4-one"

or the various possibilities (mixed and matched) for Lidocaine:
LID_235.1805_10.1
Lidocaine
2-(diethylamino)-N-(2,6-dimethylphenyl)acetamide
Lidocain
2-(Diethylamino)-N-(2,6-dimethylphenyl)acetamide
Lignocaine

One case would be to just choose the first of the first CH$NAME entries processed for a given compound, but this is somewhat random, only partially reproducible and in the case of LID_235.1805_10.1 would result in a very strange CH$NAME entry as the primary name. Another choice would be the "Preferred name" from the CompTox Dashboard - which will be fine for the curated MassBank.EU entries we have done ... but will not hold/be possible for all MassBank or (necessarily) for new records until they are registered - and these are also not always perfect. It would also remove the preferred "primary name" for the contributing institute (i.e. the entry from the compound list), which is something that some people use a lot.

Any thoughts? @meowcat @tsufz @ChemConnector

@meowcat can you remember why we chose only 3 CH$NAME entries, was this for our sanity within RMassBank and not because of a numerical restriction (I saw no restriction in the Record Specs?). Will it be a problem if I "curate" our records to have potentially MORE than 3 names? Will RMassBank be able to deal with this when we re-parse records (and if not can we help it deal with it?). Or do I have to stick with 3 names?

The current search functionality (by name) seems to work for any entry in a CH$NAME field, so this should not be an issue. Will this remain so in the future?

Finally ... does anyone have a sensible idea how we could store and access these names (and related identifiers) and make this future proof - so we can ensure that new contributions are named consistently? Would we be able to access that within RMassBank to check not just the "infolist" entries locally but also to check those already ON MassBank? I.e. have a centralized "infolist"?

Thanks!

@tsufz
Copy link
Member

tsufz commented Jul 25, 2017

Will discuss the topic with Martin, my opinion is too humble.

@tsufz
Copy link
Member

tsufz commented Jul 25, 2017

However, the curation of a central list on MassBank should be easier in future. Once, the new DB is available, we have many opportunities to use it. I guess after re-factorisation of the DB, we will also hand on the API? This could be the access point for RMassBank (and other software).

@tsufz
Copy link
Member

tsufz commented Jul 25, 2017

We suggest a central mapping table where a preferred name is automatically set and related to all existing names with manual curation of missmatches and hence only the curated name is used. The original records stay untouched. Such like:

Preferred name; collection of names
AAA; aaa, AaA, aAa, aaA, AAa, Aaa

@uchem-massbank
Copy link

uchem-massbank commented Jul 25, 2017 via email

@tsufz
Copy link
Member

tsufz commented Jul 26, 2017

Of course, we need also the structural identifier. The list should be collected automatically from the names available in all records. The preferred name could be retrieved from a reliable source, but needs the final approval from curators. This is tedious, but necessary work in my opinion.

However, I am not known as a friend of the neat way around the (auto-)curation of records also on the record file basis. The goal is to get finally rid of work around situation such as the mapping table.

We are back on the curation discussion #25. I suggest to start strict curation of all records which are marked respectively. All spectra expect the Waters are marked with CC-BY license and hence it is possible to curate them. We did already by injection of SPLASH and we should do with other stuff. It is annoying for the users to get different names of one compound or misslinks etc.

Curation will improve reliability in MassBank, starting with the names would a great step forward.

And finally, the preferred name is part of the collection of names.

@ChemConnector
Copy link

@schymane I think it's a good idea to look at the CompTox Chemistry Dashboard as a source of "Preferred Names". Certainly they will not always be perfectly matched for this purpose but in the vast majority of cases they will be appropriate. In arranging for mappings to the dashboard we could also coordinate around preferred name assignment. The associated synonyms are always available (for data that is public) so if a particular synonym was preferred over our assigned Preferred Name we could discuss. These designs are particularly subjective in nature after all. Looking forward to helping with this aspect of the project as required.

@tsufz
Copy link
Member

tsufz commented Jul 26, 2017

@ChemConnector great, txs for your help. This makes it easier to sort out lazy name tags and to improve the things. @Treutler and @naperone, should be considered in the development of new DB structure in #9

@schymane
Copy link
Member Author

schymane commented Jul 26, 2017 via email

@m-arita
Copy link

m-arita commented Jul 27, 2017 via email

@tsufz
Copy link
Member

tsufz commented Feb 14, 2019

WIth reference to #156, I would like to come back to the discussion on curation of meta data. The issue of @schymane is a very good example that a curation of meta data is required, especially the harmonisation of the presented name. I guess, it is quite annoying to people scrolling through a list with redundant entries because of different name presentations.

Best
Tobias

@schymane
Copy link
Member Author

schymane commented Feb 14, 2019 via email

@tsufz
Copy link
Member

tsufz commented Feb 14, 2019

Jupp, see my comment above (by 25 Jul 2017!)

@meier-rene
Copy link
Contributor

To solve #156 it is not necessarily needed to finally decide about this issue, but it wouldn't hurt. My feeling of the discussion is that we should not substitute existing names with curated ones, but I don't see any problems in adding new names or adding new fields to the MassBank scheme.

For #156 we need a unique key for grouping and i propose to use InChi or a subset of InChi like the first field of the InChi-Key as allready discussed above. This is reasonable now because we have added all InChi if there was at least a SMILES available. There is only a small fraction of records left without InChi/SMILES and for them we need to fall back to grouping by name.

The other question is how should the respective group of records be named. Easiest would be first in list. More reasonable maybe most occurring? And of course we can additionally provide a curated list with names for particular compounds, but I doubt that this can be complete. Maybe it could cover the most occurring compounds. Nevertheless, I wouldn't push this idea.

My suggestion: What about adding a uniform synonym for every compound as last CH$NAME field. This can be done by an algorithm (the source of the synonym is still an open question) and does not break existing MassBank format.

@schymane
Copy link
Member Author

To solve #156 it is not necessarily needed to finally decide about this issue, but it wouldn't hurt. My feeling of the discussion is that we should not substitute existing names with curated ones, but I don't see any problems in adding new names or adding new fields to the MassBank scheme.

I agree here, we should not replace any information (unless it is blatantly wrong of course) but just add extra fields if we need, as you suggest below. I would not like to see 100s of synonym entries added to records.

For #156 we need a unique key for grouping and i propose to use InChi or a subset of InChi like the first field of the InChi-Key as allready discussed above. This is reasonable now because we have added all InChi if there was at least a SMILES available. There is only a small fraction of records left without InChi/SMILES and for them we need to fall back to grouping by name.

Again I agree - I think InChIKey or InChIKey first block would be the appropriate way to go; default to grouping by name for those with missing entries. I am caught between grouping by first block or not; has advantages and disadvantages.

The other question is how should the respective group of records be named. Easiest would be first in list. More reasonable maybe most occurring? And of course we can additionally provide a curated list with names for particular compounds, but I doubt that this can be complete. Maybe it could cover the most occurring compounds. Nevertheless, I wouldn't push this idea.

Display name: First in list would be a random choice; the "most occurring" likely a better option that I would prefer and should be easy enough to manage?

My suggestion: What about adding a uniform synonym for every compound as last CH$NAME field. This can be done by an algorithm (the source of the synonym is still an open question) and does not break existing MassBank format.

OK in principle, but what if the name is repeated (maybe not a problem, just aesthetically displeasing to see unnecessary repetiton). Alternative is to introduce a second field CH$DISPNAME or CH$PREFNAME (also not ideal). My thoughts at least...

@sneumann
Copy link
Member

CTS had a way to guess the "best" name, using some scoring. Not sure if they still do.

What happens to InChIkey grouping for Emma's tentatives where we are not entirely certain about the structure ?

@schymane
Copy link
Member Author

schymane commented Feb 14, 2019 via email

@Treutler
Copy link
Contributor

Regarding the display of groups of searched records I vote for the usage of the full InChI/InChIKey as the grouping criterion. Than we can safely display a compound name and a structure. If we group isomeric structures both the compound name and the structure will be wrong for a subset of the respective group. For the selection of the displayed name the usage of the most common name or the shortest name are possible ways to go but this is not really critical in my eyes.

@schymane
Copy link
Member Author

schymane commented Feb 25, 2019 via email

@sneumann
Copy link
Member

sneumann commented Apr 2, 2019

Hi, I would like to suggest DISPLAYNAME, as people might take 'preferred' personally, if their favorite is not the preferred one.
Yours Steffen

@schymane
Copy link
Member Author

schymane commented Apr 2, 2019

I'd vote for CH$DISPLAY_NAME or similar as well!

@tsufz
Copy link
Member

tsufz commented Apr 3, 2019

I agree with CH$DISPLAY_NAME or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants