Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMI2016 records with compound/spectrum mismatch #9

Open
schymane opened this issue Apr 27, 2018 · 13 comments
Open

CASMI2016 records with compound/spectrum mismatch #9

schymane opened this issue Apr 27, 2018 · 13 comments

Comments

@schymane
Copy link
Member

User reported that SM858902 and SM858951 contain spectral data from acetylsulfamethoxazole but are labeled diphenhydramine (thank you!). Upon closer inspection we seem to have had an ID/Precursor&peaks mismatch for 3 IDs / 4 records in a series, surrounded by records that look OK; series "broken" due to missing IDs in the middle. We also need to find the cause in https://github.com/MassBank/RMassBank

This should not be passing any form of validation; a screening of the entire CASMI2016 database would be extremely useful for debugging the cause and flagging how and how many records to fix, thank you @meier-rene in advance if you can :-)

From what I can see:
**this one looks OK.
ACCESSION: SM858203
RECORD_TITLE: Cetirizine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C21H25ClN2O3
CH$EXACT_MASS: 388.15537
MS$FOCUSED_ION: PRECURSOR_M/Z 389.1626
389.1626 C21H26ClN2O3+ 1 389.1626 -0.05

**this one looks OK.
ACCESSION: SM858353
RECORD_TITLE: 2-Hydroxycarbamazepine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C15H12N2O2
CH$EXACT_MASS: 252.08988
MS$FOCUSED_ION: PRECURSOR_M/Z 251.0826
251.0827 C15H11N2O2- 1 251.0826 0.4

[no records with IDs between 8583 and 8588]

** here something has gone wrong
ACCESSION: SM858801
RECORD_TITLE: Finasteride; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C23H36N2O2
CH$EXACT_MASS: 372.27768
MS$FOCUSED_ION: PRECURSOR_M/Z 256.1696

** here something has gone wrong
ACCESSION: SM858902
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 296.07

** still wrong ... it's using the same (wrong) exact mass to get equivalent wrong precursor
ACCESSION: SM858951
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 294.0554

** still wrong:
ACCESSION: SM859002
RECORD_TITLE: Acetyl-sulfamethoxazole; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C12H13N3O4S
CH$EXACT_MASS: 295.06268
MS$FOCUSED_ION: PRECURSOR_M/Z 325.1711
325.171 C20H22FN2O+ 1 325.1711 -0.17 <= we have F annotations!!!!!

[no 8591]

** and now everything seems OK again ...
ACCESSION: SM859203
RECORD_TITLE: Amitriptyline; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C20H23N
CH$EXACT_MASS: 277.18305
MS$FOCUSED_ION: PRECURSOR_M/Z 278.1903
278.1904 C20H24N+ 1 278.1903 0.42

@schymane
Copy link
Member Author

So, I just ran getMBRecordInfo (https://github.com/schymane/ReSOLUTION/) on the directory, extracting precursor and exact mass automatically from CASMI2016 from the OpenData SVN; checking the difference flags exactly and only these 4 records as having a mass difference above/below ~1.007
SM858801, SM858902, SM858951, SM859002

@schymane
Copy link
Member Author

Thanks to diagnosis from Herbert Oberacher the case is now clear (see issue online for case history):

SM858801 is diphenhydramine
SM858902 and SM858951 are Acetyl-sulfamethoxazole
SM859002 is citalopram

So, how to update? If I update the compound information to match the spectra then we will have a mismatch between the internal IDs, UFZ IDs and the MassBank accession numbers. However if I change to the correct internal IDs we'll be changing accession numbers and I think this is worse. If I hear nothing back I will correct the compound information in these four records and send along updates when I get a chance.

@meier-rene @tsufz @meowcat

@meier-rene
Copy link
Collaborator

Is deleting the incorrect records and adding new and correct records an option?

@schymane
Copy link
Member Author

Well, the records need to be fixed, this is for sure. However, if I correct the processing error, we will end up with new accession numbers. I am not sure this is the right way to fix it in this case though. This is the compound list ... it is still inexplicable how this happened as it's kind of impossible the way that RMassBank works, but something certainly went wrong! According to the compound list, 8588 is certainly meant to be Finasteride but ended up as the compound info of finasteride with the spectral data of diphenhydramine ... do you see the problem? If I now reprocess then the SM858801 record will turn into SM858901 and SM858902 will become SM859002 ...
I think best would be to update the compound info with the current accession numbers otherwise we are going to run into awful versioning problems?

image

@meowcat
Copy link
Contributor

meowcat commented Mar 25, 2019

I understand the problem - is it a reasonable option to upload the records under a tag that is not SM? In that way the new, say SZ records will have the correct internal ID, and the old ones should be marked obsolete... Just an idea. Not yet thought through.

@schymane
Copy link
Member Author

schymane commented Mar 25, 2019 via email

@schymane
Copy link
Member Author

OK here goes with a complicated update to address issues in the CASMI spectra, I suggest @meier-rene implement this at the MassBank-data side, and I'll double check to confirm once done, and comment the commit where necessary (@meier-rene other ideas welcome if you see an alternative). This has been double checked with the data source (Martin Krauss).
Note for the record: NONE of these issues actually affected the CASMI contest. It was an inadvertent upload of files that were extracted but eliminated during quality control for the contest. But we need to fix the database now ;-)

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM872102
This is a spectrum of Exemestane (identical SPLASH), please update the compound information in SM872102 to match the compound information of Exemestane in this record:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM873802&dsn=CASMI_2016

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM871901
This is a spectrum of Trenbolone (identical SPLASH), please update the compound information in SM871901 to match the compound information of Trenbolone in this record:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM874601&dsn=CASMI_2016

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM840901
This should be simazine, please take the compound information from SM841901
The analytical information is correct.

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM841901
This should be Desethylterbutylazine, please take the compound information from SM840901.
The analytical information is correct.

The other ones we need to correct are indicated above, i.e.
SM858801 is diphenhydramine => please take compound information from the current
https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM858902&dsn=CASMI_2016
The analytical information is correct.

SM858902 and SM858951 are Acetyl-sulfamethoxazole => please take compound information from the current https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM859002&dsn=CASMI_2016
The analytical information is correct.

SM859002 is citalopram => please take compound information from an existing record, for instance:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=EA290112&dsn=Eawag
The analytical information is correct.

With compound information I'm referring to the CH$ entries, ie
CH$NAME: Diphenhydramine
CH$NAME: 2-benzhydryloxy-N,N-dimethylethanamine
CH$COMPOUND_CLASS: N/A; Environmental Standard
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
CH$SMILES: CN(C)CCOC(c1ccccc1)c1ccccc1
CH$IUPAC: InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6-10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3
CH$LINK: CAS 58-73-1
CH$LINK: CHEBI 4636
CH$LINK: KEGG D00300
CH$LINK: PUBCHEM CID:3100
CH$LINK: INCHIKEY ZZVUWRFHKOJYTH-UHFFFAOYSA-N
CH$LINK: CHEMSPIDER 2989
CH$LINK: COMPTOX DTXSID4022949

@tsufz
Copy link
Member

tsufz commented Nov 8, 2019

@schymane Who should curate this data?

@schymane
Copy link
Member Author

schymane commented Nov 8, 2019

I hoped @meier-rene could do this but if not someone just needs to update the files, all the info is there ...

@tsufz
Copy link
Member

tsufz commented Nov 9, 2019

@schymane Come on, you did generate them, why you don't curate them by yourself or have them been copied from for example UFZ records?

@schymane
Copy link
Member Author

schymane commented Nov 9, 2019

At one point Rene said he'd do things centrally. This one is tough and I see why he didn't update it, I'll do it when I have a chance but I currently don't have time. Likely during Biohackathon. If you get to it first I'll be overjoyed. If not I'll do it when I get the chance ..

@tsufz
Copy link
Member

tsufz commented Nov 11, 2019

Okay, who first comes, serves first.

@schymane
Copy link
Member Author

So, the movement to dev branch after I had forked the MassBank-data repo has caused a lot of unexpected issues. @meier-rene is walking me through fixing this, before we will be able to change anything. I've had to delete the whole repo and hope that starting from scratch will fix things. Still cloning ..

schymane added a commit to schymane/MassBank-data that referenced this issue Nov 20, 2019
schymane added a commit to schymane/MassBank-data that referenced this issue Nov 20, 2019
schymane added a commit to schymane/MassBank-data that referenced this issue Nov 20, 2019
schymane added a commit to schymane/MassBank-data that referenced this issue Nov 20, 2019
schymane added a commit to schymane/MassBank-data that referenced this issue Nov 20, 2019
MassBank#9

(both pos and neg spectra for this compound)
schymane added a commit to schymane/MassBank-data that referenced this issue Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants