Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-record MassBank format #110

Closed
sneumann opened this issue Apr 20, 2018 · 4 comments
Closed

Multi-record MassBank format #110

sneumann opened this issue Apr 20, 2018 · 4 comments

Comments

@sneumann
Copy link
Member

Hi, we currently specify one record per file. If we have immense growth,
this might hit filesystem limitations. As an exchange format we could envision
to specify and allow multi-record MassBank files.

Fun exercise:

git clone https://github.com/MassBank/MassBank-data.git
cd MassBank-data
find . -name "*.txt" | xargs cat | wc -l 
# 4.132.269

This might also be interesting in the light of https://github.com/HUPO-PSI/SpectralLibraryFormat

Yours,
Steffen

ACCESSION: EA000401
RECORD_TITLE: Metamitron-desamino; LC-ESI-ITFT; MS2; CE: 35%; R=7500; [M+H]+
DATE: 2014.01.14
AUTHORS: Stravs M, Schymanski E, Singer H, Department of Environmental Chemistry, Eawag
LICENSE: CC BY
COPYRIGHT: Copyright (C) 2012 Eawag, Duebendorf, Switzerland
COMMENT: CONFIDENCE standard compound
COMMENT: EAWAG_UCHEM_ID 4
CH$NAME: Metamitron-desamino
CH$NAME: 3-Methyl-6-phenyl-1,2,4-triazin-5-ol
CH$COMPOUND_CLASS: N/A; Environmental Standard
CH$FORMULA: C10H9N3O1
CH$EXACT_MASS: 187.0746
CH$SMILES: c(ccc1C(=NN=C2C)C(=O)N2)cc1
CH$IUPAC: InChI=1S/C10H9N3O/c1-7-11-10(14)9(13-12-7)8-5-3-2-4-6-8/h2-6H,1H3,(H,11,12,14)
CH$LINK: CAS 36993-94-9
CH$LINK: PUBCHEM CID:181502
CH$LINK: INCHIKEY OUSYWCQYMPDAEO-UHFFFAOYSA-N
CH$LINK: CHEMSPIDER 157884
AC$INSTRUMENT: LTQ Orbitrap XL Thermo Scientific
AC$INSTRUMENT_TYPE: LC-ESI-ITFT
AC$MASS_SPECTROMETRY: MS_TYPE MS2
AC$MASS_SPECTROMETRY: ION_MODE POSITIVE
AC$MASS_SPECTROMETRY: IONIZATION ESI
AC$MASS_SPECTROMETRY: FRAGMENTATION_MODE CID
AC$MASS_SPECTROMETRY: COLLISION_ENERGY 35 % (nominal)
AC$MASS_SPECTROMETRY: RESOLUTION 7500
AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters
AC$CHROMATOGRAPHY: FLOW_GRADIENT 90/10 at 0 min, 50/50 at 4 min, 5/95 at 17 min, 5/95 at 25 min, 90/10 at 25.1 min, 90/10 at 30 min
AC$CHROMATOGRAPHY: FLOW_RATE 200 ul/min
AC$CHROMATOGRAPHY: RETENTION_TIME 5.1 min
AC$CHROMATOGRAPHY: SOLVENT A water with 0.1% formic acid
AC$CHROMATOGRAPHY: SOLVENT B methanol with 0.1% formic acid
MS$FOCUSED_ION: BASE_PEAK 188.0824
MS$FOCUSED_ION: PRECURSOR_M/Z 188.0818
MS$FOCUSED_ION: PRECURSOR_TYPE [M+H]+
MS$DATA_PROCESSING: DEPROFILE Spline
MS$DATA_PROCESSING: RECALIBRATE loess on assigned fragments and MS1
MS$DATA_PROCESSING: REANALYZE Peaks with additional N2/O included
MS$DATA_PROCESSING: WHOLE RMassBank 1.3.1
PK$SPLASH: splash10-03di-0900000000-7ebbace7bb3df63350bc
PK$ANNOTATION: m/z tentative_formula formula_count mass error(ppm)
  77.0385 C6H5+ 1 77.0386 -0.73
...
  188.082 C10H10N3O+ 1 188.0818 0.97
PK$NUM_PEAK: 7
PK$PEAK: m/z int. rel.int.
  77.0385 63034.2 5
  85.0396 204249.9 17
...
  160.0871 11464205.7 999
  188.082 990072.7 86
//
ACCESSION: EA000402
RECORD_TITLE: Metamitron-desamino; LC-ESI-ITFT; MS2; CE: 15%; R=7500; [M+H]+
DATE: 2014.01.14
AUTHORS: Stravs M, Schymanski E, Singer H, Department of Environmental Chemistry, Eawag
LICENSE: CC BY
COPYRIGHT: Copyright (C) 2012 Eawag, Duebendorf, Switzerland
COMMENT: CONFIDENCE standard compound
COMMENT: EAWAG_UCHEM_ID 4
CH$NAME: Metamitron-desamino
CH$NAME: 3-Methyl-6-phenyl-1,2,4-triazin-5-ol
CH$COMPOUND_CLASS: N/A; Environmental Standard
CH$FORMULA: C10H9N3O1
CH$EXACT_MASS: 187.0746
CH$SMILES: c(ccc1C(=NN=C2C)C(=O)N2)cc1
CH$IUPAC: InChI=1S/C10H9N3O/c1-7-11-10(14)9(13-12-7)8-5-3-2-4-6-8/h2-6H,1H3,(H,11,12,14)
CH$LINK: CAS 36993-94-9
CH$LINK: PUBCHEM CID:181502
CH$LINK: INCHIKEY OUSYWCQYMPDAEO-UHFFFAOYSA-N
CH$LINK: CHEMSPIDER 157884
...
@schymane
Copy link
Member

very, very interesting idea and I had a lot of thoughts on this while reading this paper (I was lucky enough to get a sneak peek) which is now finally out :-)
https://onlinelibrary.wiley.com/doi/abs/10.1002/mrc.4737
[the SDF way is a neat idea but comes with a whole lot more space requirement and would require a whole lot of fundamental changes obviously, but the principle of what they achieved is very cool and I'd see a few ways we could/should start towards something similar for MS]

Quick comment: why repeat all the duplicate information? This could be done much more compactly by having e.g. all compound/measurement specific data at the top and only repeating spectrum-specific parameters. This would obviously have to be defined carefully (and has pros and cons).

I could say a lot more ... many thoughts and conversations on this ....

@schymane
Copy link
Member

@ChemConnector @meowcat I'm sure you also have thoughts :-) as will many others ..

@tsufz
Copy link
Member

tsufz commented Apr 23, 2018

I like the idea of a merged record format. But, I would not merge information of different records into a header. The problem is not the general size of those spectral library files. There is enough space to story large files.

The limitation of file systems are the available so called nodes for the storage of the files, right Steffen? An example, 1 024 000 000 text files of one byte will use 1 GB of space (what is nothing todays), but they will also block 1 024 000 000nodes for the storage and thus thefile systemcould run out of nodes. In NTFS (Windows), small files are stored in the MFT (master file table) directly and thus they use less nodes. But in LINUX file systems they are not, IMHO.

Merging of information from records ends in many problems of conventions, typos etc. And the vendors and users need specific programming programs reading headers, searching for the records, merge all and then write to the internal library. Importing one by one is much easier (see SDF, MSP, etc.).

@Treutler
Copy link
Contributor

Currently, we see no point in introducing a multi-record format, because we are far away from file system limitations. Hence, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants