In [1]:
from matchms.importing import load_from_msp
from raims.split import inchikey_mapping, random_split, split_a, split_b, save_partitions_as_msp

### Download and load MoNA dataset

MoNA is a freely available GC-MS dataset distributed under GNU GPL license.


In [2]:
!mkdir --parents data/src
!wget -O data/src/MoNA-export-GC-MS_Spectra.zip https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/fac60e0e-6322-4596-8b03-c1dd211e6454
!unzip data/src/MoNA-export-GC-MS_Spectra.zip -d data/src

--2022-03-23 13:55:08--  https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/fac60e0e-6322-4596-8b03-c1dd211e6454
Resolving mona.fiehnlab.ucdavis.edu (mona.fiehnlab.ucdavis.edu)... 128.120.143.183
Connecting to mona.fiehnlab.ucdavis.edu (mona.fiehnlab.ucdavis.edu)|128.120.143.183|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘data/src/MoNA-export-GC-MS_Spectra.zip’

data/src/MoNA-expor     [          <=>       ]   7.46M  2.36MB/s    in 3.2s    

2022-03-23 13:55:12 (2.36 MB/s) - ‘data/src/MoNA-export-GC-MS_Spectra.zip’ saved [7824749]

Archive:  data/src/MoNA-export-GC-MS_Spectra.zip
  inflating: data/src/MoNA-export-GC-MS_Spectra.msp  


**Note:** We are not sure that the provided link will still work for you. If you encounter any problems, please contact us.

In [3]:
mona = list(load_from_msp('data/src/MoNA-export-GC-MS_Spectra.msp'))



#### MoNA random split

Perform a pure random split of the dataset into partitions of a specified fractional size.

In [4]:
mona_randon_split = random_split(dataset=mona, partitions=[.8, .1, .1])
save_partitions_as_msp(folder='data/split/mona-random', filenames=['train.msp', 'test.msp', 'val.msp'], partitions=mona_randon_split)

#### MoNA InChIKey mapping

Divide the provided list of spectra into a dictionary of buckets with the same InChIKey.

In [5]:
mona_mapping = inchikey_mapping(spectra=mona, key='inchikey')

Dropping 6 entries out of 18902 records due to missing InChIKey


#### MoNA split A

Each first occurrence of each unique key, e.g. InChIKey, is put into the training partition. The remaining keys are randomly split into testing and validation partitions so as these two partitions does not share a single key.

In [6]:
mona_a_split = split_a(mapping=mona_mapping)
save_partitions_as_msp(folder='data/split/mona-a', filenames=['train.msp', 'test.msp', 'val.msp'], partitions=mona_a_split)

#### MoNA split B

Take a random bucket (a collection of the same keys) and put it into a random non-full partition.

In [7]:
mona_b_split = split_b(mapping=mona_mapping, partitions=[.8, .1, .1])
save_partitions_as_msp(folder='data/split/mona-b', filenames=['train.msp', 'test.msp', 'val.msp'], partitions=mona_b_split)