# 1.  Extracting Chemicals from HMDB

ViMMS operates based on the notion of `Chemicals`. A chemical object contains a formula (from which we can derive its m/z), chromatogram, retention time and other information such as possible intensities and fragmentation spectra.

Chemicals can be divided into two broad types: `KnownChemicals` for which we know its identity and therefore formula, and `UnknownChemical` which represent chemicals with unknown identity but still having chromatographic information, could be assigned retention time, intensity and fragmentation spectra.

This notebook demonstrates how we can sample formulae of actual compounds from [HMDB](https://hmdb.ca/). Extracted formulae are converted into `KnownChemical` objects, which can be used as input to simulator in ViMMS.

In [1]:
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from pathlib import Path
import seaborn as sns

In [4]:
import os
import sys
sys.path.append('../..')

In [5]:
from vimms.Common import download_file, extract_zip_file, set_log_level_debug, set_log_level_warning, \
    save_obj, load_obj, POSITIVE
from vimms.FeatureExtraction import extract_hmdb_metabolite
from vimms.ChemicalSamplers import DatabaseFormulaSampler
from vimms.Chemicals import ChemicalMixtureCreator

In [6]:
from vimms.MassSpec import IndependentMassSpectrometer
from vimms.Controller import TopNController
from vimms.Environment import Environment

### Download HMDB database

The cell below tries to load the previously processed HMDB data from `hmdb_compounds.p` in the test fixtures folder in ViMMS. If it isn't found there, then create it by downloading the entire HMDB database as a zip file. The database is then processed to extract compounds from it. 

In [7]:
compound_file = 'hmdb_compounds.p'
try:
    hmdb_compounds = load_obj(compound_file)
except FileNotFoundError:
    
    # download the entire HMDB metabolite database and extract chemicals from it
    # url = 'http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip'
    # out_file = download_file(url)
    # compounds = extract_hmdb_metabolite(out_file, delete=True)
    # save_obj(compounds, compound_file)
    
    # above could be quite slow slow, so download a pre-processed result instead
    url = 'https://github.com/glasgowcompbio/vimms-data/raw/main/hmdb_compounds.p'
    download_file(url, compound_file)
    hmdb_compounds = load_obj(compound_file)

2022-02-24 14:48:02.011 | INFO     | vimms.Common:download_file:453 - Downloading hmdb_compounds.p


  0%|          | 0.00/8.38k [00:00<?, ?KB/s]

### Sample ViMMS chemicals from HDMB database

Create a database formula sampler that will sample ViMMS chemicals from HMDB with m/z between 100 and 1000

In [8]:
df = DatabaseFormulaSampler(hmdb_compounds, min_mz=100, max_mz=1000)
samples = df.sample(1000)

2022-02-24 14:48:08.777 | DEBUG    | vimms.ChemicalSamplers:sample:73 - 73822 unique formulas in filtered database
2022-02-24 14:48:08.777 | DEBUG    | vimms.ChemicalSamplers:sample:79 - Sampled formulas


`samples` is a list of tuples, where the first entry is a `Formula` object, while the second entry is a string of its name.

In [9]:
samples[0]

(C55H106O6, 'TG(a-21:0/10:0/21:0)[rac]')

In [10]:
type(samples[0][0]), type(samples[0][1])

(vimms.Common.Formula, str)

Using the `ChemicalMixtureCreator` class, We can turn the HMDB `Formula` inside `DatabaseFormulaSampler` into a dataset of `Chemical` objects in ViMMS. These can be used as input to simulation. As an example, below 100 chemicals are generated based on the HMDB formulae, initialised with fragmentation level up to MS2. 

Default parameters are used for RT, intensity, chromatogram and MS2 peaks generations in `ChemicalMixtureCreator`. For more details of the different parameters that could be passed to `ChemicalMixtureCreator`, please refer to **03. Generating Sets of Chemicals with the ChemicalMixtureCreator class.ipynb**.

In [11]:
cm = ChemicalMixtureCreator(df)
dataset = cm.sample(100, 2) # sample 100 chemicals up to MS2

2022-02-24 14:48:14.017 | DEBUG    | vimms.ChemicalSamplers:sample:73 - 73822 unique formulas in filtered database
2022-02-24 14:48:14.017 | DEBUG    | vimms.ChemicalSamplers:sample:79 - Sampled formulas
2022-02-24 14:48:14.150 | DEBUG    | vimms.Chemicals:sample:324 - Sampled rt and intensity values and chromatograms


In [12]:
dataset[0]

KnownChemical - 'C46H76O13P2' rt=1499.63 max_intensity=767027.84

### Use in simulator

We can use the sampled chemicals to simulate various fragmentation strategies in ViMMS. Below we run it through a TopN strategy.

First we set some parameters for the Top-N controller and its simulated environment.

In [13]:
rt_range = [(0, 1440)]
min_rt = rt_range[0][0]
max_rt = rt_range[0][1]

In [14]:
isolation_window = 1
N = 3
rt_tol = 15
mz_tol = 10
min_ms1_intensity = 1.75E5

Initialise simulated mass spec and the Top-N controller 

In [15]:
mass_spec = IndependentMassSpectrometer(POSITIVE, dataset)
controller = TopNController(POSITIVE, N, isolation_window, mz_tol, rt_tol, min_ms1_intensity)

Create an environment to run both the mass spec and controller. Set the log level to WARNING so we don't see too many messages when environment is running.

In [16]:
set_log_level_warning()
env = Environment(mass_spec, controller, min_rt, max_rt, progress_bar=True)
env.run()

  0%|          | 0/1440 [00:00<?, ?it/s]

Write the resulting mzML file from simulation to the location below. You can use ToppView from OpenMS or other mzML viewer to inspect the results. Note that the output wouldn't look very realistic as the chromatograms for all chemicals are the same (gaussian), and there's no noise or small peaks at all.

In [17]:
set_log_level_debug()
mzml_filename = 'hmdb_topn_controller.mzML'
out_dir = os.path.join(os.getcwd(), 'results')
env.write_mzML(out_dir, mzml_filename)

2022-02-24 14:48:37.644 | DEBUG    | vimms.Environment:write_mzML:177 - Writing mzML file to C:\Users\joewa\Work\git\vimms\demo\01. Data\results\hmdb_topn_controller.mzML
2022-02-24 14:48:47.694 | DEBUG    | vimms.Environment:write_mzML:181 - mzML file successfully written!
