**Microbiome Summer School 2017 - Mass Spectrometry Tutorial**

Welcome to this tutorial for **Plenary 9** of the **Microbiome Summer School 2017**. This tutorial concerns *Algorithms for Mass Spectrometry*.

This notebook contains working code and an example of applications of the algorithms covered in Plenary 9. A dataset of mass spectra will be processed and corrected by the Virtual Lock Mass algorithm and subsequently aligned. A machine learning algorithm will then be applied to the data.

In [1]:
#This section contains some fundamental imports for the notebook.
import numpy as np

The following section will load the mass spectra data into memory.

This dataset is a set of 80 samples of red blood cell cultures. Their spectra was acquired by LDTD-ToF mass spectrometry on a Waters Synapt G2-Si instrument. These spectra were acquired in high resolution mode using a data independant acquisition mode ($MS^e$).

Of these 80 samples, 40 are from red blood cell cultures infected by malaria. The other 40 samples are not infected. It is the objective of this tutorial to correct and align these spectra in order to classify them by machine learning.

The dataset is stored in the file *dataset.h5*, contained within this tutorial. The [hdf5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) is a very efficient data storage format for multiple types of datasets and numeric data.

The loading operation may take some seconds to complete.

In [2]:
from tutorial_code.utils import load_spectra

In [3]:
datafile = "dataset.h5"
spectra = load_spectra(datafile)

At this point, the mass spectra are loaded in memory and ready for the next step.

The next steps will be to **correct** and align these spectra in order to render them more comparable for the machine learning analysis to follow.

First, the **Virtual Lock Mass** algorithm will be applied. 
The following command will import the corrector code.

In [4]:
from tutorial_code.virtual_lock_mass import VirtualLockMassCorrector

We must then create a corrector for the spectra.

The following command will create a corrector with a minimum peak intensity of 1000 and a maximum distance of 40 ppms.
Theses settings yield the most correction points of the dataset, and thus they are considered optimal.

In [5]:
corrector = VirtualLockMassCorrector(window_size=40, minimum_peak_intensity=1000)

The corrector is then *trained* on the dataset in order to detect the VLM correction points.
This is done by using the *fit* function, with the dataset as a parameter.

In [6]:
corrector.fit(spectra)

Once the corrector has been trained, it can apply its correction to the spectra.
We simply use the *transform* function of the corrector on the dataset.
However, we must store the result in a new variable.

In [7]:
corrected_spectra = corrector.transform(spectra)

Now the spectra are corrected and larger shifts between samples should be removed.
We must still **align** the spectra together in order to remove small variations in m/z values.

The following command will import the aligner code.

In [8]:
from tutorial_code.alignment import Mass_Spectra_Aligner

As before, we must create an aligner.

The following command will create this aligner with a window size of 30 ppms.

In [9]:
aligner = Mass_Spectra_Aligner(window_size=20)

The aligner will then detect the alignment points by being *fitted* to the mass spectra.

In [10]:
aligner.fit(corrected_spectra)

Once the aligner is fitted, we have the alignment points.

The spectra will then be aligned by the *transform* function of the aligner.
Once again, the aligned spectra will need to be stored in a new variable.

In [11]:
aligned_spectra = aligner.transform(corrected_spectra)

The spectra are now aligned.
In terms of m/z values, the spectra are ready to be compared.

The spectra must now be changed into a format more appropriate for machine learning, which the algorithms can read.
This format is that of a data matrix, where each row represents a mass spectrum and the columns represent a peak that is present in the dataset.

To make this conversion, import the spectrum_to_matrix function from the utilitaries.

In [12]:
from tutorial_code.utils import spectrum_to_matrix

data = spectrum_to_matrix(aligned_spectra)

[tags]

In [13]:
from tutorial_code.utils import extract_tags

tags = extract_tags(aligned_spectra)