[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MS-Quality-hub/pymzqc/blob/v1.0.0rc1/jupyter/colab/read_in_5_minutes.ipynb)

# Welcome to the 5-Minute mzQC interactive guides with python!
This python notebook will guide you through your first steps with reading someone's mzQC in python.

First we will explore how to open and read a `mzQC` file, explore its' content and dataframe integration, and finally visualise a metric for interactive view in colab! (We assume it is not the first time you get your feet wet with python, otherwise rather plan for 25 minutes.)

## Setting the scene
First, we need to install the mzQC python library. When outside of the python notebook, find out [here](https://github.com/MS-Quality-hub/pymzqc) how to install locally (spoiler: usually just `pip install pymzqc`).

In [2]:
#@title This will install the latest version of pymzqc
%pip install --no-deps git+https://github.com/MS-Quality-Hub/pymzqc.git  --quiet
%pip install fastobo pronto numpy pandas

  Preparing metadata (setup.py) ... [?25l[?25hdone


Then, we load this right into our python session by loading pymzqc (`from mzqc`). We'll also utilise some other libraries, too.
For example, we will use `requests` to load some data from the web.


In [3]:
from mzqc import MZQCFile as qc
import pandas as pd
from io import StringIO
import requests
import plotly.express as px

# Acquire data
Next, we need to acquire some data.
For this notebook you have two choices.
1. [either](#local-file-cell-id) upload a file from your local disk
2. [or](#github-file-cell-id) load an example file from the mzQC GitHub repo

<a name="local-file-cell-id"></a>
## Option 1.
Select and upload a local file here!

In [4]:
#@title Upload `.mzQC` file here!
from google.colab import files
try:
  uploaded = files.upload()
except:
  print("If that does not work, proceed with option 2.")

# maybe an alternative?
# from google.colab import drive
# drive.mount('/content/drive')

Saving first.mzQC to first.mzQC


And load it as `JsonSerialisable` from file:

In [8]:
#@title Provide a file object by `open` either a colab uploaded file or a file path if you are using this notebook locally.
some_mzqc = qc.JsonSerialisable.FromJson(uploaded['first.mzQC'].decode())

<a name="github-file-cell-id"></a>
## Option 2.
Load a file from GitHub

In [13]:
#@title Loading files from the web works only a little different.
response = requests.get('https://raw.githubusercontent.com/HUPO-PSI/mzQC/main/specification_documents/examples/metabo-batches.mzQC')
some_mzqc = qc.JsonSerialisable.FromJson(response.text)

# Inspect data
We can now go ahead and take a look what that file we loaded has on offer:

In [14]:
print(some_mzqc.description)

This dataset is based on the analysis of polar extracts from a nucleotype-plasmotype combination study of Arabidopsis for 58 different genotypes. For details of the used plant material we refer to Flood (2015). Analysis of the polar, derivatized metabolites by GC-ToF-MS (Agilent 6890 GC coupled to a Leco Pegasus III MS) and processing of the data were done as described in Villafort Carvalho et al. (2015). Here, the number of metabolites (75) is much lower than in the other two data sets, partly because the focus was on the primary rather than the secondary metabolites. The number of samples was 240, with a percentage of non-detects of 16 %; the maximum fraction of non-detects in individual metabolites is 92 %. All metabolites were retained in the analysis. Four batches of 31-89 samples were employed, containing 2-6 QCs per batch, 14 in total.


**pymzqc** deserialised JSON arrays (i.e. the `runQualities` and their `qualityMetrics`) can be used like python lists:

In [15]:
for m in some_mzqc.runQualities[0].qualityMetrics:
    print(m.name)

Detected Compounds


You can traverse the hierarchy with standard python member access notation (`.`) and get to the bottom of things (like a metric `name` or `value`).

In [16]:
some_mzqc.runQualities[0].qualityMetrics[0].value


57

We can also get the table metrics directly as pandas dataframe! 🤯

In [17]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

df = pd.DataFrame(some_mzqc.setQualities[0].qualityMetrics[0].value)
df

Unnamed: 0,Run name,PCA Dimension 1,PCA Dimension 2,PCA Dimension 3,PCA Dimension 4,PCA Dimension 5,Injection sequence number,Batch label
0,GCMS ToF sample 10,-3.348963,-2.341435,-1.486755,-0.276620,-2.683632,13,4
1,GCMS ToF sample 100,0.419126,2.055220,-0.396590,1.780880,-2.020238,16,7
2,GCMS ToF sample 101,6.824155,1.514235,1.163668,0.173623,-3.088806,17,7
3,GCMS ToF sample 102,-1.080886,2.994551,0.421837,0.013869,-3.161503,18,7
4,GCMS ToF sample 103,-1.707045,6.070161,0.627950,-0.628732,-1.064512,19,7
...,...,...,...,...,...,...,...,...
221,GCMS ToF sample 95,4.246444,1.841355,0.670941,0.938284,-1.520605,11,7
222,GCMS ToF sample 96,2.368072,1.001514,1.235160,-0.091535,-3.000758,12,7
223,GCMS ToF sample 97,1.674060,0.450211,-0.512383,2.100324,-1.654505,13,7
224,GCMS ToF sample 98,-2.928426,4.175587,-0.449582,0.333712,-0.814875,14,7


# Visualising our data
It is always good to have a look your data visualised. The former metric provides us with all we need to plot the first two PCA dimensions and add interactive labels.

In [18]:
df['Batch name'] = df['Batch label'].map(str)
fig = px.scatter(df, x="PCA Dimension 1", y="PCA Dimension 2", color="Batch name",
                 hover_name="Run name", hover_data=["Injection sequence number", "Batch label"])
fig.show()
