**Updated on:** 2022-10-23 23:10:05 CEST

This Notebook is used for cleaning the feature table, an output of metabolomics experiment, then performing some preliminary univariate and multivariate statistics analyses.

**Authors**: Abzer Kelminal (abzer.shah@uni-tuebingen.de), Francesco Russo (frru@ssi.dk), Filip Ottosson (faot@ssi.dk), Madaleine Ernst (maet@ssi.dk), Axel Walter (axel.walter@uni-tuebingen.de), Carolina Gonzalez (cgonzalez7@eafit.edu.co), Judith Boldt <br>
**Input file format**: .csv files or .txt files <br>
**Outputs**: .csv files, .pdf & .svg images  <br>
**Dependencies**: pandas numpy plotly pingouin kaleido scikit-learn

---
This Notebook can be run with both Jupyter Notebook & Google Colab. To know more about how to get the Jupyter Notebook running with R code, please have a look at this document: [GitHub Link](https://github.com/Functional-Metabolomics-Lab/Jupyter-Notebook-Installation/blob/main/Anaconda%20with%20R%20kernel%20installation.pdf)

---
**Before starting to run this notebook with your own data, remember to save a copy of this notebook in your own Google Drive! Do so by clicking on File --> Save a copy in Drive. You can give whatever meaningful name to your notebook.** This file should be located in a new folder of your Google Drive named 'Colab Notebooks'. You can also download this notebook: File --> Download --> Download .ipynb.<br>

---
<b><font size=3> SPECIAL NOTE: Please read the comments before proceeding with the code and let us know if you run into any errors and if you think it could be commented better. We would highly appreciate your suggestions and comments!!</font> </b>

---

# **About the Data**

The files used in this tutorial are part of an interlab comparison study, where different laboratories around the world analysed the same environmental samples on their respective LC-MS/MS equipments. To simulate algal bloom, standardized algae extracts (A) in marine dissovled organic matter (M) at different concentrations were prepared (450 (A45M); 150 (A15M); and 50 (A5M) ppm A). Samples were then shipped to different laboratories for untargeted LC-MS/MS metabolomics analysis. The data used particularly for this notebook is from Lab 1 (Dorrestein Lab, University of California at San Diego, USA; Data submitted by Allegra Aron allegra.aron@gmail.com ) <br><br>
(*To be edited*) In this tutorial, we are working with one of the datasets, which was acquired on a UHPLC system coupled to a Thermo Scientific Q Exactive HF Orbitrap LC-MS/MS mass spectrometer. MS/MS data were acquired in data-dependent acquisition (DDA) with fragmentation of the five most abundant ions in the spectrum per precursor scan. Data files were subsequently preprocessed using [MZmine3](http://mzmine.github.io/) and the [feature-based molecular networking workflow in GNPS](https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=d207c3a831264d61810ad69ac09b14e9).

# **About the different sections in the Notebook:**
### **1. Data-cleaning**

It involves cleaning the feature table, which contains all the features (metabolites, in our case) with their corresponding intensities. The data cleanup steps involved are: 1) Blank removal 2) Imputation 3) Normalisation. Each step would be discussed in detail later. Once the data is cleaned, we can then use it for further statistical analyses.

### **2. Univariate statistical analysis**

Here, we will use univariate statistical methods, such as ANOVA, to investigate whether there are differences in the levels of individual features between different time points in the dataset.

### **3. Unsupervised multivariate analyses:**
#### **i. PCoA and PERMANOVA**
Here, we will perform a Principal Coordinate Analysis (PCoA), also known as metric or classical Multidimensional Scaling (metric MDS) to explore and visualize patterns in an untargeted mass spectromtery-based metabolomics dataset. We will then assess statistical significance of the patterns and dispersion of different sample types using permutational multivariate analysis of variance (PERMANOVA).

#### **ii. Cluster Analyses and Heatmaps**
We will also perform different cluster analyses to explore patterns in the data. This will help us to discover subgroups of samples or features that share a certain level of similarity. Clustering is an example of unsupervised learning where no labels are given to the learning algorithm which will try to find patterns/structures in the input data on its own. The goal of clustering is to find these hidden patterns.<br>

Some types of cluster analyses (e.g. hierarchical clustering) are often associated with heatmaps. Heatmaps are a visual representation of the data where columns are usually samples and rows are features (in our case, different metabolic features). The color scale of heatmaps indicates higher or lower intensity (for instance, blue is lower and red is higher intensity).<br>

There are a lot of good videos and resources out there explaining very well the principle behind clustering. Some good ones are the following:<br>
- Hierarchical clustering and heatmaps: https://www.youtube.com/watch?v=7xHsRkOdVwo<br>
- K-means clustering: https://www.youtube.com/watch?v=4b5d3muPQmA

# **Questions to be asked in the Statistical analysis sections**: </br>
**Univariate Statistical analysis:**
*   Are metabolite levels dependent on the dilution?
*   How does the affected metabolite change throughout the dilution series?
*   How large are the differences? 
---
**Unsupervised multivariate analyses: PCoA & PERMANOVA**
*   Can we monitor algal bloom by looking at metabolomic profiles of marine dissolved organic matter?
---
**Cluster analysis and Heatmaps**
- Can we monitor algal bloom by looking at metabolomic profiles of marine dissolved organic matter?
- Are we able to group/cluster together samples derived from different concentrations of algae extracts using metabolic profiles? <br>
- Which samples are the most similar? <br>
- Are there any patterns defining the groups/clusters? That is, which features cluster together? 

# **Package installation:**
Since we are running the notebook via Colab environment which runs completely in cloud, we need to install the packages every time we run the notebook.This might take some time to install all these packages. In case you are running the notebook directly via Jupyter Notebook IDE, you need to install the packages only once.

In [1]:
# Install libraries that are not preinstalled
!pip install pandas numpy plotly scikit-learn scikit-bio pingouin kaleido ipyfilechooser nbformat





In [2]:
# importing necessary modules
import pandas as pd
import numpy as np
import os
import itertools
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler
from scipy.spatial import distance
from sklearn.decomposition import PCA
import pingouin as pg
import skbio # Don't import on Windows!!
from ipyfilechooser import FileChooser
from ipywidgets import interact
import warnings

In [3]:
# Disable warnings for cleaner output, comment out for debugging
warnings.filterwarnings('ignore')

# **Setting a local working directory:**
### For Google Colab Users:
<p style='text-align: justify;'> <font color='red'>For Google Colab, it is not possible to access the files from your local computer as it is hosted on Google's cloud server. An easier workaround is to upload the necessary files into the Google colab session using the 'Files' icon on the left as shown in the image. The code in the next cell creates a new folder 'My_TestData' in the Colab space and sets the folder as working directory. Following the steps in the image, you can check in your Colab to see if the folder has been created. Once you see it, simply upload the files from your local PC to the folder 'My_TestData' and then continue running the rest of the script.</font> </p>

<p style='text-align: justify;'><b>SPECIAL NOTE: All the files uploaded to Google Colab would generally disappear after 12 hours. Similarly, all the outputs would be saved only in the Colab, so we need to download them into our local system at the end of our session.</b></p> 

[Go to section: Getting outputs from Colab](#colab_output) 

**Importing files into Google Colab environment:**
![Google-Colab Files Upload](https://github.com/abzer005/Images-for-Jupyter-Notebooks/blob/main/StepsAll.png?raw=true)

In [4]:
# Get folder with data files
result_dir = input("Enter path to folder for your results (or leave empty to stay in this folder):\n")
if not result_dir:
    result_dir = "."
if not os.path.exists(result_dir):
    os.mkdir(result_dir)
print(f"Results folder is: {os.path.abspath(result_dir)}")

Results folder is: /home/a/dev/Statistical-analysis-of-non-targeted-LC-MSMS-data/Combined_Notebooks/results


**For users running the script directly in Jupyter Notebook instead through Google Colab**, please make sure to include all the input files in one folder before running the script. Then for setting the working directory, use the below code on a new cell. When you run the cell, it will display an output box where you can enter the path of the folder containing all your input files in your local computer and it will set as your working directory<br> For ex: D:\User\Project\Test_Data

```
directory = input("Enter the path of the folder with input files:\n")
os.chdir(directory)
```



# **Input files needed for the Notebook:**
1) <b>Feature table:</b> An output of metabolomics experiment, containing all the features or peaks (LC-MS/MS peaks here) with their corresponding intensities. The feature table used in the test data is obtained by MZmine3. (Filetype: .csv file) </br> 
2) <b>Metadata:</b> Created by the user about the files used obtaining the feature table (It can be a csv/txt/tsv file). The columns in a metadata should be created with the following format: filename (1st column having all the filenames in the same order as the columns in feature table), all the other columns with column name such as: ATTRIBUTE_yourDesiredAttribute. </br>

Please have a look at the metadata used here for reference. Creating a metadata in the above-mentioned format is necessary for uploading the files in GNPS and to obtain a molecular network.

## Reading the input data using URL:
Here, we can directly pull **example data** files from our Functional Metabolomics GitHub page.

In [5]:
#Reading the input data using URL 
ft_url = 'https://raw.githubusercontent.com/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/main/data/SD_BeachSurvey_GapFilled_quant.csv'
md_url = 'https://raw.githubusercontent.com/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/main/data/20221125_Metadata_SD_Beaches_with_injection_order.txt'

ft = pd.read_csv(ft_url)
md = pd.read_csv(md_url, sep = "\t").set_index("filename")

Specify your own feauture quantification and meta data table. If not, the example data will be used.

In [6]:
# feature quantification table file location
ft_file = ""
# meta data table file location
md_file = ""


# define separators for different input file formats
separators = {"csv": ",", "tsv": "\t", "txt": "\t"}

# read feature table
if ft_file:
    ft = pd.read_csv(ft_file, sep = separators[file.split(".")[-1]])
else:
    print("Please select a feature file and rerun this cell.")
# read metadata table
if md_file:
    md = pd.read_csv(md_file, sep = separators[file.split(".")[-1]]).set_index("filename")
else:
    print("Please select a metavalue file and rerun this cell.")

Please select a feature file and rerun this cell.
Please select a metavalue file and rerun this cell.


Let's check if the data has been read correclty!!

In [7]:
print('Dimension: ',ft.shape) #gets the dimension (number of rows and columns) of ft
ft.head() # gets the first 5 rows of ft

Dimension:  (11217, 200)


Unnamed: 0,row ID,row m/z,row retention time,row ion mobility,row ion mobility unit,row CCS,correlation group ID,annotation network number,best ion,auto MS2 verify,...,SD_12-2017_15_b.mzXML Peak area,SD_12-2017_15_a.mzXML Peak area,SD_12-2017_27_a.mzXML Peak area,SD_12-2017_29_b.mzXML Peak area,SD_12-2017_21_a.mzXML Peak area,SD_12-2017_30_a.mzXML Peak area,SD_12-2017_28_b.mzXML Peak area,SD_12-2017_29_a.mzXML Peak area,SD_12-2017_28_a.mzXML Peak area,Unnamed: 199
0,92572,151.035101,13.363672,,,,,,,,...,0.0,0.0,21385.48,1138.271,1144.8115,12139.16,5394.689,5270.766,1007.839,
1,2513,151.035125,1.129901,,,,,,,,...,0.0,0.0,27123.893,0.0,0.0,0.0,0.0,0.0,0.0,
2,42,151.03514,0.550724,,,,212.0,,,,...,1150350.0,1103477.9,2638109.2,1446267.0,595216.5,1225695.2,1424855.0,1557217.0,1797692.0,
3,1870,151.035199,0.88678,,,,,,,,...,0.0,0.0,314371.84,0.0,0.0,0.0,0.0,0.0,0.0,
4,2127,151.096405,0.986017,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [8]:
print('Dimension: ',md.shape)
md.head()

Dimension:  (186, 13)


Unnamed: 0_level_0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,ATTRIBUTE_Spot_Name,ATTRIBUTE_time_run,ATTRIBUTE_Injection_order
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
SD_10_2018_10_a.mzXML,Sample,3,Oct,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:19,145
SD_10_2018_10_b.mzXML,Sample,3,Oct,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:35,146
SD_10_2018_11_a.mzXML,Sample,3,Oct,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 18:51,147
SD_10_2018_11_b.mzXML,Sample,3,Oct,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 19:07,148
SD_10_2018_12_a.mzXML,Sample,3,Oct,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,Cove,18/07/2020 19:23,149


## **Creating Functions:**
<p style='text-align: justify;'> Before getting into the Data cleanup steps, we have created a function that can be used later for data summarization. By creating functions, we don't have to write these big codes multiple times. Instead, we just use the function name. <font color="red">The following cell in this section will not produce any outputs here. </font> The outputs will be produced when we give input variables to the function in the later sections. </p>

<p style='text-align: justify;'> Using this function InsideLevels, we get an idea of the multiple levels in each of the metioned attributes in the metadata as well as the datatype of each attribute.  <font color ="blue"> This function takes metadata table as its input. </font></p>

In [9]:
def inside_levels(df):
    # get all the columns (equals all attributes) -> will be number of rows
    levels = []
    types = []
    count = []
    for col in df.columns:
        types.append(type(df[col][0]))
        levels.append(sorted(set(df[col].dropna())))
        tmp = df[col].value_counts()
        count.append([tmp[levels[-1][i]] for i in range(len(levels[-1]))])
    return pd.DataFrame({"ATTRIBUTES": df.columns, "LEVELS": levels, "COUNT":count, "TYPES": types}, index=range(1, len(levels)+1))

First, let's have a look at the different conditions within each attribute of our metadata.

In [10]:
inside_levels(md)

Unnamed: 0,ATTRIBUTES,LEVELS,COUNT,TYPES
1,ATTRIBUTE_Sample.Type,"[Blank, Sample]","[6, 180]",<class 'str'>
2,ATTRIBUTE_Batch,"[1, 2, 3]","[62, 62, 62]",<class 'numpy.int64'>
3,ATTRIBUTE_Month,"[Dec, Jan, Oct]","[62, 62, 62]",<class 'str'>
4,ATTRIBUTE_Year,"[2017, 2018]","[62, 124]",<class 'numpy.int64'>
5,ATTRIBUTE_Sample_Location,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
6,ATTRIBUTE_Replicate,"[a, b]","[93, 93]",<class 'str'>
7,ATTRIBUTE_Spot,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
8,ATTRIBUTE_Latitude,"[32.75645, 32.75743, 32.75905, 32.76115, 32.76...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
9,ATTRIBUTE_Longitude,"[-117.2872, -117.28664, -117.286, -117.28355, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
10,ATTRIBUTE_Sample_Area,"[Blank, La_Jolla Reefs, La_Jolla_Cove, Mission...","[6, 36, 12, 36, 18, 12, 18, 48]",<class 'str'>


The above table is a summary of our metadata tabel. For example, the 1st row says that there are 5 different types of sample under 'ATTRIBUTE_Sample' category namely A15M,A45M,A5M,M,PPL and the count of each of these types is 3,3,3,1.

# **Arranging metadata and feature table in the same order:**

<p style='text-align: justify;'> In the next cell, we are trying to bring the feature table and metadata in the correct format such as <font color ="green"> the rownames of metadata and column names of feature table are the same. </font> They both are the file names and they need to be the same, as from now on, we will call the columns in our feature table based on our metadata information. Thus, using the metadata, the user can filter their data easily. You can also directly deal with your feature table without metadata by getting your hands dirty with some coding!! But having a metadata improves the user-experience greatly. </p>

In [11]:
# structure of the original metadata file
md.head()

Unnamed: 0_level_0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,ATTRIBUTE_Spot_Name,ATTRIBUTE_time_run,ATTRIBUTE_Injection_order
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
SD_10_2018_10_a.mzXML,Sample,3,Oct,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:19,145
SD_10_2018_10_b.mzXML,Sample,3,Oct,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:35,146
SD_10_2018_11_a.mzXML,Sample,3,Oct,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 18:51,147
SD_10_2018_11_b.mzXML,Sample,3,Oct,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 19:07,148
SD_10_2018_12_a.mzXML,Sample,3,Oct,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,Cove,18/07/2020 19:23,149


In [12]:
new_md = md.copy() #storing the files under different names to preserve the original files
# remove the (front & tail) spaces, if any present, from the rownames of md
new_md.index = [name.strip() for name in md.index]
# for each col in new_md
# 1) removing the spaces (if any)
# 2) replace the spaces (in the middle) to underscore
# 3) converting them all to UPPERCASE
for col in new_md.columns:
    if new_md[col].dtype == str:
        new_md[col] = [item.strip().replace(" ", "_").upper() for item in new_md[col]]
print('Dimension: ',new_md.shape)
new_md.head()

Dimension:  (186, 13)


Unnamed: 0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,ATTRIBUTE_Spot_Name,ATTRIBUTE_time_run,ATTRIBUTE_Injection_order
SD_10_2018_10_a.mzXML,Sample,3,Oct,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:19,145
SD_10_2018_10_b.mzXML,Sample,3,Oct,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:35,146
SD_10_2018_11_a.mzXML,Sample,3,Oct,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 18:51,147
SD_10_2018_11_b.mzXML,Sample,3,Oct,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 19:07,148
SD_10_2018_12_a.mzXML,Sample,3,Oct,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,Cove,18/07/2020 19:23,149


In [13]:
# structure of the original feature file
ft.head()

Unnamed: 0,row ID,row m/z,row retention time,row ion mobility,row ion mobility unit,row CCS,correlation group ID,annotation network number,best ion,auto MS2 verify,...,SD_12-2017_15_b.mzXML Peak area,SD_12-2017_15_a.mzXML Peak area,SD_12-2017_27_a.mzXML Peak area,SD_12-2017_29_b.mzXML Peak area,SD_12-2017_21_a.mzXML Peak area,SD_12-2017_30_a.mzXML Peak area,SD_12-2017_28_b.mzXML Peak area,SD_12-2017_29_a.mzXML Peak area,SD_12-2017_28_a.mzXML Peak area,Unnamed: 199
0,92572,151.035101,13.363672,,,,,,,,...,0.0,0.0,21385.48,1138.271,1144.8115,12139.16,5394.689,5270.766,1007.839,
1,2513,151.035125,1.129901,,,,,,,,...,0.0,0.0,27123.893,0.0,0.0,0.0,0.0,0.0,0.0,
2,42,151.03514,0.550724,,,,212.0,,,,...,1150350.0,1103477.9,2638109.2,1446267.0,595216.5,1225695.2,1424855.0,1557217.0,1797692.0,
3,1870,151.035199,0.88678,,,,,,,,...,0.0,0.0,314371.84,0.0,0.0,0.0,0.0,0.0,0.0,
4,2127,151.096405,0.986017,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [14]:
new_ft = ft.copy() #storing the files under different names to preserve the original files
# changing the index in feature table to contain m/z and RT information
new_ft.index = [f"{id}_{round(mz, 3)}_{round(rt, 3)}" for id, mz, rt in zip(ft["row ID"], ft["row m/z"], ft["row retention time"])]
# drop all columns that are not mzML or mzXML file names
new_ft.drop(columns=[col for col in new_ft.columns if ".mz" not in col], inplace=True)
# remove " Peak area" from column names
new_ft.rename(columns={col: col.replace(" Peak area", "").strip() for col in new_ft.columns}, inplace=True)
print('Dimension: ',new_ft.shape)
new_ft.head()

Dimension:  (11217, 186)


Unnamed: 0,SD_01-2018_5_b.mzXML,SD_01-2018_7_b.mzXML,SD_01-2018_7_a.mzXML,SD_01-2018_3_b.mzXML,SD_01-2018_6_a.mzXML,SD_01-2018_8_a.mzXML,SD_01-2018_1_a.mzXML,SD_01-2018_2_b.mzXML,SD_01-2018_4_b.mzXML,SD_01-2018_2_a.mzXML,...,SD_12-2017_23_b.mzXML,SD_12-2017_15_b.mzXML,SD_12-2017_15_a.mzXML,SD_12-2017_27_a.mzXML,SD_12-2017_29_b.mzXML,SD_12-2017_21_a.mzXML,SD_12-2017_30_a.mzXML,SD_12-2017_28_b.mzXML,SD_12-2017_29_a.mzXML,SD_12-2017_28_a.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,21385.48,1138.271,1144.8115,12139.16,5394.689,5270.766,1007.839
2513_151.035_1.13,14900.481,4685.837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,27123.893,0.0,0.0,0.0,0.0,0.0,0.0
42_151.035_0.551,2904683.0,3229114.0,2554332.0,2958271.0,2379195.0,3030414.0,2827136.0,2491917.5,2376710.0,2257513.8,...,903546.6,1150350.0,1103477.9,2638109.2,1446267.0,595216.5,1225695.2,1424855.0,1557217.0,1797692.0
1870_151.035_0.887,122673.88,61871.51,66445.54,29861.64,31279.658,72020.5,64247.684,71230.18,12425.48,36686.754,...,0.0,0.0,0.0,314371.84,0.0,0.0,0.0,0.0,0.0,0.0
2127_151.096_0.986,11242.242,0.0,4264.007,7410.396,0.0,8190.269,0.0,6547.927,4608.155,9321.855,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Checking the tables:

In [15]:
# check if new_ft column names and md row names are the same
if sorted(new_ft.columns) == sorted(new_md.index):
    print(f"All {len(new_ft.columns)} files are present in both new_md & new_ft.")
else:
    print("Not all files are present in both new_md & new_ft.\n")
    # print the md rows / ft column which are not in ft columns / md rows and remove them
    ft_cols_not_in_md = [col for col in new_ft.columns if col not in new_md.index]
    print(f"These {len(ft_cols_not_in_md)} columns of feature table are not present in metadata table and will be removed:\n{', '.join(ft_cols_not_in_md)}\n")
    new_ft.drop(columns=ft_cols_not_in_md, inplace=True)
    md_rows_not_in_ft = [row for row in new_md.index if row not in new_ft.columns]
    print(f"These {len(md_rows_not_in_ft)} rows of metadata table are not present in feature table and will be removed:\n{', '.join(md_rows_not_in_ft)}\n")
    new_md.drop(md_rows_not_in_ft, inplace=True)

All 186 files are present in both new_md & new_ft.


In [16]:
new_ft = new_ft.reindex(sorted(new_ft.columns), axis=1) #ordering the ft by its column names
new_md.sort_index(inplace=True) #ordering the md by its row names

In [17]:
# checking the dimensions of our new ft and md
print(f"The number of rows and columns in our original ft is: {ft.shape}")
print(f"The number of rows and columns in our new ft is: {new_ft.shape}")
print(f"The number of rows and columns in our original md is: {md.shape}")
print(f"The number of rows and columns in our new md is: {new_md.shape}\n")

The number of rows and columns in our original ft is: (11217, 200)
The number of rows and columns in our new ft is: (11217, 186)
The number of rows and columns in our original md is: (186, 13)
The number of rows and columns in our new md is: (186, 13)



Notice that the number of columns of feature table is same as the number of rows in our metadata. Now, we have both our feature table and metadata in the same order.

In [18]:
#checking if they the files are in the same order
list(new_ft.columns) == list(new_md.index)

True

Lets check the files once again!!

In [19]:
print('Dimension: ',new_ft.shape)
new_ft.head()

Dimension:  (11217, 186)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML,SD_12-2017_PPL_Bl_1.mzXML,SD_12-2017_PPL_Bl_2.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1560.679,0.0,0.0,3939.107,0.0,0.0,0.0
2513_151.035_1.13,0.0,156590.55,0.0,0.0,0.0,0.0,22862.7,0.0,29359.463,0.0,...,0.0,0.0,0.0,11498.99,0.0,0.0,0.0,0.0,0.0,0.0
42_151.035_0.551,2863941.0,3687233.2,2810288.0,2321774.2,3195918.0,2765738.8,4439634.0,3591492.5,2985472.0,3484729.0,...,1856001.5,1766485.0,1287448.5,1491507.0,1728245.0,1547097.4,1262373.0,1280963.1,4432.9683,6813.541
1870_151.035_0.887,201483.2,85594.53,23923.246,20954.787,81281.12,79683.164,140293.1,256066.56,249608.58,233550.1,...,16260.477,9554.87,73896.3,53041.84,8907.969,30851.541,0.0,0.0,0.0,0.0
2127_151.096_0.986,4317.684,14283.897,0.0,0.0,8685.125,0.0,7383.013,0.0,4742.709,4784.927,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
print('Dimension: ',new_md.shape)
new_md.head()

Dimension:  (186, 13)


Unnamed: 0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,ATTRIBUTE_Spot_Name,ATTRIBUTE_time_run,ATTRIBUTE_Injection_order
SD_01-2018_10_a.mzXML,Sample,2,Jan,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,16/01/2018 16:23,83
SD_01-2018_10_b.mzXML,Sample,2,Jan,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,16/01/2018 16:39,84
SD_01-2018_11_a.mzXML,Sample,2,Jan,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,16/01/2018 16:55,85
SD_01-2018_11_b.mzXML,Sample,2,Jan,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,16/01/2018 17:10,86
SD_01-2018_12_a.mzXML,Sample,2,Jan,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,Cove,16/01/2018 17:26,87


# Splitting the data into Blanks and Samples using Metadata:
<a id="data_split"></a>

For the first step: Blank removal, we need to split the data as spectra obtained from blanks and samples respectively using the metadata. More about Blank removal in the next section.

In [21]:
inside_levels(new_md)

Unnamed: 0,ATTRIBUTES,LEVELS,COUNT,TYPES
1,ATTRIBUTE_Sample.Type,"[Blank, Sample]","[6, 180]",<class 'str'>
2,ATTRIBUTE_Batch,"[1, 2, 3]","[62, 62, 62]",<class 'numpy.int64'>
3,ATTRIBUTE_Month,"[Dec, Jan, Oct]","[62, 62, 62]",<class 'str'>
4,ATTRIBUTE_Year,"[2017, 2018]","[62, 124]",<class 'numpy.int64'>
5,ATTRIBUTE_Sample_Location,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
6,ATTRIBUTE_Replicate,"[a, b]","[93, 93]",<class 'str'>
7,ATTRIBUTE_Spot,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
8,ATTRIBUTE_Latitude,"[32.75645, 32.75743, 32.75905, 32.76115, 32.76...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
9,ATTRIBUTE_Longitude,"[-117.2872, -117.28664, -117.286, -117.28355, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
10,ATTRIBUTE_Sample_Area,"[Blank, La_Jolla Reefs, La_Jolla_Cove, Mission...","[6, 36, 12, 36, 18, 12, 18, 48]",<class 'str'>


In case we want to remove certain files of a particular condition, for ex: ATTRIBUTE_sample = "M", we can subset them out of our dataframe using the next cell. 

In [22]:
# subset_data = new_md[new_md['ATTRIBUTE_Sample']!='M']
# print('Dimension: ',subset_data.shape)
# inside_levels(subset_data)

Once we subset the data, we can further proceed to split the blanks from the sample in the cell below. If no subsetting is involved, you can simply split your metadata into blank and sample.

In [23]:
#If subset_data exists, it will take it as "data", else take new_md as "data"
if 'subset_data' in locals():
    data = subset_data
else:
    data = new_md
display(inside_levels(data))

condition = int(input("Enter the index number of the attribute to split sample and blank: "))
df = pd.DataFrame({"LEVELS": inside_levels(data).iloc[condition-1]["LEVELS"]})
df.index = [*range(1, len(df)+1)]
display(df)

#Among the shown levels of an attribute, select the ones to keep
blank_id = int(input("Enter the index number of your BLANK: "))
print('Your chosen blank is: ', df['LEVELS'][blank_id])

#Splitting the data into blanks and samples based on the metadata
md_blank = data[data[inside_levels(data)['ATTRIBUTES'][condition]] == df['LEVELS'][blank_id]]
blank = new_ft[list(md_blank.index)]
md_samples = data[data[inside_levels(data)['ATTRIBUTES'][condition]] != df['LEVELS'][blank_id]]
samples = new_ft[list(md_samples.index)]

Unnamed: 0,ATTRIBUTES,LEVELS,COUNT,TYPES
1,ATTRIBUTE_Sample.Type,"[Blank, Sample]","[6, 180]",<class 'str'>
2,ATTRIBUTE_Batch,"[1, 2, 3]","[62, 62, 62]",<class 'numpy.int64'>
3,ATTRIBUTE_Month,"[Dec, Jan, Oct]","[62, 62, 62]",<class 'str'>
4,ATTRIBUTE_Year,"[2017, 2018]","[62, 124]",<class 'numpy.int64'>
5,ATTRIBUTE_Sample_Location,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
6,ATTRIBUTE_Replicate,"[a, b]","[93, 93]",<class 'str'>
7,ATTRIBUTE_Spot,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
8,ATTRIBUTE_Latitude,"[32.75645, 32.75743, 32.75905, 32.76115, 32.76...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
9,ATTRIBUTE_Longitude,"[-117.2872, -117.28664, -117.286, -117.28355, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
10,ATTRIBUTE_Sample_Area,"[Blank, La_Jolla Reefs, La_Jolla_Cove, Mission...","[6, 36, 12, 36, 18, 12, 18, 48]",<class 'str'>


Unnamed: 0,LEVELS
1,Blank
2,Sample


Your chosen blank is:  Blank


In [24]:
# Display the chosen blank
print('Dimension: ',blank.shape)
blank.head()

Dimension:  (11217, 6)


Unnamed: 0,SD_01-2018_PPL_Bl_1.mzXML,SD_01-2018_PPL_Bl_2.mzXML,SD_10_2018_PPL_Blank_1.mzXML,SD_10_2018_PPL_Blank_2.mzXML,SD_12-2017_PPL_Bl_1.mzXML,SD_12-2017_PPL_Bl_2.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0
2513_151.035_1.13,0.0,0.0,0.0,0.0,0.0,0.0
42_151.035_0.551,80114.62,21310.246,74143.17,105766.586,4432.9683,6813.541
1870_151.035_0.887,0.0,0.0,0.0,0.0,0.0,0.0
2127_151.096_0.986,23387.723,21032.016,0.0,115959.33,0.0,0.0


In [25]:
# Display the chosen samples
print('Dimension: ',samples.shape)
samples.head()

Dimension:  (11217, 180)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_5_a.mzXML,SD_12-2017_5_b.mzXML,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1560.679,0.0,0.0,3939.107,0.0
2513_151.035_1.13,0.0,156590.55,0.0,0.0,0.0,0.0,22862.7,0.0,29359.463,0.0,...,0.0,0.0,0.0,0.0,0.0,11498.99,0.0,0.0,0.0,0.0
42_151.035_0.551,2863941.0,3687233.2,2810288.0,2321774.2,3195918.0,2765738.8,4439634.0,3591492.5,2985472.0,3484729.0,...,1354254.0,1318947.0,1856001.5,1766485.0,1287448.5,1491507.0,1728245.0,1547097.4,1262373.0,1280963.1
1870_151.035_0.887,201483.2,85594.53,23923.246,20954.787,81281.12,79683.164,140293.1,256066.56,249608.58,233550.1,...,6334.812,0.0,16260.477,9554.87,73896.3,53041.84,8907.969,30851.541,0.0,0.0
2127_151.096_0.986,4317.684,14283.897,0.0,0.0,8685.125,0.0,7383.013,0.0,4742.709,4784.927,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Now that we have our data ready, we can start with the cleanup steps!!**

# Step1: Blank Removal

<p style='text-align: justify;'> In LC-MS/MS, we use solvents called Blanks which are usually injected time-to-time to prevent carryover of the sample. The features coming from these Blanks would also be detected by LC-MS/MS instrument. Our goal here is to remove these features from our samples. The other blanks that can be removed are: Signals coming from growth media alone in terms of microbial growth experiment, signals from the solvent used for extraction methods and so on. Therefore, it is best practice to measure mass spectra of these blanks as well in addition to your sample spectra. </p>

**How do we remove these blank features?** </br> 
<p style='text-align: justify;'> Since we have the feature table split into Control blanks and Sample groups now, we can compare blanks to the sample to identify the background features coming from blanks. A common filtering method is to use a cutoff to remove features that are not present sufficient enough in our biological samples. </p>

The steps followed in the next few cells are:
1. <p style='text-align: justify;'> We find an average for all the feature intensities in your blank set and sample set. Therefore, for n no.of features in a blank or sample set, we get n no.of averaged features. </p>
2. <p style='text-align: justify;'> Next, we get a ratio of this average_blanks vs average_sample. This ratio Blank/sample tells us how much of that particular feature of a sample gets its contribution from blanks. If it is more than 30% (or Cutoff as 0.3), we consider the feature as noise. </p>
3. <p style='text-align: justify;'> The resultant information (if ratio > Cutoff or not) is stored in a bin such as 1 = Noise or background signal, 0 = Feature Signal</p>
4. <p style='text-align: justify;'> We count the no.of features in the bin that satisfies the condition ratio > cutoff, and consider those features as 'noise or background features' and remove them. </p>

**<font color='red'> The Cutoff used to obtain the all the files in MZmine Results folder is 0.3 </font>**

In [26]:
blank_removal = samples.copy()
if (input("Do you want to perform Blank Removal- Y/N: ").upper()=="Y"):
    
    # When cutoff is low, more noise (or background) detected; With higher cutoff, less background detected, thus more features observed
    cutoff = float(input("Enter Cutoff value between 0.1 & 1 (Ideal cutoff range: 0.1-0.3): ")) # (i.e. 10% - 100%). Ideal cutoff range: 0.1-0.3
    
    # Getting mean for every feature in blank and Samples
    avg_blank = blank.mean(axis=1, skipna=False) # set skipna = False do not exclude NA/null values when computing the result.
    avg_samples = samples.mean(axis=1, skipna=False)

    # Getting the ratio of blank vs samples
    ratio_blank_samples = (avg_blank+1)/(avg_samples+1)

    # Create an array with boolean values: True (is a real feature, ratio<cutoff) / False (is a blank, background, noise feature, ratio>cutoff)
    is_real_feature = (ratio_blank_samples<cutoff)

    # Checking if there are any NA values present. Having NA values in the 4 variables will affect the final dataset to be created
    temp_NA_Count = pd.concat([avg_blank, avg_samples, ratio_blank_samples, is_real_feature], 
                            keys=['avg_blank', 'avg_samples', 'ratio_blank_samples', 'bg_bin'], axis = 1)
    
    print('No. of NA values in the following columns: ')
    display(pd.DataFrame(temp_NA_Count.isna().sum(), columns=['NA']))

    # Calculating the number of background features and features present (sum(bg_bin) equals number of features to be removed)
    print(f"No. of Background or noise features: {len(samples)-sum(is_real_feature)}")
    print(f"No. of features after excluding noise: {sum(is_real_feature)}")

    blank_removal = samples[is_real_feature.values]
    # save to file
    blank_removal.to_csv(os.path.join(result_dir, "Blanks_Removed.csv"))

No. of NA values in the following columns: 


Unnamed: 0,NA
avg_blank,0
avg_samples,0
ratio_blank_samples,0
bg_bin,0


No. of Background or noise features: 2125
No. of features after excluding noise: 9092


In [27]:
print('Dimension: ',blank_removal.shape)
display(blank_removal.head())

Dimension:  (9092, 180)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_5_a.mzXML,SD_12-2017_5_b.mzXML,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1560.679,0.0,0.0,3939.107,0.0
2513_151.035_1.13,0.0,156590.55,0.0,0.0,0.0,0.0,22862.7,0.0,29359.46,0.0,...,0.0,0.0,0.0,0.0,0.0,11498.99,0.0,0.0,0.0,0.0
42_151.035_0.551,2863941.0,3687233.2,2810288.0,2321774.2,3195918.0,2765738.8,4439634.5,3591492.0,2985472.0,3484729.0,...,1354254.0,1318947.0,1856002.0,1766485.0,1287448.0,1491507.0,1728245.0,1547097.4,1262373.0,1280963.1
1870_151.035_0.887,201483.2,85594.53,23923.246,20954.787,81281.12,79683.164,140293.06,256066.6,249608.6,233550.1,...,6334.812,0.0,16260.48,9554.87,73896.3,53041.84,8907.969,30851.541,0.0,0.0
1653_152.057_0.847,5206.803,5580.02,0.0,0.0,10935.638,10142.98,10469.594,10249.44,3827.583,2129.518,...,6396.476,4122.9,7767.999,11943.939,2908.759,5800.791,17471.045,15480.93,12566.81,15552.044


# Step 2: Imputation

<p style='text-align: justify;'> For several reasons, real world datasets might have some missing values in it, in the form of NA, NANs or 0s. Eventhough the gapfilling step of MZmine fills the missing values, we still end up with some missing values or 0s in our feature table. This could be problematic for statistical analysis. </p> 
<p style='text-align: justify;'> In order to have a better dataset, we cannot simply discard those rows or columns with missing values as we will lose a chunk of our valuable data. Instead we can try imputing those missing values. Imputation involves replacing the missing values in the data with a meaningful, reasonable guess. There are several methods, such as: </p> 
  
1) Mean imputation (replacing the missing values in a column with the mean or average of the column)  
2) Replacing it with the most frequent value  
3) Several other machine learning imputation methods such as k-nearest neighbors algorithm(k-NN), Hidden Markov Model(HMM)

Here, we use ft and see the frquency distribution of its features with a plot. It shows where the features are present in higher number.

In [28]:
bins, bins_label, a = [-1, 0, 1, 10], ['-1','0', "1", "10"], 2

while a<=10:
    bins_label.append(np.format_float_scientific(10**a))
    bins.append(10**a)
    a+=1

freq_table = pd.DataFrame(bins_label)
frequency = pd.DataFrame(np.array(np.unique(np.digitize(blank_removal.to_numpy(), bins, right=True), return_counts=True)).T).set_index(0)
freq_table = pd.concat([freq_table,frequency], axis=1).fillna(0).drop(0)
freq_table.columns = ['intensity', 'Frequency']
freq_table['Log(Frequency)'] = np.log(freq_table['Frequency']+1)

# get the lowest intensity (that is not zero) as a cutoff LOD value
cutoff_LOD = round(blank_removal.replace(0, np.nan).min(numeric_only=True).min())

fig = px.bar(freq_table, x="intensity", y="Log(Frequency)", template="plotly_white",  width=600, height=400)

fig.update_traces(marker_color="#696880")
fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"FEATURE INTENSITY - FREQUENCY PLOT", 'x':0.5, "font_color":"#3E3D53"})
fig.write_image(os.path.join(result_dir, "frequency_plot.svg"))
fig.show()

A random number between this minimum value and zero will be used for imputation.

In [29]:
imputed = blank_removal.copy()
if(input("Do you want to perform Imputation? - Y/N: ").upper()=="Y"):
    #imputed.replace(0, np.random.randint(0, cutoff_LOD), inplace=True)
    imputed = imputed.apply(lambda x: [np.random.randint(0, cutoff_LOD) if v == 0 else v for v in x])
    print('Dimension: ',imputed.shape)
    display(imputed)
    # save to file
    imputed.to_csv(os.path.join(result_dir, f"Imputed_QuantTable.csv"))

Dimension:  (9092, 180)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_5_a.mzXML,SD_12-2017_5_b.mzXML,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML
92572_151.035_13.364,4.980000e+02,873.000,150.000,385.000,63.000,626.000,231.000,2.880000e+02,5.100000e+02,2.540000e+02,...,1.540000e+02,684.0,1.880000e+02,232.000,4.900000e+02,1.560679e+03,657.000,61.000,3.939107e+03,304.000
2513_151.035_1.13,6.650000e+02,156590.550,673.000,459.000,373.000,63.000,22862.700,3.220000e+02,2.935946e+04,5.350000e+02,...,1.900000e+01,743.0,1.270000e+02,798.000,2.130000e+02,1.149899e+04,693.000,186.000,3.370000e+02,726.000
42_151.035_0.551,2.863941e+06,3687233.200,2810288.000,2321774.200,3195918.000,2765738.800,4439634.500,3.591492e+06,2.985472e+06,3.484729e+06,...,1.354254e+06,1318947.0,1.856002e+06,1766485.000,1.287448e+06,1.491507e+06,1728245.000,1547097.400,1.262373e+06,1280963.100
1870_151.035_0.887,2.014832e+05,85594.530,23923.246,20954.787,81281.120,79683.164,140293.060,2.560666e+05,2.496086e+05,2.335501e+05,...,6.334812e+03,349.0,1.626048e+04,9554.870,7.389630e+04,5.304184e+04,8907.969,30851.541,7.770000e+02,122.000
1653_152.057_0.847,5.206803e+03,5580.020,57.000,7.000,10935.638,10142.980,10469.594,1.024944e+04,3.827583e+03,2.129518e+03,...,6.396476e+03,4122.9,7.767999e+03,11943.939,2.908759e+03,5.800791e+03,17471.045,15480.930,1.256681e+04,15552.044
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91167_1442.399_12.598,3.111652e+05,189971.030,386000.970,138754.100,477054.380,109891.700,201078.030,3.259739e+05,1.406355e+05,4.215174e+05,...,7.400000e+01,759.0,3.710000e+02,754.000,4.610000e+02,6.340000e+02,498.000,800.000,8.330000e+02,217.000
90242_1442.399_12.377,2.276051e+05,558666.940,307892.380,152410.160,112337.055,80392.050,160174.360,8.876620e+04,2.441697e+05,1.880748e+05,...,7.620000e+02,567.0,1.350000e+02,655.000,1.520000e+02,8.220000e+02,627.000,475.000,6.610000e+02,96.000
88493_1442.399_11.706,2.040702e+05,683280.750,307918.940,188791.100,184659.190,63493.695,144327.080,1.713194e+05,2.228125e+05,9.528194e+04,...,4.320000e+02,632.0,7.800000e+02,234.000,6.900000e+01,8.910000e+02,567.000,226.000,5.140000e+02,31.000
90600_1443.399_12.376,4.919888e+05,592310.400,308512.250,330795.970,103600.140,311601.620,172762.140,6.116949e+05,5.126081e+05,1.308232e+05,...,8.800000e+02,737.0,1.400000e+01,494.000,2.260000e+02,2.970000e+02,426.000,818.000,7.450000e+02,743.000


Too many missing values is problematic for statistical analyses. Here we calculate the proportion of missing values (coded as the value of the cutoff_LOD) and display the proportions in a histogram

TODO move plot up before imputation

In [30]:
# check the number of missing values per feature in a histogram
n_zeros = imputed.T.apply(lambda x: sum(x<=cutoff_LOD))

fig = px.histogram(n_zeros, template="plotly_white",  
                   width=600, height=400)

fig.update_traces(marker_color="#696880")
fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"MISSING VALUES PER FEATURE", 'x':0.5, "font_color":"#3E3D53"},
                  xaxis_title="number of missing values", yaxis_title="count", showlegend=False)
fig.write_image(os.path.join(result_dir, "number_of_missing_values_per_feature.svg"))
fig.show()

# Step 3 Normalization
The following code performs sample-centric (column-wise) normalisation:

In [31]:
normalized = imputed.copy()
if(input("Do you want to perform Normalization? - Y/N: ").upper()=="Y"):
    # Dividing each element of a particular column with its column sum
    normalized = normalized.apply(lambda x: x/np.sum(x), axis=0)
    
    # save to file
    normalized.to_csv(os.path.join(result_dir, "Normalised_Quant_table.csv"))
    
    print('Dimension: ', normalized.shape)
    display(normalized.head())

# Step 4: Transposing

In [39]:
# transposing the imputed table before scaling
transposed = imputed.T
print(f'Imputed feature table rows/columns: {transposed.shape}')
display(transposed.head(3))
# put the rows in the feature table and metadata in the same order
transposed.sort_index(inplace=True)
md_samples.sort_index(inplace=True)

if (md_samples.index == transposed.index).all():
    pass
else:
    print("WARNING: Sample names in feature and metadata table are NOT the same!")

transposed.to_csv(os.path.join(result_dir, "Imputed_QuantTable_transposed.csv"))

Imputed feature table rows/columns: (180, 9092)


Unnamed: 0,92572_151.035_13.364,2513_151.035_1.13,42_151.035_0.551,1870_151.035_0.887,1653_152.057_0.847,39_153.033_0.55,91313_153.138_12.628,5376_155.07_2.215,48546_155.07_6.351,8717_157.086_2.771,...,89389_1370.381_11.978,90659_1370.382_12.502,92155_1370.382_12.968,89872_1370.954_12.136,89496_1442.399_12.015,91167_1442.399_12.598,90242_1442.399_12.377,88493_1442.399_11.706,90600_1443.399_12.376,91938_1443.4_12.895
SD_01-2018_10_a.mzXML,498.0,665.0,2863941.0,201483.19,5206.8027,513062.94,25667.047,18414.531,238.0,5073.1,...,486.0,468.0,296.0,46.0,708503.75,311165.25,227605.08,204070.17,491988.75,447.0
SD_01-2018_10_b.mzXML,873.0,156590.55,3687233.2,85594.53,5580.02,634986.6,74335.484,9468.36,253.0,531.0,...,64.0,299.0,658.0,670.0,313226.5,189971.03,558666.94,683280.75,592310.4,128523.984
SD_01-2018_11_a.mzXML,150.0,673.0,2810288.0,23923.246,57.0,425097.75,96858.484,86009.2,516.0,608.0,...,664.0,550.0,41.0,543.0,401128.94,386000.97,307892.38,307918.94,308512.25,114226.63


# Step 5: Scaling
For statistics normalization should happen across the complete dataframe via scaling and centering. 

In [40]:
# scale filtered data
scaled = pd.DataFrame(StandardScaler().fit_transform(transposed), index=transposed.index, columns=transposed.columns)
scaled.to_csv(os.path.join(result_dir, "Imputed_Scaled_QuantTable.csv"))

# Merge feature table and metadata to one dataframe:
# "how=inner" performs an inner join (only the filenames that appear in md_samples and data are kept)
data = pd.merge(md_samples, scaled, left_index=True, right_index=True, how="inner")
display(data.head())

Unnamed: 0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,...,89389_1370.381_11.978,90659_1370.382_12.502,92155_1370.382_12.968,89872_1370.954_12.136,89496_1442.399_12.015,91167_1442.399_12.598,90242_1442.399_12.377,88493_1442.399_11.706,90600_1443.399_12.376,91938_1443.4_12.895
SD_01-2018_10_a.mzXML,Sample,2,Jan,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,...,-0.168485,-0.188218,-0.183218,-0.553889,4.350676,2.029857,2.142129,2.071845,2.574823,-0.271868
SD_01-2018_10_b.mzXML,Sample,2,Jan,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,...,-0.176582,-0.191055,-0.17451,-0.549257,1.724644,1.096265,5.771122,7.714252,3.172598,0.782817
SD_01-2018_11_a.mzXML,Sample,2,Jan,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,...,-0.16507,-0.186842,-0.189353,-0.550199,2.308626,2.606336,3.022212,3.2946,1.481562,0.665082
SD_01-2018_11_b.mzXML,Sample,2,Jan,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,...,-0.1757,-0.187429,-0.182785,-0.554089,2.363625,0.701728,1.317866,1.891943,1.614342,4.752438
SD_01-2018_12_a.mzXML,Sample,2,Jan,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,...,-0.168025,-0.181655,-0.170926,-0.553674,1.921091,3.307744,0.878598,1.843293,0.260576,-0.270641


# Univariate:

**Run ANOVA** <br>

We now use the function anova from the pingouin library to run the ANOVA. Since one ANOVA is being run for each metabolite feature, we run the analyses in a loop and save the output for each feature in a list called anova_out.<br>

The vector a indicates which columns in the dataset are features (i.e. from column 5 to the last column of the data frame). <br>

We can run a for loop to pass each feature column into the first argument of the aov function, while the second argument, time point, is constant.

In [43]:
# select an attribute to perform ANOVA
anova_attribute = 'ATTRIBUTE_Sample_Area'

In [44]:
def gen_anova_data(df, columns, groups_col):
    for col in columns:
        result = pg.anova(data=df, dv=col, between=groups_col, detailed=True).set_index('Source')
        p = result.loc[groups_col, 'p-unc']
        f = result.loc[groups_col, 'F']
        yield col, p, f

dtypes = [('metabolite', 'U100'), ('p', 'f'), ('F', 'f')]
anova = pd.DataFrame(np.fromiter(gen_anova_data(data, scaled.columns, anova_attribute), dtype=dtypes))
anova

Unnamed: 0,metabolite,p,F
0,92572_151.035_13.364,0.065142,2.022219
1,2513_151.035_1.13,0.761474,0.560284
2,42_151.035_0.551,0.041332,2.243144
3,1870_151.035_0.887,0.831935,0.467234
4,1653_152.057_0.847,0.317671,1.182722
...,...,...,...
9087,91167_1442.399_12.598,0.000025,5.600474
9088,90242_1442.399_12.377,0.000798,4.049425
9089,88493_1442.399_11.706,0.001004,3.947034
9090,90600_1443.399_12.376,0.001902,3.660465


The following is of interest:
*   Feature ID (column 'metabolite')
*   p-value for ANOVA
*   p-value after taking multiple tests into consideration
*   F-value

In [45]:
# add Bonferroni corrected p-values for multiple testing correction
if 'p_bonferroni' not in anova.columns:
    anova.insert(2, 'p_bonferroni', pg.multicomp(anova['p'], method='bonf')[1])
# add significance
if 'significant' not in anova.columns:
    anova.insert(3, 'significant', anova['p_bonferroni'] < 0.05)
# sort by p-value
anova.sort_values('p', inplace=True)
# save ANOVA table
anova.to_csv(os.path.join(result_dir, 'ANOVA_results.csv'))
anova

Unnamed: 0,metabolite,p,p_bonferroni,significant,F
2812,59188_312.231_7.625,9.347287e-30,8.498553e-26,True,39.008976
1394,33200_260.196_4.886,1.585797e-28,1.441807e-24,True,36.786724
2862,57080_314.247_7.36,3.025364e-24,2.750661e-20,True,29.585419
1082,21870_246.18_3.969,6.852618e-23,6.230400e-19,True,27.468212
1035,80910_243.174_10.41,1.639375e-21,1.490520e-17,True,25.388908
...,...,...,...,...,...
535,560_217.068_0.628,9.994826e-01,1.000000e+00,False,0.049964
8355,51908_729.432_6.783,9.997658e-01,1.000000e+00,False,0.038010
4873,19113_381.238_3.737,9.998580e-01,1.000000e+00,False,0.032024
4416,47359_365.219_6.284,9.998635e-01,1.000000e+00,False,0.031593


**Plot ANOVA results**

We will use plotly to visualize results from the ANOVA, with log(F-values) on the x-axis and -log(p) on the y-axis. Features are colored after statistical significance after multiple test correction. Since there are large differences in the F- and p-values, it is easier to plot their log.

We can also display the names of some of the top features in the plot. This easily gets very cluttered if we decide to display too many names, so starting at the top 5 could be a good idea.

In [46]:
# first plot insignificant features
fig = px.scatter(x=anova[anova['significant'] == False]['F'].apply(np.log),
                y=anova[anova['significant'] == False]['p'].apply(lambda x: -np.log(x)),
                template='plotly_white', width=600, height=600)
fig.update_traces(marker_color="#696880")

# plot significant features
fig.add_scatter(x=anova[anova['significant']]['F'].apply(np.log),
                y=anova[anova['significant']]['p'].apply(lambda x: -np.log(x)),
                mode='markers+text',
                text=anova['metabolite'].iloc[:4],
                textposition='top left', textfont=dict(color='#ef553b', size=7), name='significant')

fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"ANOVA - FEATURE SIGNIFICANCE", 'x':0.5, "font_color":"#3E3D53"},
                  xaxis_title="log(F)", yaxis_title="-log(p)", showlegend=False)

# save fig as pdf
fig.write_image(os.path.join(result_dir, "plot_ANOVA.pdf"), scale=3)

fig.show()

In [47]:
# boxplots with top 4 metabolites from ANOVA
for metabolite in anova.sort_values('p_bonferroni').iloc[:4, 0]:
    fig = px.box(data, x=anova_attribute, y=metabolite, color=anova_attribute)
    fig.update_layout(showlegend=False, title=metabolite, xaxis_title="", yaxis_title="intensity", template="plotly_white", width=500)
    display(fig)

**Tukey's post hoc test:**

Define functions to run Tukey's post hoc test and plot results


In [75]:
# functions to run Tukey's and plot results

def tukey_post_hoc_test(anova_attribute, contrasts, metabolites):
    """
    Perform pairwise Tukey test for all metabolites between contrast combinations.

    Args:
        anova_attribute: A string representing the attribute to use in ANOVA.
        contrasts: A list of tuples, where each tuple contains two strings representing the groups to compare.
        metabolites: A list of strings representing the metabolites to test.

    Returns:
        A pandas DataFrame containing the results of the pairwise Tukey test, including the contrast,
        metabolite, absolute value of the metabolite ID, difference between the means, p-value, Bonferroni
        corrected p-value, and significance (True or False).
    """

    # if a single metabolite gets passed make sure to put it in a list
    if isinstance(metabolites, str):
        metabolites = [metabolites]

    def gen_pairwise_tukey(df, contrasts, metabolites):
        """ Yield results for pairwise Tukey test for all metabolites between contrast combinations."""
        for metabolite in metabolites:
            for contrast in contrasts:
                df_for_tukey = df.iloc[np.where(data[anova_attribute].isin([contrast[0], contrast[-1]]))][[metabolite, anova_attribute]]
                pairwise_tukey = pg.pairwise_tukey(df_for_tukey, dv=metabolite, between=anova_attribute)
                yield f'{contrast[0]}-{contrast[1]}', metabolite, int(metabolite.split('_')[0]), pairwise_tukey['diff'], pairwise_tukey['p-tukey']

    dtypes = [('contrast', 'U100'), ('stats_metabolite', 'U100'), ('stats_ID', 'i'), ('stats_diff', 'f'), ('stats_p', 'f')]
    tukey = pd.DataFrame(np.fromiter(gen_pairwise_tukey(data, contrasts, metabolites), dtype=dtypes))
    # add Bonferroni corrected p-values
    tukey.insert(5, 'stats_p_bonferroni', pg.multicomp(tukey['stats_p'], method='bonf')[1])
    # add significance
    tukey.insert(6, 'stats_significant', tukey['stats_p_bonferroni'] < 0.05)
    # sort by p-value
    tukey.sort_values('stats_p', inplace=True)

    # write output to csv file
    tukey.to_csv(os.path.join(result_dir, 'TukeyHSD_output.csv'))

    return tukey

def plot_tukey(df):

    # create figure
    fig = px.scatter(template='plotly_white', width=600, height=600)

    # plot insignificant values
    fig.add_trace(go.Scatter(x=df[df['stats_significant'] == False]['stats_diff'],
                            y=df[df['stats_significant'] == False]['stats_p'].apply(lambda x: -np.log(x)),
                            mode='markers', marker_color='#696880', name='insignificant'))

    # plot significant values
    fig.add_trace(go.Scatter(x=df[df['stats_significant']]['stats_diff'],
                            y=df[df['stats_significant']]['stats_p'].apply(lambda x: -np.log(x)),
                            mode='markers+text', text=anova['metabolite'].iloc[:4], textposition='top left', 
                            textfont=dict(color='#ef553b', size=8), marker_color='#ef553b', name='significant'))

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                    title={"text":"TUKEY", 'x':0.5, "font_color":"#3E3D53"},
                    xaxis_title="stats_diff", yaxis_title="-log(p)")

    # save image as pdf
    fig.write_image(os.path.join(result_dir, "TukeyHSD.pdf"), scale=3)

    display(fig)

For the most significant feature from ANOVA:

In [76]:
contrasts = list(itertools.combinations(set(data[anova_attribute]), 2)) # all possible combinations
tukey = tukey_post_hoc_test(anova_attribute, contrasts, anova['metabolite'].iloc[0])
display(tukey)

Unnamed: 0,contrast,stats_metabolite,stats_ID,stats_diff,stats_p,stats_p_bonferroni,stats_significant
13,Mission_Bay-Torrey_Pines,59188_312.231_7.625,59188,1.874688,1.660894e-13,3.487877e-12,True
14,Mission_Bay-La_Jolla Reefs,59188_312.231_7.625,59188,-1.905934,3.885869e-11,8.160326e-10,True
6,Mission_Beach-Mission_Bay,59188_312.231_7.625,59188,1.920364,9.798197e-07,2.057621e-05,True
12,Mission_Bay-SIO_La_Jolla_Shores,59188_312.231_7.625,59188,1.918814,9.964635e-07,2.092573e-05,True
1,La_Jolla_Cove-Mission_Bay,59188_312.231_7.625,59188,-1.921399,4.267755e-05,0.0008962286,True
11,Mission_Bay-Pacific_Beach,59188_312.231_7.625,59188,1.842156,8.077861e-05,0.001696351,True
17,Pacific_Beach-La_Jolla Reefs,59188_312.231_7.625,59188,-0.063778,0.07022031,1.0,False
7,Mission_Beach-Pacific_Beach,59188_312.231_7.625,59188,-0.078208,0.08641817,1.0,False
15,Pacific_Beach-SIO_La_Jolla_Shores,59188_312.231_7.625,59188,0.076658,0.09479887,1.0,False
2,La_Jolla_Cove-Pacific_Beach,59188_312.231_7.625,59188,-0.079243,0.15441,1.0,False


Here, every possible pair-wise group difference is explored. Since Mission Bay seemed to differ from other sampling sites the most, we could specifically look at the results from comparison between Mission Bay and another sampling site.

In the example below, we look at the differences between Mission Bay and La Jolla Reefs.

In [77]:
contrasts = [('Mission_Bay', 'La_Jolla Reefs')]
tukey = tukey_post_hoc_test(anova_attribute, contrasts, anova[anova['significant']]['metabolite'])
display(tukey)
plot_tukey(tukey)

Unnamed: 0,contrast,stats_metabolite,stats_ID,stats_diff,stats_p,stats_p_bonferroni,stats_significant
0,Mission_Bay-La_Jolla Reefs,59188_312.231_7.625,59188,-1.905934,3.885869e-11,6.042527e-08,True
24,Mission_Bay-La_Jolla Reefs,60583_506.326_7.811,60583,-1.670566,2.245761e-10,3.492159e-07,True
1,Mission_Bay-La_Jolla Reefs,33200_260.196_4.886,33200,-1.834910,2.717742e-10,4.226088e-07,True
2,Mission_Bay-La_Jolla Reefs,57080_314.247_7.36,57080,-1.781612,3.213444e-09,4.996905e-06,True
15,Mission_Bay-La_Jolla Reefs,36504_214.191_5.227,36504,-1.722615,4.730149e-09,7.355381e-06,True
...,...,...,...,...,...,...,...
951,Mission_Bay-La_Jolla Reefs,76102_829.344_9.902,76102,-0.005464,9.289001e-01,1.000000e+00,False
1548,Mission_Bay-La_Jolla Reefs,84617_593.442_10.836,84617,0.000213,9.400620e-01,1.000000e+00,False
1193,Mission_Bay-La_Jolla Reefs,77315_521.384_10.015,77315,-0.000432,9.720494e-01,1.000000e+00,False
663,Mission_Bay-La_Jolla Reefs,78116_797.319_10.089,78116,0.000010,9.826921e-01,1.000000e+00,False


**T-test**

---

A T-test is commonly used when one has to compare between only two groups. Here, null hypothesis H0 states no difference between the mean of 2 groups. Similar to the F-statistic used by ANOVA, T-tests use T-statistic.


$$\text{T-statistic} = \frac{\text{Mean}_{\text{group}} - \text{Mean}_{\text{population}}}{\text{SD}_{\text{group}} / \sqrt{\text{group size}}}$$


In our dataset, a heavy rainfall in January 2018 could have influenced the metabolome. We will investigate the effect of the rainfall using t-tests. The 2 conditions will be 'Jan-2018' or 'not Jan-2018'

In [None]:
ttest_attribute = 'ATTRIBUTE_Month'
target_group = 'Jan'

In [None]:
def gen_ttest_data(df, columns, ttest_attribute, target_group):
    ttest = []
    for col in columns:
        group1 = df[col][df[ttest_attribute]==target_group]
        group2 = df[col][df[ttest_attribute]!=target_group]
        result = pg.ttest(group1, group2)
        result['Metabolite'] = col   
    
        ttest.append(result)
    
    ttest = pd.concat(ttest).set_index('Metabolite')
        
    ttest.insert(8, 'p-bonf', pg.multicomp(ttest['p-val'], method='bonf')[1])
    # add significance
    ttest.insert(9, 'Significance', ttest['p-bonf'] < 0.05)

    return ttest

In [None]:
ttest = gen_ttest_data(data, scaled.columns, ttest_attribute, target_group)
ttest.head(5)

In [None]:
# Plot T-test

fig = px.scatter(x=ttest['T'],
                y=ttest['p-bonf'].apply(lambda x: -np.log(x)),
                template='plotly_white', width=600, height=600, 
                 color=ttest['Significance'].apply(lambda x: str(x)),
                color_discrete_sequence = ['#ef553b', '#696880'])

fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"T-test - FEATURE SIGNIFICANCE", 'x':0.5, "font_color":"#3E3D53"},
                  xaxis_title="T", yaxis_title="-Log(p)", showlegend=False)

fig.show()

# PCoA PermANOVA:

Principal coordinates analysis (PCoA)

Principal coordinates analysis (PCoA) is a metric multidimensional scaling (MDS) method that attempts to represent sample dissimilarities in a low-dimensional space. It converts a distance matrix consisting of pair-wise distances (dissimilarities) across samples into a 2- or 3-D graph (Gower, 2005). Different distance metrics can be used to calculate dissimilarities among samples (e.g. Euclidean, Canberra, Minkowski). Performing a principal coordinates analysis using the Euclidean distance metric is the same as performing a principal components analysis (PCA). The selection of the most appropriate metric depends on the nature of your data and assumptions made by the metric.

Within the metabolomics field the Euclidean, Bray-Curtis, Jaccard or Canberra distances are most commonly used. The Jaccard distance is an unweighted metric (presence/absence) whereas Euclidean, Bray-Curtis and Canberra distances take into account relative abundances (weighted). Some metrics may be better suited for very sparse data (with many zeroes) than others. For example, the Euclidean distance metric is not recommended to be used for highly sparse data.

This video tutorial by StatQuest summarizes nicely the basic principles of PCoA: https://www.youtube.com/watch?v=GEn-_dAyYME

In [None]:
#calculating Principal components
n = 10
pca = PCA(n_components=n)
pca_df = pd.DataFrame(data = pca.fit_transform(scaled), columns = [f'PC{x}' for x in range(1, n+1)])
pca_df.index = md_samples.index
pca_df

In [None]:
# To get a scree plot showing the variance of each PC in percentage:
percent_variance = np.round(pca.explained_variance_ratio_* 100, decimals =2)

fig_bar = px.bar(x=pca_df.columns, y=percent_variance, template="plotly_white",  width=500, height=400)
fig_bar.update_traces(marker_color="#696880", width=0.5)
fig_bar.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                    title={"text":"PCA - VARIANCE", 'x':0.5, "font_color":"#3E3D53"},
                    xaxis_title="principal component", yaxis_title="variance (%)")
fig_bar.show()

TODO make the attibute colors work

In [None]:
@interact(attribute=sorted(md_samples.columns))
def pca_scatter_plot(attribute):
    title = f'PRINCIPLE COMPONENT ANALYSIS'

    df = pd.merge(pca_df[['PC1', 'PC2']], md_samples[attribute].apply(str), left_index=True, right_index=True)

    fig = px.scatter(df, x='PC1', y='PC2', template='plotly_white', width=600, height=400, color=attribute)

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":title, 'x':0.2, "font_color":"#3E3D53"},
                      xaxis_title=f'PC1 {round(pca.explained_variance_ratio_[0]*100, 1)}%',
                      yaxis_title=f'PC2 {round(pca.explained_variance_ratio_[1]*100, 1)}%')
    display(fig)

TODO fix the interact thing

In [None]:
matrices = ['canberra', 'chebyshev', 'correlation', 'cosine', 'euclidean', 'hamming', 'jaccard', 'matching', 'minkowski', 'seuclidean']
@interact(attribute=sorted(md_samples.columns), distance_matrix=matrices)
def pcoa(attribute, distance_matrix):
    # Create the distance matrix from the original data
    distance_matrix = skbio.stats.distance.DistanceMatrix(distance.squareform(distance.pdist(scaled.values, distance_matrix)))
    # perform PERMANOVA test
    permanova = skbio.stats.distance.permanova(distance_matrix, md_samples[attribute])
    permanova['R2'] = 1 - 1 / (1 + permanova['test statistic'] * permanova['number of groups'] / (permanova['sample size'] - permanova['number of groups'] - 1))
    display(permanova)
    # perfom PCoA
    pcoa = skbio.stats.ordination.pcoa(distance_matrix)
    df = pcoa.samples[['PC1', 'PC2']]
    df = df.set_index(md_samples.index)
    df = pd.merge(df[['PC1', 'PC2']], md_samples[attribute].apply(str), left_index=True, right_index=True)
    
    title = f'PRINCIPLE COORDINATE ANALYSIS'
    fig = px.scatter(df, x='PC1', y='PC2', template='plotly_white', width=600, height=400, color=attribute)

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":title, 'x':0.18, "font_color":"#3E3D53"},
                      xaxis_title=f'PC1 {round(pcoa.proportion_explained[0]*100, 1)}%',
                      yaxis_title=f'PC2 {round(pcoa.proportion_explained[1]*100, 1)}%')
    display(fig)
    
    # To get a scree plot showing the variance of each PC in percentage:
    percent_variance = np.round(pcoa.proportion_explained* 100, decimals =2)

    fig = px.bar(x=[f'PC{x}' for x in range(1, len(pcoa.proportion_explained)+1)], y=percent_variance, template="plotly_white",  width=500, height=400)
    fig.update_traces(marker_color="#696880", width=0.5)
    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":"PCoA - VARIANCE", 'x':0.5, "font_color":"#3E3D53"},
                      xaxis_title="principal component", yaxis_title="variance (%)")#
    display(fig)


# Hierarchial Clustering Algorithm:

We are now ready to perform a cluter analysis. The concept behind hierarchical clustering is to repeatedly combine the two nearest clusters into a larger cluster.

The first step consists of calculating the distance between every pair of observation points and stores it in a matrix;
1. It puts every point in its own cluster;
2. It merges the closest pairs of points according to their distances;
3. It recomputes the distance between the new cluster and the old ones and stores them in a new distance matrix;
4. It repeats steps 2 and 3 until all the clusters are merged into one single cluster. <br>

In [None]:
fig = ff.create_dendrogram(scaled, labels=list(scaled.index))
fig.update_layout(width=700, height=500, template='plotly_white')

# save image as pdf
fig.write_image(os.path.join(result_dir, "Cluster_Dendrogram.pdf"), scale=3)
fig.show()

In [None]:
# SORT DATA TO CREATE HEATMAP

# Compute linkage matrix from distances for hierarchical clustering
linkage_data_ft = linkage(scaled, method='complete', metric='euclidean')
linkage_data_samples = linkage(scaled.T, method='complete', metric='euclidean')

# Create a dictionary of data structures computed to render the dendrogram. 
# We will use dict['leaves']
cluster_samples = dendrogram(linkage_data_ft, no_plot=True)
cluster_ft = dendrogram(linkage_data_samples, no_plot=True)

# Create dataframe with sorted samples
ord_samp = scaled.copy()
ord_samp.reset_index(inplace=True)
ord_samp = ord_samp.reindex(cluster_samples['leaves'])
ord_samp.rename(columns={'index': 'Filename'}, inplace=True)
ord_samp.set_index('Filename', inplace=True)

# Create dataframe with sorted features
ord_ft = ord_samp.T.reset_index()
ord_ft = ord_ft.reindex(cluster_ft['leaves'])
ord_ft.rename(columns={'index': 'Feature'}, inplace=True)
ord_ft.set_index('Feature', inplace=True)

In [None]:
#Heatmap
fig = px.imshow(ord_ft,y=list(ord_ft.index), x=list(ord_ft.columns), text_auto=True, aspect="auto",
               color_continuous_scale='PuOr_r', range_color=[-3,3])

fig.update_layout(
    autosize=False,
    width=700,
    height=800)

fig.update_yaxes(visible=False)
fig.update_xaxes(tickangle = 35)

# save image as pdf
fig.write_image(os.path.join(result_dir, "Heatmap.pdf"), scale=3)

fig.show()