# Performing basic uni- and multivariate statistical analsysis of untargeted metabolomics data

**Updated on:** 2022-03-20 23:10:05 CEST

In this Jupyter Notebook we perform basic explorative uni- and multivariate statistical analyses of an untargeted metabolomics data set, including data cleaning steps, normalization and batch correction.

**Authors**: Abzer Kelminal (abzer.shah@uni-tuebingen.de), Francesco Russo (frru@ssi.dk), Filip Ottosson (faot@ssi.dk), Madaleine Ernst (maet@ssi.dk), Axel Walter (axel.walter@uni-tuebingen.de), Carolina Gonzalez (cgonzalez7@eafit.edu.co), Efi Kontou, Judith Boldt <br>

**Input file format**: .csv files or .txt files <br>
**Outputs**: .csv files, .pdf & .svg images  <br>
**Dependencies**: tidyverse, vegan, IRdisplay, svglite, factoextra, ggrepel, ggsci, matrixStats, ComplexHeatmap, dendextend, NbClust 

The session info at the end of this notebook gives info about the versions of all the packages used here.
</div>

---
#### This Notebook can be run with both Jupyter Notebook & Google Colab.
---
<b> Before starting to run this notebook with your own data, remember to save a copy of this notebook in your own Google Drive! Do so by clicking on File --> Save a copy in Drive. You can give whatever meaningful name to your notebook.
This file should be located in a new folder of your Google Drive named 'Colab Notebooks'. You can also download this notebook: File --> Download --> Download .ipynb.</b>

---
<b><font size=3> SPECIAL NOTE: Please read the comments before proceeding with the code and let us know if you run into any errors and if you think it could be commented better. We would highly appreciate your suggestions and comments!!</font> </b>

---

# <font color ='blue'> 1. Introduction </font>
<a id='intro'></a>

#### About the data
<p style='text-align: justify;'> The files used in this tutorial are part of a study published by <a href="https://doi.org/10.1016/j.chemosphere.2020.129450">Petras and coworkers (2021)</a>. Here, the authors investigated the coastal environments in northern San Diego, USA, after a major rainfall event in winter 2017/2018 to observe the seawater chemotype. The dataset contains surface seawater samples collected (−10 cm) at 30 sites spaced approximately 300 meters apart and 50–100 meters offshore along the San Diego coastline from Torrey Pines State Beach to Mission Bay (Torrey Pines, La Jolla Shores, La Jolla Reefs, Pacific and Mission Beach) at 3 different time points: Dec 2017, Jan 2018 (After a major rainfall, resulting in decreased salinity of water) and Oct 2018 <br>

<p style='text-align: justify;'> As a result of the study, a huge shift was observed in the seawater's organic matter chemotype after the rainfall. Seawater samples collected at the same sites during October 2018, were not published in the original article, but are added to this tutorial to have increased sample size. The datasets used here can be found in the MassIVE repository: <a href="https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=8a8139d9248b43e0b0fda17495387756">MSV000082312</a> and <a href="https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=c8411b76f30a4f4ca5d3e42ec13998dc">MSV000085786</a>. The .mzml files
were preprocessed using <a href="http://mzmine.github.io/">MZmine3</a> and the <a href="https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=cf6e14abf5604f47b28b467a513d3532">feature-based molecular networking workflow in GNPS</a>.</p>

<p style='text-align: justify;'> We initially clean the feature table by batch correction, blank removal, imputation, normalization, and scaling. Then, we perform univariate and multivariate statistical analyses including unsupervised learning methods (e.g., PCoA and clustering) and supervised analysis using XGBoost. The details of each step are discussed in their respective sections.</p>

---

# <font color ='blue'> 2. Preliminary setup for the notebook </font>
<a id='Section-2'>

## <font color ='darkblue'> 2.1 Package installation </font>
<a id = 'pkg_install'></a>
Before we start, we need to install all packages, which we need for our analyses and load the installed packages into our session. Since we have many packages for different sections, we install the packages right before the respective sections to reduce the installation time.

In [69]:
# Install libraries that are not preinstalled
!pip install pandas numpy plotly scikit-learn scikit-bio pingouin kaleido ipyfilechooser nbformat





<font color="green"><b>TIP:</b> # operator refers to comments describing the code function. Codes beginning with # is "commented out" and it will not run. To run a commented out code, remove the # and run it again.

In [70]:
# importing necessary modules
import pandas as pd
import numpy as np
import glob
import os
import itertools
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy.cluster.hierarchy import dendrogram, linkage
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from scipy.spatial import distance
from sklearn.decomposition import PCA
import pingouin as pg
import skbio # Don't import on Windows!!
from ipyfilechooser import FileChooser
from ipywidgets import interact
import warnings

In [71]:
# Disable warnings for cleaner output, comment out for debugging
warnings.filterwarnings('ignore')

## <font color ='darkblue'> 2.2 Setting a local working directory </font>
<a id = "set_dir"></a>

<p style='text-align: justify;'> When we set a folder (or directory) as the working directory, we can access the files within the folder just by its names without mentioning the entire file path everytime we use it. Also, all the output files will be saved under the working directory. So, before proceeding with the script further, if you are trying to use your own files for the notebook, then please make sure to include all the necessary input files in one local folder and set it as your working directory. </p>

<div class="alert alert-block alert-warning">
<p style='text-align: justify;'> <b>NOTE:</b><br> When you run the next cell, it will display an output box where you can simply enter the path of the folder containing all your input files in your local computer. 
directory<br> <b> For ex: D:\User\Project\Test_Data.</b> 
It will be set as your working directory and you can access all the files within it. <b> Whenever you see an output box asking for user input, please note, the script will not proceed further without your input. Hence, make sure to run the notebook cell-by-cell instead of running all cells in the notebook simultaneously. </b> </p>
</div>

<p style='text-align: justify;'>In Google Colab homepage &rarr; there are 3 icons on the upper left corner. Click on the 3 dots to see the contents of the notebook. To create a folder with your input files, click on the folder icon &rarr; Right-click anywher on the empty space within the left section &rarr; Select 'new folder' &rarr; Copy the path and paste in the output box of next cell </p>

In [74]:
data_dir = input("Enter the path of the folder with input files:\n")
os.chdir(data_dir)

## <font color ='darkblue'> 2.3 Loading in and exploring the input files </font>
<a id='load_ip'></a>

1) <b>Feature table:</b> A typical output file of an LC-MS/MS metabolomics experiment, containing all mass spectral features (or peaks) with their corresponding relative intensities across samples. The feature table we use in this tutorial was obtained from MZmine3. (Filetype: .csv file) <br> 

2) <b>Metadata:</b> An Excel file saved in .txt format that is created by the user, providing additional information about the samples (e.g. sample type, tissue type, species, timepoint of collection etc.) In this tutorial we are using the [metadata format recognized by GNPS workflows](https://ccms-ucsd.github.io/GNPSDocumentation/metadata/). The first column should be named 'filename' and all remaining column headers should be prefixed with ATTRIBUTE_: e.g. ATTRIBUTE_SampleType, ATTRIBUTE_timepoint etc. (Filetype: .txt file) <br>

Feature table and metadata used in this tutorial can be accessed at:
https://github.com/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/tree/main/data

In addition to that, if available, provide the files for molecular annotations such as SIRIUS, CANOPUS and GNPS annotation files. SIRIUS & CANOPUS performs molecular formula prediction and chemical class predictions respectively.

<p style='text-align: justify;'> GNPS annotation files can be obtained by performing Feature-Based Molecular Networking (FBMN) analysis on the feature table (provided along with its corresponding metadata). The GNPS annotation files are obtained as a result of FBMN. To download your FBMN results locally: Go to your <b>MassIVE</b> or <b>GNPS</b> account &rarr; Jobs &rarr; Click on <b>Status</b> of your FBMN Workflow &rarr; Download Cytoscape Data &rarr; A folder will be downloaded. For Ex: "ProteoSAFe-FEATURE-BASED-MOLECULAR-NETWORKING-5877234d-download_cytoscape_data" &rarr; Unzip the folder. The annotated files needed are tsv files within the folders <b>Clusterinfo_summary</b> and <b>DB_analog_result</b> (if analog search is performed).</p>

<p style='text-align: justify;'> <b>To upload files into Google Colab &rarr; Right-click on the folder you created to 'upload' the necessary files from your computer into the cloud session</b></p>

[![More on FBMN](https://img.shields.io/badge/More%20on-FBMN-blueviolet)](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/) 
[![More on GNPS](https://img.shields.io/badge/More%20on-GNPS-informational)](https://www.nature.com/articles/nbt.3597#Abs2) 
[![More on SIRIUS](https://img.shields.io/badge/More%20on-SIRIUS-blue)](https://boecker-lab.github.io/docs.sirius.github.io/)

### 2.3.1 Loading the data: Use one of the methods 
We can load the data files into the script either from the local working directory or from the web using url.

#### a. Loading files from a local folder
Please make sure to include all the necessary input files in the folder you have set as working directory

In [None]:
#List all files from the local directory:
filenames= os.listdir(data_dir)
filenames= sorted(filenames)
filenames

In [None]:
# feature quantification table file location
ft_file = filenames[3]
# meta data table file location
md_file = filenames[1]
# optional analog match file
an_file = filenames[2]

# define separators for different input file formats
separators = {"csv": ",", "tsv": "\t", "txt": "\t"}

# read feature table
if ft_file:
    ft = pd.read_csv(ft_file, sep = separators[ft_file.split(".")[-1]])
    ft.style.format("{:.5f}")
else:
    print("Please select a feature file and rerun this cell.")
# read metadata table
if md_file:
    md = pd.read_csv(md_file, sep = separators[md_file.split(".")[-1]]).set_index("filename")
else:
    print("Please select a metavalue file and rerun this cell.")

if an_file:
    an = pd.read_csv(an_file, sep = separators[an_file.split(".")[-1]])
else:
    print("Please select a metavalue file and rerun this cell.")

#### b. Loading files using URL
<p style='text-align: justify;'> Here, we are directly pulling the data files (feature table, metadata, Analog result from FBMN) from our Functional Metabolomics GitHub page and load them into R. In Google Colab, after you load the input files into the Colab space, you can right click on the file, copy its file path and paste the url on the next cell. </p>

In [75]:
#Reading the input data using URL 
ft_url = 'https://raw.githubusercontent.com/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/main/data/SD_BeachSurvey_GapFilled_quant.csv'
md_url = 'https://raw.githubusercontent.com/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/main/data/20221125_Metadata_SD_Beaches_with_injection_order.txt'
an_url = 'https://raw.githubusercontent.com/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/main/data/DB_analog_result_FBMN.tsv'
ft = pd.read_csv(ft_url)
md = pd.read_csv(md_url, sep = "\t").set_index("filename")
an = pd.read_csv(an_url, sep = "\t")

#### c. Loading files directly from GNPS

One can also load the files directly from the repositories [MassIVE](https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp) or [GNPS](https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp). If one has performed FBMN on their feature table, the files (both, input and output files from FBMN) can be accessed by  providing the task ID in the next cell. Task ID can be found by:  Go to your <b>MassIVE</b> or <b>GNPS</b> account &rarr; Jobs &rarr; unique ID is provided for each job in  'Description' column.

<table>
<thead>
<tr><th>Description</th><th>User</th><th>Workflow</th><th>Workflow Version</th><th>Status</th><th>Protected</th><th>Create Time</th><th>Total Size</th><th>Site</th><th>Delete Task</th></tr>
</thead>
<tbody>
<tr><td><font color="red">ID given here</font></td><td>-</td><td>FBMN</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>GNPS</td><td>-</td></tr>
</tbody>
</table>

In [None]:
taskID = "5877234d0a22497eb5ecff7fd53faea5" # Enter the task ID here

In [None]:
ft_url = os.path.join('https://proteomics2.ucsd.edu/ProteoSAFe/DownloadResultFile?task='+taskID+'&file=quantification_table_reformatted/&block=main')
md_url = os.path.join('https://proteomics2.ucsd.edu/ProteoSAFe/DownloadResultFile?task='+taskID+'&file=metadata_merged/&block=main')
an_url = os.path.join('https://proteomics2.ucsd.edu/ProteoSAFe/DownloadResultFile?task='+taskID+'&file=DB_analogresult/&block=main')

##### NOTE
<blockquote>Make sure your metadata has enough columns (ATTRIBUTES) describing your data. The metadata given for FBMN might contain only few columns, however for downstream statistical analysis, one might need more attributes. In such cases, load the metadata file from a local folder</blockquote>

##### Reading the url from options b,c

In [None]:
ft = pd.read_csv(ft_url)
md = pd.read_csv(md_url, sep = "\t").set_index("filename")
an = pd.read_csv(an_url, sep = '\t')

### 2.3.2 Viewing the input files

Let's check if the data has been read correctly!

In [77]:
print('Dimension: ',ft.shape) #gets the dimension (number of rows and columns) of ft
ft.head() # gets the first 5 rows of ft

Dimension:  (11217, 200)


Unnamed: 0,row ID,row m/z,row retention time,row ion mobility,row ion mobility unit,row CCS,correlation group ID,annotation network number,best ion,auto MS2 verify,...,SD_12-2017_15_b.mzXML Peak area,SD_12-2017_15_a.mzXML Peak area,SD_12-2017_27_a.mzXML Peak area,SD_12-2017_29_b.mzXML Peak area,SD_12-2017_21_a.mzXML Peak area,SD_12-2017_30_a.mzXML Peak area,SD_12-2017_28_b.mzXML Peak area,SD_12-2017_29_a.mzXML Peak area,SD_12-2017_28_a.mzXML Peak area,Unnamed: 199
0,92572,151.035101,13.363672,,,,,,,,...,0.0,0.0,21385.48,1138.271,1144.8115,12139.16,5394.689,5270.766,1007.839,
1,2513,151.035125,1.129901,,,,,,,,...,0.0,0.0,27123.893,0.0,0.0,0.0,0.0,0.0,0.0,
2,42,151.03514,0.550724,,,,212.0,,,,...,1150350.0,1103477.9,2638109.2,1446267.0,595216.5,1225695.2,1424855.0,1557217.0,1797692.0,
3,1870,151.035199,0.88678,,,,,,,,...,0.0,0.0,314371.84,0.0,0.0,0.0,0.0,0.0,0.0,
4,2127,151.096405,0.986017,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [78]:
print('Dimension: ',md.shape)
md.head()

Dimension:  (186, 13)


Unnamed: 0_level_0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,ATTRIBUTE_Spot_Name,ATTRIBUTE_time_run,ATTRIBUTE_Injection_order
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
SD_10_2018_10_a.mzXML,Sample,3,Oct,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:19,145
SD_10_2018_10_b.mzXML,Sample,3,Oct,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,18/07/2020 18:35,146
SD_10_2018_11_a.mzXML,Sample,3,Oct,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 18:51,147
SD_10_2018_11_b.mzXML,Sample,3,Oct,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,18/07/2020 19:07,148
SD_10_2018_12_a.mzXML,Sample,3,Oct,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,Cove,18/07/2020 19:23,149


### 2.3.3 Exploring the metadata
<a id='explore_md'></a>

<p style='text-align: justify;'>Before starting with our analysis, we take a look at our metadata. For this purpose, we have created a function. A function is a collection of commands, which takes one or multiple input variables and creates a corresponding output. By creating functions, we avoid having to write big code chunks multiple times. Instead, we can call a sequence of code lines by their function name.</p>
    
<p style='text-align: justify;'><font color="red"> The following cell will not produce any outputs. </font> The outputs will only be produced when we call the function further downstream and give it an input variable. To explore our metadata we define a function called InsideLevels. This function creates a summary table of our metadata, including data types and levels contained in each column.  <font color ="blue"> The input is a metadata table and the output consists of a summary dataframe. </font></p>

In [79]:
def inside_levels(df):
    # get all the columns (equals all attributes) -> will be number of rows
    levels = []
    types = []
    count = []
    for col in df.columns:
        types.append(type(df[col][0]))
        levels.append(sorted(set(df[col].dropna())))
        tmp = df[col].value_counts()
        count.append([tmp[levels[-1][i]] for i in range(len(levels[-1]))])
    return pd.DataFrame({"ATTRIBUTES": df.columns, "LEVELS": levels, "COUNT":count, "TYPES": types}, index=range(1, len(levels)+1))

Let's have a look at our metadata, with the above defined function InsideLevels. 

In [80]:
len(md.columns)

13

In [81]:
inside_levels(md)

Unnamed: 0,ATTRIBUTES,LEVELS,COUNT,TYPES
1,ATTRIBUTE_Sample.Type,"[Blank, Sample]","[6, 180]",<class 'str'>
2,ATTRIBUTE_Batch,"[1, 2, 3]","[62, 62, 62]",<class 'numpy.int64'>
3,ATTRIBUTE_Month,"[Dec, Jan, Oct]","[62, 62, 62]",<class 'str'>
4,ATTRIBUTE_Year,"[2017, 2018]","[62, 124]",<class 'numpy.int64'>
5,ATTRIBUTE_Sample_Location,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
6,ATTRIBUTE_Replicate,"[a, b]","[93, 93]",<class 'str'>
7,ATTRIBUTE_Spot,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
8,ATTRIBUTE_Latitude,"[32.75645, 32.75743, 32.75905, 32.76115, 32.76...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
9,ATTRIBUTE_Longitude,"[-117.2872, -117.28664, -117.286, -117.28355, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
10,ATTRIBUTE_Sample_Area,"[Blank, La_Jolla Reefs, La_Jolla_Cove, Mission...","[6, 36, 12, 36, 18, 12, 18, 48]",<class 'str'>


The above table is a summary of our metadata tabel. For example, the 1st row says that there are 5 different types of sample under 'ATTRIBUTE_Sample' category namely A15M,A45M,A5M,M,PPL and the count of each of these types is 3,3,3,1.

## <font color ='darkblue'> 2.4 Merging annotations with feature table</font>
<a id="merge_ft"></a>

<div class="alert alert-block alert-info">
    
The first column in feature table: <b>row ID</b> is given in different column names in different files: <br>
* In clusterinfo summary file of GNPS, it is given under <b>Cluster index</b> ('LibraryID' column has the spectral match annotations)
* In DB analog result file,it is given as <b>#Scan#</b> ('Compound_Name' column has the annotation information)
* For SIRIUS and CANOPUS summary files, the row ID of the feature table is given in the column <b>id</b>. A typical feature would be: "3_ProjectName_MZmine3_SIRIUS_1_16", where the last number 16 representing the row ID.
    
    </div>

<p style='text-align: justify;'> Here, we will show how to merge ft and an (analog file) based on these columns. You can use the same method to merge SIRIUS, CANOPUS annotations to your feature table as well. This merged table can be used later, when needed. Before merging two dataframes based on certain columns, make sure that the classes of both columns are the same. Although the values are same, but if one column is numeric class and the other is of character class, this might cause unwanted error. </p>

In [82]:
#checking if both columns are of similar class
ft["row ID"].dtype== an["#Scan#"].dtype

True

In [83]:
ft_an = pd.merge(ft, an, left_on= ft["row ID"],  how='left', right_on= an["#Scan#"], right_index=True)
ft_an

Unnamed: 0,row ID,row m/z,row retention time,row ion mobility,row ion mobility unit,row CCS,correlation group ID,annotation network number,best ion,auto MS2 verify,...,MoleculeExplorerDatasets,MoleculeExplorerFiles,InChIKey,InChIKey-Planar,superclass,class,subclass,npclassifier_superclass,npclassifier_class,npclassifier_pathway
0,92572,151.035101,13.363672,,,,,,,,...,,,,,,,,,,
1,2513,151.035125,1.129901,,,,,,,,...,,,,,,,,,,
2,42,151.035140,0.550724,,,,212.0,,,,...,92.0,6073.0,IIRDTKBZINWQAW-UHFFFAOYSA-N,IIRDTKBZINWQAW,Organic oxygen compounds,Organooxygen compounds,Ethers,Glycerolipids,Monoacylglycerols,Fatty acids
3,1870,151.035199,0.886780,,,,,,,,...,0.0,0.0,QQYQHZUMKAJTAA-AMRGJXDSSA-N,QQYQHZUMKAJTAA,Organic oxygen compounds,Organooxygen compounds,Alcohols and polyols,Diterpenoids,,Terpenoids
4,2127,151.096405,0.986017,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11212,92162,1444.398875,12.973346,,,,,,,,...,,,,,,,,,,
11213,88518,1444.398952,11.717666,,,,265.0,,,,...,,,,,,,,,,
11214,88057,1445.398112,11.541277,,,,,,,,...,,,,,,,,,,
11215,89348,1445.398187,11.987819,,,,36.0,,,,...,,,,,,,,,,


## <font color ='darkblue'> 2.5 Arranging metadata and feature table in the same order</font>
<a id="arr_input_files"></a>

<p style='text-align: justify;'> In the next cells, we bring feature table and metadata in the correct format such that the rownames of the metadata and column names of the feature table are the same. Filenames and the order of files need to correspond in both tables, as we will match metadata attributes to the feature table. In that way, both metadata and feature table, can easily be filtered. </p>

In [84]:
new_md = md.copy() #storing the files under different names to preserve the original files
# remove the (front & tail) spaces, if any present, from the rownames of md
new_md.index = [name.strip() for name in md.index]
# for each col in new_md
# 1) removing the spaces (if any)
# 2) replace the spaces (in the middle) to underscore
# 3) converting them all to UPPERCASE
for col in new_md.columns:
    if new_md[col].dtype == str:
        new_md[col] = [item.strip().replace(" ", "_").upper() for item in new_md[col]]
new_md= new_md.reindex(sorted(new_md.index), axis=0)
print('Dimension: ',new_md.shape)
new_md.head()

Dimension:  (186, 13)


Unnamed: 0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,ATTRIBUTE_Spot_Name,ATTRIBUTE_time_run,ATTRIBUTE_Injection_order
SD_01-2018_10_a.mzXML,Sample,2,Jan,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,16/01/2018 16:23,83
SD_01-2018_10_b.mzXML,Sample,2,Jan,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,SIO_South_Pier,16/01/2018 16:39,84
SD_01-2018_11_a.mzXML,Sample,2,Jan,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,16/01/2018 16:55,85
SD_01-2018_11_b.mzXML,Sample,2,Jan,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,La_Jolla_Shores,16/01/2018 17:10,86
SD_01-2018_12_a.mzXML,Sample,2,Jan,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,Cove,16/01/2018 17:26,87


In [85]:
new_ft = ft.copy() #storing the files under different names to preserve the original files
# changing the index in feature table to contain m/z and RT information
new_ft.index = [f"{id}_{round(mz, 3)}_{round(rt, 3)}" for id, mz, rt in zip(ft["row ID"], ft["row m/z"], ft["row retention time"])]
# drop all columns that are not mzML or mzXML file names
new_ft.drop(columns=[col for col in new_ft.columns if ".mz" not in col], inplace=True)
# remove " Peak area" from column names
new_ft.rename(columns={col: col.replace(" Peak area", "").strip() for col in new_ft.columns}, inplace=True)
# sort column names
new_ft= new_ft.reindex(sorted(new_ft.columns), axis=1)
print('Dimension: ',new_ft.shape)
new_ft.head()

Dimension:  (11217, 186)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML,SD_12-2017_PPL_Bl_1.mzXML,SD_12-2017_PPL_Bl_2.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1560.679,0.0,0.0,3939.107,0.0,0.0,0.0
2513_151.035_1.13,0.0,156590.55,0.0,0.0,0.0,0.0,22862.7,0.0,29359.463,0.0,...,0.0,0.0,0.0,11498.99,0.0,0.0,0.0,0.0,0.0,0.0
42_151.035_0.551,2863941.0,3687233.2,2810288.0,2321774.2,3195918.0,2765738.8,4439634.0,3591492.5,2985472.0,3484729.0,...,1856001.5,1766485.0,1287448.5,1491507.0,1728245.0,1547097.4,1262373.0,1280963.1,4432.9683,6813.541
1870_151.035_0.887,201483.2,85594.53,23923.246,20954.787,81281.12,79683.164,140293.1,256066.56,249608.58,233550.1,...,16260.477,9554.87,73896.3,53041.84,8907.969,30851.541,0.0,0.0,0.0,0.0
2127_151.096_0.986,4317.684,14283.897,0.0,0.0,8685.125,0.0,7383.013,0.0,4742.709,4784.927,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Test if filenames are identical
---
The output below should indicate that all files are present in both new_md & new_ft. Furthermore, metadata filenames and feature table column names are identical, indicating that they are in the same order. If the shell below returns FALSE, it means some files are missing. We can check which names of the files are missing. If everything went well, the last cell should return an empty output. 

In [86]:
# check if new_ft column names and md row names are the same
if sorted(new_ft.columns) == sorted(new_md.index):
    print(f"All {len(new_ft.columns)} files are present in both new_md & new_ft.")
else:
    print("Not all files are present in both new_md & new_ft.\n")
    # print the md rows / ft column which are not in ft columns / md rows and remove them
    ft_cols_not_in_md = [col for col in new_ft.columns if col not in new_md.index]
    print(f"These {len(ft_cols_not_in_md)} columns of feature table are not present in metadata table and will be removed:\n{', '.join(ft_cols_not_in_md)}\n")
    new_ft.drop(columns=ft_cols_not_in_md, inplace=True)
    md_rows_not_in_ft = [row for row in new_md.index if row not in new_ft.columns]
    print(f"These {len(md_rows_not_in_ft)} rows of metadata table are not present in feature table and will be removed:\n{', '.join(md_rows_not_in_ft)}\n")
    new_md.drop(md_rows_not_in_ft, inplace=True)

All 186 files are present in both new_md & new_ft.


In [87]:
list(new_ft.columns) == list(new_md.index) #if this returns False it means some files are missing

True

In [88]:
np.setdiff1d(list(new_ft.columns),list(new_md.index)) # if this is empty, no files should be missing or different between the metadata and the matrix

array([], dtype='<U34')

If the above cell returns some filenames, check the corresponding column names in the feature table for spelling mistakes, case-sensitive errors. Re-upload the files with correct metadata and rerun the above steps. 

In [89]:
# print(new_ft.columns) # uncomment to check the column names of new_ft

In [90]:
#checking the dimensions of our new ft and md:
print("The number of rows and columns in our original ft is:", ft.shape,"\n")
print("The number of rows and columns in our new ft is:", new_ft.shape,"\n")
print("The number of rows and columns in our new md is:", new_md.shape)

The number of rows and columns in our original ft is: (11217, 200) 

The number of rows and columns in our new ft is: (11217, 186) 

The number of rows and columns in our new md is: (186, 13)


# <font color ='blue'> 3. Data-cleanup </font>
<a id ="data_cleanup"></a>

As a first step of our analysis, prior to data cleanup, let's have a look at the data using a simple Principal Coordinate analysis (PCoA). You can also using Principal component analysis (PCA). [PCoA](#pcoa) is commonly used for environmental samples. In order to perform PCoA, we first transpose the feature table, [scale](#scaling) it and then calculate the distances using Bray-Curtis metric. The explanation for scaling & PCoA is provided in the respective sections. Hence, we will just proceed with the following cells for performing PCoA.

In [91]:
ft_t= new_ft.transpose() #transposing the ft
for col in ft_t.columns:
    col=pd.to_numeric(col, errors='coerce') #converting all values to numeric
list(new_ft.columns) == list(ft_t.index) #should return TRUE now

True

!!!! TO-DO: markdown and code from here on are different in some cases from the R notebook

In case we want to remove certain files of a particular condition, for ex: ATTRIBUTE_sample = "M", we can subset them out of our dataframe using the next cell. 

In [92]:
# subset_data = new_md[new_md['ATTRIBUTE_Sample']!='M']
# print('Dimension: ',subset_data.shape)
# inside_levels(subset_data)

Once we subset the data, we can further proceed to split the blanks from the sample in the cell below. If no subsetting is involved, you can simply split your metadata into blank and sample.

In [93]:
#If subset_data exists, it will take it as "data", else take new_md as "data"
if 'subset_data' in locals():
    data = subset_data
else:
    data = new_md
display(inside_levels(data))

condition = int(input("Enter the index number of the attribute to split sample and blank: "))
df = pd.DataFrame({"LEVELS": inside_levels(data).iloc[condition-1]["LEVELS"]})
df.index = [*range(1, len(df)+1)]
display(df)

#Among the shown levels of an attribute, select the ones to keep
blank_id = int(input("Enter the index number of your BLANK: "))
print('Your chosen blank is: ', df['LEVELS'][blank_id])

#Splitting the data into blanks and samples based on the metadata
md_blank = data[data[inside_levels(data)['ATTRIBUTES'][condition]] == df['LEVELS'][blank_id]]
blank = new_ft[list(md_blank.index)]
md_samples = data[data[inside_levels(data)['ATTRIBUTES'][condition]] != df['LEVELS'][blank_id]]
samples = new_ft[list(md_samples.index)]

Unnamed: 0,ATTRIBUTES,LEVELS,COUNT,TYPES
1,ATTRIBUTE_Sample.Type,"[Blank, Sample]","[6, 180]",<class 'str'>
2,ATTRIBUTE_Batch,"[1, 2, 3]","[62, 62, 62]",<class 'numpy.int64'>
3,ATTRIBUTE_Month,"[Dec, Jan, Oct]","[62, 62, 62]",<class 'str'>
4,ATTRIBUTE_Year,"[2017, 2018]","[62, 124]",<class 'numpy.int64'>
5,ATTRIBUTE_Sample_Location,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
6,ATTRIBUTE_Replicate,"[a, b]","[93, 93]",<class 'str'>
7,ATTRIBUTE_Spot,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.int64'>
8,ATTRIBUTE_Latitude,"[32.75645, 32.75743, 32.75905, 32.76115, 32.76...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
9,ATTRIBUTE_Longitude,"[-117.2872, -117.28664, -117.286, -117.28355, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...",<class 'numpy.float64'>
10,ATTRIBUTE_Sample_Area,"[Blank, La_Jolla Reefs, La_Jolla_Cove, Mission...","[6, 36, 12, 36, 18, 12, 18, 48]",<class 'str'>


Enter the index number of the attribute to split sample and blank: 1


Unnamed: 0,LEVELS
1,Blank
2,Sample


Enter the index number of your BLANK: 1
Your chosen blank is:  Blank


In [94]:
# Display the chosen blank
print('Dimension: ',blank.shape)
blank.head()

Dimension:  (11217, 6)


Unnamed: 0,SD_01-2018_PPL_Bl_1.mzXML,SD_01-2018_PPL_Bl_2.mzXML,SD_10_2018_PPL_Blank_1.mzXML,SD_10_2018_PPL_Blank_2.mzXML,SD_12-2017_PPL_Bl_1.mzXML,SD_12-2017_PPL_Bl_2.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0
2513_151.035_1.13,0.0,0.0,0.0,0.0,0.0,0.0
42_151.035_0.551,80114.62,21310.246,74143.17,105766.586,4432.9683,6813.541
1870_151.035_0.887,0.0,0.0,0.0,0.0,0.0,0.0
2127_151.096_0.986,23387.723,21032.016,0.0,115959.33,0.0,0.0


In [95]:
# Display the chosen samples
print('Dimension: ',samples.shape)
samples.head()

Dimension:  (11217, 180)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_5_a.mzXML,SD_12-2017_5_b.mzXML,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1560.679,0.0,0.0,3939.107,0.0
2513_151.035_1.13,0.0,156590.55,0.0,0.0,0.0,0.0,22862.7,0.0,29359.463,0.0,...,0.0,0.0,0.0,0.0,0.0,11498.99,0.0,0.0,0.0,0.0
42_151.035_0.551,2863941.0,3687233.2,2810288.0,2321774.2,3195918.0,2765738.8,4439634.0,3591492.5,2985472.0,3484729.0,...,1354254.0,1318947.0,1856001.5,1766485.0,1287448.5,1491507.0,1728245.0,1547097.4,1262373.0,1280963.1
1870_151.035_0.887,201483.2,85594.53,23923.246,20954.787,81281.12,79683.164,140293.1,256066.56,249608.58,233550.1,...,6334.812,0.0,16260.477,9554.87,73896.3,53041.84,8907.969,30851.541,0.0,0.0
2127_151.096_0.986,4317.684,14283.897,0.0,0.0,8685.125,0.0,7383.013,0.0,4742.709,4784.927,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Now that we have our data ready, we can start with the cleanup steps!!**

<div class="alert alert-block alert-warning">
<b><font size=3> Skip the Batch correction section if you do not have multiple batches !! </font> </b> </div>

## <font color ='darkblue'> 3.1. Batch Correction (Optional) </font>
<a id="batch_corr"></a>

<p style='text-align: justify;'> A 'Batch' is a group of samples processed and analyzed by the same experimental & instrumental conditions in the same short time period. In general, if we have more samples than the tray size, we might measure them as multiple batches or groups. When arranging samples in a batch for measurement, in order to ensure biological diversity within a batch, in addition to our samples of interest, it is advised to have QCs, blanks, and controls (Wehrens et al., 2016). To merge data from these different batches, we must look for batch-effects, both, between the batches and within each batch and correct these effects. <b>But, prior to batch correction on a dataset, we should evaluate the severity of the batch effect and when it is small, it is best to not perform batch correction as this may result in an incorrect estimation of the biological variance in the data. Instead, we should treat the statistical results with caution (Nygaard et al., 2016). For more details, please read the manuscript </b>.</p>

<p style='text-align: justify;'> In this tutorial, the test dataset was utilized to evaluate the chemical impacts of a significant rain event that occurred in northern San Diego, California (USA) during the Winter of 2017/2018. Despite the presence of a "ATTRIBUTE_Batch" column in the metadata, the 3 groups mentioned are not considered as batches due to their distinct collection conditions. The "ATTRIBUTE_time_run" column clearly indicates that the seawater samples were collected and measured at different times during Dec 2017, Jan 2018 (after rainfall), and Oct 2018, respectively. Also, they were collected 'before' and 'after' rainfall. Therefore, searching for inter-batch effects is not meaningful in our example dataset. In terms of intra-batch effect, since the sample dataset does not have QCs, we cannot correct for the intra-batch effect.</p> 

<font color="red"> Add a bit about normalization / scaling correcting for batch effects to a certain extent. </font>

Follow the notebook for Batch Correction: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/blob/main/Individual_Notebooks/R-Notebooks/Batch_Correction.ipynb)

## <font color ='darkblue'> 3.2 Blank removal </font>
<a id="norm"></a>


<p style='text-align: justify;'> In LC-MS/MS, we use solvents called Blanks which are usually injected time-to-time to prevent carryover of the sample. The features coming from these Blanks would also be detected by LC-MS/MS instrument. Our goal here is to remove these features from our samples. The other blanks that can be removed are: Signals coming from growth media alone in terms of microbial growth experiment, signals from the solvent used for extraction methods and so on. Therefore, it is best practice to measure mass spectra of these blanks as well in addition to your sample spectra. </p>

**How do we remove these blank features?** </br> 
<p style='text-align: justify;'> Since we have the feature table split into Control blanks and Sample groups now, we can compare blanks to the sample to identify the background features coming from blanks. A common filtering method is to use a cutoff to remove features that are not present sufficient enough in our biological samples. </p>

The steps followed in the next few cells are:
1. <p style='text-align: justify;'> We find an average for all the feature intensities in your blank set and sample set. Therefore, for n no.of features in a blank or sample set, we get n no.of averaged features. </p>
2. <p style='text-align: justify;'> Next, we get a ratio of this average_blanks vs average_sample. This ratio Blank/sample tells us how much of that particular feature of a sample gets its contribution from blanks. If it is more than 30% (or Cutoff as 0.3), we consider the feature as noise. </p>
3. <p style='text-align: justify;'> The resultant information (if ratio > Cutoff or not) is stored in a bin such as 1 = Noise or background signal, 0 = Feature Signal</p>
4. <p style='text-align: justify;'> We count the no.of features in the bin that satisfies the condition ratio > cutoff, and consider those features as 'noise or background features' and remove them. </p>

**<font color='red'> The Cutoff used to obtain the all the files in MZmine Results folder is 0.3 </font>**

In [None]:
#enter the directory for the results:
result_dir = input("Enter the path of the folder for all output files:\n")
os.chdir(result_dir)

In [97]:
blank_removal = samples.copy()
if (input("Do you want to perform Blank Removal- Y/N: ").upper()=="Y"):
    
    # When cutoff is low, more noise (or background) detected; With higher cutoff, less background detected, thus more features observed
    cutoff = float(input("Enter Cutoff value between 0.1 & 1 (Ideal cutoff range: 0.1-0.3): ")) # (i.e. 10% - 100%). Ideal cutoff range: 0.1-0.3
    
    # Getting mean for every feature in blank and Samples
    avg_blank = blank.mean(axis=1, skipna=False) # set skipna = False do not exclude NA/null values when computing the result.
    avg_samples = samples.mean(axis=1, skipna=False)

    # Getting the ratio of blank vs samples
    ratio_blank_samples = (avg_blank+1)/(avg_samples+1)

    # Create an array with boolean values: True (is a real feature, ratio<cutoff) / False (is a blank, background, noise feature, ratio>cutoff)
    is_real_feature = (ratio_blank_samples<cutoff)

    # Checking if there are any NA values present. Having NA values in the 4 variables will affect the final dataset to be created
    temp_NA_Count = pd.concat([avg_blank, avg_samples, ratio_blank_samples, is_real_feature], 
                            keys=['avg_blank', 'avg_samples', 'ratio_blank_samples', 'bg_bin'], axis = 1)
    
    print('No. of NA values in the following columns: ')
    display(pd.DataFrame(temp_NA_Count.isna().sum(), columns=['NA']))

    # Calculating the number of background features and features present (sum(bg_bin) equals number of features to be removed)
    print(f"No. of Background or noise features: {len(samples)-sum(is_real_feature)}")
    print(f"No. of features after excluding noise: {sum(is_real_feature)}")

    blank_removal = samples[is_real_feature.values]
    # save to file
    blank_removal.to_csv(os.path.join(result_dir, "Blanks_Removed.csv"))

Do you want to perform Blank Removal- Y/N: y
Enter Cutoff value between 0.1 & 1 (Ideal cutoff range: 0.1-0.3): 0.3
No. of NA values in the following columns: 


Unnamed: 0,NA
avg_blank,0
avg_samples,0
ratio_blank_samples,0
bg_bin,0


No. of Background or noise features: 2125
No. of features after excluding noise: 9092


In [98]:
print('Dimension: ',blank_removal.shape)
display(blank_removal.head())

Dimension:  (9092, 180)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_5_a.mzXML,SD_12-2017_5_b.mzXML,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML
92572_151.035_13.364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1560.679,0.0,0.0,3939.107,0.0
2513_151.035_1.13,0.0,156590.55,0.0,0.0,0.0,0.0,22862.7,0.0,29359.46,0.0,...,0.0,0.0,0.0,0.0,0.0,11498.99,0.0,0.0,0.0,0.0
42_151.035_0.551,2863941.0,3687233.2,2810288.0,2321774.2,3195918.0,2765738.8,4439634.5,3591492.0,2985472.0,3484729.0,...,1354254.0,1318947.0,1856002.0,1766485.0,1287448.0,1491507.0,1728245.0,1547097.4,1262373.0,1280963.1
1870_151.035_0.887,201483.2,85594.53,23923.246,20954.787,81281.12,79683.164,140293.06,256066.6,249608.6,233550.1,...,6334.812,0.0,16260.48,9554.87,73896.3,53041.84,8907.969,30851.541,0.0,0.0
1653_152.057_0.847,5206.803,5580.02,0.0,0.0,10935.638,10142.98,10469.594,10249.44,3827.583,2129.518,...,6396.476,4122.9,7767.999,11943.939,2908.759,5800.791,17471.045,15480.93,12566.81,15552.044


## <font color ='darkblue'> 3.3 Imputation </font>
<a id="norm"></a>

<p style='text-align: justify;'> For several reasons, real world datasets might have some missing values in it, in the form of NA, NANs or 0s. Eventhough the gapfilling step of MZmine fills the missing values, we still end up with some missing values or 0s in our feature table. This could be problematic for statistical analysis. </p> 
<p style='text-align: justify;'> In order to have a better dataset, we cannot simply discard those rows or columns with missing values as we will lose a chunk of our valuable data. Instead we can try imputing those missing values. Imputation involves replacing the missing values in the data with a meaningful, reasonable guess. There are several methods, such as: </p> 
  
1) Mean imputation (replacing the missing values in a column with the mean or average of the column)  
2) Replacing it with the most frequent value  
3) Several other machine learning imputation methods such as k-nearest neighbors algorithm(k-NN), Hidden Markov Model(HMM)

Here, we use ft and see the frquency distribution of its features with a plot. It shows where the features are present in higher number.

In [99]:
bins, bins_label, a = [-1, 0, 1, 10], ['-1','0', "1", "10"], 2

while a<=10:
    bins_label.append(np.format_float_scientific(10**a))
    bins.append(10**a)
    a+=1

freq_table = pd.DataFrame(bins_label)
frequency = pd.DataFrame(np.array(np.unique(np.digitize(blank_removal.to_numpy(), bins, right=True), return_counts=True)).T).set_index(0)
freq_table = pd.concat([freq_table,frequency], axis=1).fillna(0).drop(0)
freq_table.columns = ['intensity', 'Frequency']
freq_table['Log(Frequency)'] = np.log(freq_table['Frequency']+1)

# get the lowest intensity (that is not zero) as a cutoff LOD value
cutoff_LOD = round(blank_removal.replace(0, np.nan).min(numeric_only=True).min())

fig = px.bar(freq_table, x="intensity", y="Log(Frequency)", template="plotly_white",  width=600, height=400)

fig.update_traces(marker_color="#696880")
fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"FEATURE INTENSITY - FREQUENCY PLOT", 'x':0.5, "font_color":"#3E3D53"})
fig.write_image(os.path.join(result_dir, "frequency_plot.svg"))
fig.show()

A random number between this minimum value and zero will be used for imputation.

In [100]:
imputed = blank_removal.copy()
if(input("Do you want to perform Imputation? - Y/N: ").upper()=="Y"):
    #imputed.replace(0, np.random.randint(0, cutoff_LOD), inplace=True)
    imputed = imputed.apply(lambda x: [np.random.randint(0, cutoff_LOD) if v == 0 else v for v in x])
    print('Dimension: ',imputed.shape)
    display(imputed)
    # save to file
    imputed.to_csv(os.path.join(result_dir, f"Imputed_QuantTable.csv"))

Do you want to perform Imputation? - Y/N: y
Dimension:  (9092, 180)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_5_a.mzXML,SD_12-2017_5_b.mzXML,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML
92572_151.035_13.364,5.200000e+02,134.000,469.000,802.000,565.000,413.000,179.000,4.720000e+02,4.110000e+02,7.690000e+02,...,5.350000e+02,891.0,3.380000e+02,472.000,8.600000e+02,1.560679e+03,41.000,603.000,3.939107e+03,205.000
2513_151.035_1.13,8.040000e+02,156590.550,245.000,757.000,2.000,29.000,22862.700,8.910000e+02,2.935946e+04,4.510000e+02,...,1.260000e+02,889.0,4.180000e+02,466.000,7.560000e+02,1.149899e+04,438.000,391.000,4.900000e+02,229.000
42_151.035_0.551,2.863941e+06,3687233.200,2810288.000,2321774.200,3195918.000,2765738.800,4439634.500,3.591492e+06,2.985472e+06,3.484729e+06,...,1.354254e+06,1318947.0,1.856002e+06,1766485.000,1.287448e+06,1.491507e+06,1728245.000,1547097.400,1.262373e+06,1280963.100
1870_151.035_0.887,2.014832e+05,85594.530,23923.246,20954.787,81281.120,79683.164,140293.060,2.560666e+05,2.496086e+05,2.335501e+05,...,6.334812e+03,486.0,1.626048e+04,9554.870,7.389630e+04,5.304184e+04,8907.969,30851.541,2.210000e+02,586.000
1653_152.057_0.847,5.206803e+03,5580.020,527.000,342.000,10935.638,10142.980,10469.594,1.024944e+04,3.827583e+03,2.129518e+03,...,6.396476e+03,4122.9,7.767999e+03,11943.939,2.908759e+03,5.800791e+03,17471.045,15480.930,1.256681e+04,15552.044
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91167_1442.399_12.598,3.111652e+05,189971.030,386000.970,138754.100,477054.380,109891.700,201078.030,3.259739e+05,1.406355e+05,4.215174e+05,...,8.070000e+02,747.0,1.820000e+02,864.000,8.500000e+02,4.830000e+02,639.000,478.000,1.530000e+02,115.000
90242_1442.399_12.377,2.276051e+05,558666.940,307892.380,152410.160,112337.055,80392.050,160174.360,8.876620e+04,2.441697e+05,1.880748e+05,...,2.150000e+02,188.0,2.230000e+02,303.000,8.670000e+02,5.090000e+02,855.000,841.000,8.100000e+01,526.000
88493_1442.399_11.706,2.040702e+05,683280.750,307918.940,188791.100,184659.190,63493.695,144327.080,1.713194e+05,2.228125e+05,9.528194e+04,...,2.980000e+02,66.0,5.110000e+02,724.000,8.600000e+01,7.930000e+02,126.000,727.000,8.310000e+02,179.000
90600_1443.399_12.376,4.919888e+05,592310.400,308512.250,330795.970,103600.140,311601.620,172762.140,6.116949e+05,5.126081e+05,1.308232e+05,...,8.640000e+02,764.0,9.100000e+01,337.000,6.410000e+02,5.400000e+01,747.000,162.000,4.170000e+02,630.000


Too many missing values is problematic for statistical analyses. Here we calculate the proportion of missing values (coded as the value of the cutoff_LOD) and display the proportions in a histogram

TODO move plot up before imputation

In [101]:
# check the number of missing values per feature in a histogram
n_zeros = imputed.T.apply(lambda x: sum(x<=cutoff_LOD))

fig = px.histogram(n_zeros, template="plotly_white",  
                   width=600, height=400)

fig.update_traces(marker_color="#696880")
fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"MISSING VALUES PER FEATURE", 'x':0.5, "font_color":"#3E3D53"},
                  xaxis_title="number of missing values", yaxis_title="count", showlegend=False)
fig.write_image(os.path.join(result_dir, "number_of_missing_values_per_feature.svg"))
fig.show()

## <font color ='darkblue'> 3.4 Normalization </font>
<a id="norm"></a>

The following code performs sample-centric (column-wise) normalisation:

In [102]:
normalized = imputed.copy()
if(input("Do you want to perform Normalization? - Y/N: ").upper()=="Y"):
    # Dividing each element of a particular column with its column sum
    normalized = normalized.apply(lambda x: x/np.sum(x), axis=0)
    
    # save to file
    normalized.to_csv(os.path.join(result_dir, "Normalised_Quant_table.csv"))
    
    print('Dimension: ', normalized.shape)
    display(normalized.head())

Do you want to perform Normalization? - Y/N: y
Dimension:  (9092, 180)


Unnamed: 0,SD_01-2018_10_a.mzXML,SD_01-2018_10_b.mzXML,SD_01-2018_11_a.mzXML,SD_01-2018_11_b.mzXML,SD_01-2018_12_a.mzXML,SD_01-2018_12_b.mzXML,SD_01-2018_13_a.mzXML,SD_01-2018_13_b.mzXML,SD_01-2018_14_a.mzXML,SD_01-2018_14_b.mzXML,...,SD_12-2017_5_a.mzXML,SD_12-2017_5_b.mzXML,SD_12-2017_6_a.mzXML,SD_12-2017_6_b.mzXML,SD_12-2017_7_a.mzXML,SD_12-2017_7_b.mzXML,SD_12-2017_8_a.mzXML,SD_12-2017_8_b.mzXML,SD_12-2017_9_a.mzXML,SD_12-2017_9_b.mzXML
92572_151.035_13.364,3.444975e-07,1.009382e-07,3.363454e-07,5.550733e-07,3.612357e-07,2.63783e-07,1.183178e-07,3.048915e-07,2.191244e-07,5.371341e-07,...,7.707809e-07,1.281412e-06,4.421695e-07,6.126857e-07,1.000499e-06,2e-06,4.104435e-08,6.034876e-07,4.220049e-06,2.215596e-07
2513_151.035_1.13,5.326462e-07,0.000117955,1.757028e-07,5.239283e-07,1.27871e-09,1.852229e-08,1.511209e-05,5.755474e-07,1.565298e-05,3.150162e-07,...,1.815297e-07,1.278536e-06,5.46825e-07,6.048973e-07,8.795088e-07,1.2e-05,4.384738e-07,3.913162e-07,5.249475e-07,2.474983e-07
42_151.035_0.551,0.001897347,0.002777483,0.00201541,0.001606926,0.002043327,0.001766477,0.002934569,0.002319948,0.001591703,0.002434027,...,0.00195109,0.001896874,0.00242801,0.002293009,0.001497781,0.0016,0.001730115,0.001548348,0.001352407,0.001384437
1870_151.035_0.887,0.0001334817,6.447582e-05,1.715666e-05,1.450305e-05,5.196751e-05,5.089362e-05,9.273279e-05,0.0001654079,0.0001330787,0.000163131,...,9.126641e-06,6.989521e-07,2.127186e-05,1.240282e-05,8.596884e-05,5.7e-05,8.917606e-06,3.087649e-05,2.36762e-07,6.333362e-07
1653_152.057_0.847,3.449482e-06,4.203263e-06,3.779404e-07,2.367021e-07,6.991757e-06,6.478319e-06,6.920333e-06,6.620692e-06,2.040674e-06,1.487434e-06,...,9.21548e-06,5.929444e-06,1.016205e-05,1.550398e-05,3.383967e-06,6e-06,1.748995e-05,1.549345e-05,1.34631e-05,1.680831e-05


## <font color ='darkblue'> 3.5 Scaling </font>
<a id="norm"></a>

For statistics normalization should happen across the complete dataframe via scaling and centering. 

In [103]:
# transposing the imputed table before scaling
transposed = imputed.T
print(f'Imputed feature table rows/columns: {transposed.shape}')
display(transposed.head(3))
# put the rows in the feature table and metadata in the same order
transposed.sort_index(inplace=True)
md_samples.sort_index(inplace=True)

if (md_samples.index == transposed.index).all():
    pass
else:
    print("WARNING: Sample names in feature and metadata table are NOT the same!")

transposed.to_csv(os.path.join(result_dir, "Imputed_QuantTable_transposed.csv"))

Imputed feature table rows/columns: (180, 9092)


Unnamed: 0,92572_151.035_13.364,2513_151.035_1.13,42_151.035_0.551,1870_151.035_0.887,1653_152.057_0.847,39_153.033_0.55,91313_153.138_12.628,5376_155.07_2.215,48546_155.07_6.351,8717_157.086_2.771,...,89389_1370.381_11.978,90659_1370.382_12.502,92155_1370.382_12.968,89872_1370.954_12.136,89496_1442.399_12.015,91167_1442.399_12.598,90242_1442.399_12.377,88493_1442.399_11.706,90600_1443.399_12.376,91938_1443.4_12.895
SD_01-2018_10_a.mzXML,520.0,804.0,2863941.0,201483.19,5206.8027,513062.94,25667.047,18414.531,553.0,5073.1,...,790.0,282.0,348.0,294.0,708503.75,311165.25,227605.08,204070.17,491988.75,838.0
SD_01-2018_10_b.mzXML,134.0,156590.55,3687233.2,85594.53,5580.02,634986.6,74335.484,9468.36,71.0,882.0,...,155.0,210.0,490.0,126.0,313226.5,189971.03,558666.94,683280.75,592310.4,128523.984
SD_01-2018_11_a.mzXML,469.0,245.0,2810288.0,23923.246,527.0,425097.75,96858.484,86009.2,623.0,889.0,...,327.0,891.0,194.0,85.0,401128.94,386000.97,307892.38,307918.94,308512.25,114226.63


In [104]:
# scale filtered data
scaled = pd.DataFrame(StandardScaler().fit_transform(transposed), index=transposed.index, columns=transposed.columns)
scaled.to_csv(os.path.join(result_dir, "Imputed_Scaled_QuantTable.csv"))

# Merge feature table and metadata to one dataframe:
# "how=inner" performs an inner join (only the filenames that appear in md_samples and data are kept)
data = pd.merge(md_samples, scaled, left_index=True, right_index=True, how="inner")
display(data.head())

Unnamed: 0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,...,89389_1370.381_11.978,90659_1370.382_12.502,92155_1370.382_12.968,89872_1370.954_12.136,89496_1442.399_12.015,91167_1442.399_12.598,90242_1442.399_12.377,88493_1442.399_11.706,90600_1443.399_12.376,91938_1443.4_12.895
SD_01-2018_10_a.mzXML,Sample,2,Jan,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,...,-0.162225,-0.190591,-0.182257,-0.552037,4.350746,2.029883,2.142192,2.071779,2.574828,-0.269252
SD_01-2018_10_b.mzXML,Sample,2,Jan,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,...,-0.174409,-0.191799,-0.178841,-0.553284,1.724593,1.096327,5.770871,7.714563,3.172589,0.782374
SD_01-2018_11_a.mzXML,Sample,2,Jan,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,...,-0.171109,-0.18037,-0.185962,-0.553588,2.308602,2.606341,3.0222,3.294615,1.481595,0.664621
SD_01-2018_11_b.mzXML,Sample,2,Jan,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,...,-0.170725,-0.192839,-0.178697,-0.551072,2.363603,0.701804,1.318001,1.891865,1.614371,4.752602
SD_01-2018_12_a.mzXML,Sample,2,Jan,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,...,-0.176673,-0.183592,-0.175737,-0.552289,1.921049,3.307723,0.878771,1.843211,0.260638,-0.273041


# <font color ='blue'> 4. Univariate Analysis </font>
<a id="uni"></a>

<p style='text-align: justify;'>Univariate statistics involves analysing "one" variable (or one category) at a time in an attempt to describe the data. In univariate statistics, our null hypothesis H0 states that there is no relationship between different groups or categories. To test this hypothesis, we use statistical tests to either either reject (meaning there is a relationship between groups) or accept the null hypothesis (means no relationship). Below here is a list of some parametric and non-parametric tests used for hypothesis testing. In general, parametric test assusmes the data to have normal distribution whereas non-parametric tests have no such assumption about the distribution of the data. </p>

<table>
    <thead>
        <tr><th><font size=3>Parametric Test</font></th>
            <th><font size=3>Non-Parametric test</font></th>
        </tr>
    </thead>
    <tbody>
        <tr><td><font size=3>Paired t-test</font></td>
            <td><font size=3>Wilcoxon Rank sum test</font></td></tr>
        <tr><td><font size=3>Unpaired t-test</font></td>
            <td><font size=3>Mann Whitney U-test</font></td></tr>
        <tr><td><font size=3>One-way ANOVA</font></td>
            <td><font size=3>Kruskal Wallis Test</font></td></tr>
    </tbody>
</table>

In the following section we will use univariate statistical analyses to investigate how the metabolome is influenced by:
*   Sampling site: We will compare seven different sampling areas and investigate if there is a gradual shift in metabolite levels from along the coast. 
*   Heavy rainfall: We will compare the metabolite levels before and and after a heavy rainfall in January 2018.

Once again, let's merge metadata and the scaled data to one dataframe.

In [105]:
Data = pd.merge(md_samples, scaled, left_index=True, right_index=True)
Data

Unnamed: 0,ATTRIBUTE_Sample.Type,ATTRIBUTE_Batch,ATTRIBUTE_Month,ATTRIBUTE_Year,ATTRIBUTE_Sample_Location,ATTRIBUTE_Replicate,ATTRIBUTE_Spot,ATTRIBUTE_Latitude,ATTRIBUTE_Longitude,ATTRIBUTE_Sample_Area,...,89389_1370.381_11.978,90659_1370.382_12.502,92155_1370.382_12.968,89872_1370.954_12.136,89496_1442.399_12.015,91167_1442.399_12.598,90242_1442.399_12.377,88493_1442.399_11.706,90600_1443.399_12.376,91938_1443.4_12.895
SD_01-2018_10_a.mzXML,Sample,2,Jan,2018,10,a,10,32.86261,-117.26042,SIO_La_Jolla_Shores,...,-0.162225,-0.190591,-0.182257,-0.552037,4.350746,2.029883,2.142192,2.071779,2.574828,-0.269252
SD_01-2018_10_b.mzXML,Sample,2,Jan,2018,10,b,10,32.86261,-117.26042,SIO_La_Jolla_Shores,...,-0.174409,-0.191799,-0.178841,-0.553284,1.724593,1.096327,5.770871,7.714563,3.172589,0.782374
SD_01-2018_11_a.mzXML,Sample,2,Jan,2018,11,a,11,32.85601,-117.26253,SIO_La_Jolla_Shores,...,-0.171109,-0.180370,-0.185962,-0.553588,2.308602,2.606341,3.022200,3.294615,1.481595,0.664621
SD_01-2018_11_b.mzXML,Sample,2,Jan,2018,11,b,11,32.85601,-117.26253,SIO_La_Jolla_Shores,...,-0.170725,-0.192839,-0.178697,-0.551072,2.363603,0.701804,1.318001,1.891865,1.614371,4.752602
SD_01-2018_12_a.mzXML,Sample,2,Jan,2018,12,a,12,32.85161,-117.26965,La_Jolla_Cove,...,-0.176673,-0.183592,-0.175737,-0.552289,1.921049,3.307723,0.878771,1.843211,0.260638,-0.273041
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SD_12-2017_7_b.mzXML,Sample,1,Dec,2017,7,b,7,32.88456,-117.25879,Torrey_Pines,...,-0.173315,-0.183642,-0.170012,-0.550233,-0.353466,-0.363294,-0.346946,-0.321843,-0.356335,-0.273436
SD_12-2017_8_a.mzXML,Sample,1,Dec,2017,8,a,8,32.87627,-117.25570,Torrey_Pines,...,-0.168614,-0.190238,-0.169483,-0.552245,-0.354349,-0.362093,-0.343153,-0.329697,-0.352206,-0.269623
SD_12-2017_8_b.mzXML,Sample,1,Dec,2017,8,b,8,32.87627,-117.25570,Torrey_Pines,...,-0.168519,-0.185505,-0.177806,-0.548228,-0.352761,-0.363333,-0.343307,-0.322620,-0.355692,-0.275330
SD_12-2017_9_a.mzXML,Sample,1,Dec,2017,9,a,9,32.86989,-117.25836,SIO_La_Jolla_Shores,...,-0.162129,-0.193981,-0.188608,-0.553707,-0.355273,-0.365836,-0.351637,-0.321396,-0.354172,-0.274268


## <font color ='darkblue'> 4.1 Test for normality </font>
<a id="norm_test"></a>

In order to decide whether to go for parametric or non-parametric tests, we test for normality. Some common methods to test for normality are:
1. Visual representations like histogram, Q–Q Plot
2. Statistical tests such as Shapiro–Wilk test, Kolmogorov–Smirnov test

The null hypothosis(H0) of these statistical tests states that the data has a normal distribution. H0= TRUE if p > 0.05. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/">Read more about normality tests</a>

Let's start by inspecting a histogram of the first feature in the dataset:

## <font color ='darkblue'> 4.2 ANOVA </font>
<a id="anova"></a>

<p style='text-align: justify;'> We can also perform the parametric test, ANOVA (Analysis of Variance) on our data. Here, we will test whether metabolite levels were different between different sampling sites. Here, the seven different sampling areas will be compared. We  use the function 'aov' to run statistical analyses using ANOVA. ANOVA makes use of variances of different groups to see if they are different from each other. (Variance = SD<sup>2</sup>) </p>
<p style='text-align: justify;'> 
<b>H0 = No differences among the groups (their means or Standard deviations)</b>.Using p-value, one can see if the groups are statistically different from one another. When there is a significant difference, F-ratio, another output of ANOVA will be larger and H0 will be rejected. </p>

$$F-statistic = \frac{\text{between-group variance}}{\text{within groups variance}}$$

In [106]:
# select an attribute to perform ANOVA
anova_attribute = 'ATTRIBUTE_Sample_Area'

In [107]:
def gen_anova_data(df, columns, groups_col):
    for col in columns:
        result = pg.anova(data=df, dv=col, between=groups_col, detailed=True).set_index('Source')
        p = result.loc[groups_col, 'p-unc']
        f = result.loc[groups_col, 'F']
        yield col, p, f

dtypes = [('metabolite', 'U100'), ('p', 'f'), ('F', 'f')]
anova = pd.DataFrame(np.fromiter(gen_anova_data(data, scaled.columns, anova_attribute), dtype=dtypes))
anova

Unnamed: 0,metabolite,p,F
0,92572_151.035_13.364,0.083717,1.897880
1,2513_151.035_1.13,0.763463,0.557745
2,42_151.035_0.551,0.041332,2.243144
3,1870_151.035_0.887,0.832150,0.466937
4,1653_152.057_0.847,0.315791,1.186243
...,...,...,...
9087,91167_1442.399_12.598,0.000025,5.598529
9088,90242_1442.399_12.377,0.000797,4.050007
9089,88493_1442.399_11.706,0.001017,3.941154
9090,90600_1443.399_12.376,0.001904,3.660210


The following is of interest:
*   Feature ID (column 'metabolite')
*   p-value for ANOVA
*   p-value after taking multiple tests into consideration
*   F-value

In [108]:
# add Bonferroni corrected p-values for multiple testing correction
if 'p_bonferroni' not in anova.columns:
    anova.insert(2, 'p_bonferroni', pg.multicomp(anova['p'], method='bonf')[1])
# add significance
if 'significant' not in anova.columns:
    anova.insert(3, 'significant', anova['p_bonferroni'] < 0.05)
# sort by p-value
anova.sort_values('p', inplace=True)
# save ANOVA table
anova.to_csv(os.path.join(result_dir, 'ANOVA_results.csv'))
anova

Unnamed: 0,metabolite,p,p_bonferroni,significant,F
2812,59188_312.231_7.625,9.262929e-30,8.421854e-26,True,39.016209
1394,33200_260.196_4.886,1.620657e-28,1.473501e-24,True,36.769932
2862,57080_314.247_7.36,3.029371e-24,2.754304e-20,True,29.584505
1082,21870_246.18_3.969,7.050516e-23,6.410329e-19,True,27.449234
1035,80910_243.174_10.41,1.678681e-21,1.526256e-17,True,25.373669
...,...,...,...,...,...
535,560_217.068_0.628,9.994868e-01,1.000000e+00,False,0.049824
8355,51908_729.432_6.783,9.997691e-01,1.000000e+00,False,0.037827
4873,19113_381.238_3.737,9.998595e-01,1.000000e+00,False,0.031914
4416,47359_365.219_6.284,9.998650e-01,1.000000e+00,False,0.031476


**Plot ANOVA results**

We will use plotly to visualize results from the ANOVA, with log(F-values) on the x-axis and -log(p) on the y-axis. Features are colored after statistical significance after multiple test correction. Since there are large differences in the F- and p-values, it is easier to plot their log.

We can also display the names of some of the top features in the plot. This easily gets very cluttered if we decide to display too many names, so starting at the top 5 could be a good idea.

In [109]:
# Create Top Table
def topx_metabolite (data, met_col, sort_col, top):

    table_val = []
    data = data.sort_values(by=sort_col)
    for met in data[met_col].iloc[:top]:
        table_val.append(met.split('_'))

    array = np.array(table_val)
    transposed_array = array.T
    transposed_list_of_lists = transposed_array.tolist()
    transposed_list_of_lists.insert(0, ['ID', 'mz', 'rt', sort_col])
    sort_val = data[sort_col].iloc[:top].to_list()
    transposed_list_of_lists.append([f"{val:.3e}" for val in sort_val])
    
    return transposed_list_of_lists

In [110]:
top = 10

fig = make_subplots(
    rows=1, cols=2,
    shared_xaxes=True,
    horizontal_spacing=0.03,
    column_widths = [300,200],
    subplot_titles = ["ANOVA - Feature Significance", None],
    specs=[[{"type": "scatter"}, {"type": "table"}]])

scatter1 = fig.add_trace(go.Scatter(x=anova['F'].apply(np.log),
                                    y=anova['p'].apply(lambda x: -np.log(x)),
                                    mode = 'markers', marker=dict(color="#ef553b"),
                                    name="Non-significant"),
                         row=1, 
                         col=1)

scatter2 = fig.add_trace(go.Scatter(y=anova[anova['significant'] == True]['p'].apply(lambda x: -np.log(x)),
                                    x=anova['F'].apply(np.log),
                                    mode = 'markers', marker=dict(color="#696880"), 
                                    name="Significant"), 
                         row=1, 
                         col=1)

scatter3 = fig.add_trace(go.Scatter(y=anova[anova['significant'] == True]['p'].iloc[:top].apply(lambda x: -np.log(x)),
                                    x=anova['F'].apply(np.log),
                                    mode = 'markers', marker=dict(color="#EC7C1E",
                                                                  size = 8,
                                                                  line=dict(width=0.5,
                                                                            color='black')), 
                                    name=f"Most significant - (Top {top})"), 
                         row=1, 
                         col=1)

table = fig.add_trace(go.Table(header=dict(values=topx_metabolite(anova, 'metabolite', 'p', top)[0],
                                           font=dict(size=12, color='white'),
                                           align="left",
                                           fill_color = "#EC7C1E",line = dict(color='white', width=0.5)),
                               cells=dict(values=topx_metabolite(anova, 'metabolite', 'p', top)[1:],
                                          align = "left",
                                         fill_color='#F7ECD9')),
                      row=1, col=2)

fig.update_layout(template = "plotly_white",
                  xaxis_title="Log(F)",
                  yaxis_title="-Log(p)",
                  legend_title = "Significant",
                  height=700,
                  yaxis_color = "gray",
                  xaxis_color = "gray",
                 legend=dict(x=0.6, y=0.1))


In [111]:
# boxplots with top 4 metabolites from ANOVA
for metabolite in anova.sort_values('p_bonferroni').iloc[:4, 0]:
    fig = px.box(data, x=anova_attribute, y=metabolite, color=anova_attribute)
    fig.update_layout(showlegend=False, title=metabolite, xaxis_title="", yaxis_title="intensity", template="plotly_white", width=500)
    display(fig)

## <font color ='darkblue'> 4.3 Tukey's post-hoc test </font>
<a id ="tukey"></a>
As mentioned above, Tukey's post hoc test is a common post-hoc test after a 1-way anova. It also assumes the data to be normally distributed and homoscedastic (having same variances). One we know that there is a significant difference among different sampling sites, we can use tukey-test to calculate, which features show statistically significant differences between 2 sampling sites. 


In [112]:
# functions to run Tukey's and plot results

def tukey_post_hoc_test(anova_attribute, contrasts, metabolites):
    """
    Perform pairwise Tukey test for all metabolites between contrast combinations.

    Args:
        anova_attribute: A string representing the attribute to use in ANOVA.
        contrasts: A list of tuples, where each tuple contains two strings representing the groups to compare.
        metabolites: A list of strings representing the metabolites to test.

    Returns:
        A pandas DataFrame containing the results of the pairwise Tukey test, including the contrast,
        metabolite, absolute value of the metabolite ID, difference between the means, p-value, Bonferroni
        corrected p-value, and significance (True or False).
    """

    # if a single metabolite gets passed make sure to put it in a list
    if isinstance(metabolites, str):
        metabolites = [metabolites]

    def gen_pairwise_tukey(df, contrasts, metabolites):
        """ Yield results for pairwise Tukey test for all metabolites between contrast combinations."""
        for metabolite in metabolites:
            for contrast in contrasts:
                df_for_tukey = df.iloc[np.where(data[anova_attribute].isin([contrast[0], contrast[-1]]))][[metabolite, anova_attribute]]
                pairwise_tukey = pg.pairwise_tukey(df_for_tukey, dv=metabolite, between=anova_attribute)
                yield f'{contrast[0]}-{contrast[1]}', metabolite, int(metabolite.split('_')[0]), pairwise_tukey['diff'], pairwise_tukey['p-tukey']

    dtypes = [('contrast', 'U100'), ('stats_metabolite', 'U100'), ('stats_ID', 'i'), ('stats_diff', 'f'), ('stats_p', 'f')]
    tukey = pd.DataFrame(np.fromiter(gen_pairwise_tukey(data, contrasts, metabolites), dtype=dtypes))
    # add Bonferroni corrected p-values
    tukey.insert(5, 'stats_p_bonferroni', pg.multicomp(tukey['stats_p'], method='bonf')[1])
    # add significance
    tukey.insert(6, 'stats_significant', tukey['stats_p_bonferroni'] < 0.05)
    # sort by p-value
    tukey.sort_values('stats_p', inplace=True)

    # write output to csv file
    tukey.to_csv(os.path.join(result_dir, 'TukeyHSD_output.csv'))

    return tukey

def plot_tukey(df):

    # create figure
    fig = px.scatter(template='plotly_white', width=600, height=600)

    # plot insignificant values
    fig.add_trace(go.Scatter(x=df[df['stats_significant'] == False]['stats_diff'],
                            y=df[df['stats_significant'] == False]['stats_p'].apply(lambda x: -np.log(x)),
                            mode='markers', marker_color='#696880', name='insignificant'))

    # plot significant values
    fig.add_trace(go.Scatter(x=df[df['stats_significant']]['stats_diff'],
                            y=df[df['stats_significant']]['stats_p'].apply(lambda x: -np.log(x)),
                            mode='markers+text', text=anova['metabolite'].iloc[:4], textposition='top left', 
                            textfont=dict(color='#ef553b', size=8), marker_color='#ef553b', name='significant'))

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                    title={"text":"TUKEY", 'x':0.5, "font_color":"#3E3D53"},
                    xaxis_title="stats_diff", yaxis_title="-log(p)")

    # save image as pdf
    fig.write_image(os.path.join(result_dir, "TukeyHSD.pdf"), scale=3)

    display(fig)

For the most significant feature from ANOVA:

In [113]:
contrasts = list(itertools.combinations(set(data[anova_attribute]), 2)) # all possible combinations
tukey = tukey_post_hoc_test(anova_attribute, contrasts, anova['metabolite'].iloc[0])
display(tukey)

Unnamed: 0,contrast,stats_metabolite,stats_ID,stats_diff,stats_p,stats_p_bonferroni,stats_significant
11,Mission_Bay-Torrey_Pines,59188_312.231_7.625,59188,1.874555,1.660894e-13,3.487877e-12,True
12,Mission_Bay-La_Jolla Reefs,59188_312.231_7.625,59188,-1.906257,3.858347e-11,8.102529e-10,True
1,Mission_Beach-Mission_Bay,59188_312.231_7.625,59188,1.920306,9.794057e-07,2.056752e-05,True
6,SIO_La_Jolla_Shores-Mission_Bay,59188_312.231_7.625,59188,1.919578,9.877288e-07,2.07423e-05,True
13,Mission_Bay-La_Jolla_Cove,59188_312.231_7.625,59188,-1.9205,4.293869e-05,0.0009017125,True
14,Mission_Bay-Pacific_Beach,59188_312.231_7.625,59188,1.842428,8.056055e-05,0.001691771,True
19,La_Jolla Reefs-Pacific_Beach,59188_312.231_7.625,59188,-0.06383,0.07034818,1.0,False
5,Mission_Beach-Pacific_Beach,59188_312.231_7.625,59188,-0.077879,0.08778051,1.0,False
10,SIO_La_Jolla_Shores-Pacific_Beach,59188_312.231_7.625,59188,0.07715,0.09330773,1.0,False
20,La_Jolla_Cove-Pacific_Beach,59188_312.231_7.625,59188,-0.078072,0.1601359,1.0,False


Here, every possible pair-wise group difference is explored. Since Mission Bay seemed to differ from other sampling sites the most, we could specifically look at the results from comparison between Mission Bay and another sampling site.

In the example below, we look at the differences between Mission Bay and La Jolla Reefs.

In [114]:
contrasts = [('Mission_Bay', 'La_Jolla Reefs')]
tukey = tukey_post_hoc_test(anova_attribute, contrasts, anova[anova['significant']]['metabolite'])
display(tukey)
plot_tukey(tukey)

Unnamed: 0,contrast,stats_metabolite,stats_ID,stats_diff,stats_p,stats_p_bonferroni,stats_significant
0,Mission_Bay-La_Jolla Reefs,59188_312.231_7.625,59188,-1.906257,3.858347e-11,5.992013e-08,True
24,Mission_Bay-La_Jolla Reefs,60583_506.326_7.811,60583,-1.670673,2.228846e-10,3.461398e-07,True
1,Mission_Bay-La_Jolla Reefs,33200_260.196_4.886,33200,-1.834872,2.730651e-10,4.240702e-07,True
2,Mission_Bay-La_Jolla Reefs,57080_314.247_7.36,57080,-1.781896,3.201267e-09,4.971568e-06,True
15,Mission_Bay-La_Jolla Reefs,36504_214.191_5.227,36504,-1.722631,4.731854e-09,7.348568e-06,True
...,...,...,...,...,...,...,...
405,Mission_Bay-La_Jolla Reefs,55430_449.201_7.163,55430,-0.004806,9.161726e-01,1.000000e+00,False
952,Mission_Bay-La_Jolla Reefs,76102_829.344_9.902,76102,-0.006151,9.201342e-01,1.000000e+00,False
1291,Mission_Bay-La_Jolla Reefs,2727_333.165_1.239,2727,0.002981,9.236637e-01,1.000000e+00,False
397,Mission_Bay-La_Jolla Reefs,75268_423.334_9.805,75268,0.001955,9.506227e-01,1.000000e+00,False


## <font color ='darkblue'> 4.4 T-tests </font>
<a id ="tukey"></a>
A T-test is commonly used when one has to compare between only two groups. Here, null hypothesis H0 states no difference between the mean of 2 groups. Similar to the F-statistic used by ANOVA, T-tests use T-statistic.


$$\text{T-statistic} = \frac{\text{Mean}_{\text{group}} - \text{Mean}_{\text{population}}}{\text{SD}_{\text{group}} / \sqrt{\text{group size}}}$$


In our dataset, a heavy rainfall in January 2018 could have influenced the metabolome. We will investigate the effect of the rainfall using t-tests. The 2 conditions will be 'Jan-2018' or 'not Jan-2018'

In [115]:
ttest_attribute = 'ATTRIBUTE_Month'
target_group = 'Jan'

In [116]:
def gen_ttest_data(df, columns, ttest_attribute, target_group):
    ttest = []
    for col in columns:
        group1 = df[col][df[ttest_attribute]==target_group]
        group2 = df[col][df[ttest_attribute]!=target_group]
        result = pg.ttest(group1, group2)
        result['Metabolite'] = col   
    
        ttest.append(result)
    
    ttest = pd.concat(ttest).set_index('Metabolite')
        
    ttest.insert(8, 'p-bonf', pg.multicomp(ttest['p-val'], method='bonf')[1])
    # add significance
    ttest.insert(9, 'Significance', ttest['p-bonf'] < 0.05)

    return ttest

In [117]:
ttest = gen_ttest_data(data, scaled.columns, ttest_attribute, target_group)
ttest.head(5)

Unnamed: 0_level_0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power,p-bonf,Significance
Metabolite,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
92572_151.035_13.364,-3.642729,123.313055,two-sided,0.0003957032,"[-0.62, -0.18]",0.409476,67.786,0.730988,1.0,False
2513_151.035_1.13,1.78167,62.575782,two-sided,0.07965544,"[-0.05, 0.8]",0.382911,0.733,0.67318,1.0,False
42_151.035_0.551,3.033388,177.476052,two-sided,0.002781208,"[0.14, 0.65]",0.395982,11.17,0.702265,1.0,False
1870_151.035_0.887,1.51467,153.21789,two-sided,0.1319163,"[-0.05, 0.41]",0.178943,0.49,0.203064,1.0,False
1653_152.057_0.847,-8.369189,124.624626,two-sided,1.000038e-13,"[-1.07, -0.66]",0.942722,321300000000.0,1.0,9.092345e-10,True


In [118]:
# Plot T-test

fig = px.scatter(x=ttest['T'],
                y=ttest['p-bonf'].apply(lambda x: -np.log(x)),
                template='plotly_white', width=600, height=600, 
                 color=ttest['Significance'].apply(lambda x: str(x)),
                color_discrete_sequence = ['#696880', '#ef553b'])

fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"T-test - FEATURE SIGNIFICANCE", 'x':0.5, "font_color":"#3E3D53"},
                  xaxis_title="T", yaxis_title="-Log(p)", showlegend=False)

fig.show()

## <font color ='darkblue'> 4.5 Kruskal-Wallis </font>
<a id="kr_wallis"></a>
Kruskal-Wallis Test is a non-parametric version of ANOVA. Here, the test does not assume normality of the data. The median of multiple groups are compared to see if they are statistically different from one another. The null hypothesis H0 states no significant difference among different groups. Based on the p value, we decide whether to reject H0 or not. When H0 is rejected, the alternate hypothesis H1 states that atleast one group is statistically different from the others.
<a href="https://statsandr.com/blog/kruskal-wallis-test-nonparametric-version-anova/#introduction">Read more about Kruskal-Wallis test</a>
Performing Kruskal Test on the first feature:

Performing Kruskal Test on the first feature:

In [119]:
def gen_kruskal_wallis(df, columns, groups_col):
    for col in columns:
        result = pg.kruskal(data=df, dv=col, between=groups_col, detailed=True).set_index('Source')
        p = result.loc[groups_col, 'p-unc']
        h = result.loc[groups_col, 'H']
        yield col, p, h

dtypes = [('metabolite', 'U100'), ('KW_p', 'f'), ('KW_H', 'f')]
kruskal = pd.DataFrame(np.fromiter(gen_kruskal_wallis(data, scaled.columns, anova_attribute), dtype=dtypes))
kruskal

Unnamed: 0,metabolite,KW_p,KW_H
0,92572_151.035_13.364,0.330383,6.897998
1,2513_151.035_1.13,0.839777,2.747739
2,42_151.035_0.551,0.050233,12.578839
3,1870_151.035_0.887,0.023326,14.631376
4,1653_152.057_0.847,0.427213,5.964190
...,...,...,...
9087,91167_1442.399_12.598,0.034091,13.627062
9088,90242_1442.399_12.377,0.007109,17.670952
9089,88493_1442.399_11.706,0.437498,5.873600
9090,90600_1443.399_12.376,0.119100,10.134518


In [120]:
if 'KW_significant' not in kruskal.columns:
    kruskal.insert(3, 'KW_significant', kruskal['KW_p'] < 0.05)
# sort by p-value
kruskal.sort_values('KW_p', inplace=True)
# save ANOVA table
kruskal.to_csv(os.path.join(result_dir, 'KRUSKAL-WALLIS_results.csv'))
kruskal

Unnamed: 0,metabolite,KW_p,KW_H,KW_significant
8785,91372_906.258_12.697,8.835076e-23,116.502892,True
8799,90743_908.258_12.555,3.730546e-22,113.520126,True
8793,91133_907.259_12.628,1.427622e-21,110.738586,True
8782,90597_906.258_12.403,4.608034e-20,103.525444,True
8792,90429_907.259_12.326,3.694662e-19,99.194534,True
...,...,...,...,...
4416,47359_365.219_6.284,9.996915e-01,0.253445,False
3329,72506_330.206_9.483,9.998732e-01,0.186863,False
8222,89741_695.498_12.187,9.999273e-01,0.154585,False
153,53070_181.122_6.859,9.999424e-01,0.142838,False


We can also compare the results of ANOVA and Kruskal-Wallis to see if the answers are correlating.



In [121]:
aov_kw = anova.set_index('metabolite').join(kruskal.set_index('metabolite'))

In [122]:
aov_kw['Significance'] = (aov_kw['significant'].astype(int)) + (aov_kw['KW_significant'].astype(int))

In [123]:
aov_kw

Unnamed: 0_level_0,p,p_bonferroni,significant,F,KW_p,KW_H,KW_significant,Significance
metabolite,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
59188_312.231_7.625,9.262929e-30,8.421854e-26,True,39.016209,1.297841e-12,67.551689,True,2
33200_260.196_4.886,1.620657e-28,1.473501e-24,True,36.769932,2.277154e-08,46.572220,True,2
57080_314.247_7.36,3.029371e-24,2.754304e-20,True,29.584505,2.390909e-11,61.351421,True,2
21870_246.18_3.969,7.050516e-23,6.410329e-19,True,27.449234,3.358942e-07,40.672680,True,2
80910_243.174_10.41,1.678681e-21,1.526256e-17,True,25.373669,5.794372e-16,83.820168,True,2
...,...,...,...,...,...,...,...,...
560_217.068_0.628,9.994868e-01,1.000000e+00,False,0.049824,8.488414e-01,2.671164,False,0
51908_729.432_6.783,9.997691e-01,1.000000e+00,False,0.037827,9.635441e-01,1.437397,False,0
19113_381.238_3.737,9.998595e-01,1.000000e+00,False,0.031914,8.887859e-01,2.312953,False,0
47359_365.219_6.284,9.998650e-01,1.000000e+00,False,0.031476,9.996915e-01,0.253445,False,0


In [124]:
# Plot KW and AOV

fig = px.scatter(x=aov_kw['F'],
                y=aov_kw['KW_H'].apply(lambda x: -np.log(x)),
                template='plotly_white', width=600, height=600, 
                 color=aov_kw['Significance'].apply(lambda x: str(x)),
                color_discrete_sequence = ['#032234', '#014F86', '#00B4D8'])

fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                  title={"text":"Correlation between ANOVA and Kruskal-Wallis statistics", 'x':0.5, "font_color":"#3E3D53"},
                  xaxis_title="ANOVA F-statistic", yaxis_title="Kruskal-Wallis H-statistc", showlegend=False)

fig.show()

A higher score for a particular feature in both axes indicates that both tests, ANOVA and Kruskal-Wallis, rejected the null-hypothesis. Thus those features are significantly different among different groups.



## <font color ='darkblue'> 5.1 PCoA PermANOVA </font>
<a id="norm_test"></a>

Principal coordinates analysis (PCoA)

Principal coordinates analysis (PCoA) is a metric multidimensional scaling (MDS) method that attempts to represent sample dissimilarities in a low-dimensional space. It converts a distance matrix consisting of pair-wise distances (dissimilarities) across samples into a 2- or 3-D graph (Gower, 2005). Different distance metrics can be used to calculate dissimilarities among samples (e.g. Euclidean, Canberra, Minkowski). Performing a principal coordinates analysis using the Euclidean distance metric is the same as performing a principal components analysis (PCA). The selection of the most appropriate metric depends on the nature of your data and assumptions made by the metric.

Within the metabolomics field the Euclidean, Bray-Curtis, Jaccard or Canberra distances are most commonly used. The Jaccard distance is an unweighted metric (presence/absence) whereas Euclidean, Bray-Curtis and Canberra distances take into account relative abundances (weighted). Some metrics may be better suited for very sparse data (with many zeroes) than others. For example, the Euclidean distance metric is not recommended to be used for highly sparse data.

This video tutorial by StatQuest summarizes nicely the basic principles of PCoA: https://www.youtube.com/watch?v=GEn-_dAyYME

In [None]:
#calculating Principal components
n = 10
pca = PCA(n_components=n)
pca_df = pd.DataFrame(data = pca.fit_transform(scaled), columns = [f'PC{x}' for x in range(1, n+1)])
pca_df.index = md_samples.index
pca_df

In [None]:
# To get a scree plot showing the variance of each PC in percentage:
percent_variance = np.round(pca.explained_variance_ratio_* 100, decimals =2)

fig_bar = px.bar(x=pca_df.columns, y=percent_variance, template="plotly_white",  width=500, height=400)
fig_bar.update_traces(marker_color="#696880", width=0.5)
fig_bar.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                    title={"text":"PCA - VARIANCE", 'x':0.5, "font_color":"#3E3D53"},
                    xaxis_title="principal component", yaxis_title="variance (%)")
fig_bar.show()

TODO make the attibute colors work

In [None]:
@interact(attribute=sorted(md_samples.columns))
def pca_scatter_plot(attribute):
    title = f'PRINCIPLE COMPONENT ANALYSIS'

    df = pd.merge(pca_df[['PC1', 'PC2']], md_samples[attribute].apply(str), left_index=True, right_index=True)

    fig = px.scatter(df, x='PC1', y='PC2', template='plotly_white', width=600, height=400, color=attribute)

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":title, 'x':0.2, "font_color":"#3E3D53"},
                      xaxis_title=f'PC1 {round(pca.explained_variance_ratio_[0]*100, 1)}%',
                      yaxis_title=f'PC2 {round(pca.explained_variance_ratio_[1]*100, 1)}%')
    display(fig)

TODO fix the interact thing

In [None]:
matrices = ['canberra', 'chebyshev', 'correlation', 'cosine', 'euclidean', 'hamming', 'jaccard', 'matching', 'minkowski', 'seuclidean']
@interact(attribute=sorted(md_samples.columns), distance_matrix=matrices)
def pcoa(attribute, distance_matrix):
    # Create the distance matrix from the original data
    distance_matrix = skbio.stats.distance.DistanceMatrix(distance.squareform(distance.pdist(scaled.values, distance_matrix)))
    # perform PERMANOVA test
    permanova = skbio.stats.distance.permanova(distance_matrix, md_samples[attribute])
    permanova['R2'] = 1 - 1 / (1 + permanova['test statistic'] * permanova['number of groups'] / (permanova['sample size'] - permanova['number of groups'] - 1))
    display(permanova)
    # perfom PCoA
    pcoa = skbio.stats.ordination.pcoa(distance_matrix)
    df = pcoa.samples[['PC1', 'PC2']]
    df = df.set_index(md_samples.index)
    df = pd.merge(df[['PC1', 'PC2']], md_samples[attribute].apply(str), left_index=True, right_index=True)
    
    title = f'PRINCIPLE COORDINATE ANALYSIS'
    fig = px.scatter(df, x='PC1', y='PC2', template='plotly_white', width=600, height=400, color=attribute)

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":title, 'x':0.18, "font_color":"#3E3D53"},
                      xaxis_title=f'PC1 {round(pcoa.proportion_explained[0]*100, 1)}%',
                      yaxis_title=f'PC2 {round(pcoa.proportion_explained[1]*100, 1)}%')
    display(fig)
    
    # To get a scree plot showing the variance of each PC in percentage:
    percent_variance = np.round(pcoa.proportion_explained* 100, decimals =2)

    fig = px.bar(x=[f'PC{x}' for x in range(1, len(pcoa.proportion_explained)+1)], y=percent_variance, template="plotly_white",  width=500, height=400)
    fig.update_traces(marker_color="#696880", width=0.5)
    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":"PCoA - VARIANCE", 'x':0.5, "font_color":"#3E3D53"},
                      xaxis_title="principal component", yaxis_title="variance (%)")#
    display(fig)


# Hierarchial Clustering Algorithm:

We are now ready to perform a cluter analysis. The concept behind hierarchical clustering is to repeatedly combine the two nearest clusters into a larger cluster.

The first step consists of calculating the distance between every pair of observation points and stores it in a matrix;
1. It puts every point in its own cluster;
2. It merges the closest pairs of points according to their distances;
3. It recomputes the distance between the new cluster and the old ones and stores them in a new distance matrix;
4. It repeats steps 2 and 3 until all the clusters are merged into one single cluster. <br>

In [None]:
fig = ff.create_dendrogram(scaled, labels=list(scaled.index))
fig.update_layout(width=700, height=500, template='plotly_white')

# save image as pdf
fig.write_image(os.path.join(result_dir, "Cluster_Dendrogram.pdf"), scale=3)
fig.show()

In [None]:
# SORT DATA TO CREATE HEATMAP

# Compute linkage matrix from distances for hierarchical clustering
linkage_data_ft = linkage(scaled, method='complete', metric='euclidean')
linkage_data_samples = linkage(scaled.T, method='complete', metric='euclidean')

# Create a dictionary of data structures computed to render the dendrogram. 
# We will use dict['leaves']
cluster_samples = dendrogram(linkage_data_ft, no_plot=True)
cluster_ft = dendrogram(linkage_data_samples, no_plot=True)

# Create dataframe with sorted samples
ord_samp = scaled.copy()
ord_samp.reset_index(inplace=True)
ord_samp = ord_samp.reindex(cluster_samples['leaves'])
ord_samp.rename(columns={'index': 'Filename'}, inplace=True)
ord_samp.set_index('Filename', inplace=True)

# Create dataframe with sorted features
ord_ft = ord_samp.T.reset_index()
ord_ft = ord_ft.reindex(cluster_ft['leaves'])
ord_ft.rename(columns={'index': 'Feature'}, inplace=True)
ord_ft.set_index('Feature', inplace=True)

In [None]:
#Heatmap
fig = px.imshow(ord_ft,y=list(ord_ft.index), x=list(ord_ft.columns), text_auto=True, aspect="auto",
               color_continuous_scale='PuOr_r', range_color=[-3,3])

fig.update_layout(
    autosize=False,
    width=700,
    height=800)

fig.update_yaxes(visible=False)
fig.update_xaxes(tickangle = 35)

# save image as pdf
fig.write_image(os.path.join(result_dir, "Heatmap.pdf"), scale=3)

fig.show()