# Data Clean up

Updated on: 2022-10-17 16:54:01 CEST

Authors: Abzer Kelminal (abzer.shah@uni-tuebingen.de), Axel Walter, Carolina Gonzalez <br>
Input file format: .csv files or .txt files <br>
Outputs: .csv files <br>
Dependencies: pandas <br>

This Notebook is used for cleaning the feature table, an output of metabolomics experiment, containing all the features with their corresponding intensities. The data cleanup steps involved are: 1) Blank removal 2) Imputation 3) Normalisation. Each step would be discussed in detail later.

This Notebook can be run with both Jupyter Notebook & Google Colab. Please read the comments before proceeding with the code and let us know if you run into any errors and if you think it could be commented better. We would highly appreciate your suggestions and comments!!

# About the Test data:

The files used in this tutorial are part of an interlab comparison study, where different laboratories around the world analysed the same environmental samples on their respective LC-MS/MS equipments. To simulate algal bloom, standardized algae extracts (A) in marine dissovled organic matter (M) at different concentrations were prepared (450 (A45M), 150 (A15M), and 50 (A5M) ppm A). Samples were then shipped to different laboratories for untargeted LC-MS/MS metabolomics analysis. In this tutorial we are working with one of the datasets, which was acquired on a UHPLC system coupled to a Thermo Scientific Q Exactive HF Orbitrap LC-MS/MS mass spectrometer. MS/MS data were acquired in data-dependent acquisition (DDA) with fragmentation of the five most abundant ions in the spectrum per precursor scan. Data files were subsequently preprocessed using MZmine3 and the feature-based molecular networking workflow in GNPS.

# Package installation:

In [1]:
!pip install scipy==1.8.1
!pip install plotly scikit-bio



In [2]:
# installing necessary packages (omitted for now)

# importing necessary modules
import pandas as pd
import numpy as np
import os
import plotly.express as px
import matplotlib.pyplot as plt
import skbio

from scipy.spatial.distance import pdist, squareform
from scipy.spatial import distance
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

from ipywidgets import interact

# PCA:

In [None]:
# open a feature dataframe with imputed and scaled values
imputed_s = pd.read_csv('imputed_s.tsv', sep='\t').set_index("Unnamed: 0")
# open the matching metadata file
md_samples = pd.read_csv('md_samples.tsv', sep='\t').set_index("Unnamed: 0")

In [7]:
merged_data = pd.merge(md_samples, imputed_s, left_index=True, right_index=True)
merged_data.head()

Unnamed: 0_level_0,ATTRIBUTE_Sample,ATTRIBUTE_Sample_Type,ATTRIBUTE_Time-Point,1458_150.128_3.194,1418_150.128_3.173,30_151.035_0.653,7137_151.075_6.002,6905_151.075_5.937,5684_151.075_5.554,4120_151.075_5.025,...,90_770.851_0.705,16880_774.609_12.932,16869_776.625_12.233,14833_782.541_9.242,16914_808.525_14.316,16922_810.54_14.323,16913_824.52_14.314,86_838.838_0.705,16935_852.551_14.321,80_906.826_0.705
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DOM_Interlab-LCMS_Lab1_A15M_Pos_MS2_rep1.mzML,A15M,Sample,15.0,1.62898,-1.053124,-1.020978,1.348525,0.441509,1.651623,0.187444,...,-0.48505,-0.050817,-0.309721,-0.527895,-0.56542,0.433955,-0.570665,-0.174226,-0.549442,0.320186
DOM_Interlab-LCMS_Lab1_A15M_Pos_MS2_rep2.mzML,A15M,Sample,15.0,-0.543722,0.324533,-1.153552,1.356302,0.54507,1.140705,1.376215,...,-0.578428,-0.493004,3.315775,-0.570459,-0.56542,-0.623779,-0.570665,-1.231997,-0.028349,-1.252695
DOM_Interlab-LCMS_Lab1_A15M_Pos_MS2_rep3.mzML,A15M,Sample,15.0,-1.084394,0.720872,0.514123,1.525673,0.642921,1.010851,0.557715,...,-0.701652,-0.280701,-0.227047,-0.528371,-0.56542,0.372539,-0.570665,-0.614848,-0.052354,-1.206147
DOM_Interlab-LCMS_Lab1_A45M_Pos_MS2_rep1.mzML,A45M,Sample,45.0,-0.279449,0.086509,-1.573132,-1.084736,-1.303003,-1.453874,-1.704643,...,1.379181,2.195786,-0.309721,1.974547,-0.56542,2.214279,-0.570665,1.146465,-0.549442,1.326451
DOM_Interlab-LCMS_Lab1_A45M_Pos_MS2_rep2.mzML,A45M,Sample,45.0,-0.350608,0.512217,0.566929,-1.08555,-1.373272,-1.520887,-1.818895,...,0.50067,1.767888,-0.309721,1.663683,-0.56542,-0.609766,-0.570665,0.334642,2.668665,0.081857


## imputed_s was already scaled - do we need to to it again?

In [8]:
#transformed data
# trans_data = StandardScaler().fit_transform(imputed_s)

#calculating Principal components
n = 5
pca = PCA(n_components=n)
pca_df = pd.DataFrame(data = pca.fit_transform(imputed_s), columns = [f'PC{x}' for x in range(1, n+1)])
pca_df.index = md_samples.index
pca_df

Unnamed: 0_level_0,PC1,PC2,PC3,PC4,PC5
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DOM_Interlab-LCMS_Lab1_A15M_Pos_MS2_rep1.mzML,-34.220333,-25.358305,14.866447,-1.984973,-20.153902
DOM_Interlab-LCMS_Lab1_A15M_Pos_MS2_rep2.mzML,-37.048456,-29.357718,14.396593,0.306494,14.121997
DOM_Interlab-LCMS_Lab1_A15M_Pos_MS2_rep3.mzML,-37.835252,-30.059551,14.299632,13.199799,7.13075
DOM_Interlab-LCMS_Lab1_A45M_Pos_MS2_rep1.mzML,66.660832,-9.184186,2.361366,-7.754506,-8.131249
DOM_Interlab-LCMS_Lab1_A45M_Pos_MS2_rep2.mzML,66.551429,-11.622051,2.71166,2.122439,6.69103
DOM_Interlab-LCMS_Lab1_A45M_Pos_MS2_rep3.mzML,66.38688,-9.201951,-0.113552,0.666145,2.589
DOM_Interlab-LCMS_Lab1_A5M_Pos_MS2_rep1.mzML,-5.704333,56.754609,41.816351,-3.125428,3.808121
DOM_Interlab-LCMS_Lab1_A5M_Pos_MS2_rep2.mzML,-8.498028,17.019133,-17.058609,20.457882,-10.517224
DOM_Interlab-LCMS_Lab1_A5M_Pos_MS2_rep3.mzML,-7.309008,20.122238,-22.113799,23.797493,6.095314
DOM_Interlab-LCMS_Lab1_M_Pos_MS2_rep1.mzML,-22.088166,6.479618,-12.931901,-10.868099,-15.126714


## no need to combine since they are already in the same order in the pca_df and md_samples?

In [9]:
def pca_scatter_plot(attribute):
    title = f'PRINCIPLE COMPONENT ANALYSIS'
    
    df = pd.merge(pca_df[['PC1', 'PC2']], md_samples[attribute].apply(str), left_index=True, right_index=True)
    # display(df)
    fig = px.scatter(df, x='PC1', y='PC2', template='plotly_white', width=600, height=400, color=attribute)

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":title, 'x':0.2, "font_color":"#3E3D53"},
                      xaxis_title=f'PC1 {round(pca.explained_variance_ratio_[0]*100, 1)}%',
                      yaxis_title=f'PC2 {round(pca.explained_variance_ratio_[1]*100, 1)}%')
    fig.show()
    
    # To get a scree plot showing the variance of each PC in percentage:
    percent_variance = np.round(pca.explained_variance_ratio_* 100, decimals =2)

    fig = px.bar(x=pca_df.columns, y=percent_variance, template="plotly_white",  width=500, height=400)
    fig.update_traces(marker_color="#696880", width=0.5)
    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":"PCA - VARIANCE", 'x':0.5, "font_color":"#3E3D53"},
                      xaxis_title="principal component", yaxis_title="variance (%)")
    fig.show()

interact(pca_scatter_plot, attribute=sorted(md_samples.columns))

interactive(children=(Dropdown(description='attribute', options=('ATTRIBUTE_Sample', 'ATTRIBUTE_Sample_Type', …

<function __main__.pca_scatter_plot(attribute)>

## Principle Coordinate Analysis (PCoA)

In [12]:
def pcoa(attribute, distance_matrix):
    # Create the distance matrix from the original data
    distance_matrix = skbio.stats.distance.DistanceMatrix(distance.squareform(distance.pdist(imputed_s.values, distance_matrix)))
    # perform PERMANOVA test
    permanova = skbio.stats.distance.permanova(distance_matrix, md_samples['ATTRIBUTE_Sample'])
    permanova['R2'] = 1 - 1 / (1 + permanova['test statistic'] * permanova['number of groups'] / (permanova['sample size'] - permanova['number of groups'] - 1))
    display(permanova)
    # perfom PCoA
    pcoa = skbio.stats.ordination.pcoa(distance_matrix)
    df = pcoa.samples[['PC1', 'PC2']]
    df = df.set_index(md_samples.index)
    df = pd.merge(df[['PC1', 'PC2']], md_samples[attribute].apply(str), left_index=True, right_index=True)
    
    title = f'PRINCIPLE COORDINATE ANALYSIS'
    fig = px.scatter(df, x='PC1', y='PC2', template='plotly_white', width=600, height=400, color=attribute)

    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":title, 'x':0.18, "font_color":"#3E3D53"},
                      xaxis_title=f'PC1 {round(pcoa.proportion_explained[0]*100, 1)}%',
                      yaxis_title=f'PC2 {round(pcoa.proportion_explained[1]*100, 1)}%')
    fig.show()
    
    # To get a scree plot showing the variance of each PC in percentage:
    percent_variance = np.round(pcoa.proportion_explained* 100, decimals =2)

    fig = px.bar(x=[f'PC{x}' for x in range(1, len(pcoa.proportion_explained)+1)], y=percent_variance, template="plotly_white",  width=500, height=400)
    fig.update_traces(marker_color="#696880", width=0.5)
    fig.update_layout(font={"color":"grey", "size":12, "family":"Sans"},
                      title={"text":"PCoA - VARIANCE", 'x':0.5, "font_color":"#3E3D53"},
                      xaxis_title="principal component", yaxis_title="variance (%)")#
    fig.show()

matrices = ['canberra', 'chebyshev', 'correlation', 'cosine', 'euclidean', 'hamming', 'jaccard', 'matching', 'minkowski', 'seuclidean']
interact(pcoa, attribute=sorted(md_samples.columns), distance_matrix=matrices)

interactive(children=(Dropdown(description='attribute', options=('ATTRIBUTE_Sample', 'ATTRIBUTE_Sample_Type', …

<function __main__.pcoa(attribute, distance_matrix)>

In [8]:
dm = skbio.stats.distance.DistanceMatrix(distance.squareform(distance.pdist(imputed_s.values, 'canberra')))

In [59]:
p = skbio.stats.distance.permanova(dm, md_samples['ATTRIBUTE_Sample'])
r2 = 1 - 1 / (1 + p['test statistic'] * p['number of groups'] / (p['sample size'] - p['number of groups'] - 1))
p["R2"] = r2
print(p)

method name               PERMANOVA
test statistic name        pseudo-F
sample size                      12
number of groups                  4
test statistic               5.7346
p-value                       0.001
number of permutations          999
R2                         0.766187
Name: PERMANOVA results, dtype: object
