# **Metabolomics Data Visualisation Workflow for ANN-SS (MD-CG) [W/ SMOTE]**

This Google Colab notebook describes the metabolomics data analysis and visualisation workflow for a 2 layer artificial neural network with layer 1 consisting of multiple neurons (n = 2 to 6) with a sigmoidal activation, and layer 2 (output layer) consisting of a single neuron with a sigmoidal activation function (ANN-SS) for a binary classification outcome.

This computational workflow is described using a previously published LC-MS dataset by Sinclair et al. (2021). The study compared the metabolomic profiles across Parkinson's disease patients, characterised as medicated (MD; n=138) and drug-naive (DN; n=80), versus control (CG; n=56) using 8765 named metabolites. For the purpose of this computational workflow, only the MD vs CG samples were compared in a binary discriminant analysis. The deconvolved and annotated data from this study is deposited on Metabolomics Workbench (Study ID: MTBLS2266).

This computational workflow requires a dataset to be in, or converted to, a previously described standardised Excel file format (Mendez et al. 2019). This format uses the Tidy Data Framework (Wickham, 2014), where each row represents an observation (e.g. sample) and each column represents a variable (e.g. age or metabolite). Each excel file (per study) contains two sheets; a data sheet and a peak sheet. The data sheet contains the metabolite concentration together with the metadata associated for each observation (requiring the inclusion of the columns: Idx, SampleID, and Class). The peak sheet contains the additional metadata that pertains to the metabolites in the data sheet (requiring the inclusion of the columns: Idx, Name, and Label). The standardisation of this format allows for the efficient re-use of this computational workflow.

The steps included in this data analysis and visualisation workflow are:
1. Import Packages 
2. Load Data and Peak Sheet
3. Data Pre-processing
4. Split Data into Train and Test Set
5. Hyperparameter Optimisation
6. Build Model and Evaluate
7. Permutation Testing
8. Bootstrap Resampling of the Model
9. 

# Section 1 - Import Packages

Certain packages need installing onto the virtual environment prior to use.

For this computational workflow the CIMCB package was installed using pip, CIMCB requires:

*   Python (>=3.5)
*   Bokeh (>=1.0.0)
*   Keras
*   NumPy (>=1.12)
*   SciPy
*   scikit-learn
*   Statsmodels
*   TensorFlow
*   tqdm



In [1]:
pip install cimcb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cimcb
  Downloading cimcb-1.1.0-py3-none-any.whl (150 kB)
[K     |████████████████████████████████| 150 kB 7.6 MB/s 
Installing collected packages: cimcb
Successfully installed cimcb-1.1.0


To use tools that extend beyond the basic functionalities of Python programming, packages must first be imported to enable their use in each Google Colab environment. Each package is a container of modules.

For this computational workflow, the following packages were used:


*   numpy: A fundamental package for scientific computing with Python, primarly used for the manipulation of arrays
*   pandas: A fundamental package for data analysis and manipulation
*   cimcb: A package for the statistical analysis of untargeted and targeted metabolomics data
*   matplotlib.pyplot: A package mainly used for interactive plots and simple cases of programmatic plot generation
*   seaborn: A package that provides a high-level interface for drawing attractive and informative statistical graphics
*   sklearn: A fundamental package containing tools for machine learning
  *   train_test_split: A method to split arrays into training and test subsets






In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

import cimcb as cb

print('All packages successfully loaded')

All packages successfully loaded


In order to  reproducibility of the workflow, random seeds are set.

*   seed_split: Seed the generator using an integer value e.g. 42 (default = None ; no seed set)

This seed is used to mainatin duplicability in the way the data is divided when the train and test set are generated.

*  seed_init: seed the generator using an integer value e.g. 42 (default = None ; no seed set)

This seed is used to maintain duplicability in the way the intial weights are drawn from a truncated normal distribution when the neural network is first compiled.

In [32]:
seed_split = 100
seed_init = 4
# seed_split = None
# seed_init = None

# Section 2 - Load Data & Peak Sheet

To upload the dataset to the Google Colab notebook environment, an upload widget was used.

In [None]:
from google.colab import files

uploaded = files.upload()

The helper function load_dataXL loads the two data sheets from the Excel file 'Data and Peak sheet_Test.xlsx'. Provided the dataset adheres to the standardised TidyData framework format, load_dataXL() outputs the data sheets from the uploaded Excel file as individual Pandas DataFrames.

In [None]:
# The path to the input file (Excel spreadsheet)
filename = 'Data and Peak sheet_Test.xlsx'

# Load Peak and Data tables into two variables
dataTable, peakTable = cb.utils.load_dataXL(filename, DataSheet='Data', PeakSheet='Peaks')

# Section 3 - Data Pre-processing

**Section 3.1 - Data Cleaning**

According to Broadhurst (2019), it is best practice to access the quality of the data and refine the dataset by removing those metabolites that lack reporducible measurements. The QC-RSD and percentage of missing values has been calculated and are included in the peakTable DataFrame. Using those values, we remove all metabolomic features that do not meet the following criteria:

*   QC-RSD less than 20%
* Fewer than 10% of values are missing



In [None]:
# Clean PeakTable
RSD = peakTable['QC_RSD']   
percMiss = peakTable['Perc_missing']  
peakTableClean = peakTable[(RSD < 20) & (percMiss < 10)]   
peakList = peakTableClean['Name']  

print("Number of peaks remaining: {}".format(len(peakTableClean)))

**Section 3.2 - Extract X and Y**

As previously mentioned, this workflow is performing binary classification of the classes MD vs CG. The X matrix of metabolite concentrations and Y vector of classification labels ("MD"=0 and "CG"=1) are extracted through the following steps:

1.   Create a subset of the dataTable called dataTable1, containing samples in the Class "MD" or "CG"
2.   Use the peakList variable to hold the names of the metabolites to be used 
3.   Extract all the applicable columns, using peakList, from dataTable1 and place in matrix X
4.   Set Y to the list of binary outcomes from the "Class" column from dataTable1







In [None]:
# Extract PeakList
dataTable1 = dataTable[(dataTable.Class == "MD") | (dataTable.Class == "CG")]  # Reduce data table only to MD and CG class members
pos_outcome = "CG"

dataTable1['Class'] = [0 if x == 'MD' else 1 for x in dataTable1['Class']]

to_drop = ['Idx', 'SampleID', 'SampleType']
dataTable1.drop(to_drop, axis=1, inplace=True)

In [None]:
peaklist = peakTableClean['Name']          
X = dataTable1[peaklist]

merged = pd.concat([dataTable1['Class'], X], axis=1)

merged.reset_index()

**Section 3.3 - Transform and Scale Data**

The MinMaxScaler() method is used to scale all columns that contain values larger than 1, to [0,1] range. This makes the values more manageable when creating and evaluating of the model.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Scale only columns that have values greater than 1
to_scale = [col for col in X.columns if X[col].max() > 1]
mms = MinMaxScaler()
scaled = mms.fit_transform(merged[to_scale])
scaled = pd.DataFrame(scaled, columns=to_scale)

# Replace original columns with scaled ones
for col in scaled:
    merged.reset_index()[col] = scaled[col]

In [None]:
y1 = merged['Class']
merged1 = pd.concat([y1.reset_index()['Class'], scaled], axis=1)

display(merged1)

from sklearn.model_selection import train_test_split

X = merged1.drop('Class', axis=1)
y = merged1['Class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

print(f'''% Positive class in Train = {np.round(y_train.value_counts(normalize=True)[1] * 100, 2)}
% Positive class in Test  = {np.round(y_test.value_counts(normalize=True)[1] * 100, 2)}''')

**Section 3.4 - Data Balancing with Missing Values Imputed**

The Synthetic Minority Oversampling Technique (SMOTE) is a statisical method that balances the amount of samples in a dataset by oversampling the minority class. Empty cells are imputed using the .nan_to_num() method.

In [None]:
from imblearn.over_sampling import SMOTE 

sm = SMOTE(random_state=42)

x = np.nan_to_num(X)
Y = np.nan_to_num(y)

X_sm, y_sm = sm.fit_resample(x, Y)

print(f'''Shape of X before SMOTE: {X.shape}
Shape of X after SMOTE: {X_sm.shape}''')

print('\nBalance of positive and negative classes (%):')
pd.Series(y_sm).value_counts(normalize=True) * 100

# Section 4 - Split Data in Train and Test Set

Using the train_test_split method, the balanced X and Y data is divided into train (2/3) and test (1/3) sets.

In [None]:
# Optional: Save Class Labels for Figure Legends
Class = merged1.Class

# Split Data into Train (2/3rd) and Test (1/3rd)
XTrain, XTest, YTrain, YTest = train_test_split(X_sm,
                                                y_sm,
                                                test_size=1/3,
                                                random_state=seed_split)

# Section 5 - Hyperparameter Optimisation

**Section 5.1 - k-fold Cross-Validation**

k-fold cross-validation (k=5) is carried out using the CIMCB helper function cb.cross_val.kfold(). This is applied to a set of ANN-SS models with number of neurons ranging from 1 - 6, and a learning rate ranging from 0.01 - 0.05.

In [None]:
# Parameter Dictionary
lr = [0.01,0.02,0.03,0.04,0.05]
neurons = [2, 3, 4, 5, 6]

param_dict = dict(learning_rate=lr,
                  n_neurons=neurons,
                  epochs=400,
                  momentum=0.5,
                  decay=0,
                  loss='binary_crossentropy')

# Initialise
cv = cb.cross_val.kfold(model=cb.model.NN_SigmoidSigmoid,
                                X=XTrain,
                                Y=YTrain,
                                param_dict=param_dict,
                                folds=5,
                                n_mc=10)

# Run 
cv.run()

**Section 5.2 - Plot R^2 and Q^2**

When displaying the R^2 and Q^2 statistics, there are six plots used. From left to right, top to bottom:

1. Heatmap of R^2
2. Heatmap of Q^2
3. Heatmap of 1 - |R^2 - Q^2|
4. |R^2 - Q^2| vs. Q^2
5. R^2 & Q^2 vs. learning rate
6. R^2 & Q^2 vs. number of neurons

The AUC metric portrays the predictability of the model as area under the ROC curve (AUC), AUC(full), and AUC(cv). It is a non-parametric alternative to R^2 and Q^2.

In [None]:
cv.plot(metric='auc', ci=95)
cv.plot(metric='r2q2', ci=95)

**Section 5.3 - Plot Latent Projections: Full and CV**

The method .plot_projections() displays n x n grid of plots, where n is the different number of neurons in the hidden layer being analysed. The types of plot are:

*   Score plots
*   Distribution plots
*   Receiver operating characteristic (ROC) curves

A score plot is produced for each combination of two neurons. Within each score plot the entire score is included, presented as circles, the CV scores, presented as crosses and coloured by group, and the 95% confidence intervals, presented as solid line for full scores and a dashed line for the CV scores. The orthagonal line is displayed as a solid grey line, and the optimal line of seperation as a dashed grey line.

For each neuron a distribution plot is produced. These display the full and CV scores for each group, using kernel density estimation to calculate each distribution.

As with the score plots, the ROC curves produce a plot for each combination of two neurons. The discrimination is determined by the optimal seperation between the two specific neurons being interrogated. Each ROC curve is comprised of a cruve for the full model (green), a curve for the CV model (yellow) with 95% confidence intervals, and the distribution line is presented as a dashed black line.

In [None]:
cv.plot_projections()

# Section 6 - Build Model and Evaluate

Using the optimal hyperparameter values, identified in Section 5, an ANN-SS model is created and intialised. The model is then trained, where XTrain is the X matrix and YTrain is the Y vector, and tested, where XTest is the X matrix and YTest is y vector, and returns the Y predicted value YPredTest.

The .evaluate() method uses the train and test set to evaluate the predictability of the model. Three plots are produced:

1.   Violin plot
2.   Distribution plot
3.   ROC curve

"The violin plots show the predicted score for the train and test (by group). The distribution plot shows the probability density function of the predicted scores for the train and test (by group). The ROC curve shows the ROC curve for the train (green) and test (yellow)."

In [None]:
# Build Model
model = cb.model.NN_LogitLogit(learning_rate=0.04,
                                   n_neurons=5,
                                   epochs=75,
                                   momentum=0.5,
                                   decay=0,
                                   loss='binary_crossentropy')
YPredTrain = model.train(XTrain, YTrain)
YPredTest = model.test(XTest)

# Put YTrain and YPredTrain in a List
EvalTrain = [YTrain, YPredTrain]

# Put YTest and YPrestTest in a List
EvalTest = [YTest, YPredTest]


# Evaluate Model (include Test Dataset)
model.evaluate(testset=EvalTest)

# Section 7 - Permuatation Test

After the model has been trained, permutation testing can be performed to access the reliability of the model. The .permutation_test() method randomises the X matrix, whilst the Y vector remains fixed and is then trained and tested on the randomised data. This is repeated 100 times to produce a more reliable assessment of the distribution of the model.

The method produces two plots:

*   R^2 & Q^2 against the correlation of permuted data against original data
*   Probability densities for R^2 & Q^2

For datasets that contains metabolomic features with no meaningful contribution towards the classification we would expect R^2 & Q^2 values significantly lower than the values from a dataset that contained features with a meaningful contribution.  

In [None]:
model.permutation_test(nperm=100)

# Section 8 - Binary Cross-Entropy Loss

In [None]:
display(XTrain)

In [None]:
display(YTrain)

In [None]:
display(XTest)

In [None]:
display(YTest)

**BINARY CROSS-ENTROPY LOSS**

In [None]:
	
# mlp for the circles problem with cross entropy loss
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from matplotlib import pyplot

# define model
model = Sequential()
model.add(Dense(50, input_dim=2259, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='sigmoid'))
opt = SGD(lr=0.04, momentum=0.5)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model
history = model.fit(XTrain, YTrain, validation_data=(XTest, YTest), epochs=75, verbose=0)

# evaluate the model
_, train_acc = model.evaluate(XTrain, YTrain, verbose=0)
_, test_acc = model.evaluate(XTest, YTest, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss during training
pyplot.subplot(211)
pyplot.title('Loss')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()

# plot accuracy during training
pyplot.subplot(212)
pyplot.title('Accuracy')
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()