# Multivariate Analysis - PLS-DA

In this notebook we will perform a supervised *multivariate* PLS-DA analysis of the *C. elegans* dataset. It is recommended to finish first the notebook *Multivariate Analysis - PCA*.

The notebook is divided in the following steps:

1) Model fitting basics: Fit PLS-DA models to predict genotype from the metabolic profile data, using different types of scaling.

2) Model cross-validation and component selection: Describe model cross-validation, parameter selection and performance assessment, including permutation testing.

3) Model interpretation: Describe some of the available variable importance metrics for PLS-DA, and highlight which variables might be important for the discrimination. Compare the selected variables with the results of an univariate analysis (performed using the notebook **Univariate Analysis**)

## Code import

Import all the packages and configure notebook plotting mode.

In [1]:
# Import the required python packages including 
# the custom Chemometric Model objects
import numpy as np
from sklearn import preprocessing
import pandas as pds
import matplotlib.pyplot as plt
import warnings
from sklearn.exceptions import DataConversionWarning

from pyChemometrics.ChemometricsPLSDA import ChemometricsPLSDA
from pyChemometrics.ChemometricsScaler import ChemometricsScaler
from pyChemometrics.ChemometricsOrthogonalPLSDA import ChemometricsOrthogonalPLSDA

# Use to obtain same values as in the text
np.random.seed(350)

In [2]:
# Set the data conversion warnings to appear only once to avoid repetition during CV
warnings.filterwarnings("ignore", category=DataConversionWarning) 

The next cell sets up the figure display mode. The *notebook* mode allows interactive plotting.

In [3]:
# Set the plot backend to support interactive plotting
%matplotlib notebook

## Data import

We will now import the NMR data and the metadata (Y variables).

X - NMR data matrix

Y - Matrix with the 2 metadata outcomes

ppm - Chemical shift axis for the NMR data in H $\delta$ppm.

#### Metadata
Y1 - represents the genotype (1: wild-type, 2: *sod-2* mutants, in original Y data matrix)

Y2 - represents the age (1: younger L2 worms, 2: L4 worms, in original Y data matrix)

In [4]:
# Load the dataset
X = np.genfromtxt("./data/X_spectra.csv", delimiter=',', dtype=None)
Y = pds.read_csv("./data/worm_yvars.csv",delimiter=',',dtype=None, header=None)
ppm = np.loadtxt("./data/ppm.csv",delimiter=',')

# Use pandas Categorical type to generate the dummy enconding of the Y vector (0 and 1) 
Y1 = pds.Categorical(Y.iloc[:, 0]).codes
Y2 = pds.Categorical(Y.iloc[:, 1]).codes

**Note**: To apply the analyses exemplified in this notebook to any other dataset, just modify the cell above to import the data matrices and vectors X and Y from any other source file.

The expected data types and formatting for **X** and **Y** are:

   **X**: Any data matrix with n rows (observations/samples) and p columns (variables/features). The matrix should be provided as a [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) object, with 2 dimensions, and with shape = (n, p). We recommend using the *numpy* function [numpy.genfromtxt](https://numpy.org/devdocs/reference/generated/numpy.genfromtxt.html) or the *pandas* [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to read the data from a text file. When using the *pandas.read_csv* function, extract the data matrix as a *numpy.ndarray* from the pandas.DataFrame object using the `.values` attribute. 
```
X_DataFrame = pds.read_csv("./data/X_spectra.csv")
X = X_DataFrame.values
```
   
   **Y** vectors: Each **Y** vector should be a 1-dimensional [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) object, with a number and ordering of elements matching the rows in **X**. For continuous variables, any regular *numpy.ndarray* with a data type of `int` (integers only) or `float` can be used.
   ```
   Y_continuous = numpy.ndarray([23.4, 24, 0.3, -1.23], dtype='float')
   ```
To encode binary class labels, a *numpy.ndarray* of dtype `int`, with 0 and 1 as labels (e.g., 0 = Control, 1 = Case) must be used. The way in which classes are encoded will affect the model interpretation: the class labeled as 1 is used as the "positive/case" class by the *pyChemometrics* objects.
   
   In the example above, we used the *pandas* [Categorical](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) datatype to handle the conversion of the original numerical values (1, 2) to the required (0, 1) labels. After converting a column to a `Categorical` datatype, the `.codes` attribute returns a vector with the same length of the original Y, but where each value is replaced by their integer (`int`) code. The correspondence between code and category can be inspected with the `categories` attribute. The order of the labels in `.codes` is the same as the order of the `categories` attribute (i.e. 0 is the first element in `categories`, 1 the second and so on).
   ```
   Y1 = pds.Categorical(Y.iloc[:, 1])
   Y1.codes # The numerical label
   Y1.categories # Original text or numerical description of the category
   ```
   [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) is another helpful function to perform dummy (0-1) encoding of variables. 

Plot all the spectra in the dataset.


In [5]:
# Plot the spectra in the dataset
plt.figure()
plt.plot(ppm, X.T)
plt.title("X matrix of spectra")
plt.xlabel("$\delta$ppm")
plt.gca().invert_xaxis()
plt.ylabel("Intensity")
plt.show()

<IPython.core.display.Javascript object>

# PLS-DA modeling

## 1) Model fitting basics

In this section we will fit a PLS-DA model to classify *C.elegans* samples based on their genotype, and assess the metabolic differences between *sod-2* mutants and the parent wild-type (N2).

As an example, we start by fitting a PLS-DA model with 2 components and with unit-variance (UV) scaling. The choice of components to use in the modeling will be addressed properly in the next section, the objective of this first section is to introduce the model syntax.

Similar to PCA, we start by choosing a scaling method for the X data matrix. The choice of scaling method will influence the results and interpretation.

In [6]:
# Select the scaling options: 

# Unit-Variance (UV) scaling:
scaling_object_uv = ChemometricsScaler(scale_power=1)

# Pareto scaling:
scaling_object_par = ChemometricsScaler(scale_power=1/2)

# Mean Centring:
scaling_object_mc = ChemometricsScaler(scale_power=0)

For this example we will use Unit-Variance scaling (UV scaling), and start by fitting a PLS-DA model with 2 components.

In [7]:
# Create and fit PLS-DA model
pls_da = ChemometricsPLSDA(n_components=2, x_scaler=scaling_object_uv)
pls_da.fit(X, Y1)

PLS models perform dimensionality reduction in a manner similar to PCA. The main difference (besides the criteria in which the components are found) is that as well as the projections for the X matrix ($T$ scores) we also have projections for the Y matrix ($U$ scores).

Model visualization of PLS/PLS-DA models is typically performed by plotting the $T$ scores (X matrix scores). 
The score plot gives an overview of the relationships between samples, their similarities and dissimilatrities within the model space.

<br>

**Warning**: PLS-DA models can easily overfit, and the degree of separation or clustering of samples from distinct classes or Y outcome in the score plot is not a reliable measure of model validity. We recommend focusing on model validation before exploring the relationships in the scores plot. See the next section.

In [8]:
# Plot the scores
pls_da.plot_scores(color=Y1, discrete=True, label_outliers=True, plot_title=None)

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.


<AxesSubplot:xlabel='T[1]', ylabel='T[2]'>

The *plot_scores* methods from `ChemometricsPLS` and `ChemometricsPLSDA` objects share the same functionality as `ChemometricsPCA.plot_scores`. Score plot data points can be colored by levels of a continuous or discrete covariate by using the `color` argument, and setting the ```discrete``` argument to ```True``` or ```False```, accordingly). The index (row index of the data matrix **X**) of the outlying can be labeled with ```label_outliers=True``` and the plot title changed with the argument```plot_title```.

The main directions associated with each component in the score plots can be interpreted in terms of the original X variables using the loading vector, just like in PCA. Each component has an associated loading vector $p$ and weight vector $w$.

In [9]:
# Plot the weights and loadings.
# w for weights, p for loadings,
# ws for X rotations (rotated version of w) 
pls_da.plot_model_parameters(parameter='p', component=1)

<IPython.core.display.Javascript object>

Besides the loading vectors, PLS models have another important set of parameters, the weight vectors. There is one weight vector ($w$) corresponding to the X matrix and another ($c$) to the Y variables.

The weight vector ($w$) relates the original X variables with the Y outcome we are predicting. These vectors (and metrics based on them, such as VIP) are important to assess the relationship between X and Y and which X variables are more associated with Y. This will be discussed in more detail later in this tutorial.

The larger the magnitude of the variable coefficient in the weight vector, the more "associated" that variable is with the response.

In [10]:
# Plot the weights and loadings.
# w for weights, p for loadings,
# ws for X rotations (rotated version of w) 
pls_da.plot_model_parameters(parameter='w', component=1)

<IPython.core.display.Javascript object>

## 2) Model Selection - Number of components

Selection of the number of components for a PLS model follows a very similar logic to the PCA case.
Since the goal is to predict the Y variable, the main criteria used are the $R^{2}Y$/$Q^{2}Y$ as opposed to $R^{2}X$/$Q^{2}X$.

Ideally, we want to select enough components to predict as much of the variation in Y as possible using the data in X, while avoiding overfitting. 

We apply a similar criterion as the one used with PCA: choosing as the number of components after which the $Q^{2}Y$ value reaches a plateau (less than 5% increase compared to previous number of components). 

In [11]:
pls_da.scree_plot(X, Y1, total_comps=10)

<IPython.core.display.Javascript object>

AUC measure stabilizes (increase of less than 5% of previous value or decrease) at component 3


(3, <AxesSubplot:xlabel='Number of components', ylabel='R2/Q2Y/AUC'>)

Just like in the case of PCA, the $Q^{2}Y$ and other validation metrics obtained during K-Fold cross validation is sensitive to row permutation of the X and Y matrices. Shuffling the rows and repeating the cross-validation steps multiple times is a more reliable way to select the number of components.

**Note**: Model cross-validation, especially the *repeated_cv* call in the next cell requires fitting the model multiple times, and can take a few minutes.

In [12]:
# Repeated cross_validation
rep_cv = pls_da.repeated_cv(X, Y1, repeats=5, total_comps=10)

<IPython.core.display.Javascript object>

### Outlier detection

The outlier detection measures available for PCA (Hotelling $T^{2}$ and DmodX) are also available for PLS/PLS-DA models. Outlier interpretation is also performed in the same way.

In [13]:
pls_da.plot_scores(label_outliers=True)
pls_da.outlier(X)

<IPython.core.display.Javascript object>

array([ 25,  36, 100, 106, 113, 117])

The strongest outliers in this case are the 5 samples with more negative PLS component 2 scores. These are actually the same samples identified as outliers during the preliminary PCA analysis. We will remove them before proceeding.

In [14]:
pca_outliers = np.array([36, 100, 106, 113, 117])
X = np.delete(X, pca_outliers, axis=0)
Y1 = np.delete(Y1, pca_outliers, axis=0)
Y2 = np.delete(Y2, pca_outliers, axis=0)

We now re-check the optimal number of components after exclusion of outliers.

In [15]:
pls_da.scree_plot(X, Y1, total_comps=10)

<IPython.core.display.Javascript object>

AUC measure stabilizes (increase of less than 5% of previous value or decrease) at component 2


(2, <AxesSubplot:xlabel='Number of components', ylabel='R2/Q2Y/AUC'>)

In [16]:
# Repeated cross_validation
rep_cv = pls_da.repeated_cv(X, Y1, repeats=5, total_comps=10)

<IPython.core.display.Javascript object>

Following the recomendations from cross-validation and repeated cross validation we select 4 as the final number of components.

### Refit the model
Refit the model without outliers and use the number of components selected.

In [17]:
# Refit the model with the selected number of components
pls_da = ChemometricsPLSDA(n_components=4, x_scaler=scaling_object_uv)
pls_da.fit(X, Y1)

In [18]:
pls_da.plot_scores(color=Y1, discrete=True)

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.


<AxesSubplot:xlabel='T[1]', ylabel='T[2]'>

Although we used the $Q^{2}Y$ metric to perform model selection, this metric is easier to interpret for regression problems, and it is not straightforward to assess the performance of a classifier model using $Q^{2}Y$ or $R^{2}Y$ and similar goodness of fit metrics. The performance in a classification task is more effectively described by confusion matrices and related metrics, such as accuracy/balanced accuracy, f1, ROC curves and their respective area under the curve.

To obtain more reliable estimates we can calculate the cross-validation estimates of any of these metrics, including cross-validated ROC curves. This ROC curve was estimated using the left-out samples (the test sets) during cross-validation.

In [19]:
# Cross-validated ROC curve
pls_da.cross_validation(X, Y1)
pls_da.plot_cv_ROC()

<IPython.core.display.Javascript object>

Mean AUC: [0.96916195]


<AxesSubplot:xlabel='False Positive Rate (1 - Specificity)', ylabel='True Positive Rate (Sensitivity)'>

### Permutation Testing
A final and very important method for model validation is the permutation randomization test. In a permutation randomisation test, the model will be refitted and assessed multiple times, but each time with the Y randomly permuted to destroy any relationship between X & Y. This allows us to assess what sort of model we can get when there really is no relationship between the two data matrices, and calculate the likelihood of obtaining a model with predictive performance as good as the non-permuted model by chance alone.

During this test, the number of components, scaling, type of cross-validation employed, and any other modeling choice is kept constant. In each randomization, the model is refitted, and the AUC, $Q^{2}Y$ or any other validation metric is recorded. This enables the generation of permuted null distributions for any parameter, which can be used to obtain an empirical *p-value* for their significance.

**Note** Running the permutation test with a large number of permutation randomizations (for example, 1000) is expected to take a considerable ammount of time (approximately 30 mins on a laptop).

In [20]:
permt = pls_da.permutation_test(X, Y1, 1000)

#### Optional: Load pre-calculated results

In [21]:
# plot the results from the permuation test
pls_da.plot_permutation_test(permt, metric='AUC')
plt.xlabel('AUC')
plt.ylabel('Counts')
print("Permutation p-value for the AUC: {0}".format(permt[1]['AUC']))

<IPython.core.display.Javascript object>

Permutation p-value for the AUC: 0.0196078431372549


In [22]:
# plot the results from the permuation test
pls_da.plot_permutation_test(permt, metric='Q2Y')
plt.xlabel('Q2Y')
plt.ylabel('Counts')
print("Permutation p-value for the Q2Y: {0}".format(permt[1]['Q2Y']))

<IPython.core.display.Javascript object>

Permutation p-value for the Q2Y: 0.0196078431372549


The *p-value* obtained is < 0.05, so the model AUC and Q2Y values are significantly different from what is expected by chance alone at a level of $\alpha$ = 0.05.

## 3) Model interpretation and variable importance

The main parameters to assess in terms of variable importance for the prediction of Y from X are the weights ($w$), the VIP metric and regression coefficients.

The values in a weight vector vary between -1 (strong negative-covariance) and 1 (strong covariance), with 0 meaning no association/covariance. The weight vector of the first component (which explains the most variation in Y) is the primary weight vector to analyze when interpreting the main variables of X associed with Y.

The variable importance for prediction (VIP) metric is a sum (weighted by the ammount of variance of Y explained by each respective component) of the squared weight values. It provides a summary of the importance of a variable accounting for all weight vectors. VIPs are bounded between 0 (no effect) and infinity. Because it is calculated from the weights $w$, for PLS models with a single component these are directly proportional to the $w^{2}$. The VIP metric has the disadvantage of pooling together $w$ vectors from components which contribute a very small magnitude to the model's $R^{2}Y$.

The regression coefficients ($\beta$) have a similar interpretation as regression coefficients in a multivariate/multiple linear regression.

In [23]:
pls_da.plot_model_parameters('w', component=1, sigma=2, cross_val=True, xaxis=ppm)
plt.gca().invert_xaxis()
plt.gca().set_xlabel('ppm')

<IPython.core.display.Javascript object>

Text(0.5, 0, 'ppm')

In [24]:
pls_da.plot_model_parameters('VIP', sigma=2, cross_val=True, xaxis=ppm)
plt.gca().invert_xaxis()
plt.gca().set_xlabel('ppm')


<IPython.core.display.Javascript object>

Text(0.5, 0, 'ppm')

In [25]:
pls_da.plot_model_parameters('beta', sigma=2, cross_val=True, xaxis=ppm)
plt.gca().invert_xaxis()
plt.gca().set_xlabel('ppm')

<IPython.core.display.Javascript object>

Text(0.5, 0, 'ppm')

Unfortunately, assessment of variable importance in PLS-DA/PLS multivariate models is not straightfoward, given the multiple choice of parameters and their different interpretation, especially in models with more than 1 PLS component. To obtain a ranking of variables from the data matrix X associated with Y, we recommend starting with the weights $w$ of the first component, which contributes the most to $R^{2}Y$. 

However, it must be mentioned that the weights of the first PLS component are equal to the normalized (so that the weight vector has norm equal to 1) vector of the univariate covariances estimated between each X column or variable, and the Y vector. This implies there is no advantage in using a PLS model and $w$ when compared to a series of univariate analyses for variable ranking and selection.

In [26]:
fig, ax = plt.subplots(1,2, figsize=(8, 5))
X_scaled = pls_da.x_scaler.transform(X)

cov_x_y = np.dot(Y1.T - Y1.mean(), X_scaled) / (Y1.shape[0]-1)
cov_x_y = cov_x_y/np.linalg.norm(cov_x_y)

ax[0].plot(cov_x_y, 'orange')
ax[1].plot(pls_da.weights_w[:, 0], 'green')
ax[0].set_xlabel('Normalised $Cov(X_{i}, Y)$')
ax[1].set_xlabel('$w$ for PLS component 1')
fig.show()

<IPython.core.display.Javascript object>

Another set of quantities which can be used to assess variable importance are the $\beta$ regression coefficients. However, as with other multivariate regression models, the final $\beta$ vector encodes information about the correlation structure of X and how it relates to Y, and the magnitude and sign of $\beta$ coefficient express how to derive a "good" prediction of Y using X. Taking the magnitude of each $\beta$ and using it to rank variables can be misleading.

This does not mean necessarily that PLS should only be used as a predictive "black box" regressor/classifier and model interpretation avoided altogether. The strength of PLS for exploratory data analysis and interpretation resides on the latent variable projections. The scores $T$ or $U$ can be plotted and associated with other metadata variables, or even correlated or regressed against them, and the corresponding loading $p$ can be visualized to assess the signals which make up the latent variable signature.

For example, if we inspect the scores plot for components 2 and 3 it becomes apparent that although we have not added information about the Age covariate to the model, the PLS component number 3 seems to be associated with it. This hints that this component is accounting for some of the variability related with Age to improve the prediction. The loadings of this component can then be used to visualize which regions of the spectrum are correlated. 

**Note**: We recommend refering to loadings $p$ and not weights $w$ when interpreting latent variable signatures, especially in PLS components after the 1st component.


In [27]:
# Same model, but coloured by Age instead of Genotype
pls_da.plot_scores(comps=[1, 2], color=Y2, discrete=True)
pls_da.plot_model_parameters('p', component=2)

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.


<IPython.core.display.Javascript object>

### Orthogonal PLS

The orthogonal PLS modeling technique can be used to assist intepretation of PLS latent variables.
After obtaining a reliable PLS model, we generate an Orthogonal PLS/PLS-DA model with the same number of components as the PLS model. In an orthogonal PLS model, the first component is called predictive, and the subsequent components "orthogonal" because they are uncorrelated to the response Y. Compared to the equivalent PLS model, Orthogonal PLS models shuffle away variation from the loading vector $p$ of the first component to subsequent components, which can aid in interpretation of the latent variables.

In [28]:
# Generate an Orthogonal PLS-DA version of the PLS-DA model fitted
orthogonal_pls_da = ChemometricsOrthogonalPLSDA(ncomps=5, xscaler=scaling_object_uv)
orthogonal_pls_da.fit(X, Y1)

The Orthogonal PLS model we just fitted has 1 predictive component and 4 orthogonal components. The predictive component encodes the information in X directly associated with Y. 
The orthogonal components can be investigated and associated with other known covariates, to assist in understanding the sources of variation that the PLS model/Orthogonal PLS model is "learning" from the data to improve the prediction of Y (measured by the $R^{2}Y$)

In the following plot, we investigate the scores on the predictive ($T_{pred}$) and first orthogonal component ($T_{ortho[1]}$), coloured by genotype.


In [29]:
orthogonal_pls_da.plot_scores(color=Y1, orthogonal_component=1, discrete=True)

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.


<AxesSubplot:xlabel='Tpred', ylabel='Tortho[1]'>

The analysis of orthogonal component 2 hints that age (Y2) contributes orthogonal variation to the data. Note: in the plot below, the data points are coloured by age (Y2).

In [30]:
orthogonal_pls_da.plot_scores(color=Y2, orthogonal_component=2, discrete=True, label_outliers=False)

  fig, ax = plt.subplots()


<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.


<AxesSubplot:xlabel='Tpred', ylabel='Tortho[2]'>

The interpretation of the Orthogonal PLS score plorts model should be made using the predictive and orthogonal loading vectors ($p$) for all components. Only the weight vector $w$ for the predictive component should be evaluated.

In [31]:
orthogonal_pls_da.plot_model_parameters('p_pred', orthogonal_component = 1, xaxis=ppm)
plt.gca().invert_xaxis()
# 
orthogonal_pls_da.plot_model_parameters('p_ortho', orthogonal_component = 2, xaxis=ppm)
plt.gca().invert_xaxis()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Permutation p-values for variable ranking

The permutation test we ran before is also useful to obtain permuted null distributions for most of the model parameters. These can be used to obtain empirical confidence intervals and potentially permutation *p-values* for hypothesis testing.

To illustrate this, the next cells generate histograms for the permuted distribution of the $w$ and $p$ for the first PLS component and regression coefficients for 2 randomly selected variables.
Notice the differences between the permuted null distributions of weights, loadings and regression coefficients. 

In [32]:
# Plot empirical null distributions for weights
plt.figure()
plt.hist(permt[0]['Weights_w'][:, 3000, 0], 100)
plt.title("Permuted null distribution for weights (w), component 1, {0} $\delta$ppm".format(ppm[3000]))
plt.show()

plt.figure()
plt.hist(permt[0]['Weights_w'][:, 10, 0], 100)
plt.title("Permuted null distribution for weights (w), component 1, {0} $\delta$ppm".format(ppm[10]))
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [33]:
# Plot empirical null distributions for loadings
# Notice how these are not unimodal and distributed around 0...
plt.figure()
plt.hist(permt[0]['Loadings_p'][:, 3000, 0], 100)
plt.title("Permuted null distribution for loadings (p), component 1, {0} $\delta$ppm".format(ppm[3000]))
plt.show()

plt.figure()
plt.hist(permt[0]['Loadings_p'][:, 10, 0], 100)
plt.title("Permuted null distribution for loadings (p), component 1, {0} $\delta$ppm".format(ppm[10]))
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [34]:
# Plot empirical null distributions for regression coefficients
plt.figure()
plt.hist(permt[0]["Beta"][:, 3000], 100)
plt.title(r"Permuted null distribution for $\beta$, {0} $\delta$ppm".format(ppm[3000]))
plt.show()

plt.figure()
plt.hist(permt[0]['Beta'][:, 10], 100)
plt.title(r"Permuted null distribution for $\beta$, {0} $\delta$ppm".format(ppm[10]))
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Both the regression coefficients and weights have a null distribution centered around 0. Conversely, for the loadings, the center of the distribution is shifted. Loadings encode information about the variance and covariance (with the latent variable score) of each variable, and their magnitude is harder to interpret in terms of importance for prediction. The permutation performed in this manner does not change the correlation between variables in X, and therefore is not adequate to obtain permuted null distributions of the loading parameters.

We can now calculate empirical p-values for the regression coefficients...

In [35]:
# Always set *nperms* equal to the number of permutations used before
nperms = permt[0]['R2Y'].size
perm_indx = abs(permt[0]['Beta'].squeeze()) >= abs(pls_da.beta_coeffs.squeeze())
counts = np.sum(perm_indx, axis=0)
beta_pvals = (counts + 1) / (nperms + 1)

perm_indx_W = abs(permt[0]['Weights_w'][:, :, 0].squeeze()) >= abs(pls_da.weights_w[:, 0].squeeze())
counts = np.sum(perm_indx_W, axis=0)
w_pvals = (counts + 1) / (nperms + 1)

In [36]:
plt.figure()
plt.title(r"p-value distribution for the regression coefficients $\beta$ ")
z = plt.hist(beta_pvals, bins=100, alpha=0.8)
plt.axvline(x=0.05, ymin=0, ymax=max(z[0]), color='r', linestyle='--') 
plt.show()

plt.figure()
plt.title(r"p-value distribution for the weights corresponding to the first component")
z = plt.hist(w_pvals, bins=100, alpha=0.8)
plt.axvline(x=0.05, ymin=0, ymax=max(z[0]), color='r', linestyle='--') 
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

... and use the permutation test to obtain a list of statistically significant variables.

In [None]:
signif_bpls_idx = np.where(beta_pvals <= 0.05)[0]

print("Number of significant values: {0}".format(len(signif_bpls_idx)))

It is worth noting that a selection procedure of this kind is also a type of multiple testing, and it is recommended to apply false discovery rate or any other multiple testing correction to the *p-values* obtained in this manner. Also, formal inferential procedures to derive *p-values* and confidence intervals are not established for PLS models. Although *ad-hoc* solutions like a permutation test can be implemented as shown, some issues still remain - for example, the *p-value* distribution obtained for the regression coefficients is clearly non-uniform and care must be exercised when performing multiple testing correction or even interpreting the *p-values* obtained in this manner.

The latent variable and dimensionality reduction provided by PLS/PLS-DA can be very usefull to visualize general trends in the data. However, interpreting which variables are important to the model and how they contribute for the explanation/separation between classes is not easy. We suggest complementing the inspection of multivariate model parameters with univariate analysis.

### Comparison between variables highlighted in a multivariate PLS-DA analysis with a univariate analysis.

The following cells should be run after completing the analyses described in the **Univariate Analysis** Jupyter Notebook.

Load the results of an equivalent univariate analysis of associations between metabolic signals and genotype.

In [None]:
#load the results of the univariate testing procedure
univ_gen = pds.read_csv('./data/UnivariateAnalysis_Genotype.csv')

# Select significant peaks from univariate analysis 
signif = np.where(univ_gen['genotype_q-value'] < 0.05)[0]

We then plot the overlap between the PLS-DA classifer for Genotype and the results of an univariate analysis against genotype.

In [None]:
# p-values significant for association with genotype in both the PLS analysis and linear regression
common_idx = np.array([x for x in signif_bpls_idx if x in signif])
# p-values significant only in PLS
pls_idx = np.array([x for x in signif_bpls_idx if x not in signif])
# p-values significant only for linear regression
reg_idx = np.array([x for x in signif if x not in signif_bpls_idx])

In [None]:
plt.figure()
plt.plot(ppm, X.mean(axis=0))
#plt.scatter(ppm[signif], X.mean(axis=0)[signif], c='red', s=30)
plt.scatter(ppm[reg_idx], X.mean(axis=0)[reg_idx], c='red', s=30)
plt.scatter(ppm[pls_idx], X.mean(axis=0)[pls_idx], c='orange', s=30)
plt.scatter(ppm[common_idx], X.mean(axis=0)[common_idx], c='green', s=30)
plt.gca().invert_xaxis()
plt.legend(['Mean Spectrum', 'Both', 'Linear regression only', 'PLS only'])
plt.show()