<div style="text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    <font color='red'>To begin: Click anywhere in this cell and press <kbd>Run</kbd> on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cell. A <i>text cell</i> and a <i>code cell</i>. When you <kbd>Run</kbd> a text cell (<i>we are in a text cell now</i>), you advance to the next cell without executing any code. When you <kbd>Run</kbd> a code cell (<i>identified by <span style="font-family: courier; color:black; background-color:white;">In[ ]:</span> to the left of the cell</i>) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press <kbd>Run</kbd> again. Repeat this process until the end of the notebook. <b>NOTE:</b> All the cells in this notebook can be automatically executed sequentially by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart and Run All</kbd>. Should anything crash then restart the Jupyter Kernel by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart</kbd>, and start again from the top.
        
</div>



<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">
<img src="images/logo_text.png" width="200px" align="right">

<h1 id="tutorial2interactivemetabolomicsdataanalysisworkflow" style="text-align: justify">Tutorial 2: Interactive Metabolomics Data Analysis Workflow</h1>

<p style="text-align: justify"><br>
<br>
<br>
<br>
The functionality of this notebook is identical to Tutorial 1, but now the text cells have been expanded into a comprehensive interactive tutorial. As before, text cells provide the metabolomics context and describe the purpose of the code in the following code cell; however, this has now been simplified to avoid complete reptition of Tutorial 1. Additional coloured text boxes are now placed throughout the workflow to help novice users navigate and understand the interactive principles of a Jupyter Notebook.
<br><br></p>
</div>

<div style="background-color:rgb(255,210,210); padding:5px;  border: 10px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="40" src="images/cog2.png">
<div style="padding-left:80px; text-align: justify">
<b style="text-align: justify">Red boxes (cog icon) provide suggestions for changing the functionality of the subsequent code cell by editing (or substituting) one or more lines of code.</b><br><br>
</div></div>

<div style="background-color:rgb(210,250,210); padding:5px;  border: 10px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="40" src="images/mouse.png">
<div style="padding-left:80px; text-align: justify">
<b style="text-align: justify"> Green boxes (mouse icon) provide suggestions for interacting with the visual results generated by a code cell. For example, the first green box in the notebook describes how to sort and colour data in the embedded data tables.</b><br>
</div></div>

<div style="background-color:rgb(210,250,255); padding:5px;  border: 10px solid rgb(255, 250, 250); border-bottom: 10px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="40" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">
<b style="text-align: justify">Blue boxes (lightbulb icon) provide further information about the theoretical reasoning behind a block of code or visualisation. This information is not essential to understand Jupyter notebooks but may be of general educational utility and interest to new metabolomics data scientists.</b><br>
</div></div>



<div style="background-color:rgb(255, 250, 250); padding:2px; padding-left: 1em; padding-right: 1em;">
    
<h2 id="1importpackagesmodules" style="text-align: justify">1. Import Packages/Modules</h2>

<p style="text-align: justify">The first code cell of this tutorial (below this text box) imports <a href="https://docs.python.org/3/tutorial/modules.html"><em>packages</em> and <em>modules</em></a> into the Jupyter environment. <em>Packages</em> and <em>modules</em> provide additional functions and tools that extend the basic functionality of the Python language.
<br></p>
</div>

<div style="background-color:rgb(210,250,255); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">

<ul>
<li style="text-align: justify">All the code embedded in this example notebook is written using the Python programming language (<a href="http://www.python.org">python.org</a>) and is based upon extensions of popular open source packages with high levels of support. 
    
<em>Note:</em> a tutorial on the python programming language in itself is beyond the scope of this notebook. For more information on using Python and Jupyter Notebooks please refer to the excellent: 
<a href="https://mybinder.org/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb">Python Data Science Handbook (Jake VanderPlas, 2016)</a>, which is in itself a Jupyter Notebook deployed via <a href="https://mybinder.org">Binder</a>.</li>
</ul>
</div> </div>



In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import cimcb_lite as cb

print('All packages successfully loaded')

<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<h2 id="2loaddataandpeaksheet" style="text-align: justify">2. Load Data and Peak sheet</h2>

<p style="text-align: justify">The code cell below loads the <em>Data</em> and <em>Peak</em> sheets from an Excel file, using the CIMCB helper function <code>load_dataXL()</code>. When this is complete, you should see confirmation that Peak (stored in the <code>Peak</code> worksheet in the Excel file) and Data (stored in the <code>Data</code> worksheet in the Excel file) tables have been loaded.<br></p>
</div>

<div style="background-color:rgb(255,210,210); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px; text-align: justify"> 

<ul>
<li style="text-align: justify">There is a second datase included with this  tutorial which has been converted to standardised <a href="https://en.wikipedia.org/wiki/Tidy_data">Tidy Data</a> format. This data has been previously published as an article <a href="https://link.springer.com/article/10.1007%2Fs11306-016-1059-9">Gardlo et al. (2016)</a> in  <i>Metabolomics</i>. 
    
Urine samples collected from newborns with perinatal asphyxia were analysed using a Dionex UltiMate 3000 RS system coupled to a triple quadrupole QTRAP 5500 tandem mass spectrometer. The deconvoluted and annotated file is deposited at the <a href="https://www.ebi.ac.uk/metabolights/">Metabolights</a> data repository (Project ID <a href="https://www.ebi.ac.uk/metabolights/MTBLS290">MTBLS290</a>). 

Please inspect the <a href="MTBLS290db.xlsx">Excel file</a> before using it in this tutorial. To change the data set to be loaded into the notebook replace <code>filename = 'GastricCancer_NMR.xlsx'</code> with <code>'filename = 'MTBLS290db.xlsx'</code>,and press <mark><kbd>Run</kbd></mark> on the menu bar.

<b>Note: if you change the name of the file in this code cell, you will also have to make changes to <a href=#5>Section 5</a> and <a href=#6>Section 6</a> (as indicated in the text cell above each) for the correct models to be built. It is probably best to come back to this excercise after finishing an initial walk-through of the complete tutorial using the default data set.</b></li>
</ul>
</div></div>

In [None]:
# The path to the input file (Excel spreadsheet)
filename = 'GastricCancer_NMR.xlsx'
#filename = 'MTBLS290db.xlsx'

# Load Peak and Data tables into two variables
dataTable, peakTable = cb.utils.load_dataXL(filename, DataSheet='Data', PeakSheet='Peak') 

<div style="background-color:rgb(255, 250, 250); padding:2px; padding-left: 1em; padding-right: 1em;">
<h3 id="21displaythedatatable" style="text-align: justify">2.1 Display the <code>Data</code> table</h3>

<p style="text-align: justify">The <code>dataTable</code> table can be displayed interactively so we can inspect and check the imported values. To do this, we use the <code>display()</code> function.
<br></p>
</div>

<div style="background-color:rgb(210,250,210); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="60" src="images/mouse.png">
<div style="padding-left:80px; text-align: justify">

<ul>
<li style="text-align: justify">Scroll up/down &amp; left/right using the scroll bars</li>
<li style="text-align: justify">Click on any column header to sort by that column (sort alternates between ascending and decending order)</li>
<li style="text-align: justify">Click on the left side of a header column for futher options 
<ul>
<li style="text-align: justify">for column <b>Class</b> click on <i>'color by unique'</i></li>
<li style="text-align: justify">for column <b>SampleType</b> click on <i>'sort ascending'</i> to group all the <em>QC</em> samples together.</li></ul>
</li>
<li style="text-align: justify">Click on column header <b>index</b> to sort back into the orginal order.</li>
</ul>
</div></div>

In [None]:
display(dataTable) # View and check the dataTable 

<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<h3 id="22displaythepeaksheet" style="text-align: justify">2.2. Display the <code>Peak</code> sheet</h3>

<p style="text-align: justify">The <code>peakTable</code> table can be displayed interactively in the same way.
<br></p>
</div>

<div style="background-color:rgb(210,250,210); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="60" src="images/mouse.png">
<div style="padding-left:80px"> 

<ul>
<li style="text-align: justify">Click on the column header <strong>QC_RSD</strong> to sort the peaks by ascending value</li>
<li style="text-align: justify">Click on the left edge of the column header <strong>QC_RSD</strong> and select <em>'heatmap'</em></li>
<li style="text-align: justify">Scroll up/down to see how the "quality" of the peaks increase/decrease</li>
</ul>
</div></div>

In [None]:
display(peakTable) # View and check PeakTable

<div style="background-color:rgb(255, 250, 250); padding:10px;  padding-left: 1em; padding-right: 1em;">

<h2 id="3datacleaning" style="text-align: justify">3. Data Cleaning</h2>
</div>
<div style="background-color:rgb(255,210,210); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="60" src="images/cog2.png">
<div style="padding-left:80px"> 
    
<ul>
<li style="text-align: justify">Replace the code: <code>PeakTableClean = peakTable[(rsd &lt; 20) &amp; (percMiss &lt; 10]</code> with: <code>peakTableClean = peakTable[(rsd &lt; 10) &amp; (percMiss &lt; 5)]</code>. In doing this you will see the effect of making the data cleaning criteria more stringent. This will change the number of 'clean' metabolites.</li>
</ul>
</div></div>
<div style="background-color:rgb(210,250,255); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="40" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">    
<ul>
<li style="text-align: justify"><b>Note: Changing the number of clean metabolites will significantly change the outputs from all subsequent code cells.</b><br> So be sure to click on <mark><kbd>Cell</kbd></mark><font color='black'>→</font><mark><kbd>Run All Below</kbd></mark> then scroll down the notebook to see how changing this setting has changed all the cell outputs.</li>
</ul>
</div></div>

In [None]:
# Create a clean peak table 

rsd = peakTable['QC_RSD']  
percMiss = peakTable['Perc_missing']  
peakTableClean = peakTable[(rsd < 20) & (percMiss < 10)]   

print("Number of peaks remaining: {}".format(len(peakTableClean)))

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">

<h2 id="4pcaqualityassesment" style="text-align: justify">4. PCA - Quality Assesment</h2>

<p style="text-align: justify">To provide a multivariate assesment of the quality of the cleaned data set it is good practice to perform a simple  <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a> (PCA), after suitable <a href="https://doi.org/10.1186/1471-2164-7-142">transforming &amp; scaling</a>. The PCA score plot is typically labelled by sample type (i.e. quality control (QC) or biological sample (Sample)). Data of high quality will have QCs that cluster tightly compared to the biological samples <a href="https://link.springer.com/article/10.1007/s11306-018-1367-3">Broadhurst <em>et al.</em> 2018</a>.<br><br></p>

</div>

<div style="background-color:rgb(210,250,210); padding:2px;border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/mouse.png">
<div style="padding-left:80px"> 

<ul>
<li style="text-align: justify">Hover over points in the PCA Score Plot to reveal corresponding sample information ('IDX' and 'SampleType'). </li>

<li style="text-align: justify">Hover over points in the PCA Loading Plot to reveal corresponding metabolite information ('Name','Label', and 'QC_RSD'). </li>

<li style="text-align: justify">In the menu at the top right corner of the figure click on the 'disk' icon to save the images.</li>

<li style="text-align: justify">In the menu at the top right corner of the figure click on the 'magnifying glass' icon to selct a zoom area.</li>
</ul>

</div></div>
<div style="background-color:rgb(255,210,210); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px"> 


<ul>
<li style="text-align: justify">Replace the code: <code>XScale = cb.utils.scale(Xlog, method='auto')</code> with: <code>XScale = cb.utils.scale(Xlog, method='pareto')</code> This will change the type of X column scaling.</li>

<li style="text-align: justify">In the PCA function call <code>cb.plot.pca</code> replace the code: <code>pcy=2</code> with: <code>pcy=3</code> to change the plot from (PC1 vs. PC2) to (PC1 vs. PC3)</li>

<li style="text-align: justify">Replace the code: <code>group_label=dataTable['SampleType']</code> with: <code>group_label=dataTable['Class']</code>. The PCA scores plot will now be grouped by the data in  column <code>Class</code> of the <code>dataTable</code>.</li>
</ul>
</div></div>

<div style="background-color:rgb(210,250,255); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="40" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">


<ul>
<li style="text-align: justify">There are four type of scaling supported by the function <code>cimvb.utils.scale</code>: <code>'auto'</code>, <code>'range'</code>, <code>'pareto'</code>, <code>'vast'</code>, and <code>'level'</code>. In the context of metabolomics these are comprehensively reviewed by <a href="https://dx.doi.org/10.1186%2F1471-2164-7-142">van den Berg <strong>et al</strong> 2006</a>.</li>
</ul>
</div>
</div>

In [None]:
# Extract and scale the metabolite data from the dataTable 

peaklist = peakTableClean['Name']                   # Set peaklist to the metabolite names in the peakTableClean
X = dataTable[peaklist].values                      # Extract X matrix from dataTable using peaklist
Xlog = np.log10(X)                                  # Log scale (base-10)
Xscale = cb.utils.scale(Xlog, method='auto')        # methods include auto, range, pareto, vast, and level
Xknn = cb.utils.knnimpute(Xscale, k=3)              # missing value imputation (knn - 3 nearest neighbors)

print("Xknn: {} rows & {} columns".format(*Xknn.shape))

cb.plot.pca(Xknn,
            pcx=1,                                                  # pc for x-axis
            pcy=2,                                                  # pc for y-axis
            group_label=dataTable['SampleType'])                    # labels for Hover in PCA loadings plot

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
<a id='5'></a>
<h2 id="5univariatestatisticsforcomparisonofgastriccancergcvshealthycontrolshe" style="text-align: justify">5. Univariate Statistics for comparison of Gastric Cancer (<code>GC</code>) vs Healthy Controls (<code>HE</code>)</h2>

<p style="text-align: justify">The data set uploaded into <code>dataTable</code> describes the <sup>1</sup>H-NMR urine metabolite profiles of individuals classified into three distinct groups: <code>GC</code> (gastric cancer), <code>BN</code> (benign), and <code>HE</code> (healthy). For this specific workflow we are interested in comparing only the differences in profiles between individuals classsified as <code>GC</code> and <code>HE</code>.
<br><br></p>
</div>

<div style="background-color:rgb(210,250,210); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/mouse.png">
<div style="padding-left:80px">

<ul>
<li style="text-align: justify">Scroll up/down using the scroll bars.</li>
<li style="text-align: justify">Click on the column header to sort by that column (sort alternates between ascending and decending order).</li>
<li style="text-align: justify">Click on the left side of a header column for futher options, e.g.:
<ul>
    <li> For column <b>TtestStat</b> click on <b>Data Bars</b>.</li>
    <li> For column <b>ShapiroPvalue</b> click on <b>Format -> exponential 5</b> (coverts to scientific notation). </li>
    </ul></li>
</div></div>


<div style="background-color:rgb(255,210,210); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px; text-align: justify">


<ul>
<li style="text-align: justify">For data set <strong><em>GastricCancer_NMR.xlsx</em></strong> replace the code: <code>dataTable[(dataTable.Class == "GC") | (dataTable.Class == "HE")]</code> with: <br> <code>dataTable[(dataTable.Class == "BN") | (dataTable.Class == "HE")]</code> and replace <code>pos_outcome = "GC"</code> with: <code>pos_outcome = "BN"</code>. This will allow you to perform a 2-class statistical comparison between the patients with benign tumors and healthy controls.<br></li>

<li style="text-align: justify"><strong>OR</strong> for data set <strong><em>MTBLS290db.xlsx</em></strong> replace the code: <code>dataTable[(dataTable.Class == "GC") | (dataTable.Class == "HE")]</code>  with: <code>dataTable[(dataTable.Class == "Patient") | (dataTable.Class == "Control")]</code> and replace <code>pos_outcome = "GC"</code> with: <code>pos_outcome = "Patient"</code>. You will now perform a 2-class statistical comparison between the unhealthy patients and healthy controls.<br></li>

<li style="text-align: justify">In the statistical function call <code>cb.utils.univariate_2class</code> replace the code: <code>parametric=True</code> with: <code>parametric=False</code> to change the statistical test to a non-parametric Wilcoxon rank-sum test.</li>
</ul>
</div></div>

<div style="background-color:rgb(210,250,255); padding:2px;  border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="60" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">


<ul>
<li style="text-align: justify"><b>Note: Changing the outcome comparison will significantly affect the output of subsequent code cells.</b><br> So be sure to click on <mark><kbd>Cell</kbd></mark><font color='black'>→</font><mark><kbd>Run All Below</kbd></mark> then scroll down the notebook to see how changing this setting has changed all the cell outputs.</li>
</ul>
</div></div>

In [None]:
# Select subset of Data for statistical comparison
dataTable2 = dataTable[(dataTable.Class == "GC") | (dataTable.Class == "HE")]  # Reduce data table only to GC and HE class members
pos_outcome = "GC" 

# Calculate basic statistics and create a statistics table.
statsTable = cb.utils.univariate_2class(dataTable2,
                                        peakTableClean,
                                        group='Class',                # Column used to determine the groups
                                        posclass=pos_outcome,         # Value of posclass in the group column
                                        parametric=True)              # Set parametric = True or False

# View and check StatsTable
display(statsTable)

<a id='7'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
</div>

<div style="background-color:rgb(255,210,210); padding:2px;  border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="50" src="images/cog2.png">
<div style="padding-left:80px">

<ul>
<li style="text-align: justify">Replace the filename <code>"stats.xlsx"</code> to <code>"my_stats.xlsx"</code><br></li>
<li style="text-align: justify">AND/OR replace <code>sheet_name='StatsTable'</code> with <code>sheet_name='myStatsTable'</code><br></li>
</ul>
</div></div>

In [None]:
# Save StatsTable to Excel
statsTable.to_excel("stats.xlsx", sheet_name='StatsTable', index=False)
print("done!")

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
<p><a id='6'></a></p>
    
<h2 id="6machinelearning">6. Machine Learning</h2>

<p style="text-align: justify">The remainder of this tutorial will describe the use of a 2-class <a href="https://en.wikipedia.org/wiki/Partial_least_squares_regression">Partial Least Squares</a>-<a href="https://doi.org/10.1002/cem.713">Discriminant Analysis</a> (PLS-DA) model to identify metabolites which, when combined in a <a href="https://en.wikipedia.org/wiki/Linear_equation">linear equation</a>, are able to classify unknown samples as either <code>GC</code> or <code>HE</code> with a measurable degree of certainty.</p>


<h3 id="61splittingdataintotrainingandtestsets" style="text-align: justify">6.1 Splitting data into Training and Test sets.</h3>
</div>

<div style="background-color:rgb(255,210,210); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px">


<ul>
<li style="text-align: justify">If you have changed the comparsion groups in the default data to benign tumors (BN) vs. healthy controls (HE) then replace the code: <code>outcome == 'GC'</code> with: <code>outcome == 'BN'</code>.</li>

<li style="text-align: justify">For data set <strong><em>MTBLS290db.xlsx</em></strong> replace the code: <code>outcome == 'GC'</code> with: <code>outcome == 'Patient'</code>.</li>

<li style="text-align: justify">Replace the code: <code>train_test_split(DataTable2, Y, test_size=0.25, stratify=Y)</code> with: <code>train_test_split(DataTable2, Y, test_size=0.1, stratify=Y)</code>. This will decrease the number of samples in the test set. How does this affect the results?</li>
</ul>
</div></div>

<div style="background-color:rgb(210,250,255); padding:2px;  border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">


<ul>
<li style="text-align: justify"><b>Note: If you change any of the code in the following Machine Learning sections you will change the performance of all the subsequent code cells.</b><br> So be sure to click on <mark><kbd>Cell</kbd></mark><font color='black'>→</font><mark><kbd>Run All Below</kbd></mark> then scroll down the notebook to see how changing this setting has changed all the cell outputs.</li>
</ul>
</div></div>

In [None]:
# Create a Binary Y vector for stratifiying the samples
outcomes = dataTable2['Class']                                  # Column that corresponds to Y class (should be 2 groups)
Y = [1 if outcome == 'GC' else 0 for outcome in outcomes]       # Change Y into binary (GC = 1, HE = 0)  
Y = np.array(Y)                                                 # convert boolean list into to a numpy array

# Split DataTable2 and Y into train and test (with stratification)
dataTrain, dataTest, Ytrain, Ytest = train_test_split(dataTable2, Y, test_size=0.25, stratify=Y, random_state=10)

print("DataTrain = {} samples with {} postive cases.".format(len(Ytrain),sum(Ytrain)))
print("DataTest = {} samples with {} postive cases.".format(len(Ytest),sum(Ytest)))

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
  
<h3 id="62determineoptimalnumberofcomponentsforplsdamodel" style="text-align: justify">6.2. Determine optimal number of components for PLS-DA model</h3>

<p style="text-align: justify">In this section, we will perform 5-fold cross-validation using the training set we created above (<code>dataTrain</code>) to determine the optimal number of components to use in our PLS-DA model. First, we extract and scale the training data in <code>dataTrain</code> the same way as we did for PCA quality assessment in section 4 (log-transformation, scaling, and k-nearest-neighbour imputation of missing values).<br></p>
</div>

<div style="background-color:rgb(255,210,210); padding:2px;  border: 20px solid rgb(255, 250, 250);  padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px"> 


<ul>
<li style="text-align: justify">Replace the code: <code>cb.utils.scale(XTlog, method='auto')</code> with: <code>cb.utils.scale(XTlog, method='pareto')</code> This will change the type of X column scaling.</li>

<li style="text-align: justify">Replace the code: <code>cb.utils.scale(XTlog, method='auto')</code> with: <code>cb.utils.scale(XT, method='auto')</code> This change will ignore the  log transformed data (<code>XTlog</code>), and scale the raw <code>XT</code> data instead (thus missing out the log tranformation step of the data preprocessing).</li>
</ul>
</div></div>

In [None]:
# Extract and scale the metabolite data from the dataTable
peaklist = peakTableClean['Name']                           # Set peaklist to the metabolite names in the peakTableClean
XT = dataTrain[peaklist]                                    # Extract X matrix from DataTrain using peaklist
XTlog = np.log(XT)                                          # Log scale (base-10)
XTscale = cb.utils.scale(XTlog, method='auto')              # methods include auto, pareto, vast, and level
XTknn = cb.utils.knnimpute(XTscale, k=3)                    # missing value imputation (knn - 3 nearest neighbors)

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">

<p style="text-align: justify">We use the <code>cb.cross_val.kfold()</code> helper function to carry out 5-fold cross-validation of a set of PLS-DA models configured with different numbers of latent variables.<br></p>
</div>

<div style="background-color:rgb(255,210,210); padding:2px;  border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250);  padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px"> 


<ul>
<li style="text-align: justify">Replace the code: <code>param_dict={'n_components': [1,2,3,4,5,6]}</code> with: <code>param_dict={'n_components': [1,2,3,4,5,6,7,8,9,10]}</code>. This will increase the range of latent variables used to build PLS-DA models from a PLS-DA model with 1 latent variable to a PLS-DA model with 10 latent variables.</li>

<li style="text-align: justify">Replace the code: <code>folds=5</code> with: <code>folds=10</code>. This will change the number of folds in the k-fold cross validation.</li>

<li style="text-align: justify">Replace the code: <code>bootnum=100</code> with: <code>bootnum=500</code>. This will change the number of bootstrap samples used to calculate the 95% confidence interval for the $R^2$ and $Q^2$ curves. This will drastically slow down the code execution.</li>
</ul>
</div></div>

<div style="background-color:rgb(230,250,255); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250);  padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<img align="right" width="150" src="images/R2Q2_ab.png">
<div style="padding-left:80px">

<ul>
<li style="text-align: justify; padding-right:200px">For more information on the PLS SIMPLS algorithm refer to: De Jong, S., 1993. <a href= "https://www.sciencedirect.com/science/article/abs/pii/016974399385002X">SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251–263</a></li>
<li style="text-align: justify; padding-right:200px">Although it is common practice to assume the optimal number of components for the PLS-DA model is chosen when $Q^2$  is at its apex (A), this is incorrect. Overtraining starts as soon as $Q^2$ significantly deviates from the $R^2$  trajectory. If the distance between $R^2$  and $Q^2$ gets large (>0.2 or the 95% CI stop overlapping) then one has to assume that the model is already overtrained. The point at which the $Q^2$ value begins to diverge from the $R^2$ value is considered point at which the optimal number of components has been met without overfitting (B). The $R^2$  vs. $(R^2 - Q^2$) plot is provided to aid decison making.</li>
</ul>
</div></div>

<div style="background-color:rgb(210,250,210); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="60" src="images/mouse.png">
<div style="padding-left:80px">

<ul>
<li style="text-align: justify">Hover over the green data points in each of the plots to view the corresponding $R^2$  and $Q^2$ values.</li>

<li style="text-align: justify">Click on a point in one of the green plots. Notice that the two plots are linked.</li>

<li style="text-align: justify">Use the menu bar at the top right of the figure to save, scroll and zoom.</li>
</ul>
</div></div>


In [None]:
# initalise cross_val kfold (stratified) 
cv = cb.cross_val.kfold(model=cb.model.PLS_SIMPLS,                   # model; we are using the PLS_SIMPLS model
                        X=XTknn,                                 
                        Y=Ytrain,                               
                        param_dict={'n_components': [1,2,3,4,5,6]},  # The numbers of latent variables to search                
                        folds=5,                                     # folds; for the number of splits (k-fold)
                        bootnum=100)                                 # num bootstraps for the Confidence Intervals


cv.run()  # run the cross validation
cv.plot() # plot cross validation statistics


<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
    
<h3 id="63trainandevaluateplsdamodel" style="text-align: justify">6.3 Train and evaluate PLS-DA model</h3>

<p style="text-align: justify">Now we have determined the optimal number of components for this data set using k-fold cross validation, we create a new PLS-DA model with the requisite number of latent variables, train the model using <code>XTknn</code> and '<code>YT</code>, then evaluate its predictive ability.<br><br></p>
</div>

<div style="background-color:rgb(255,210,210); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/cog2.png">
<div style="padding-left:80px"> 


<ul>
<li style="text-align: justify">Replace the code: <code>n_components=2</code> with: <code>n_components=3</code>. This will increase the number of latent variables used in the PLS-DA model. Notice how this changes the apparent predictive ability of the model.</li>

<li style="text-align: justify">Replace the code: <code>cutoffscore=0.5</code> with: <code>cutoffscore=0.4</code> This will change the decision boundary for the classifier and alter the resulting perfomance statistics.</li>
</ul>
</div></div>

<div style="background-color:rgb(210,250,210); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="40" src="images/mouse.png">
<div style="padding-left:80px">


<ul>
<li style="text-align: justify">Hover over the green data points in each of the plots to view extra information.</li>

<li style="text-align: justify">Use the menu bar at the right of the figures to save, scroll and zoom.<br></li>
</ul>

</div></div>

In [None]:
modelPLS = cb.model.PLS_SIMPLS(n_components=2)  # Initalise the model with n_components = 2

Ypred = modelPLS.train(XTknn, Ytrain)  # Train the model 

modelPLS.evaluate(cutoffscore=0.5)  # Evaluate the model

modelPLS.permutation_test(nperm=100)  #nperm denotes to the number of permutations

<a id='6.4'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
    
<h3 id="64plotlatentvariableprojectionsforplsdamodel" style="text-align: justify">6.4. Plot latent variable projections for PLS-DA model</h3>

<p style="text-align: justify">The PLS model also provides a <code>.plot_projections()</code> method, so we can visually inspect characteristics of the fitted latent variables. This returns a grid of plots:</p>
</div>

<div style="background-color:rgb(210,250,255); padding:2px;  border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">


<ul>
<li style="text-align: justify">These plots are useful to visualise to what degree each model component (latent variable) contribute to the model's discriminative ability. In the Gastric cancer example each individual component does not perform well in isolation. It is only when combined that a good prdicitve ability is revealed. In the bottom left figure the prjection scores plot includes a solid diagonal line describing the direction of prediction and a dashed line describing the orthogonal variance. In the method <a href="https://doi.org/10.1002%2Fcem.695">orthogonal partial least squares</a> (O-PLS) this rotation is performed automatically to aid interpretation. However, these changes <a href="http://dx.doi.org/10.1016/j.trac.2009.08.006">only improve the interpretability, not the predictivity, of the PLS models</a> (see <a href="https://fiehnlab.ucdavis.edu/staff/kind/statistics/concepts/opls-plsda">Fiehnlab</a> for further discussion)
</ul>
</div>
</div>

In [None]:
modelPLS.plot_projections(label=
                          dataTrain[['Idx','SampleID']], size=12) # size changes circle size

<a id='6.5'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
    
<h3 id="65plotfeatureimportancecoefficientplotandvipforplsdamodel" style="text-align: justify">6.5. Plot feature importance (Coefficient plot and VIP) for PLS-DA model</h3>

<p style="text-align: justify">Now that we have built a model and established that it represents meaningful features of the dataset, we determine the importance of specific peaks to the model's discriminatory power. To do this, in the cell below we use the PLS model's <code>plot_featureimportance()</code> method to render scatterplots of the <a href="https://doi.org/10.6084/m9.figshare.5696494.v3">PLS regression <em>coefficient</em> values</a> for each metabolite, and <a href="https://books.google.com.au/books?id=58qLBQAAQBAJ"><em>Variable Importance in Projection</em></a> (VIP) plots. The coefficient values provide information about the contribution of the peak to either a negative or positive classification for the sample, and peaks with VIP greater than unity (1) are considered to be "important" in the model.<br></p>
</div>

<div style="background-color:rgb(230,250,255); padding:2px; border-top: 20px solid rgb(255, 250, 250); border-left: 20px solid rgb(255, 250, 250); border-right: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="80" src="images/bulb.png">
<div style="padding-left:80px">

<ul>
<li style="text-align: justify">In statistics, the bootstrap procedure involves choosing random samples with replacement from a data set and calculating some statistic on those samples. The range of sample estimates you obtain enables you to establish the uncertainty of the quantity you are estimating. Sampling with replacement means that each observation in a sample is selected (and recorded) at random from the original dataset and then replaced, so it is possilbe for an observation can be selected multiple times. If the orginal data set contains N observations then each bootstrap sample contains N randomly selected observations. It has been shown that approximately 2/3 of the orginal data are include in each bootstrap sample (with 1/3 of the original data being included twice). Here we use bootstrap resampling to calculate confidence intervals for the coefficients in the PLS-DA model using the <a href="https://doi.org/10.1002/9780470057339.vab028">'bootstrapping of observations'</a> method.
</ul>
</div></div>

<div style="background-color:rgb(255,210,210); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="60" src="images/cog2.png">
<div style="padding-left:80px"> </p>

<ul>
<li style="text-align: justify">Replace the code: <code>type='bca'</code> with either <code>type='perc'</code> or <code>type='bc'</code> to change from <a href="https://doi.org/10.2307%2F2289144">Bias corrected and accelerated percentile method</a> to either <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.2742&amp;rep=rep1&amp;type=pdf"><em>Bias corrected percentile method</em> or <em>percentile method</em></a></li>

<li style="text-align: justify">Replace the code: <code>sort=False</code> with: <code>sort=True</code>. This will sort the metabolites in decending order of importance.</li>
</ul>

</div></div>

In [None]:
# Calculate the bootstrapped confidence intervals 
modelPLS.calc_bootci(type='bca', bootnum=200)                # decrease bootnum if it this takes too long on your machine

# Plot the feature importance plots, and return a new Peaksheet 
peakSheet = modelPLS.plot_featureimportance(peakTableClean,
                                            peaklist,
                                            ylabel='Label',  # change ylabel to 'Name' 
                                            sort=False)      # change sort to False

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">
    
<h3 id="66testmodelwithnewdatausingtestsetfromsection61" style="text-align: justify">6.6. Test model with new data (using test set from section 6.1)</h3>

<p style="text-align: justify">So far, we have trained and tested our PLS classifier on a single training dataset. This risks <em>overfitting</em> as we could be optimising the performance of the model on this dataset such that it cannot <em>generalise</em>, in the sense that it may not perform as well on a dataset that it has not already seen. To see if the model can <em>generalise</em>, we must test our trained model using a new dataset that it has not already encountered. In section 6.1 we divided our original complete dataset into four components: <code>datatrain</code>, <code>Ytrain</code>, <code>dataTest</code> and <code>Ytest</code>. Our trained model has not seen the <code>dataTest</code> and <code>Ytest</code> values that we have <em>held out</em>, so these can be used to evaluate model preformance on new data.</p>
</div>

<div style="background-color:rgb(210,250,255); padding:2px;  border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="60" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">


<ul>
<li style="text-align: justify">Note: It is important that the test data is tranformed and scaled using the same parameters as the training data. If the training data is log transformed then the test data must also be log transformed, otherwise the test predictions will be inappropriate, and likely highly imprecise. Equally the scaling must be performed using the scaling factors derived from the training data (e.g. mean-centred to the traning data mean, and normalised to the training data standard deviation.</li>
</ul>
</div>
</div>

In [None]:
# Get mu and sigma from the training dataset to use for the Xtest scaling
mu, sigma  = cb.utils.scale(XTlog, return_mu_sigma=True) 

# Pull of Xtest from DataTest using peaklist ('Name' column in PeakTable)
peaklist = peakTableClean.Name 
XV = dataTest[peaklist].values

# Log transform, unit-scale and knn-impute missing values for Xtest
XVlog = np.log(XV)
XVscale  = cb.utils.scale(XVlog, method='auto', mu=mu, sigma=sigma) 
XVknn = cb.utils.knnimpute(XVscale, k=3)

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">

<p style="text-align: justify">Now we predict a new set of response variables from <code>XVknn</code> as input, using our trained model and its <code>.test()</code> method, and then evaluate the performance of the prediction against the known values in <code>Ytest</code> using the <code>.evaluate()</code> method (as in section 6.3).<br></p>
</div>

<div style="background-color:rgb(210,250,255); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="40" src="images/bulb.png">
<div style="padding-left:80px; text-align: justify">


<ul>
<li style="text-align: justify">Note: Although the calulcated bootstrap confidence intervals for prediciton will give an estimate of uncertainty of prediction the only way to definitively evaluate any model is with an independent test set, as shown in this plot.  </li>
</ul>
</div>
</div>


In [None]:
# Calculate Ypredicted score using modelPLS.test
YVpred = modelPLS.test(XVknn)

# Evaluate Ypred against Ytest
evals = [Ytest, YVpred]    # alternative formats: (Ytest, Ypred) or np.array([Ytest, Ypred])
#modelPLS.evaluate(evals, specificity=0.9)
modelPLS.evaluate(evals, cutoffscore=0.5) 

<a id='6.7'></a>
<div style="background-color:rgb(255, 250, 250); padding:10px;  padding-left: 1em; padding-right: 1em;">
    
<h3 id="67exportresultstoexcel" style="text-align: justify">6.7. Export results to Excel</h3>

<p style="text-align: justify">Finally, we copy our model predictions into a table and save in a persistent Excel spreadsheet.<br></p>
</div>

<div style="background-color:rgb(255,210,210); padding:2px; border: 20px solid rgb(255, 250, 250); padding-right: 1em;">
<img align="left" width="50" src="images/cog2.png">
<div style="padding-left:80px">


<ul>
<li style="text-align: justify">Replace the filename <code>"modelPLS.xlsx"</code> to <code>"myModelPLS.xlsx"</code><br></li>

<li style="text-align: justify">AND/OR change <code>sheet_name='Datasheet'</code> / <code>sheet_name='PeakSheet'</code> as appropriate<br></li>
</ul>
</div></div>


In [None]:
# Save DataSheet as 'Idx', 'SampleID', and 'Class' from DataTest
dataSheet = dataTest[["Idx", "SampleID", "Class"]].copy() 

# Add 'Ypred' to Datasheet
dataSheet['Ypred'] = YVpred 
 
# Create an empty excel workbook
writer = pd.ExcelWriter("modelPLS.xlsx")     # provide the filename for the Excel file

# Add each dataframe to the workbook in turn, as a separate worksheet
dataSheet.to_excel(writer, sheet_name='Datasheet', index=False)
peakSheet.to_excel(writer, sheet_name='Peaksheet', index=False)

# Write the Excel workbook to disk
writer.save()

print("Done!")

<div style="background-color:rgb(255, 250, 250); padding:10px; padding-left: 1em; padding-right: 1em;">

<p>Congratulations! You have completed tutorial 2. </p>

</div>