<div style="text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    <font color='red'>To begin: Click anywhere in this cell and press <kbd>Run</kbd> on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cells. A <i>text cell</i> and a <i>code cell</i>. When you <kbd>Run</kbd> a text cell (<i>we are in a text cell now</i>), you advance to the next cell without executing any code. When you <kbd>Run</kbd> a code cell (<i>identified by <span style="font-family: courier; color:black; background-color:white;">In[ ]:</span> to the left of the cell</i>) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press <kbd>Run</kbd> again. Repeat this process until the end of the notebook. <b>NOTE:</b> All the cells in this notebook can be automatically executed sequentially by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart and Run All</kbd>. Should anything crash then restart the Jupyter Kernal by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart</kbd>, and start again from the top.
        
</div>

<div style="text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
<img src="https://github.com/CIMCB/MetabComparisonBinaryML/blob/master/cimcb_logo.png?raw=true" width="180px" align="right" style="padding: 20px">

<a id="introduction"></a>

<h1> Notebook 2 : Quality Control-Robust Spline Correction (QC-RSC)</h1>

<br>
<br>
<br>

</div>

<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">
    
<a id="1"></a>
<h2 style="text-align: justify">1. Import Packages</h2>

<p  style="text-align: justify">[Enter Text Here]</p>

<ul>
<li style="text-align: justify"><a href="http://www.numpy.org/"><code>numpy</code></a>: A standard package primarily used for the manipulation of arrays</li>

<li style="text-align: justify"><a href="https://pandas.pydata.org/"><code>pandas</code></a>: A standard package primarily used for the manipulation of data tables</li>

<li style="text-align: justify"><a href="https://github.com/CIMCB/qcrsc"><code>qcrsc</code></a>: A library of helpful functions and tools provided by the authors</li>


</li>

</ul>

<br>

</div>

In [1]:
import numpy as np
import pandas as pd
import qcrsc   


print('All packages successfully loaded')

# Remove later
%load_ext autoreload
%autoreload 2

All packages successfully loaded


<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<a id="2"></a>
<h2 style="text-align: justify">2. Load Data & Peak Sheet</h2>

<p  style="text-align: justify">[Enter Text Here]</p>

<p  style="text-align: justify"><code>qcrsc.load_dataXL()</code> parameters:</p> 

<ul>
    <li><code>filename</code>: The name of the excel file (.xlsx file)</li>
    <li><code>DataSheet</code>: The name of the data sheet in the file. Requires Order, SampleType, Batch.</li>
    <li><code>PeakSheet</code>: The name of the peak sheet in the file. Required Idx, Name, Label.</li>
</ul>   
<br>

</div>

In [2]:
home = 'data/'
file = 'Dataset08a__SFPM_PQN_TIDYDATA.xlsx' 

DataTable, PeakTable = qcrsc.load_dataXL(home + file,'Data','Peak')

Loadings PeakFile: Peak
Loadings DataFile: Data
Data Table is suitable for use with QCRSC
TOTAL SAMPLES: 172 TOTAL PEAKS: 2488
Done!


<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<a id="2"></a>
<h2 style="text-align: justify">3. View Correction Per Peak</h2>

<p  style="text-align: justify">[Enter Text Here]</p>

<br>
<p  style="text-align: justify"><code>qcrsc.peak()</code> parameters:</p> 

<ul>
    <li><code>DataTable</code>: DataTable</li>
    <li><code>PeakTable</code>: PeakTable</li>
    <li><code>batch</code>: e.g. 1 or [1] or [1, 2, 3] or -1 for all</li>
    <li><code>peak</code>: e.g. 'M1' or 'R' for random</li>
    <li><code>gamma</code>: False or (min, max, step) (default (0.5, 5, 0.2))</li>
    <li><code>transform</code>: 'log' or False (default 'log')</li>
    <li><code>parametric</code>: True or False (default 'parametric')</li>
    <li><code>plot</code>: list e.g. ['Sample', QC', 'Blank] or ['Sample', QC'] (default ['Sample', QC'])</li>
    <li><code>zeroflag</code>: True or False (default True)</li>
    <li><code>control_limit</code>: False or ('RSD', value) or ('D-ratio', value) (default False)</li>

</div>

In [3]:

qcrsc.peak(DataTable, 
           PeakTable,
           batch='all',
           peak='M2479', 
           gamma=(0.5, 5, 0.2), 
           transform='log', 
           parametric=True,
           zero_remove=True, 
           plot=['QC', 'Sample'],
           control_limit={'RSD':30}, 
           colormap='Accent',
           fill_points=False,
           scale_x=1, 
           scale_y=1)


<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<a id="2"></a>
<h2 style="text-align: justify">4. QC-RSC Correction</h2>

<br>
<p  style="text-align: justify"><code>qcrsc.qc_correction()</code> parameters:</p> 

<ul>
    <li><code>DataTable</code>: DataTable</li>
    <li><code>PeakTable</code>: PeakTable</li>
    <li><code>gamma</code>: False or (min, max, step) (default (0.5, 5, 0.2))</li>
    <li><code>transform</code>: 'log' or False (default 'log')</li>
    <li><code>zeroflag</code>: True or False (default True)</li>
    <li><code>remove_outlier</code>: True or False (default True)</li>
    <li><code>impute_missing</code>: True or False (default True)</li>

Note: * in PeakTableX output means non-parametric e.g. RSD* -> non-parametric RSD
</div>

In [4]:
# Currently exporting both parametric and non-parametric metrics in PeakTableX
DataTableX, PeakTableX = qcrsc.qc_correction(DataTable,
                                             PeakTable,
                                             gamma=(0.5, 5, 0.2),
                                             transform='log',
                                             remove_outliers=False,
                                             impute_missing=False)

Number of Batches : 8


Batch 1: 100%|██████████| 2488/2488 [03:58<00:00, 10.44it/s]
Batch 2: 100%|██████████| 2488/2488 [00:03<00:00, 818.09it/s]
Batch 3: 100%|██████████| 2488/2488 [00:02<00:00, 889.27it/s]
Batch 4: 100%|██████████| 2488/2488 [02:35<00:00, 15.97it/s]
Batch 5: 100%|██████████| 2488/2488 [02:42<00:00, 15.32it/s]
Batch 6: 100%|██████████| 2488/2488 [02:41<00:00, 15.38it/s]
Batch 7: 100%|██████████| 2488/2488 [02:46<00:00, 14.95it/s]
Batch 8: 100%|██████████| 2488/2488 [00:03<00:00, 816.71it/s]
  overwrite_input=overwrite_input)


8 batches corrected and concatenated
Final data set: 172 samples and 2488 metabolites
Done!


In [5]:
# To do.. if no QCT -> remove columns
#      .. if no QCW and QCB (i.e. just QC) -> not necessary to have both columns

PeakTableX.columns

Index(['Idx', 'Name', 'Label', 'm/z', 'RSD_QCW', 'DRatio_QCW', 'RSD_QCB',
       'DRatio_QCB', 'RSD_QCT', 'DRatio_QCT',
       ...
       'B8_RSD*_QCW', 'B8_DRatio*_QCW', 'B8_RSD_QCB', 'B8_DRatio_QCB',
       'B8_RSD*_QCB', 'B8_DRatio*_QCB', 'B8_RSD_QCT', 'B8_DRatio_QCT',
       'B8_RSD*_QCT', 'B8_DRatio*_QCT'],
      dtype='object', length=122)

<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<a id="2"></a>
<h2 style="text-align: justify">5. PCA Plot</h2>

<p  style="text-align: justify">[Enter Text Here]</p>

<br>
<p  style="text-align: justify"><code>qcrsc.pca_plot()</code> parameters:</p> 

<ul>
    <li><code>DataTable</code>: Requires DataTable </li>
    <li><code>PeakTable</code>: Requires PeakTable </li>
    <li><code>pcx</code>: pc on x-axis e.g. 1 (default 1)</li>
    <li><code>pcy</code>: pc on y-axis e.g. 2 (default 2)</li>
    <li><code>project_qc</code>: True or False (default True) </li>
    <li><code>batch</code>: e.g. 1 or [1] (default 1)</li>
    <li><code>gamma</code>: False or (min, max, step) (default False)</li>
    <li><code>transform</code>: 'log' or False (default 'log')</li>
    <li><code>zeroflag</code>: True or False (default True)</li>
    <li><code>unitscale</code>: True or False (default False)</li>
    <li><code>knn</code>: e.g. 3 or 4 (default 3)</li> 
    <li><code>plot</code>: list e.g. ['Sample', QC', 'Blank] or ['Sample', QC'] (default ['Sample', QC'])</li>
    <li><code>control_limit</code>: False or ('RSD', value) or ('D-ratio', value) (default False)</li>
    <li><code>plot_elipse</code>: 'all', 'none', 'meanci', 'ci'</li>
    
</div>

In [6]:
qcrsc.pca_plot(DataTableX, 
               PeakTableX, 
               pcx=1,
               pcy=2, 
               batch='all', 
               transform='log', 
               scale='unit',
               knn=3,
               plot=['QC', 'Sample'],
               control_limit={'RSD':20},
               plot_ellipse='all',
               plot_points=False,
               colormap = 'Accent',
               fill_points = False,
               scale_y = 1,
               scale_x = 1,
               alpha_ellipse=(0.1, 0.2))

<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<a id="2"></a>
<h2 style="text-align: justify">6. Distribution Plot</h2>

<p  style="text-align: justify">[Enter Text Here]</p>

<br>
<p  style="text-align: justify"><code>qcrsc.dist_plot()</code> parameters:</p> 

<ul>
    <li><code>DataTable</code>: Requires DataTable </li>
    <li><code>PeakTable</code>: Requires PeakTable </li>
    <li><code>metric</code>: RSD or Dratio</li>
    <li><code>tranform</code>: False or 'log'</li>
    <li><code>parametric</code>: True or False (default True) </li>
    <li><code>batch</code>: e.g. 1 or [1] (default 1)</li>
    <li><code>plot</code>: ["QC", "QCT"]</li>
    <li><code>colormap</code>: based on categorical colormaps https://matplotlib.org/tutorials/colors/colormaps.html </li>
</div>

In [7]:

qcrsc.dist_plot(DataTableX,
                PeakTableX, 
                parametric = True, 
                batch = 'all', 
                plot = 'all',
                colormap = 'Accent',
                scale_x = 1, 
                scale_y = 1,
                padding = 0.10,
                smooth = None,
                alpha = 0.05,
                legend= True)


<div style="background-color:rgb(255, 250, 250); padding:5px; padding-left: 1em; padding-right: 1em;">

<a id="2"></a>
<h2 style="text-align: justify">7. Export DataTableX & PeakTableX</h2>

<p  style="text-align: justify"><code>qcrsc.export_dataXL()</code> parameters:</p> 

<ul>
    <li><code>filename</code>: The name of the excel file (.xlsx file)</li>
    <li><code>DataTable</code>: DataTable</li>
    <li><code>PeakTable</code>: PeakTable</li>
    <li><code>data_sheet</code>: Name of created sheet (DataTable)</li>
    <li><code>peak_sheet</code>: Name of created sheet (PeakTable)</li>
    
</ul>   
<br>

</div>

In [8]:

qcrsc.export_dataXL(home + file, DataTableX, PeakTableX, data_sheet='DataTableX', peak_sheet='PeakTableX') 

Done.
