<div style="text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
    <font color='red'>To begin: Click anywhere in this cell and press <kbd>Run</kbd> on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cells. A <i>text cell</i> and a <i>code cell</i>. When you <kbd>Run</kbd> a text cell (<i>we are in a text cell now</i>), you advance to the next cell without executing any code. When you <kbd>Run</kbd> a code cell (<i>identified by <span style="font-family: courier; color:black; background-color:white;">In[ ]:</span> to the left of the cell</i>) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press <kbd>Run</kbd> again. Repeat this process until the end of the notebook. <b>NOTE:</b> All the cells in this notebook can be automatically executed sequentially by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart and Run All</kbd>. Should anything crash then restart the Jupyter Kernal by clicking <kbd>Kernel</kbd><font color='black'>→</font><kbd>Restart</kbd>, and start again from the top.
        
</div>

### 1. Import Packages

In [1]:
import numpy as np
import pandas as pd
import cimcb as cb

print('All packages successfully loaded')

%load_ext autoreload
%autoreload 2

Using TensorFlow backend.


All packages successfully loaded


### 2. Load Data & Peak Sheet

In [2]:
home = 'data/'
file = 'ST001047.xlsx'

DataTable,PeakTable = cb.utils.load_dataXL(home + file, DataSheet='Data', PeakSheet='Peak')

Loadings PeakFile: Peak
Loadings DataFile: Data
Data Table & Peak Table is suitable.
TOTAL SAMPLES: 140 TOTAL PEAKS: 149
Done!


### 3. Extract X & Y

In [3]:
# Clean PeakTable
RSD = PeakTable['QC_RSD']
PercMiss = PeakTable['Perc_missing']
PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]

# Select Subset of Data
DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]

# Create a Binary Y Vector 
Outcomes = DataTable2['Class']
Y = [1 if outcome == 'GC' else 0 for outcome in Outcomes]
Y = np.array(Y)

# Extract and Scale Metabolite Data 
peaklist = PeakTableClean['Name']
XT = DataTable2[peaklist]
XTlog = np.log(XT)
XTscale = cb.utils.scale(XTlog, method='auto')
XTknn = cb.utils.knnimpute(XTscale, k=3)

### 4. Hyperparameter Optimisation

In [None]:
# Parameter Dictionary
param_dict = {'n_components': [1, 2, 3, 4, 5, 6]}                   

# Initialise
cv = cb.cross_val.kfold(model=cb.model.PLS_SIMPLS,                      
                                X=XTknn,                                 
                                Y=Y,                               
                                param_dict=param_dict,                   
                                folds=5,
                                n_mc=100)                                

# Run and Plot
cv.run()  
cv.plot(metric='r2q2', ci=95)

In [68]:
# Build Model and plot projections (kfold - monte carlo reps)
# To do: Parallel

modelOptimise = cb.model.PLS_SIMPLS(n_components=2)
modelOptimise.train(XTknn, Y)
Ytest = modelOptimise.test(XTknn)

modelOptimise.plot_projections_kfold(label=DataTable2[['Idx','SampleID']],
                             size=12,
                             ci95=True,
                             scatterplot=True,
                             folds=5,
                             n_mc=100)


100%|██████████| 100/100 [00:00<00:00, 146.46it/s]


### 5. Build Model & Evaluate

In [65]:
# Build Model
model = cb.model.PLS_SIMPLS(n_components=3)
model.train(XTknn, Y)
model.test(XTknn)

# Evaluate Model 
model.evaluate(cutoffscore=0.5) 

### 6. Visualise

In [66]:
# To do:
    # Parallel bootstrap resampling

# Calculate the bootstrapped confidence intervals 
model.calc_bootci(type='bc', bootnum=1000)                # decrease bootnum if it this takes too long on your machine

Bootstrap Resample: 100%|██████████| 1000/1000 [00:01<00:00, 745.92it/s]


In [67]:
# To do:
    # density figure: figure dimensions
    # weight alt: figure dimensions + ci95 + intersecting line
    
model.plot_projections(label=DataTable2[['Idx','SampleID']],
                       size=12,
                       scatterplot=False) 

In [None]:
# To do:
    # Plot density to check if there is flipping
    # Fix sorting

model.plot_loadings(PeakTable,
                    peaklist,
                    ylabel='Label',  # change ylabel to 'Name' 
                    sort=False)      # change sort to False

In [None]:
# To do:
    # Rename output in peakSheet_featureimportance for ANNs
    # Fix sorting

peakSheet_featureimportance = model.plot_featureimportance(PeakTable,
                                         peaklist,
                                         ylabel='Label',  # change ylabel to 'Name' 
                                         sort=True)      # change sort to False

### 7. Evaluate

In [None]:
model.booteval(XTknn, Y, bootnum=100)

In [None]:
# To do: Parallel

model.permutation_test(nperm=100) 

In [20]:
X = XTknn
meanX = np.mean(X, axis=0)
X0 = X - meanX

In [52]:
self = modelOptimise


In [47]:
from copy import deepcopy, copy
from sklearn.model_selection import StratifiedKFold
folds = 5

try:
    kmodel = deepcopy(self)  # Make a copy of the model
except TypeError:
    kmodel = copy(self)
# kmodel =
x_scores_cv = [None] * len(Y)

crossval_idx = StratifiedKFold(n_splits=folds, shuffle=True)
for train, test in crossval_idx.split(self.X, self.Y):
    print(test)
    X_train = self.X[train, :]
    Y_train = self.Y[train]
    X_test = self.X[test, :]
    kmodel.train(X_train, Y_train)
    kmodel.test(X_test)
    x_scores_cv_i = kmodel.model.x_scores_
    # Return value to y_pred_cv in the correct position # Better way to do this
    for (idx, val) in zip(test, x_scores_cv_i):
        x_scores_cv[idx] = val.tolist()

[ 5  9 16 20 26 37 52 53 58 60 62 63 65 70 72 74 81]
[ 4  7 11 12 15 18 19 24 25 30 34 35 44 48 66 71 80]
[ 1  6 13 17 23 28 29 31 40 46 49 55 57 59 76 77 79]
[ 0  8 10 27 32 36 39 47 50 51 54 56 61 68 73 75]
[ 2  3 14 21 22 33 38 41 42 43 45 64 67 69 78 82]


In [30]:
self.model.x_scores_;

In [57]:
np.mean(self.a, axis=0)

array([[ 1.53288549e-01, -1.32215640e-01],
       [-2.67592100e-02,  8.47681679e-02],
       [ 1.16770931e-01,  2.12486240e-02],
       [ 1.36039708e-01,  1.84820059e-02],
       [-3.93873069e-02,  1.46037084e-01],
       [ 5.91131036e-02,  1.67274916e-03],
       [-1.56837902e-02, -4.30196658e-02],
       [-5.23082297e-02, -5.98336708e-02],
       [-9.90032236e-02, -1.63151838e-01],
       [ 6.55746524e-02,  1.40255014e-01],
       [-1.69135581e-01,  1.40856520e-01],
       [ 4.56138249e-02,  4.11776829e-02],
       [ 7.13424587e-02, -1.41781866e-01],
       [-1.66892313e-01,  1.35318677e-01],
       [ 3.36111832e-02, -2.02606260e-01],
       [ 1.17062390e-01,  4.27568234e-02],
       [ 4.12857261e-02, -2.94064185e-02],
       [ 1.85564862e-01,  3.01353141e-02],
       [-1.01031763e-01, -1.59308575e-01],
       [ 1.03770095e-01,  4.96286115e-02],
       [-4.84455489e-02, -1.63054420e-01],
       [-6.63534803e-02,  3.19829424e-02],
       [ 3.38171930e-03, -1.96281328e-01],
       [ 1.