<div style="background-color: #eee3d3">
<h1> 4-dimensionality_reduction.ipynb </h1>
</div>


---

### The purpose of this notebook is to reduce the dimension of our peak table :

One main challenge in metabolomics data analysis is dealing with high dimension data, e.g. for a peak table $(n,p)$, having $p > n$ (more features than samples). It could be a problem for downstream analysis.

To reduce the dimensionnality, you can use :
- feature selection: find a subset of input features
- feature extraction: project high-dimensional space into a space of fewer dimensions

_Hint : methods that can be tested (or not ?) $\rightarrow$ Principal Component Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis (CCA), Autoencoder, ..._

_Autoencoder is a type of artificial neural network, part of deep learning, you can test this method at the very end of the project if you still have time and interest in that (__huge bonus if you manage to make it work, but do it only if you have time left, the main objective of this project is to find potential biomarkers__)_

---

Import a peak table that you previously imputed (thus has no more missing values) and treated (transformation and/or scaling and/or normalisation).

Same as before, think about quantitative/qualitative/graphic ways to present the different method outputs !

---

### Very nice --> https://cimcb.github.io/MetabWorkflowTutorial/Tutorial1.html

**Voir les quelques notes que j'ai mise sur le drive**

## 1) PCA testing (not done yet)

In [None]:
import pandas as pd
from numpy import mean
from numpy import std
from matplotlib import pyplot
import numpy as np
from sklearn.model_selection import train_test_split
import cimcb_lite as cb

import os
import sys

sys.path.append('/'.join(os.getcwd().split('/')[:-1]) + '/bin/')

In [None]:
import normalisation_scaling_functions as nsf

In [None]:
# Used the mean normalization and Bayesian imputation (can use any other, I chosed randomly)
path_peakTable = '/'.join(os.getcwd().split('/')[:-1]) + '/data/peakTable/original_peak_table/peakTable_HILIC_POS.csv'
peakTable_HILIC_POS = pd.read_csv(path_peakTable, sep=',', decimal='.', na_values='NA')
peakTable_HILIC_POS.head()

data = pd.read_csv('/'.join(os.getcwd().split('/')[:-1]) + '/data/peakTable/imputed_peak_tables/X_python_MICE_BayesianRidge.csv')
data_nm = nsf.normPeakTable(data, 'mean_normalisation', based='samples')
data_nm

In [None]:
import sklearn.decomposition
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [None]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from matplotlib.colors import ListedColormap

# Extract X matrix
names = peak['Name']
x = data[names].values
x = np.log(x)
x = StandardScaler().fit_transform(x)

# Create and fit PCA
pca = PCA(n_components=2)
scores = pca.fit_transform(x)
label = data['SampleType']

# Split scores into sample and QC
Sample_scores = scores[label == 'Sample',:]
QC_scores = scores[label == 'QC',:]

# Plot Sample score and QC score
fig = plt.figure(figsize=(8,8))
h1 = plt.scatter(Sample_scores[:,0],Sample_scores[:,1],edgecolors='Black', facecolors='Green',s=100,alpha=0.5)
h2 = plt.scatter(QC_scores[:,0],QC_scores[:,1], edgecolors='Black', facecolors='Red',s=100,alpha=0.5)

# Add legend, labels, and title
plt.legend((h1,h2),('Sample','QC'),fontsize=15)
plt.xlabel('PC1', fontsize=15)
plt.ylabel('PC2', fontsize=15)
plt.title('Quality Control PCA plot',fontsize=20)

# Show plot
plt.show()

In [None]:
data_nm

###  Creating PCA object

In [None]:
transpose = data_nm.T
pca = PCA(n_components=5)
scores = pca.fit(transpose)

In [None]:
scores = pca.fit_transform(data_nm)

In [None]:
scores

In [None]:
fig = plt.figure(figsize=(8,8))
h = plt.scatter(scores[:,3],scores[:,1],edgecolors='Black', facecolors='Black',s=10,alpha=1)

In [None]:
tumo_type=peakTable_HILIC_POS["TypTumo"]
tumo_type=tumo_type.fillna("Non-case")

In [None]:
groups=peakTable_HILIC_POS["Groups"]

###   Separation en incedent et noncase

In [None]:
scores_incedent = scores[groups == 'Incident',:]
scores_noncase = scores[groups == 'Non-case',:]

In [None]:
# Plot Sample score and QC score
fig = plt.figure(figsize=(8,8))
h1 = plt.scatter(scores_incedent[:,0],scores_incedent[:,1],edgecolors='Black', facecolors='Green',s=100)
h2 = plt.scatter(scores_noncase[:,0],scores_noncase[:,1], edgecolors='Black', facecolors='Blue',s=100)

# Add legend, labels, and title
plt.legend((h1,h2),('Incident','Non-case'),fontsize=15)
plt.xlabel('PC1', fontsize=15)
plt.ylabel('PC2', fontsize=15)
plt.title('Quality Control PCA plot',fontsize=20)

# Show plot
plt.show()

###  Non case HCC Wid and HCC

In [None]:
# Extract and scale the metabolite data from the dataTable 

X = data_nm.values                      # Extract X matrix from dataTable using peaklist
#Xlog = np.log10(X)                                  # Log scale (base-10)
#Xscale = cb.utils.scale(Xlog, method='auto')        # methods include auto, range, pareto, vast, and level
#Xknn = cb.utils.knnimpute(Xscale, k=3)              # missing value imputation (knn - 3 nearest neighbors)

#print("Xknn: {} rows & {} columns".format(*Xknn.shape))

cb.plot.pca(X,
            pcx=1,                                                  # pc for x-axis
            pcy=5,                                                  # pc for y-axis
            group_label=tumo_type)                    # labels for Hover in PCA loadings plot

## 2) PLS (Partial Least Square)

In [None]:
path_peakTable = '/'.join(os.getcwd().split('/')[:-1]) + '/data/peakTable/original_peak_table/peakTable_HILIC_POS.csv'
peakTable = pd.read_csv(path_peakTable, sep=',', decimal='.', na_values='NA')
first_cols = peakTable.iloc[:, ['variable' not in col for col in peakTable.columns]]
first_cols

In [None]:
full_data = pd.concat([first_cols, data_nm], axis=1)
full_data

In [None]:
# Create a Binary Y vector for stratifiying the samples
outcomes = full_data['TypTumo']                                  # Column that corresponds to Y class (should be 2 groups)
Y = [1 if outcome == 'HCC' or outcome == 'HCC_Wide' else 0 for outcome in outcomes]       # Change Y into binary (GC = 1, HE = 0)  
Y = np.array(Y)
Y

In [None]:
# Split full_data and Y into train and test (with stratification)
dataTrain, dataTest, Ytrain, Ytest = train_test_split(full_data, Y, test_size=0.25, stratify=Y,random_state=10)

print("DataTrain = {} samples with {} positive cases.".format(len(Ytrain),sum(Ytrain)))
print("DataTest = {} samples with {} positive cases.".format(len(Ytest),sum(Ytest)))

In [None]:
# Extract and scale the metabolite data from the dataTable

peaklist = peakTable.columns[5:]                          # Set peaklist to the metabolite names in the peakTableClean
XT = dataTrain[peaklist]                                    # Extract X matrix from DataTrain using peaklist
#XTlog = np.log(XT)                                          # Log scale (base-10)
XTscale = cb.utils.scale(XT, method='auto')              # methods include auto, pareto, vast, and level
XTknn = cb.utils.knnimpute(XTscale, k=3)                    # missing value imputation (knn - 3 nearest neighbors)



In [None]:
# initalise cross_val kfold (stratified) 
cv = cb.cross_val.kfold(model=cb.model.PLS_SIMPLS,                   # model; we are using the PLS_SIMPLS model
                        X=XTknn,                                 
                        Y=Ytrain,                               
                        param_dict={'n_components': [1,2,3,4,5,6]},  # The numbers of latent variables to search                
                        folds=5,                                     # folds; for the number of splits (k-fold)
                        bootnum=100)                                 # num bootstraps for the Confidence Intervals

# run the cross validation
cv.run()  

In [None]:
cv.plot()

In [None]:
# no idea how to interpret these plots :/ sorrrrrryyyyyyyy jpp c'est flou