# Scan

This package is used to evaluate large scale mass spectrometry experiments with Dilute-and-Shoot Flow-Injection-Analysis Tandem Mass Spectrometry (DS-FIA-MS/MS). The functions provided in this module require data and result tables provided by database and method development.

Main functions:
- Preprocessing
- Univariate analysis
- Multivariate analysis
- Cluster analysis
- Pathway analysis
- Network analysis

## Import packages

In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('..')

from supplementcode import scan

## Preprocessing
### Set parameter
This cell is used for operator input. Provide a raw path to the xlsx. result file from MultiQuant based on a measurement batch acquired with DSFIApy method development. In addition, provide a raw path to the reports.xlsx file created with method development. Set a signal / noise theshold as a preliminary signal filter (e.g. 10) and required technical precision (e.g. 20) for quality control. The workflow can be used iteratively, so if there are specific metabolites identified by the algorithm, provide a list with KEGG compound IDs to highlight these metabolites throughout the evaluation. Typical plotting parameters can be identified iteratively by the operator.

In [None]:
# Results file
path_file = r'C:\...\trace_samples\trace_P5-P7.xlsx'
# Report file
path_report = r'C:\...\examples\Database\Development\Projects\cgb\cgb\reports.xlsx'
# Signal/Noise filter
pre_signal_noise = 10
# Relative standard deviation filter(e.g. 15%, 20%)
pre_precision = 20
# Metabolite focus list
pre_list_relevant = []

# Labelsize summary
pre_labelsize_identification = 12
# Figsize summary, identification
pre_figsize_identification = (14,6)
# Figsize summary, quantification
pre_figsize_quantification = (10,6)

### Calculation
This is the preprocessing function. It's used for initial formatting and preprocessing of result tables:
- Intra-batch correction
    - Quality control (QC) samples
    - Locally weighted regression and cubic spline fit
    - Normalization with QC
    
    
- Outlier detection
    
- Classification:
    - Qualitative interpretation and relative quantification
    - Inhouse, literature and prediction data used

In [None]:
inp = scan.scan_preprocessing(
    path_file = path_file, 
    path_report = path_report,
    pre_signal_noise = pre_signal_noise, 
    pre_precision = pre_precision, 
    pre_list_relevant = pre_list_relevant,
    pre_labelsize_identification = pre_labelsize_identification, 
    pre_figsize_identification = pre_figsize_identification, 
    pre_figsize_quantification = pre_figsize_quantification
)

## Univariate analysis
### Set parameter
This cell is used for operator input. Provide the error of probability for univariate analysis and hypothesis tests (e.g. 0.05). Select, if a internal decision tree is used for automated hypothesis test selection. Based on the experiment design, the samples are either dependent or independent of each other. Additionally, provide one of the correction methods (e.g. FDR, Holm-Bonferroni, Bonferroni) to correct for multi-comparison problems. The fold change is only used for plotting purposes, since the x-fold limit is arbitrary. Plotting parameters can be tested by the operator.

In [None]:
# Probability of error
uv_alpha_univariate = 0.05
# Use univariate decision tree, else t-test (dependent, independent)
uv_decision_tree = False 
# Dependent (True) or independent (False) samples
uv_paired_samples = True 
# Multi-comparison correction; 'Holm-Bonferroni', 'Bonferroni', 'FDR'
uv_correction = 'FDR' 

# Fold change
uv_fold_change = 1
# Labelsize vulcanoplot
uv_labelsize_vulcano = 12
# Figsize vulcanoplot
uv_figsize_vulcano = (10,10)
# Use full names in vulcanoplot
uv_label_full_vulcano = True

### Calculation
This is the univariate analysis function. It's used to acquire fold changes and to conduct hypothesis tests. A non-parametric Kruskal-Wallis omnibus test for test of center is conducted for further analysis. 

The selection of the hypothesis test tree provides the following procedures:
- Test of normality:
    - Shapiro-Wilk


- Test of variance:
    - Bartlett (normally distributed)
    - Levene   (not normally distributed)
        
    
- Test of center:
    - t-test independent (normally distributed, equal variance)
    - Welch (normally distributed, unequal variances)
    - Wilcoxon rank-sum (not normally distributed, unpaired)
    - Wilcoxon signed-rank (not normally distributed, paired)

In [None]:
inp = scan.scan_uv(
    inp = inp,
    uv_alpha_univariate = uv_alpha_univariate, 
    uv_fold_change = uv_fold_change,
    uv_decision_tree = uv_decision_tree, 
    uv_paired_samples = uv_paired_samples, 
    uv_correction = uv_correction, 
    uv_labelsize_vulcano = uv_labelsize_vulcano, 
    uv_figsize_vulcano = uv_figsize_vulcano, 
    uv_label_full_vulcano = uv_label_full_vulcano,    
)

## Multivariate analysis
### Set parameter
This cell is used for operator input. Provide information for multivariate modelling in form of scaling and cross-validation information.

Feature scaling can be provided with the following scaling methods:
- Auto scaling (scaling = True, scaling_method = 'auto')
- Range scaling (scaling = True, scaling_method = 'range')
- Pareto scaling (scaling = True, scaling_method = 'pareto')
- Vast scaling (scaling = True, scaling_method = 'vast')
- Level scaling (scaling = True, scaling_method = 'level')

The following cross validation iterator combinations are possible:
- k-fold (cv_iterator = 'kfold', stratified = False, repeated = False)
- stratified k-fold (cv_iterator = 'kfold', stratified = True, repeated = False)
- repeated k-fold (cv_iterator = 'kfold', stratified = False, repeated = True)
- repeated stratified k-fold (cv_iterator = 'kfold', stratified = True, repeated = True)

Stratification is most likely necessary for small data sets.

Plotting parameters can be tested by the operator.

In [None]:
# Scaling
mv_scaling = True
# Scaling method
mv_scaling_method = 'range'
# Cross-validation, iterator
mv_cv_iterator = 'kfold'
# Cross-validation, stratification
mv_cv_stratified = True
# Cross-validation, repetition
mv_cv_repeated = True
# Cross-validation, iterator number
mv_cv_kfold = 7
# Cross-validation, repetition number
mv_cv_repetition = 2

# Labelsize plots
mv_labelsize_mv = 12
# Figsize score plots
mv_figsize_score = (6,6)
# Figsize scree plots
mv_figsize_scree = (6,4)
# Figsize vip score plots
mv_figsize_vip = (4.5,8)
# Use full label in vip score plots
mv_label_full_vip = True
# Show top vips
mv_vip_number = 30

### Calculation
This is the multivariate analysis function. It provides sample and feature diagnostics for further analysis. 

The following models are provided:
- Principial component analysis (PCA)
- Partial least squares discriminant analysis (PLS-DA)

The PCA model is used as a unsupervised sample diagnostic for quality control evaluation. The supervised PLS-DA model allows to identify discriminant features under multi-collinearity and acts as a classifier. Hyperparameter optimization is conducted automatically by scree analysis with model validation parameters. Feature diagnostic is provided by beta coefficients and variable importance on projection scores. Confidence intervalls are bootstrapped. Cross-validation is conducted based on operator input. Due to a pipeline approach, data leakage is avoided (e.g. scaling training and test sets before splitting etc.)

Model validation is based on goodness-of-fit (R2X, R2Y) and goodness-of-prediction (Q2Y). Further validation procedures consist of receiver-operation-characteristic (ROC, one-vs-all), the area under the curve (AUC) of the ROC (AUROC) and permutation-based hypothesis tests for significance. 

In [None]:
inp = scan.scan_mv(
    inp, 
    mv_scaling = mv_scaling, 
    mv_scaling_method = mv_scaling_method, 
    mv_cv_iterator = mv_cv_iterator, 
    mv_cv_stratified = mv_cv_stratified,
    mv_cv_repeated = mv_cv_repeated, 
    mv_cv_kfold = mv_cv_kfold, 
    mv_cv_repetition = mv_cv_repetition,
    mv_labelsize_mv = mv_labelsize_mv, 
    mv_figsize_score = mv_figsize_score, 
    mv_figsize_scree = mv_figsize_scree, 
    mv_figsize_vip = mv_figsize_vip, 
    mv_label_full_vip = mv_label_full_vip,
    mv_vip_number = mv_vip_number
)

## Cluster analysis
### Set parameter
This cell is used for operator input. Provide information for unsupervised cluster-analysis in form of univariate and multivariate filter parameters. 

Plotting parameters can be tested by the operator.

In [None]:
# Use kruskal-wallis filter
cluster_threshold_kruskal = True
# Use beta-coefficient filter
cluster_threshold_beta = True
# Use vip filter
cluster_threshold_vip = True
# Use vip filter for user list
cluster_threshold_vip_relevant = False

# Clustermap orientation; 'horizontal', 'vertical'
cluster_orientation = 'horizontal'
# Maximum number of vips to display
cluster_vip_top_number = 500
# Average cluster map
cluster_mean_area = True
# Labelsize cluster map
cluster_labelsize_cluster = 12
# Figsize cluster map
cluster_figsize_cluster = (10,4)

### Calculation
This is the cluster analysis function. It provides sample and feature diagnostics in a unsupervised hierarchical cluster analysis approach.

In [None]:
inp = scan.scan_cluster(
    inp, 
    cluster_threshold_kruskal = cluster_threshold_kruskal, 
    cluster_threshold_beta = cluster_threshold_beta, 
    cluster_threshold_vip = cluster_threshold_vip, 
    cluster_threshold_vip_relevant = cluster_threshold_vip_relevant,
    cluster_orientation = cluster_orientation,
    cluster_vip_top_number = cluster_vip_top_number, 
    cluster_mean_area = cluster_mean_area, 
    cluster_labelsize_cluster = cluster_labelsize_cluster, 
    cluster_figsize_cluster = cluster_figsize_cluster
)

## Pathway analysis
### Set parameter
This cell is used for operator input. Provide a raw path to the corresponding xlsx organism file created with database. For hypothesis testing, provide the propability of error (e.g. 0.05) and multi-comparison correction method. For pathway topology analysis, provide the pathway centrality measure (e.g. betweeness).

Plotting parameters can be tested by the operator.

In [None]:
# Organism file
path_org = r'C:\...\examples\Database\Database\Pathways\cgb.xlsx'
# Probability of error
pathway_alpha = 0.05
# Multi-comparison correction; 'Holm-Bonferroni', 'Bonferroni', 'FDR'
pathway_correction = 'FDR'
# Analyte selection; 'univariate', 'multivariate'
pathway_selection = 'multivariate'
# Topology analysis centrality measure; 'degree', 'betweenness', 'closeness', 'load' or 'harmonic'
pathway_measure = 'betweenness' 

# Labelsize plots
pathway_labelsize_pathway = 12
# Figsize plots
pathway_figsize_pathway = (6,6)
# Maximum number of pathways to display
pathway_number_pathways_top = 20

### Calculation
This is the pathway analysis function. The over-representation analysis (ORA) allows to identify significantly changed pathways between conditions. 

With the metabolite set enrichment analysis (MSEA), additional metabolite areas are provided for pathway significance analysis.

Pathway topology analysis extends the ORA by modelling of organism specific pathways in a network approach. The corresponding network metrics work as weights and additional information for pathway significance evaluation.

In [None]:
inp = scan.scan_pathway(
    inp,
    path_org = path_org,
    pathway_alpha = pathway_alpha, 
    pathway_correction = pathway_correction,
    pathway_selection = pathway_selection,
    pathway_measure = pathway_measure,
    pathway_labelsize_pathway = pathway_labelsize_pathway, 
    pathway_figsize_pathway = pathway_figsize_pathway, 
    pathway_number_pathways_top = pathway_number_pathways_top
)