xiRT needs to two set of parameter files that are supplied via YAML files. The xiRT parameters contain the settings that define the network architecture and learning tasks. With different / new types of chromatography settings this is where the learning behavior is influenced. The leanring parameters are used to define the learning data (which FDR) and some higher-level learning behaviour. For instance, loading pretrained models and crossvalidation settings are controlled.
The xiRT-Parameters can be divided into several categories that either reflect the individual layers of the network or some higher level parameters. Since the input file structure is very dynamic the xiRT configuration needs to be done with care. For example, the RT information in the input data is encoded in the predictions section. Here the column names of the RT data needs to be passed. Accordingly, the learning options in the output section must be adapted. Each prediction task needs the parameters x-activation, x-column, x-dimension, x-loss, x-metrics and x-weight, where x is the RT column.
Here is an example YAML file with comments (form xiRT v. 1.0.32):
LSTM: activation: tanh # activiation function activity_regularization: l2 # regularization to use activityregularizer_value: 0.001 # lambda value bidirectional: true # if RNN-cell should work bidirectional kernel_regularization: l2 # kernel regularization method kernelregularizer_value: 0.001 # lambda value lstm_bn: true # use batch normalization nlayers: 1 # number of layers type: GRU # RNN type of layer to use: GRU, LSTM and CuDNNGRU, CuDNNGRU units: 50 # number of units in the RNN cell dense: # parameters for the dense layers activation: # type of activiations to use for the layers (for each layer) - relu # activiation function - relu - relu dense_bn: # use batch normalization - true - true - true dropout: # dropout usage rate - 0.1 - 0.1 - 0.1 kernel_regularizer: # regularizer for the kernel - l2 - l2 - l2 neurons: # number of neurons per layer - 300 - 150 - 75 nlayers: 3 # number of layers, this number must be matched by the parameters regularization: # use regularization - true - true - true regularizer_value: # lambda values - 0.001 - 0.001 - 0.001 embedding: # parameters for the embedding layer length: 50 # embedding vector dimension learning: # learning phase parameters batch_size: 128 # observations to use per batch epochs: 75 # maximal epochs to train learningrate: 0.001 # initial learning rate verbose: 1 # verbose training information output: # important learning parameters callback-path: data/results/callbacks/ # network architectures and weights will be stored here # the following parameters need to be defined for each chromatography variable hsax-activation: sigmoid # activiation function, use linear for regression hsax-column: hsax_ordinal # output column name hsax-dimension: 10 # equals number of fractions hsax-loss: binary_crossentropy # loss function, must be adapted for regression / classification hsax-metrics: mse # report the following metric hsax-weight: 50 # weight to be used in the loss function rp-activation: linear rp-column: rp rp-dimension: 1 rp-loss: mse rp-metrics: mse rp-weight: 1 scx-activation: sigmoid scx-column: scx_ordinal scx-dimension: 9 scx-loss: binary_crossentropy scx-metrics: mse scx-weight: 50 siamese: # parameters for the siamese part use: True # use siamese merge_type: add # how to combined individual network params after the Siamese network single_predictions: True # use also single peptide predictions callbacks: # callbacks to use check_point: True log_csv: True early_stopping: True early_stopping_patience: 15 tensor_board: False progressbar: True reduce_lr: True reduce_lr_factor: 0.5 reduce_lr_patience: 15 predictions: # parameters that define how the input variables are treated # continues means that linear (regressin) activation functions are used for the learning. # if this should be done the above parameters must also be adapted (weight, loss, metric, etc) continues: - rp fractions: # simply write fractions: [] if no fraction prediction is desired # if fractions (discrete) numbers should be used for the learning than this needs to be # indicated here # For fractions, either ordinal regression or classification can be used in the # fractions setting (regression is possible too). - scx - hsax
Apart from the very important neural network architecture definitions the target variable encoding is also done in the YAML.
Parameters that govern the separation of training and testing data for the learning.
Here is an example YAML file with comments (form xiRT v. 1.0.32):
# preprocessing options: # le: str, label encoder location. Only needed for transfer learning, or usage of pretrained # max_length: float, max length of sequences # cl_residue: bool, if True crosslinked residues are decoded as Kcl or in modX format clK preprocessing: le: None max_length: -1 # -1 cl_residue: True # fdr: float, a FDR cutoff for peptide matches to be included in the training process # ncv: int, number of CV folds to perform to avoid training/prediction on the same data # mode: str, must be one of train, crossvalidation, predict # train and transfer share the same options that are necessary to run xiML, here is a brief rundown: # augment: bool, if data augmentation should be performed # sequence_type: str, must be linear, crosslink, pseudolinear. crosslink uses the siamese network # pretrained_weights: "None", str location of neural network weights. Only embedding/RNN weights # are loaded. pretrained weights can be used with all modes, essentially resembling a transfer # learning set-up # sample_frac: float, (0, 1) used for downsampling the input data (e.g. for learning curves). # Usually, left to 1 if all data should be used for training # sample_state: int, random state to be used for shuffling the data. Important for recreating # results. # refit: bool, if True the classifier is refit on all the data below the FDR cutoff to predict # the RT times for all peptide matches above the FDR cutoff. If false, the already trained CV # classifier with the lowest validation loss is chosen train: fdr: 0.01 ncv: 3 mode: "crossvalidation" # other modes are: train / crossvalidation / predict augment: False sequence_type: "crosslink" pretrained_weights: "None" test_frac: 0.10 sample_frac: 1 sample_state: 21 refit: False
Generally, it is better to supply more high-quality data than more data. Sometimes considerable drops in performance can be observed when 5% instead of 1% input data is used. However, there is no general rule of thumb and this needs to be optimized per run / experiment.
Neural Networks are very sensitive to their hyperparameters. To automate the daunting task of finding the right hyperparameters two utils are shipped with xiRT. 1) a convenience function that generates YAML files from a grid YAML file. 2) a snakemake workflow that can be used to run xiRT with each parameter combination.
The grid will be generated based on all entries where not a single value is passed but a list of values. This can lead to an enormous search space, so step-wise optimization is sometimes the only viable option.