Load dependecies

In [1]:
import NN_Trainer
import numpy as np
import pandas as pd
from reaction_class import Reaction as rc
import os
import sys
from pathlib import Path
path = Path.cwd()
sys.path.append(path)

2023-07-12 14:36:36.196292: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Generating reaction presence dataframe

To prepare the training data we need to determine the reactions present in your training metabolic models. This means that we need generate a list of possible reactions found in your training data, which will serve as the reaction keys. We can then determine for every draft training models which of these reactions are present and create a binary list of reactions presences. We will end up with a binary array with on one axis the different reactions and on the other every model in the training data. 

We will use the class we build but you can use any module to load metabolic models or extract the reaction sets in another way, the key is to end up with a binary array of reaction presences. If you already have this, this step can be skipped

In [2]:
#path to training models

model_path =  ''

#output path training data

output_path = ''

#list of model-ids of draft-models
paths  = os.listdir(model_path)
model_ids = []
for filename in paths:
    model_ids.append(filename[:-5])
n_models = len(model_ids)
dic = {}
rxn = []
for file_path, model_id in zip(paths,model_ids):
    print(model_id)
    model = rc(model = os.path.join(model_path, file_path))
    rs = set(model.reactions)
    dic[model_id]=rs
    
    #generate a list of all possible reactions
    for i in list(rs):
         if i not in rxn:
             rxn.append(i)

n_reactions = len(rxn)

reaction_df=pd.DataFrame(index=rxn, columns=model_ids)
for key, value in dic.items():
    a = []
    for i in rxn:
        if i in value:
            a.append(1)
        else:
            a.append(0)
    reaction_df[key]=a

#saving to pandas file
reaction_df.to_csv(output_path)



FileNotFoundError: [Errno 2] No such file or directory: ''

### Training the Neural Network

The easiest way to train the network requires providing a pandas dataframe where the index are the reaction keys and the columns the different training examples (see above). You can also provide a numpy array and the reaction keys as a separate list. During training the function will automatically generate the training dataset. You can change the number of times each training model is used (nuplo). You can also give a range of deletion percentages (min_for to max_for) which will be removed in equal sized steps based on the number of replicates. There is also optional parameter that can be used to weigh the deletion of certain reactions (del_p). It is also possible to add false reactions (using min_con and max_con), but we do not currently use it and it will not work with the masking of input reactions (as the mask does not differentiate between contamination and real reactions).

You can provide labels (the full set of reactions) for the network to try and predict, if no labels are provided the network will asume that your input (the data without deletions) should be what the network tries to predict. 

You can rely on the default parameters to define the network which we optimised for our usecase, but for optimal perfomance on different datasets, you might want to change the hyperparameters (dropout, batch size), the architecture (nnodes, nlayers) or bias of predicted classes. You can also disable the masking of input positions during loss calculation. Finally you can determine a validation split which will set apart a part of your input data during training and calculate scores after to validate your network.

The function will return a class containing a Tensorflow object (the network), the list of reactions which respond to the output nodes (reaction keys) and the modeltype (ModelSEED, BiGG etc.). If save=True you can save these as a .h5 file. 

Finally you can set history = True to also return the history of training for optimisation purposes.


In [1]:
"""
        PARAMETERS:
        ----------
        data: DataFrame or array, required
            binary array of reactions presences, for DataFrame index is used as rxn_keys
            otherwise rxn_keys should be provided
        modeltype : string, (currently) required
            The modeltype of the training data,
        rxn_keys: list, optional
            Can be used if data is not a pandas dataframe but a numpy array. Default is None
        labels:
            User can specify labels, by default input data is used as labels

        TRAINING PARAMETERS:
        -------
        nuplo: int
            create duplicates of input data
            default=30

        The omission and contamination rates will increase linearly from min to max,
        with stepsize determined by nuplo
        min_for, float
            minimum false omssion rate, default = 0.05
        max_for, float
            maximum false ommision rate, default = 0.55
        min_con, float
            minimum contanimation introduced, currently not used, default = 0
        max_con, float
            maximum contamination introduced, currently not used, default = 0
        del_p, list
            list of probabilities of deletion for reactions
        con_p, list
            list of probabilities of introduction for reactions

        NETWORK PARAMETERS
        -------------
        nlayers: int, optional
            number of hidden layers (layers that are not input or output)
            default=1
        nnodes: int, optional
            number of nodes per layer,
            default=256
        nepochs: int, optional
            how often the network needs to loop over all the data
            default=10
        b_size: int, optional
            batch_size (number of training examples that are simultaneously evaluated)
            default=32,
        dropout: float, optional
            parameter for training that can reduce overfitting
            default = 0.1,
        bias_0: float, optional
            default = 0.3,
        maskI: boolean, optional
            Determines wether the input positions are masked during loss calculation, default=True
            default=True
        validation_split: float, optional
            Splits the input data in training and validation
            default = 0 (no split)

        SAVING PARAMETERS:

        save: boolean, optional
            Whether you want to save the network, default = False
        name: string, optional
            name of your network, default='noname'
        output_path: string,
            where output, default=''
        return_history: boolean, optional
            If you want training history

       Returns:
        -------------
        trainedNN
            NN class containing network, rxn_keys and modeltype
        history: History(), if history=True
            history of training, this can be used to look at the performance during training
    
    """

In [2]:
#Load in a small training sample
file_path = os.path.join(path.parent,'files', 'NN')
data = pd.read_csv(os.path.join(file_path, 'Sample_reaction_presence.csv'), index_col=0)

network = NN_Trainer.train(data=data, modeltype='ModelSEED',name='example',output_path=file_path, save=True)

Num GPUs Available:  0
using data as labels
dataset created
training on data with shape: (300, 2452) with 249710.0 reactions
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 256)               627968    
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 2452)              630164    
                                                                 
Total params: 1,258,132
Trainable params: 1,258,132
Non-trainable params: 0
_________________________________________________________________
Train on 300 samples
Epoch 1/10


2023-07-12 14:36:42.909027: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2023-07-12 14:36:42.923453: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-07-12 14:36:42.975300: W tensorflow/c/c_api.cc:300] Operation '{name:'training/Adam/dense_1/bias/vhat/Assign' id:565 op device:{requested: '', assigned: ''} def:{{{node training/Adam/dense_1/bias/vhat/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](training/Adam/dense_1/bias/vhat, training/Adam/dense_1/bias/vhat/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
