Skip to content

Evodictor is a software package for learning patterns and predicting the future of evolution by gain/losses of given traits

Notifications You must be signed in to change notification settings

IwasakiLab/Evodictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evodictor and User Manual

Overview of Evodictor

Evodictor is a software package for learning patterns and predicting the future of evolution by gain/losses of given binary traits (e.g., gene presence/absence). Evodictor takes a phylogenetic tree and presence/absence profiles of every trait for all the extant and the ancestral species in the tree as input, then predicts the gain/loss probability of a target trait from a given trait repertoire of a species (e.g., presence/absence of every gene in the genome of the species). To predict trait gain/loss, Evodictor learns what traits tend to be present/absent prior to gain/losses of the target trait from past gain/loss evolution across diverse species. Evodictor was established in Konno and Iwasaki, Science Advances, 2023, and was demonstrated to predict gene gain/loss evolution of bacterial metabolic systems.

Figure 1. Overview of Evodictor for gene gain/loss prediction.

Supported Environment

  1. Evodictor can be executed on Linux OS / Mac

Software Dependency

Required

  1. Python3 (version: 3.7.0 or later) with biopython, scipy, numpy, imblearn, and scikit-learn modules required
  • You can install these python modules using conda

    conda install -c conda-forge biopython imbalanced-learn
    conda install -c anaconda scipy scikit-learn
    conda install -c conda-forge conda-forge::numpy

Software installation

Each installation step will take less than ~1 min

Installation of Evodictor

  1. Download Evodictor by

     git clone https://github.com/IwasakiLab/Evodictor.git
  2. Add the absolute path of xxx/src directory to $PATH

  3. Make /src/* executable

    chmod u+x xxx/src/*

Sample Codes

This repository contains an example input file in the examples directory so users can quickly try predicting gene gain/loss evolution using Evodictor step-by-step:

Step 1: Dataset Generation

Generate a dataset for machine learning from a phylogenetic tree and presence/absence profiles of every trait for all the extant and the ancestral species in the tree to predict gene gain of a target ortholog group (K00005 in this example)

evodictor generate --target K00005 -X OG_node_state.txt -y OG_node_state.txt -t example.tree --predictor feature_OG.txt --gl gain > branch_X_y.txt

Or you can type "xygen" instead of "evodictor generate".

Input:

example.tree: A phylogenetic tree in a Newick format.

OG_node_state.txt: The presence/absence profile of every ortholog group (OG) for every tip node (extant species) and every internal node (ancestors) of example.tree. There is one row for every internal/tip node in this file. The first, second, and third columns of every row indicate the OG name, node name, and the presence/absence state, respectively. The presence/absence state is represented as 0 (absent), 1 (present), or 0.5 (uncertain; for ancestors). Rows for which states are 0 can be omitted in this file (in other words, states of nodes not defined in this file are treated as 0).

feature_OG.txt: Correspondence between OGs (e.g., K00001) and features (defined as groups of OGs; e.g., M00001). The input of the machine learning model in Evodictor is the vector in which every dimension (feature) corresponds to the number of present OGs included in the feature.

Output:

branch_X_y.txt: The dataset for machine learning which can be an input file of evodictor predict. The first row is the header, and each of the following rows correspond to a branch in the example.tree. The first, second, and third column of every row indicate the node name of a parental species of a branch in example.tree, the number of present traits of every feature in the parental species (separated by ;), and the occurrence of gene gain of predicted OG (K00005) at the branch (1: the gene was gained at the branch; 0: the gene was not gained at the branch).

Step 2: Feature Selection

Select top-20 important input features based on ANOVA F-value to predict gene gain of an OG (K00005).

evodictor select -i branch_X_y.txt --skip_header --o1 feature_importance.txt --o2 selection_result.txt --o3 branch_X_y.selected.txt -k 20

Or you can type "selevo" instead of "evodictor select".

Input:

branch_X_y.txt: The file generated in Step 1

Output:

feature_importance.txt : Importance (ANOVA F-value) of every feature

selection_result.20.txt : Binary values indicating whether each feature was included in top-20 important features or not (1: selected, 0: not selected)

branch_X_y.selected.20.txt : The dataset for machine learning which can be an input file of evodictor predict and contain only selected top-20 important features.

Step 3: Cross-validation

Conduct three-fold cross validation of gene gain prediction by logistic regression for an OG (K00005)

evodictor predict -i branch_X_y.selected.20.txt -c -k 3 -m LR --header > cross_validated_AUCs.txt

Input:

branch_X_y.selected.20.txt : The file generated in Step 3

Output:

cross_validated_AUCs.txt : List of the three AUCs (AUROCs) measured by three-fold cross validation

Step 4: Future gene gain prediction

Conduct training of logistic regression model and prediction of future gene gain probability of an OG (K00005) for every species. All the features were used for model training and prediction in this example. You can also conduct prediction with only selected features by changing two of the input files: feature_OG.txt and branch_X_y.txt.

evodictor generate --target K00005 -X OG_node_state.txt -y OG_node_state.txt -t example.tree --predictor feature_OG.txt --gl gain --ex > extant_X.txt
evodictor predict -m LR --header -i branch_X_y.txt -t extant_X.txt > species_probability.txt

Input:

example.tree: The same input file as Step 1

OG_node_state.txt: The same input file as Step 1

feature_OG.txt: The same input file as Step 1

branch_X_y.txt: The file generated in Step 1

Output:

extant_X.txt : List of input feature vectors of extant species (i.e., tip nodes of example.tree). The first row is a header. The first and second columns in each of the following rows indicate a extant species name and the number of present traits for every feature in the extant species (separated by ;).

species_probability.txt : Predicted gene gain probability of (K00005 for every extant species. The first and second columns in each row indicate a extant species name and the predicted gene gain probability.

Usage

evodictor generate / xygen

usage: evodictor generate [-h] [-v] [-p] [--target TARGET] [-X SPARSE_X] [-y SPARSE_Y]
             [-t TREE] [--predictor PREDICTOR] [--gl GL] [-m MODE] [--ex]

evodictor generate

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Print evodictor version (default: False)
  -p, --print           Print all arguments (default: False)
  --target TARGET       [Required] Prediction target (eg. 'R00001')
  -X SPARSE_X, --sparse_X SPARSE_X
                        [Required] Sparse matrix file path for input features
                        X
  -y SPARSE_Y, --sparse_y SPARSE_Y
                        [Required] Sparse matrix file path for output y
  -t TREE, --tree TREE  [Required] Tree file path
  --predictor PREDICTOR
                        [Required] Predictor definition file path
  --gl GL               [Required] Specify 'gain' or 'loss'
  -m MODE, --mode MODE  Mode of dataset generator (default: 'define')
  --ex                  Print only X for extant species (default: False)

evodictor select / selevo

usage: evodictor select [-h] [-v] [-p] [-i INPUT] [-m METHOD] [--scores SCORES]
              [--mask MASK] [--newXygen NEWXYGEN] [-n NORMALIZE] [-k K]
              [--skip_header] [--n_estimators N_ESTIMATORS]
              [--max_depth MAX_DEPTH] [--signed]

evodictor select

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Print evodictor version
  -p, --print           Print all arguments
  -i INPUT, --input INPUT
                        [required] Input file path
  -m METHOD, --method METHOD
                        [required] Feature selection method (Permissive
                        values: 'ANOVA', 'RandomForest')
  --scores SCORES, --o1 SCORES
                        Output feature importance file path ('stdout' is also
                        acceptable, 'None' inactivates output) (default:
                        stdout)
  --mask MASK, --o2 MASK
                        Output selected parameters ('0': not selected, '1':
                        selected) (default: stdout)
  --newXygen NEWXYGEN, --o3 NEWXYGEN
                        Output a dataset file with selected features (default:
                        stdout)
  -n NORMALIZE, --normalize NORMALIZE
                        Conduct normalization (Permissive values: 'standard',
                        'minmax', 'skip') (default: 'standard')
  -k K                  Number of selected features (default: 5)
  --skip_header         Skip header row (default: False)
  --n_estimators N_ESTIMATORS
                        This option is active only when '-m RandomForest'.
                        Number of trees for random forest feature selection.
                        (default: 100)
  --max_depth MAX_DEPTH
                        This option is active only when '-m RandomForest'.
                        Maximum tree depth for random forest feature
                        selection. (default: 2)
  --signed              Calculate signed importance value (positive or
                        negative) (default: False)

evodictor predict / predevo

usage: evodictor predict [-h] [-v] [-p] [-i INPUT] [-m MODEL] [-t TEST] [-n NORMALIZE]
               [--pointbiserialr] [-c] [--hv] [-k KFOLD] [-s SAMPLING]
               [--scoring SCORING] [--permutation PERMUTATION] [-r SEED]
               [--header] [--n_estimators N_ESTIMATORS]
               [--max_depth MAX_DEPTH] [--lr_penalty LR_PENALTY]
               [--lr_solver LR_SOLVER]

evodictor predict

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Print evodictor version
  -p, --print           Print all arguments
  -i INPUT, --input INPUT
                        [Required] Input file path
  -m MODEL, --model MODEL
                        [Required] Prediction model (Permissive values: 'LR',
                        'RF')
  -t TEST, --test TEST  Test data file path; if this option was specified, the
                        file specified by -i is treated as a training dataset,
                        then conduct prediction for the test data file
                        specified by this option
  -n NORMALIZE, --normalize NORMALIZE
                        Conduct normalization (Permissive values: 'standard',
                        'minmax', 'skip') (default: 'standard')
  --pointbiserialr      Calculates a point biserial correlation coefficient
                        between each feature and y, and the associated p-value
                        (if True, other options will be ignored) (default:
                        False)
  -c, --cv              Conduct stratified cross validation
  --hv                  Conduct stratified hold-out validation (default:
                        False)
  -k KFOLD, --kfold KFOLD
                        K for k-fold stratified cross validation. This option
                        is valid only when -c is specified. (default: 0)
  -s SAMPLING, --sampling SAMPLING
                        Resampling the training dataset in cross validation.
                        Permissive values: 'none', 'under', 'over' (default:
                        'none')
  --scoring SCORING     Scoring of cross validation. Permissive values:
                        'roc_auc', 'roc_auc_pvalue', or 'roc_curve'. This
                        option is valid only when -c is specified. (default:
                        'roc_auc')
  --permutation PERMUTATION
                        Number of permutations for calculating p-value of AUC
                        in cross validation. This option is used only when '--
                        scoring roc_auc_pvalue' is specified. (default:
                        100000)
  -r SEED, --seed SEED  Random seed (default: 0)
  --header              Skip header row (default: False)
  --n_estimators N_ESTIMATORS
                        Number of trees for random forest feature selection.
                        This option is active only when '-m RF'. (default:
                        100)
  --max_depth MAX_DEPTH
                        Maximum tree depth for random forest feature
                        selection. This option is active only when '-m RF'.
                        (default: 2)
  --lr_penalty LR_PENALTY
                        Regularization for logistic regression. This option is
                        active only when '-m LR'. Permissive values: ‘l1’,
                        ‘l2’, ‘elasticnet’, ‘none’ (default: 'l2')
  --lr_solver LR_SOLVER
                        Solver for logistic regression. Permissive values:
                        ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’
                        (default: 'liblinear')

How to cite Evodictor

Naoki Konno, and Wataru Iwasaki. 2023. “Machine Learning Enables Prediction of Metabolic System Evolution in Bacteria.” Science Advances 9 (2): eadc9130.

Contact

Naoki Konno (The University of Tokyo) konno-naoki555@g.ecc.u-tokyo.ac.jp

About

Evodictor is a software package for learning patterns and predicting the future of evolution by gain/losses of given traits

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages