Skip to content

OcMalde/PhyloCharMod_publ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhyloCharMod

Scripts and data for publication & whole pipeline Docker image

Phylogenetic Prediction of Functional Sequence Modules

Introducing an original approach to characterizing functional motifs. This methodology encompasses:

  1. Detection of conserved sequence modules (using Partial Local Multiple Alignment)
  2. Phylogenetic inference of species/genes/modules/functions evolutionary histories
  3. Identification of co-appearances of modules and functions

The process accepts protein sequences and their associated annotations as input. It then returns the presence of conserved sequence modules, along with their associated annotations, across different ancestral genes.

Data recuperation

For a more detailed explanation of the methodology, refer to the following article [Dennler et al. 2023], or the following PhD thesis (only available in French).

Citation

Dennler, O., Coste, F., Blanquart, S., Belleannée, C., Théret, N., (2023). Phylogenetic inference of the emergence of sequence modules and protein-protein interactions in the ADAMTS-TSL family. PLOS Computational Biology.

Simple Usage

Docker Installation

For ease of use, considering the various software and dependencies required, we strongly recommend using our Docker image. After installing Docker, you can pull our Docker image using the following command:

docker pull ghcr.io/ocmalde/phylocharmod:0.1

Run the Docker Image

First, you need to make a container and connect interactivly to it:

docker run -it --entrypoint /bin/bash ghcr.io/ocmalde/phylocharmod:0.1 

Then you can simply execute the whole pipeline using:

python3 /phylocharmod/phylocharmod.py <sequences.fasta> <annotations.csv>

To quit it, simply type exit

Example with test files provided in the Docker image (you need to connect to the container first):

cd test_dir/ && python3 ../phylocharmod/phylocharmod.py 712buddy37seq.fasta leaf_Manual_712.csv

If you already made a container, you can obtain its <CONTAINER ID> using docker ps -a. Using the <CONTAINER ID>, you can connect to the existing container with:

docker start <CONTAINER ID> && docker attach <CONTAINER ID>

To use move file from/to the container, use:

docker cp <CONTAINER ID>:/path/in/container/ /path/in/local

Input

  1. <sequences.fasta>:

    Each sequence header must adhere to the format: >SeqID_taxid, where SeqID represents the unique sequence identifier and taxid is the NCBI species taxid (e.g., >NP031426.2_10090)

    Refer to this file for an example

    Please refrain from using special characters in the header (e.g., |,()`":;). Use only _ as a separator.

    These files can be generated using orthogroups and GFF files, all of which are included in the Docker image for the nine species. You only need to compile a file with a list of RefSeq of interest. For detailed instructions, please refer to To-build-a-sequence-dataset-based-on-orthogroups.

  2. <annotations.csv>:

    This file contains annotations associated with the different sequences

    Each line should be formatted as: SeqID,Annotation_1|Annotation_2. Here, SeqID is the unique sequence identifier. It's separated from the list of annotations by a comma (,), and individual annotations are separated by a pipe (|) (e.g., NP_620594.1,P00451_F8|P04275_VWF). Annotations must be more than one character in length.

    Refer to this file for an example

⚠️ Important Notice: Default Execution and Gene Tree Input. When running the analysis with the default settings, a default rooted tree will be generated. However, for optimal results, it is strongly recommended to infer a properly rooted gene tree prior to analysis and use it as the input for the --gene_tree option. The gene phylogenetic tree serves as a critical template for the entire analysis, thus it is essential that a reliable and accurately rooted gene tree is prepared and utilized.

Output

The main workflow output is the list of modules/annotations present/gained/lost at the different ancestral genes. This output is presented as a table in the file complete_functionChange_moduleChange.csv (Example here). It is strongly advised to also look at the the final gene tree gene_tree/gene.tree to visualise the annotated gene nodes. Plus, description of all modules are available in modules_segm_dir/ and enable to get module segments (sequences and positions) based on module names.

For an interactive visualisation of these data, various iTOL files are generated in visuReconc/ and compressed in visuReconc.zip for batch upload on iTOL.

For all details, all outputs and working files will be available such as (see this directory for an example)

output_dir
│ 
├── gene.fasta                                                    -----> The input fasta file
├── leaf_Manual.csv                                               -----> The input annotation file
├── gene.tree                                                     -----> The input gene tree file (if given as input)
├── t*m*M*_plma.dot                                               -----> The input plma file (if given as input)
│
├── acs_dir                                                       -----> All ancestral scenario reconstruction files from pastML
│   ├── results_pastml                                            -----> pastML outputs
│   │   ├── marginal_probabilities.character_*
│   │   ├── params.character_*
│   ├── leaf.csv                                                  -----> The input annotation file
│   ├── pastml.csv                                                -----> The input annotation formated as a {0,1,?} matrix for pastML use
│   ├── pastml_combined_ancestral_states.tab                      -----> pastML output regrouping all annotation ancestral states
│   ├── seadog_gene.tree                                          -----> The "final" gene tree after treefix correction / internal node labelling by seadog mDGS reconciliation / branch length computing by PhyML
│   └── seadog_sp_gene_event.csv                                  -----> Gene nodes event from Species - Gene reconciliation (e.g., Gene duplication, Speciation)
│
├── gene_tree
│   └── gene.tree                                                 -----> The "final" gene tree after treefix correction / internal node labelling by seadog mDGS reconciliation / branch length computing by PhyML
│
├── modules_segm_dir                                              -----> All modules decomposition files: paloma plma -> for all modules: module fasta -> PhyML tree -> treefix corrected tree
│   ├── modules_t*m*M*_plma
│   │   ├── B*.fasta                                              -----> A module fasta file
│   │   ├── B*.phylip                                             -----> A module in phylip format
│   │   ├── B*.phylip_phyml_stats.txt                             -----> PhyML output
│   │   ├── B*.phylip_phyml_tree.txt                              -----> PhyML output
│   │   ├── B*.tree                                               -----> PhyML output
│   │   ├── B*_treefix_dir.smap                                   -----> Gene - Module nodes mapping
│   │   ├── B*_treefix_dir_treeToFixPath.txt                      -----> Define treefix input
│   │   ├── B*_treefix_dir                                        -----> treefix working directory
│   │   │   ├── B*.fasta                                          -----> Module fasta file 
│   │   │   ├── B*.tree                                           -----> Module PhyML tree
│   │   │   ├── B*.treefix.tree                                   -----> Module treefix corrected tree - "final"
│   ├── gene.fasta                                                -----> The input fasta file
│   └── t*m*M*_plma.dot (or t*m*M*_plma.agraph)                   -----> The plma file (from input or computed by paloma)
│
├── modules_paths_modules_t*m*M*_plma.txt                         -----> Define list of module trees to use as seadog input
├── seadog.output                                                 -----> Seadog mDGS output file
├── seadog_gene.tree                                              -----> Gene tree from seadog mDGS, with internal gene labelled
├── seadog_sp_gene_event.csv (or specieGeneEvent_seadog.csv)      -----> Gene nodes event from Species - Gene reconciliation (e.g., Gene duplication, Speciation)
│
├── visuReconc                                                    -----> All itol visualisation files
│   ├── geneReconc_seadog.tree                                    -----> The "final" gene tree after treefix correction / internal node labelling by seadog mDGS reconciliation / branch length computing by PhyML
│   ├── itolAnnotPresence.txt                                     -----> Annotation presences in leaves as heatmap
│   ├── itolModPresence.txt                                       -----> Module presences in leaves as heatmap
│   ├── itolBarModulesNb.txt                                      -----> Module number in leaves as barplot
│   ├── itolDomains.txt                                           -----> Domain decompositions as domain mosaics
│   ├── itolGOt.txt                                               -----> Annotation presence as symbols
│   ├── itolModTransfer.txt                                       -----> Module transfers as arrows
│   ├── itolModulesPresent_G*_*_*_only_mod.txt                    -----> For each gene node * : all module segments in actual gene present at this * (ancestral) gene as domain mosaics (brown squares)
│   ├── itolModulesThatChanged_G*_*_*_only_mod.txt                -----> For each gene node * : all module segments in actual gene gained at this * (ancestral) gene as domain mosaics (green squares)
│   ├── itolModules.txt                                           -----> Module decompositions as domain mosaics
│   ├── itolPopup.txt                                             -----> Gene nodes popup with module and annotation decriptions
│   ├── itolSpGeneEvents.txt                                      -----> Gene - Species reconcilation events as symbols
│   ├── itol_modules_PieGainsLost.txt                             -----> Number of modules gained / lost at each gene nodes as a pie charts
│   └── itol_ppi_PieGainsLost.txt                                 -----> Number of modules gained / lost at each gene nodes as a pie charts
├── visuReconc.zip                                                -----> All itol visualisation files compressed for itol batch upload
│
├── species.tree                                                  -----> Species tree extracted from ncbi taxonomy
├── domains.csv                                                   -----> Known domains/motifs from scans
│
├── modulesCompo.csv                                              -----> Lists of modules presents at each gene nodes
├── modulesChange.csv                                             -----> Lists of modules gained / lost at each gene nodes
│
├── complete_functionChange_moduleChange.csv                      -----> Table of all annotations and modules gained / lost at each gene nodes (actual and ancestral)
├── ances_functionChange_moduleChange.csv                         -----> Table of all annotations and modules gained / lost at each gene nodes (only ancestral)
├── ances_modulesChange.csv                                       -----> Table of modules gained / lost at each gene nodes (only ancestral)
│
├── functionChange_moduleChange.csv                               -----> Table of modules gained / lost at each gene nodes where there is annotation changes (actual and ancestral)
├── functionChange_moduleChange_expand.csv                        -----> Table of modules gained / lost at each gene nodes where there is annotation changes (actual and ancestral) with all module segment details in leaf
└── onlyAnc_functionChange_moduleChange_expand.csv                -----> Table of modules gained / lost at each gene nodes where there is annotation changes (only ancestral) with all module segment details in leaf


Advanced Usage

To use with your own data :

python3 phylocharmod/phylocharmod.py

Pre-computed phylogenetic trees or/and paloma module decompositions can be use, as long as they respect the imposed header format. (see --help for details)

usage: phylocharmod.py [-h] [--output_directory OUTPUT_DIRECTORY] [--species_tree SPECIES_TREE] [--gene_tree GENE_TREE] [--plma_file PLMA_FILE] [--reconc_domains] multi_fasta_file leaf_functions_csv

positional arguments:
  multi_fasta_file      Multi fasta file, with specific formated header >RefSeq_taxid (ex : >XP_012810820.2_8364)
  leaf_functions_csv    csv file containing for each of our sequence, the list of his functions (ex : XP_012810820.2, P59509 | P999999)

optional arguments:
  -h, --help            show this help message and exit
  --output_directory OUTPUT_DIRECTORY
                        output directory name
  --species_tree SPECIES_TREE
                        Species tree to use as a support for the reconciliations (WARNING, must correspond to the taxid use in the other files !)
  --gene_tree GENE_TREE
                        Gene tree to use as a support for the pastML and DGS reconciliation inference (WARNING, must correspond to the sequences in the multi fasta file !)
  --plma_file PLMA_FILE
                        Paloma-2 output file (.agraph format, .dot, or .oplma format)
  --reconc_domains      Do a DGS reconciliation with known modules (pfam / prosite)

To replicate publication results :

cd data/min5_human_214_t10m1M20/

python3 /phylocharmod/integrate_3phylo.py seadogMD_214.output gene_tree_214/214.tree --pastml_tab acs_dir_seadogMD_214_gene/pastml_seadogMD_214_gene_leaf_Manual_214_combined_ancestral_states.tab --domains_csv domains_214.csv --itol

To build a sequence dataset based on orthogroups

You can use as input any fasta file with ortholog and paralog sequences, as long as their headers are formatted. But we propose a sequence dataset building based on orthogroups from the OrthoFinder tool. As a prerequisite, you will need to select a set of species (and one assembly per species) and to have:

  • The Orthgroup.tsv file, computed with OrthoFinder on the proteomes of the selected assemblies of the selected species (to do so, run orthofinder -f <directory with all assemblie proteomes in fasta>).
  • A directory containing the description in .gff of the selected assemblies (e.g., GCF_000002035.6.gff).
  • An assoc_taxid_spName.csv file with taxid and species name associations (e.g., 7955,Danio rerio).
  • An assoc_taxid_assembly.csv file with taxid and assembly name associations (e.g., 7955,GCF_000002035).
  • An my_protein.txt file with refseq of the proteins of interest to study (one refseq per line).

To build your own dataset, we propose a methodology for identifying and selecting the longest protein sequence to represent each gene from a list of protein sequences of interest, which may include multiple isoforms' RefSeq IDs for comprehensiveness. Initially, the process involves searching for (pre-computed) orthogroups that contain the provided RefSeq IDs and extracting all associated protein sequences, including orthologs, paralogs, and their various isoforms. Subsequently, each extracted protein sequence undergoes a search for its genomic locus. Proteins are then grouped by gene based on overlapping genomic loci on the same chromosome strand. For each gene (i.e., a group of protein sequences), the longest protein sequence is selected as the representative sequence. The final output consists of these representative sequences as fasta, which are saved to a specified path <orthogroup_dir>/isoforms_per_locus/longest_isoform.fasta, thus offering a comprehensive representation of the protein family of interest.

0. Cluster the proteomes in orthogroups with the OrthoFindersoftware.

If you are using the Docker image, pre-computed orthogroups and all associated file for 9 species (7955 - Danio rerio; 9606 - Homo sapiens; 9913 - Bos taurus; 7719 - Ciona intestinalis; 10090 - Mus musculus; 7227 - Drosophila melanogaster; 8364 - Xenopus tropicalis; 6239 - Caenorhabditis elegans; 9031 - Gallus gallus) are available in the image directory: /data_9sp

⚠️ The Orthofinder results used human proteom version: GCF_000001405.39: recent RefSeq may not be retrived (see S1 Table of the article for version of other species assemblies)

1. Select orthogroups with at least one protein of interest, with python3 phylocharmod/myOrthogroups_fasta.py.

usage: myOrthogroups_fasta.py [-h] [--download] orthogroups_file myProtein_file assocF_taxid_sp

positional arguments:
  orthogroups_file  csv file containing the orthogroups from an orthofinder analysis
  myProtein_file    text file containing the ncbi id of the our protein of interest
  assocF_taxid_sp   association file : taxid, specieName

optional arguments:
  -h, --help        show this help message and exit
  --download        download proteins sequences of all the proteins in our orthogroups of interest

With the --download argument, a fasta directory will be build, containing all sequences of the proteins in the selected orthogroups (i.e., the one containing the proteins of interest).

Example for a file <refseqID.txt> (simply one refseq ID by line) using pre-computed orthogroups for 9 species:

docker run -w $(pwd) -v $(pwd):$(pwd) --rm ghcr.io/ocmalde/phylocharmod:0.1 /phylocharmod/myOrthogroups_fasta.py /data_9sp/Orthogroups.tsv <refseqID.txt>  /data_9sp/assocF_taxid_spName.csv --download

2. Regroup the protein by gene, based on genomic annotation (gff) and keep only the longest isoform for each gene, with python3 phylocharmod/gff_regroup_iso_locus.py.

usage: gff_regroup_iso_locus.py [-h] [--fasta_directory FASTA_DIRECTORY] [--assoc_file ASSOC_FILE] [--gff_directory GFF_DIRECTORY]

optional arguments:
  -h, --help            show this help message and exit
  --fasta_directory FASTA_DIRECTORY
                        Directory containing proteins fasta selection files, name in the format selec_taxid.fasta
  --assoc_file ASSOC_FILE
                        .csv association file (taxid, db_name)
  --gff_directory GFF_DIRECTORY
                        Directory containing genome at the gff format (filename must contains db_name of the assoc_file)

Example for <orthogroup_dir> (computed on previous step):

docker run -w $(pwd) -v $(pwd):$(pwd) --rm ghcr.io/ocmalde/phylocharmod:0.1 python3 /phylocharmod/gff_regroup_iso_locus.py --fasta_directory <orthogroup_dir> --assoc_file /data_9sp/assocF_taxid_dbnt.csv --gff_directory /data_9sp/gff

The fasta file containing only the longest sequence by gene will be written in <orthogroup_dir>/isoforms_per_locus/longest_isoform.fasta

Standalone Modules

Modules

Specific modules of the pipeline can be executed independently (see their --help for all usage details).

For example, to compute a phylogenetic tree using Muscle/Trimal/PhyML/Treefix:

python3 /phylocharmod/gene_phylo.py <fasta_file> <species_tree>

To only compute the final intergration module:

python3 /phylocharmod/integrate_3phylo.py <seadogMD.output> <gene_tree.tree> --pastml_tab <pastml_seadogMD_combined_ancestral_states.tab> --domains_csv <domains.csv>

Softwares

All the different included softwares are usable using the Docker image. For example, paloma-2 can be used with:

docker start <CONTAINER ID> && docker exec <CONTAINER ID> /bin/bash -c ". ~/.bashrc && paloma-D --help && exit"

Or simply: paloma-D --help if connected to a container

Dependencies

All these programs are mandatory to run PhyloCharMod, and must be in ; /usr/local/bin/ If not, their path must be specified in the config file ; phylocharmod/config.txt

Softwares

Muscle (v3.8.31), one of the best-performing multiple alignment programs, Conda package

PhyML (v3.3.20190909), maximum likelihood phylogenetic inference for the gene and the module trees, Conda package

TreeFix (v1.1.10), Statistically Informed Gene Tree Error Correction Using Species Trees, Conda package

SEADOG-MD, for DGS-reconciliation.

PastML (v1.9.33), for ancestral characters inference, Pip package

paloma-2 (v0.1), for sequence segmentation.

About

Phylogenetic prediction of functional sequence modules

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages