Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
test-data
.shed.yml
GeneFamilies_GeneSeqToFamily.ga
GeneSeqToFamily.ga
GeneSeqToFamily.png
finding_orphan_genes.ga
finding_orphan_genes.png
readme.md
swissprot.ga
swissprot.png

readme.md

GeneSeqToFamily: the Ensembl GeneTrees pipeline as a Galaxy workflow

Introduction

GeneSeqToFamily is an open-source Galaxy workflow based on the Ensembl GeneTrees pipeline. The Ensembl GeneTrees pipeline [1] infers the evolutionary history of gene families, represented as gene trees. It is a computational pipeline that comprises clustering, multiple sequence alignment, and tree generation (using TreeBeST), to discover familial relationship.

Installation

To use this workflow, please install the required tools (listed below) into Galaxy from the Galaxy ToolShed. Also install and import the workflow from the Galaxy ToolShed.

List of required tools

The 3 workflows in this repository requires Galaxy tools from the following ToolShed repositories:

Helper tools for data preparation:

Workflow inputs and steps

Inputs

GeneSeqToFamily requires the following inputs:

  • the coding sequences (CDS) in FASTA format (this can be achieved with GeneSeqToFamily preparation tool)
  • gene feature information in SQLite format (this can be achieved with GeneSeqToFamily preparation tool)
  • a species tree in Newick format (this can be generated by ete tool in Galaxy)

Steps

The pipeline is made up of 7 main steps:

  1. Translation of CDS to protein sequences
  2. All-vs-all BLASTP of protein sequences
  3. Cluster protein sequences using hcluster_sg and BLASTP scores
  4. Multiple sequence alignment (MSA) for each cluster using T-Coffee
  5. Generate gene trees from MSAs using TreeBeST
  6. Create an SQLite database from the MSAs, gene trees and gene feature information using Gene Alignment and Family Aggregator (GAFA)
  7. Visualise the GAFA dataset using Aequatus

Helper tools:

We have developed various tools to help with data preparation for the workflow. This includes tools for retrieving sequences, and features from Ensembl using its REST API, and tools to parse Ensembl results into the required formats for the workflow. We also developed a tool to merge gene feature files and convert them from GFF3 (Gene Feature File) and/or JSON format to SQLite, which is then used to generate the Aequatus dataset.

Results

The resulting gene families can be visualised using the Aequatus.js interactive tool, which is developed as part of the Aequatus software [2].

The Aequatus.js plugin provides an interactive visual representation of the phylogenetic and structural relationships among the homologous genes, using a shared colour scheme for coding regions to represent homology in internal gene structure alongside their corresponding gene trees. It is also able to indicate insertions and deletions in homologous genes with respect to shared ancestors.

Citation information:

If you are using GeneSeqToFamily for any kind of research purpose, please cite the following paper:

Anil S. Thanki, Nicola Soranzo, Wilfried Haerty, Robert P. Davey (2018) GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline GigaScience 7(3), giy005, doi: 10.1093/gigascience/giy005

References

  1. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19(2):327–335, doi: 10.1101/gr.073585.107
  2. Thanki AS, Ayling S, Herrero J, Davey RP (2016) Aequatus: An open-source homology browser. bioRxiv, doi: 10.1101/055632

Project contacts:

Copyright © 2016-2018 Earlham Institute, Norwich, UK