Skip to content

TheRincon/YAGA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YAGA is a tool that performs ABBA/BABA analysis for Orthofinder results.

From an Orthofinder_Results folder the "yaga" option will conduct an automated filtering of single copy orthologs for the species provided and then, paired with a KEGG, or GO file (or both), it will associate functional enrichment for the different species and characteristics.

The "abba" option will find the orthogroups, extract all of the CDS regions from the genes in those orthogroups and emit four taxa D-statistics for the orthogroup combinations.

While ABBA/BABA was not relavent for my master's thesis because a lack of introgression (fungi are not expected to hybridize and horizontal gene transfer is quite rare), I release this tool in the hope it will be useful to others. Instead, I was looking for identity between genes of organisms that have desired traits but within the structure of the ABBA/BABA four-taxa tree.

Requirements

Needs R, the statistical programming language. If you want to use the "abba" option, then install "evobiR" in R:

install.packages("evobiR")

Please also install mafft and bedtools with the "abba" option. If you have MacOS and Homebrew just use:

brew install mafft bedtools

Linux apt-get install should work as well.

The main functions are from python. Please have version 2.7+ Check the version with:

python -v

Python (2.7+) packages - ete2, argparse, and Biopython. They can be installed by pip. If you have not installed pip please follow insctructions from here.

Then

pip install biopython ete2 argparse

Functionality

The algorithm for the original ABBA/BABA-like analysis of YAGA is as follows:

First, the file paths of pertinent files are deduced from the Orthfinder_Results folder as of Orthofinder version 2.2.1. It will also work for up to version 2.2.6 but I ignored those because of this bug. [put bug issue here] We create two new directories "YAGA" in the main folder, and "Single_Copy_Gene_Trees" in the "Orthologues" directory. The "YAGA" folder contains output files directly related to the analysis, while "Single_Copy_Gene_Trees" conatins the Gene trees which only have 1 gene per species tested, and all species must be present, essentially analogous to BUSCOs but for this dataset only.

From the Orthogroups.csv file, we can gather the Orthogroups, the members of that orthogroup and the order in which they are specified in the file.

Next, the "target.json" is parsed to glean the info for which species are what. Typically ABBA looks like this:

             /\
            /  \
           /    \
          /\     \
         /  \     \
        /    \     \
       /\     \     \
      /  \     \     \
     /    \     \     \
    /      \     \     \
   /        \     \     \
  /          \     \     \
 Target  Nearest  Trait  Outgroup

The target and the nearest neighbor (by expected phylogeny) are the two closest for most of the trees. This is because they are specifically chosen to be close, and the gene sequence identity should follow this pattern. For genes that are necessary for a specific trait, a rock-inhabiting fungi for example, the "rock" genes of the "target" and "trait" species would be more similar indicating the tree had rearranged to BABA as shown below:

             /\
            /  \
           /    \
          /\     \
         /  \     \
        /    \     \
       /\     \     \
      /  \     \     \
     /    \     \     \
    /      \     \     \
   /        \     \     \
  /          \     \     \
Target    Trait  Nearest  Outgroup

If the signal was strong enough, for instance, out of 3000 single copy orthologs identifed, 30 genes were annotated for GO:0033961 out of 32 of that very same GO:0033961 present in the whole genome overall, then we might be able to say it was enriched, as the fisher test would look something like:

               | Genes in X | Genes NOT in X |
---------------|------------|----------------|
Genes in A     |    30      |         2      |
Genes NOT in A |    500     |       2468     |

From the species, when we have the "target" we go through all the trees and find the "sisters" to the target species. From the sisters (assuming the tree is resolved, MSA usually needs to be run with Orthofinder), we can then infer which orthogroups are closer if the sister appears.

So to reiterate, the GO term GO:0033961 was found 32 times in the whole genome and 30 of those instances were in the "enriched" set. 530 genes in total were in the "enriched" set. Notice I say genes, when I refer to a GO term, this is because if the interproscan file is included, YAGA gets an array of all the GO terms in the file (even duplicates) and then filters for unique per gene, so in effect it counts gene occurences linked to the GO term. Then it is correct to say genes, since the GO terms are deduplicated.

The OUTPUT file is then fed into R, through the Rscript and the fisher.test is run for each line like so:

> x <- matrix(c(30, 2, 100, 2868), byrow = TRUE, 2, 2)
> fisher.test(x)

Fisher's Exact Test for Count Data

data:  x
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  18.65206 641.52001
sample estimates:
odds ratio 
  73.90576

This would be an example of enrichment.

True ABBA/BABA analysis

It might be useful to use the pipeline demonstrated in this tutorial. This does true ABBA/BABA, albeit with a custom ".geno" file. It also involves some preprocessing.

True ABBA/BABA would take SNPs and look at the D-Statistics. It would be easist to accept a GFF/GTF file and extract the DNA sequence from there. But, to make YAGA as accessible as possible, one can optionally use only a protein fasta and a refrence genome in fasta format. Exonerate can take the proteins and create alignments. (In progress)

Regardless of how the DNA is come by, the sequences can then be compared to see certain polymorphisms and evidence of introgression.

Running YAGA

python yaga.py abba --target /path/to/dir/YAGA/target.json

The only required arugment for the "abba" option is "--target" or "-t". This file contains the names of taxa in the analysis, or at least a substring, so the taxa can be recognized in other files. It also needs the gff tracks with structural annotation (read: CDS locations). These are then extracted with bedtools "getfasta" and then joined to a mature DNA which is then aligned with MAFFT to a combination of sequences from the other 3 members of the four-taxa orthogroup.

The aligned fastas are then read into R and processed by "evobiR", and the D-statistics returned for each single combination and the entire orthogroup average. Of course, this depends entirely on the combination

Likewise the "yaga" option needs the "--target" option and the GO file (interproscan output) and an optional KEGG annotation.

python yaga.py yaga --target /path/to/dir/YAGA/target.json -g /path/to/dir/protein_fasta.csv -k /path/to/dir/kegg.csv

In Progress

  • Be able to align MAUVE co-linear blocks
  • Get duplicates and missing/expanded ("yaga" option)
  • Implement "geno" file and genome wide analysis
  • Find average D-stats by gene present, not just overall combination

Tips

Orthofinder names the "species" according to the protein fasta file names, so the "target.json" will need to have values similar to the filenames. If not, just run Orthofinder again with renamed protein fasta files.

Tested on Orthofinder 2.2.1, other Orthofinder versions should work.

About

ABBA/BABA Analysis (D-Statistics)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages