Skip to content
msemon edited this page Jun 23, 2017 · 20 revisions

Preamble

With CAARS you can at the same time both assemble and annotate transcripts. The assembly is facilitated by using guide taxa (i. e. sister species that can be highly divergent). The transcripts are then inserted in user-provided multi-species alignments. Gene trees are subsequently inferred and the annotation is performed using the phylogenetic information in the trees.

Installing CAARS

CAARS uses a lot of dependencies (such as BLAST, Trinity...). In order to avoid installing all of these dependencies on your machine, we suggest you to use Docker. Docker will create a local environment on your computer that will contain all CAARS dependencies; they will be all packaged within the CAARS Docker image (If you don't want to use Docker and prefer to install CAARS from source, follow these instructions).

If you don't have Docker on your machine, you may get it here first. (Be aware that installation might differ if you're a Linux, a Mac or a Windows user.)

Using Docker: the easy way (several seconds to a few minutes)

We will use the Docker image named carinerey/caars. This image will be run in a Docker container on your local machine (this container is a closed environment where CAARS and all its dependencies are already installed).

In order to interact with the Docker container environment, you first need to create a shared directory on your machine (that will be used for the interaction between the Docker container and your machine).

# ON YOUR MACHINE

mkdir /home/crey/shared/     # whatever directory
cd /home/crey/shared/ 

We can now start the Docker container with the image carinerey/caars.

# ON YOUR MACHINE

export SHARED_DIR=$PWD      # We will use the variable $SHARED_DIR as the path shared by your machine and the docker
# start the docker named carinerey/caars
# the first image download can take several minutes (around 2Go)
# loading the image the next times should take a few seconds
docker run -t -i -e LOCAL_USER_ID=`id -u $USER` -e SHARED_DIR=$SHARED_DIR -v $SHARED_DIR:$SHARED_DIR carinerey/caars bash 

# "-e LOCAL_USER_ID=`id -u $USER`"    will ensure that you have all permissions on files created in the `Docker` container
# "-e SHARED_DIR=$SHARED_DIR"         exports the variable $SHARED_DIR in Docker container (that we will use later)

Please note that SHARED_DIR must contain an absolute path – as we did here. Indeed, CAARS builds links with absolute path. These links will be broken if you don't use the same directory tree.

Great! You just entered the Docker container. You can see that the prompter just changed and looks something like user_caars@e3921d1820bf:shared$. All commands that follow will be executed from the terminal in the Docker container. (To exit the Docker container, simply type exit.)

(If you installed CAARS without Docker, you remain on your local machine and must just run export SHARED_DIR=$PWD .)

How to use CAARS?

Go to the Tutorial page to have usage examples.

How does CAARS work?

CAARS overview

Figure 1: CAARS overview.

Representation of the major steps of CAARS:

  • Steps 1-4 group prerequisite computations.

    1. If no draft transcriptome is given in input, RNA-Seq data are de novo assembled into coding sequences and coding sequences are parsed to remove 5' and 3' UTR.
    2. Transcriptomes from guide species are extracted from input MSAs to form guide transcriptomes.
    3. Transcripts from the draft transcriptome are associated to the corresponding gene families by best hit using guide transcriptomes as reference.
  • Steps 5-10 group computation made for each family:

    1. RNA-Seq reads are clustered and formated in a database.
    2. Transcripts are assembled again with an assisted and iterative method (Apytram). At the first iteration, genes from the guide species and target transcripts from the draft transcriptome corresponding to this family are used as bait sequences to fish reads in the reads database. Mate reads are used to enlarge this batch of reads. Reads are then de novo assembled, and a new iteration can begin with the reconstructed sequences as baits.
    3. Coding sequences from both assemblies are added to the existing gene family alignments.
    4. A primary gene tree is obtained for the family.
    5. Redundancy is removed by merging sequences from the same species when appropriate. Then, sequences (from target species) with a low alignment score on their sister sequences (from guide or helper species) can be discarded (Not shown). A gene tree is calculated again to take into account potential changes.
    6. The species tree and the gene family tree are used jointly to infer a more likely tree (reconciled tree) placing gene losses and duplications along the gene family tree.

More detailed implementation can be found here