# Using TransPi to Performing an "Annotation Only" Run

In the previous notebook, we ran the entire default TransPi workflow, generating a small transcriptome from a test data set.  While that is a valid exercise in carrying through the workflow, the downstream steps (annotation and assessment) will be unrealistic in their output, since the test set will only generate a few hundred transcripts.  In contrast, a more complete estimate of a vertebrate transcriptome will contain tens to hundreds of thousands of transcripts.

In this notebook, we will start from an assembled transcriptome.  We will download and work with a more realistic example that was generated and submitted to the NCBI Transcriptome Shotgun Assembly archive.

Make sure that we start out in our working directory that we created and worked with in the basic assembly workbook

In [None]:
import subprocess
workdir=subprocess.check_output("pwd").decode("utf-8").rstrip()
workdir
%cd $workdir/transpi_example/
%pwd

In order to run an annotation-only run, TransPi has some specific requirements.  First, we must create a directory named "onlyAnn" in the current directory.

In [None]:
!mkdir onlyAnn

Make a variable to hold the google storage bucket to retrieve example transcriptomes

In [None]:
gbucket="gs://nigms-sandbox/nosi-inbremaine-storage"
gbucket

Now, we will go retrieve a few significantly larger assembled transcriptomes (originating from the [NCBI Transcriptome shotgun assembly](https://www.ncbi.nlm.nih.gov/genbank/tsa/)).

In [None]:
!gsutil cp $gbucket/example_transcriptomes/Oncorhynchus_mykiss_GGBN01.1.fsa_nt.gz ./
!gsutil cp $gbucket/example_transcriptomes/Pseudacris_regilla_GAEI01.1.fsa_nt.gz ./
!gsutil cp $gbucket/example_transcriptomes/Microcaecilia_dermatophaga_GFOE01.1.fsa_nt.gz ./

TransPi will process any transcriptome assembly that it finds in the "annOnly" directory, but the names must end in either ".fa" or ".fasta"  The files that we downloaded are gzipped, and also do not end in the right naming, so we will use gzip to decompress them, while redirecting the output into a renamed file within annOnly.

In [None]:
!gzip -d -c ./Oncorhynchus_mykiss_GGBN01.1.fsa_nt.gz > onlyAnn/Oncorhynchus_mykiss_GGBN01.1.fa

count the sequences in this file (you should get 31,176)

In [None]:
!grep -c ">" onlyAnn/Oncorhynchus_mykiss_GGBN01.1.fa

Now we finally will run TransPi using the option "--onlyAnn" which assumes that the transcriptome has been generated, and will now run the various steps for annotation of the transcripts.  Note that we choose to put the output into a new directory, so as not to overwrite results from our previous runs.

Tnis run should take about 36 minutes, assuming an N1 high-memory, 16 processor 104GB instance

In [None]:
!nextflow run ../TransPi/TransPi.nf -profile docker --onlyAnn --outdir results_onlyAnn -resume

As with the basic assembly example of the last workbook, the output will be arrange in a directory structure that is automatically created by nextflow.  Let's get a listing

In [None]:
!ls -l results_onlyAnn

Let's take a look at the "RUN_INFO.txt" file to see what the parameters and programs associated with our analysis were.

In [None]:
!cat results_onlyAnn/RUN_INFO.txt

Note that while the "onlyAnn" run carries out the searches against Pfam and the BLAS analysis against known proteins, it does not carry out the BUSCO analysis.  We can make that happen ourselves however, but in order to do so, we need to learn a little bit about running programs from containers.  

Container systems (and associated images) are one approach that simplifies the use of a broad set of programs, such as is commonly found in the wide field of computational biology. To put it concisely, most programs are not "stand-alone" but instead rely upon at least a few supporting libraries or auxiliary programs.  Since many analyses require multiple programs, installation of the necessary programs will also require installation of the supporting components, and critically *sometimes the supporting components of one program conflict with those of other programs.*  

Container systems ([Docker](https://www.docker.com/) and [Singularity](https://sylabs.io/singularity/) are the two most well-known examples) address this by installing and encapsulating the program and all of its necessary supporting components in an image.  Each program is then executed in the context of its container image, which i activated just long enough to run its program.

Because of the way that we have run the TransPi workflow in the previous, our system will already have several container images installed.  We can now work directly with these images.

Start by getting a listing of the images that are currently loaded.  

In [None]:
!docker images

We want the docker image that contains the program (and all necessary infrastructure) for running the BUSCO analysis.  The name is in the first column, but we also need the version number, which is the second column.  So let's put that together and first activate the container and ask it to run busco and just give us back the help message.

We will use the docker run command, and we will use the following options with it:
- -it, which means run interactively
- --rm, which means clean up after shutting down
- --volume /home:/home This is critical, because by default, a docker image can only see the file system inside of the container image.  We need to have it see our working directory, so we create a volume mapping.  For simplicity, we will just map the /home directory outside the container to the same address inside.  This will let us access and use all of the files that are below /home.

In [None]:
!docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:4.1.4--py_2 busco --help

Now that we have that, we will fill out a complete command and ask busco to analyze the same data as above (the trout dataset that is currently in the "onlyAnn" subdirectory), and put the output in a new directory here in our current working directory.  Here is the full command needed to make this run go.  There is a lot going on here,
- -i \$workdir/transpi_example/onlyAnn/Oncorhynchus_mykiss_GGBN01.1.fa => this tells the location and name of the file to be examined
- -l \$workdir/TransPi/DBs/busco_db/vertebrata_odb10 => this tells busco to use the vertebrata gene set (genes common to vertebrates) as the target
- -o GGBN01_busco_vertebrata => this tells busco to use this for the label of the output  
- --out_path /home/jupyter/transpi_example => this tells busco to put the output in the current directory
- -m tran => this tells busco that the input is transcripts (rather than protein or genomic data) 
- -c 14 => this tells busco to use 14 cpus
- --offline => this tells busco not to try to download updated versions of the target gene set

In [None]:
!docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:4.1.4--py_2 busco\
-i $workdir/transpi_example/onlyAnn/Oncorhynchus_mykiss_GGBN01.1.fa \
-l $workdir/TransPi/DBs/busco_db/vertebrata_odb10 -o GGBN01_busco_vertebrata \
--out_path $workdir/transpi_example -m tran -c 14 --offline

The analysis should take about 12 minutes.

In [None]:
!ls GGBN01_busco_vertebrata/

The output above (with pretty much 0 hits) proves one of the maxims of this kind of work:  ***Know what your data is before you start analyzing it.***

The sample in question GGBN01 can be looked up in NCBI TSA, and it explicitly says that it was targeted to long non-coding (lnc) RNA sequences.  Part of the process was to filter out all probable protein coding genes.  As such, most of the annotation tools will fail to find anything.  

So let's try instead with one of the other transcriptomes that we downloaded from TSA

In [None]:
!rm onlyAnn/Oncorhynchus_mykiss_GGBN01.1.fa
!gzip -d -c ./Pseudacris_regilla_GAEI01.1.fsa_nt.gz > onlyAnn/Pseudacris_regilla_GAEI01.1.fa
!docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:4.1.4--py_2 busco \
-i $workdir/transpi_example/onlyAnn/Pseudacris_regilla_GAEI01.1.fa \
-l $workdir/TransPi/DBs/busco_db/vertebrata_odb10 -o GAEI01_busco_vertebrata \
--out_path $workdir/transpi_example -m tran -c 14 --offline -f

This run should take about 20 minutes to complete.

Finally for one more illustrative comparison, let's run the third analysis.

In [None]:
!rm onlyAnn/Pseudacris_regilla_GAEI01.1.fa
!gzip -d -c ./Microcaecilia_dermatophaga_GFOE01.1.fsa_nt.gz > onlyAnn/Microcaecilia_dermatophaga_GFOE01.1.fa
!docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:4.1.4--py_2 busco \
-i $workdir/transpi_example/onlyAnn/Microcaecilia_dermatophaga_GFOE01.1.fa \
-l $workdir/TransPi/DBs/busco_db/vertebrata_odb10 -o GFOE01_busco_vertebrata \
--out_path $workdir/transpi_example -m tran -c 14 --offline -f

This run should take about 33 minutes to complete.