## Peforming a "standard" basic transcriptome assembly

In this notebook, we will set up and run a basic transcriptome assembly, using the analysis pipeline as defined by the TransPi Nextflow workflow.  The steps to be carried out are the following, and each is described in more detail in the Background material notebook.

- Sequence QC, removing adapters and low-quality sequence
- Sequence normalization, which reduces the reads that appear to be "overrepresented" (based on their k-mer content)
- Generation of multiple 1st-pass assemblies, using tools Trinity, TransAbyss, SOAP, rnaSpades, and Velvet/Oases
- Integration and reduction of the individual transcriptomes using Evidential Gene
- Assessment of the final transcriptome with rnaQuast and BUSCO
- Annotation of the final transcriptome using alignment to known proteins (using diamond/blast) and with Hmmer/Pfam, which assigns probable protein domains
- Generation of output reports

To get started, first change the working directory to the working directory that we created in the setup 

In [None]:
import subprocess
workdir=subprocess.check_output("pwd").decode("utf-8").rstrip()
workdir
%cd $workdir/transpi_example/
%pwd

One final step that we put in to make sure that all is setup correctly.  There should a template version of the config file in the "parent directory" - it is set up with a dummy entry (**WORKDIR**) for the value of the working directory.  This is needed in order to tell the TransPi workflow where to find all of the resources needed for the run.

Below is a simple one-line perl script that will replace all instances of **WORKDIR** with the contents of the variable ***workdir*** that we created above.

In [None]:
!perl -ne "s:WORKDIR:$workdir:g; print" ../nextflow.template.config > ./nextflow.config

Get a directory listing to make sure you see the nextflow.config file and the directory seq2, which holds our test sequences.

In [None]:
!ls

Get a listing of the sequence directory.  You should see seven pairs of gzipped fastq files (signified by the paired .fastq.gz naming).  Six of these are for individual samples, and the seventh set, labled "joined" is a concatenation of all files.  Because of the way that TransPi works (as well as some of the programs that it uses), it's best to use a joined set of all sequences to make a unified transcriptome assembly.

In [None]:
!ls seq2/

Now we are set to perform the assembly. using the sequences within directory seq2.  The specific sequences here are from Zebrafish, and they represent a selected subset of the sequences from the experiment of Hartig et al (<b>reference here</b>).  

The data was selected in order to create a reasonably large assembly (targeting a few hundred transcripts), while also being able to be checked against the "known" transcripts and genes).

We will set only a small number of the options used in TransPi, focusing on the following:
- --reads, the files (note the formatting of the file names)
- "-profile docker"  This is a key setting, as it allows all software to be run from docker container images, negating the need to install all programs locally
- "--outdir" the output directory, which because it has no leading slash (/), will result in the creation of a new directory of the indicated name in the current directory.  All final output generated by the pipeline will end up there.
- --*k*, the size(s) of *k*-mers to be used in generation of the de Bruijn graphs.  (See the backrgound file for a discussion of the role of *k*, and why it needs to be variable)
- --maxReadLength, the maximum length of the reads.  (Since these files all come from one experiment, this represents the length of all sequences
- --all, this setting tell nextflow to run all steps from pre assembly QC, through assembly and refinement, and then finally the analysis and tabulation of annotations to the putative transcripts.

Under the assumption of an n1-high-memory node with 16 processors and 104GB of RAM, his run should take approximately <b>45 minutes</b>.

As the workflow executes, the nextflow engine will generate a directory called "work" where it places all of the intermediate information, output, and such needed to carry out this work.

In [None]:
!nextflow run ../TransPi/TransPi.nf \
-profile docker --reads './seq2/joined*R[1,2].fastq.gz' \
--outdir transpi_results --k 17,25,43 --maxReadLen 50 --all 

The beauty and power of using a defined workflow in a management system (such as Nextflow) is that we not only get a defined set of steps that are carried out in the proper order, but we also get a well strunctured and concise directory structure that holds all pertinent output (with naming specified in the command-line call of Nextflow and TransPi).

With the execution complete, let's look at what we have generated, first in the results directory.  We will add the -l argument for a "long listing"

In [None]:
!ls -l transpi_results

## The remainder of this notebook will be an investigation/exploration of the results of the assembly and annotations
Use of an established and complex multi-step workflow such as we have given you here has the benefit of saving you a lot of manual effort in setting up and carrying out the individual steps yourself.  It also is highly reproducible, given the same input data and parameters.  

It does, however, generate a lot of output, and it is beyond the scope of this training exercise to go through all of it in detail.  We recommend that you download the complete results directory onto another machine or storage, so that you can view it at your convenience, and on a less expensive machine than you are using to run this tutorial.

At least two possible means can be used to make this copy.  
- If you have a Google Storage bucket available, you can use the gsutil command "gsutil -m cp -r \$workdir/transpi_results/ gs://YOUR-BUCKET-NAME-HERE"
- If you instead have an external machine that accepts ssh connections, then you can use the secure copy scp command "scp -r \$workdir/transpi_results/YOUR_USERID@YOUR.MACHINE"

***NOTE*** in both of the commands above, you will need to edit the All-Caps part in order to match your own machine.

## Examining the workflow/pipeline performance 
One of the benefits of using Nexflow and a well-defined pipeline/workflow is that when the run is completed, we get a high-level summary of the execution timeline and resources.  For any nextflow run, these will be stored in the output subdirectory *pipeline_info*.  Get a list of what is in there.

In [None]:
!ls transpi_results/pipeline_info

You can use the file browser at left to navigate ito this directory.  Do so, and then open up these two files:
- transpi_timeline.html
- transpi_report.html

Each of them has dynamic elements, so you will likely need to press the "Trust HTML" button at the top of the page in order to see all that is present.

Within the timeline file, you can see a graphic representation of the total execution time of the entire workflow, along with where in the process each of the programs were active.  From this diagram, you can also infer the ***dependency*** ordering that is encoded into the TransPi workflow.  For example, none of the assemby runs started until the process labeled **"normalize reads"** was complete, because each of these are run on the normalized data, rather than the raw input.  Similarly, **"evigene"**, the program that integrates and refines the output of all of the assembly runs doesn't start until all of the assemby processes are complete.  

Within the report file, you can get a view of the resources used and activities carried out by each process, including CPUs, RAM, Input/Output, and more.  

It is worth noting here that the results that are stored in our output directory (which we labeled transpi_results) are just a subset of all that was generated in this process.  The table as the bottom of the results 