# MDIBL Transcriptome Assembly Learning Module
# Notebook 3: Performing a "Standard" basic transcriptome assembly

## Overview

This notebook will run `denovotranscript` Nextflow pipeline utilizing Google Batch. It is a bioinformatics pipeline for de novo transcriptome assembly of paired-end short reads from bulk RNA-seq. It takes a samplesheet and FASTQ files as input, perfoms quality control (QC), trimming, assembly, redundancy reduction, pseudoalignment, and quantification. It outputs a transcriptome assembly FASTA file, a transcript abundance TSV file, and a MultiQC report with assembly quality and read QC metrics.

> <img src="../images/denovo-workflow.png" width="1500">

1. Read QC of raw reads (FastQC)

2. Adapter and quality trimming (fastp)

3. Read QC of trimmed reads (FastQC)

4. Remove rRNA or mitochondrial DNA (optional) (SortMeRNA)

5. Transcriptome assembly using any combination of the following:
    - Trinity with normalised reads (default=True)
    - Trinity with non-normalised reads
    - rnaSPAdes medium filtered transcripts outputted (default=True)
    - rnaSPAdes soft filtered transcripts outputted
    - rnaSPAdes hard filtered transcripts outputted

6. Redundancy reduction with Evidential Gene tr2aacds. A transcript to gene mapping is produced from Evidential Gene’s outputs using gawk.

7. Assembly completeness QC (BUSCO)

8. Other assembly quality metrics (rnaQUAST)

9. Transcriptome quality assessment with TransRate, including the use of reads for assembly evaluation. This step is not performed if profile is set to conda or mamba.

10. Pseudo-alignment and quantification (Salmon)

11. HTML report for raw reads, trimmed reads, BUSCO, and Salmon (MultiQC)


## Prerequisites

**1. Software/Environment:**

*   Jupyter Notebook in Google Could Vertex AI (Python kernel)
*   Google Cloud CLI (configured)
*   Nextflow (installed via `mamba`)
*   `jupytercards` (install via `pip`)

**2. Enabled APIs:**

*   Google Batch
*   Google Storage

## Learning Objectives



## Get Started 

### **Step 1:** Data

The data used for this notebook is stored in a public Google cloud storage bucket: `gs://nigms-sandbox/nosi-inbremaine-storage/resources`. While you *could* download the data to your local machine for more detailed inspection, as you did in submodule-2, it is not *required*.

First, check the listings within the `resources directory`. Make sure you see the items listed below:
```
DBs  bin  conf  seq2  trans
```

In [None]:
! gsutil ls gs://nigms-sandbox/nosi-inbremaine-storage/resources/

Now, check the listing of the sequence directory: `seq2`. You should see seven pairs of gzipped fastq files (signified by the paired `.fastq.gz` naming). Six of these are for individual samples, and the seventh set, labeled **joined** is a concatenation of all files. Because of the way that denovotrascript works (as well as some of the programs that it uses), it's best to use a joined set of all sequences to make a unified transcriptome assembly.

In [None]:
! gsutil ls gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2/

### **Step 2:** Google Batch Setup

Google Batch will create the needed resources to run Nextflow in a serverless manner. 

#### Change the parameters as desired in `gbatch` profile inside `../denovotrascript/nextflow.config` file
 - Google Cloud project ID
 - Google Cloud region 
 - Nextflow work directory
 - Nextflow output directory

### **Step 3:** Install Nextflow

In [None]:
%%capture
! mamba create  -n nextflow -c bioconda nextflow -y
! mamba install -n nextflow ipykernel -y

In [None]:
# Add nextflow to the PATH
import os
os.environ['PATH'] = '/opt/conda/envs/nextflow/bin:' + os.environ['PATH']

### **Step 4:** Run `denovotranscript`

In [None]:
! nextflow run ../denovotranscript/main.nf --input ../denovotranscript/test_samplesheet_gcp.csv -profile gbatch --run_mode full

The beauty and power of using a defined workflow in a management system (such as Nextflow) are that we not only get a defined set of steps that are carried out in the proper order, but we also get a well-structured and concise directory structure that holds all pertinent output.

---
# Andrea, please update the rest for result

With the execution complete, let's look at what we have generated, first in the results directory.

In [None]:
! gsutil ls s3://<YOUR-BUCKET-NAME>/<Your-Output-Directory>/

## Investigation and Exploration: Assembly and Annotation Results
The use of an established and complex multi-step workflow (such as the TransPi workflow that you just ran) has the benefit of saving you a lot of manual effort in setting up and carrying out the individual steps yourself. It also is highly reproducible, given the same input data and parameters.

It does, however, generate a lot of output, and it is beyond the scope of this training exercise to go through all of it in detail. We recommend that you download the complete results directory onto another machine or storage so that you can view it at your convenience, and on a less expensive machine than you are using to run this tutorial. *If you would like the proceed with the data in its current location, this also works, just bear in mind that it will cost roughly $0.72 per hour.*

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  To Download...
</div>

>Here are two possible options to access the results files outside of this expensive JupyterLab instance.  
>- If you instead have an external machine that accepts ssh connections, then you can use the secure copy scp command: `!scp -r ./basicRun/output/YOUR_USERID@YOUR.MACHINE`
>- If you have a Google Cloud Storage bucket available, you can use the gsutil command: `!gsutil -m cp -r ./basicRun/output gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output` to place all of your results into that bucket. 
>    - From there you have two options: 
>         1. (Recommended) You could create a new (cheaper) Vertex AI instance (or use an old one) and copy the files down into that new directory using the following gsutil command:`!gsutil -m cp -r gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output ./`
>         2. You could navigate to the bucket through the Google Cloud console and open the files through the links labeled `Authenticated URL`
>
>**In all of the commands above, you will need to edit the All-Caps part to match your own bucket or machine.**

<div class="alert alert-block alert-info">
    <i class="fa fa-lightbulb-o" aria-hidden="true"></i>
    <b>Tip: </b>
</div>

> - **After you have the output directory in its desired location, consider the information in the cell below as you explore the output.**
> - **If you are viewing the output in a different location, consider copying or taking a screenshot of the cell below.**
> - **Make sure that if you are viewing your output in a different location, you save your notebooks here, and then stop the VM instance, or it will keep costing money.**
> - **Upon completion of your exploration, return to this submodule to complete the checkpoint quiz.**

## Output Overview
*These sub-directories will be mentioned in the order of their execution within TransPi.*

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  HTML files
</div>

>**If you are viewing your output within a JupyterLab VM instance, for the `.html` files to work correctly, you will need to select `Trust HTML` at the top of the screen.** This is due to the dynamic elements within the files.

### FastQC
> FastQC takes the raw read files and runs a swift analysis of the read data. The two key output files are `joined_R1_fastqc.html` and `joined_R2_fastqc.html` which provide a visual illustration of the read quality metrics. It is important to note that FastQC does not manipulate the data for further steps, it just outputs an analysis of the data quality.

### Filter
> FastP is a bioinformatics tool that preprocesses the raw read data. It trims poor-quality reads, removes adapter sequences, and corrects errors noticed within the reads. The `joined.fastp.html` provides an overview of the processing done on both read files.

### Assemblies
> TransPi uses five different assembly tools. All of the assembly `.fa` files are placed within the assemblies directory. For all of the assemblies except for Trinity, there are four `.fa` files: one for each of the *k*-mer length plus a compilation of all three. Trinity does not have the option to customize the *k*-mer size. Instead, it runs at a default `k=25`, therefore only having one assembly.

### EviGene
> At this point, we have a major overassembly of the transcriptome. We use a small piece of the EvidentialGene (EviGene) program known as tr2aacds which takes all of the assemblies and crunches them into a single, unified transcriptome. Within the evigene directory, there are two files: `joined.combined.fa` is all of the assemblies placed into the same file and`joined.combined.okay.fa` is the combined transcriptome after EviGene has reduced it down. In each header line, there is key information about the sequence.
>> For example: `>SOAP.k17.C9429 58.0 evgclass=main,okay,match:SPADES.k43.NODE_313_length_1670_cov_12.047941_g161_i0,pct:100/100/.; aalen=392,75%,complete;`
>>
>> - This header indicates that this sequence was found in both the SOAP and SPADES assemblies.
>> - The `eviclass=main` means that this sequence is the primary transcript, and there are alternates identified.
>> - The `aalen=392` is the amino acid length of the sequence.
>> - The `complete` means that it is a complete reading frame.
>> - For more information on interpreting the headers from EviGene, reference the following [link](http://arthropods.eugenes.org/EvidentialGene/evigene/) in section 3.

### BUSCO
> BUSCO uses a database of known universal single-copy orthologs under a specific lineage (vertebrata in this case) and checks our assembled transcriptome for those sequences which it expects to find. BUSCO was run on both the TransPi assembly along with the assembly just done by Trinity. To visualize BUSCO's results, refer to the `short_summary.specific.vertebrata_odb10.joined.TransPi.bus4.txt` and `short_summary.specific.vertebrata_odb10.joined.Trinity.bus4.txt` files.

### Mapping 
> One way to verify the quality of the assembly is to map the original input reads to the assembly (using an alignment program called bowtie2). There are two output files, one for the TransPi assembly and one for the Trinity exclusive assembly. These files are named `log_joined.combined.okay.fa.txt` and `log_joined.Trinity.fa.txt`.

### rnaQUAST
> rnaQUAST is another assembly assessment program. It provides statistics about the transcripts that have been produced. For a brief overview of the transcript statistics, refer to `joined_rnaQUAST.csv`.

### TransDecoder 
> TransDecoder is a program that "decodes" the transcripts. First, it identifies open reading frames (ORFs). From there, it then will make predictions on what is likely to be coding regions. For statistics regarding TransDecoder, refer to the `joined_transdecoder.stats` file.

### Trinotate
> Trinotate uses the information regarding likely coding regions produced by TransDecoder to make predictions about potential protein function. It does this by cross-referencing the assembled transcripts to various databases such as pfam and hmmer. These annotations can be viewed in the `joined.trinotate_annotation_report.xls` file.

### Report
> Within `report` is one file: `TransPi_Report_joined.html`. This is an HTML file that combines the results throughout TransPi into a series of visual tables and figures.
>> The sub-directories `stats` and `figures` are intermediary sub-directories that hold information to generate the report.

### pipeline_info
> One of the benefits of using Nexflow and a well-defined pipeline/workflow is that when the run is completed, we get a high-level summary of the execution timeline and resources. Two key files within this sub-directory are `transpi_timeline.html` and `transpi_report.html`. In the `transpi_timeline.html` file, you can see a graphic representation of the total execution time of the entire workflow, along with where in the process each of the programs was active. From this diagram, you can also infer the ***dependency*** ordering that is encoded into the TransPi workflow. For example, none of the assembly runs started until the process labeled **`normalize reads`** was complete because each of these is run on the normalized data, rather than the raw input. Similarly, **`evigene`**, the program that integrates and refines the output of all of the assembly runs doesn't start until all of the assembly processes are complete. Within the `transpi_report.html` file, you can get a view of the resources used and activities carried out by each process, including CPUs, RAM, input/output, container used, and more.

### RUN-INFO.txt
> `RUN-INFO.txt` provides the specific details of the run such as where the directories are and the versions of the various programs used.

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Checkpoint 1:</b>
</div>

*The green cards below are interactive. Spend some time to consider the question and click on the card to check your answer.*

In [None]:
# This is an install that you need to run once to allow the quizes to be functional.
!pip install jupyterquiz
!pip install jupytercards

In [None]:
from jupytercards import display_flashcards
display_flashcards('../quiz-material/02-cp1-1.json')

In [None]:
display_flashcards('../quiz-material/02-cp1-2.json')

## Conclusion



## Clean Up

Shut down your instance if you are finished.