# MDIBL Transcriptome Assembly Learning Module
# Notebook 2: Performing a "Standard" basic transcriptome assembly

## Overview

In this notebook, we will set up and run a basic transcriptome assembly, using the analysis pipeline as defined by the TransPi Nextflow workflow. The steps to be carried out are the following, and each is described in more detail in the Background material notebook.

- Sequence Quality Control (QC): removing adapters and low-quality sequences.
- Sequence normalization: reducing the reads that appear to be "overrepresented" (based on their *k*-mer content).
- Generation of multiple 1st-pass assemblies using the following tools: Trinity, TransAbyss, SOAP, rnaSpades, and Velvet/Oases.
- Integration and reduction of the individual transcriptomes using EvidentialGene.
- Assessment of the final transcriptome with rnaQuast and BUSCO.
- Annotation of the final transcriptome using alignment to known proteins (using DIAMOND/BLAST) and assignment to probable protein domains (using HMMER/Pfam).
- Generation of output reports.

> <img src="../images/TransPiWorkflow.png" width="1500">
>
> **Figure 1:** TransPi workflow for a basic transcriptome assembly run.

## Learning Objectives

1. **Understanding the TransPi Workflow:** Learners will gain a conceptual understanding of the TransPi workflow, including its individual steps and their order.  This involves understanding the purpose of each stage (QC, normalization, assembly, integration, assessment, annotation, and reporting).

2. **Executing a Transcriptome Assembly:** Learners will learn how to run a transcriptome assembly using Nextflow and the TransPi pipeline, including setting necessary parameters (e.g., k-mer size, read length). They will learn how to interpret the command-line interface for executing Nextflow workflows.

3. **Interpreting Nextflow Output:** Learners will learn to navigate and understand the directory structure generated by the TransPi workflow.  This includes interpreting the output from various tools such as FastQC, FastP, Trinity, TransAbyss, SOAP, rnaSpades, Velvet/Oases, EvidentialGene, rnaQuast, BUSCO, DIAMOND/BLAST, HMMER/Pfam, and TransDecoder.  This involves understanding the different types of output files generated and how to extract relevant information from them (e.g., assembly statistics, annotation results).

4. **Assessing Transcriptome Quality:** Learners will understand how to assess the quality of a transcriptome assembly using metrics generated by rnaQuast and BUSCO.

5. **Interpreting Annotation Results:** Learners will learn to interpret the results of transcriptome annotation using tools like DIAMOND/BLAST and HMMER/Pfam, understanding what information they provide regarding protein function and domains.

6. **Utilizing Workflow Management Systems:** Learners will gain practical experience using Nextflow, a workflow management system, to execute a complex bioinformatics pipeline.  This includes understanding the benefits of using a defined workflow for reproducibility and efficiency.

7. **Working with Jupyter Notebooks:** The notebook itself provides a practical example of how to integrate command-line tools within a Jupyter Notebook environment.

## Prerequisites

* **Nextflow:** A workflow management system used to execute the TransPi pipeline. 
* **Docker:** Used for containerization of the various bioinformatics tools within the workflow.  This avoids the need for local installation of numerous packages.
* **TransPi:** The specific Nextflow pipeline for transcriptome assembly. The notebook assumes it's present in the `/home/jupyter` directory.
* **Bioinformatics Tools (within TransPi):** The workflow utilizes several bioinformatics tools. These are packaged within Docker containers, but the notebook expects that TransPi is configured correctly to access and use them:
    * FastQC: Sequence quality control.
    * FastP: Read preprocessing (trimming, adapter removal).
    * Trinity, TransAbyss, SOAPdenovo-Trans, rnaSpades, Velvet/Oases:  Transcriptome assemblers.
    * EvidentialGene: Transcriptome integration and reduction.
    * rnaQuast: Transcriptome assessment.
    * BUSCO: Assessment of completeness of the assembled transcriptome.
    * DIAMOND/BLAST: Protein alignment for annotation.
    * HMMER/Pfam: Protein domain assignment for annotation.
    * Bowtie2: Read mapping for assembly validation.
    * TransDecoder: ORF prediction and coding region identification.
    * Trinotate: Functional annotation of transcripts.

## Get Started 

**Step 1:** Make sure you are in the correct local working directory as in `01_prog_setup.ipynb`.
> It should be `/home/jupyter`.

In [None]:
%cd /home/jupyter

In [None]:
!pwd

**Step 3A:** First, check the listings within the `resources directory`. Make sure you see the items listed below:
```
DBs  bin  conf  seq2  trans
```

In [1]:
!ls ./resources

bin  conf  DBs	seq2  trans


**Step 3B** Now, check the listing of the sequence directory: `seq2`. You should see seven pairs of gzipped fastq files (signified by the paired `.fastq.gz` naming). Six of these are for individual samples, and the seventh set, labeled **joined** is a concatenation of all files. Because of the way that TransPi works (as well as some of the programs that it uses), it's best to use a joined set of all sequences to make a unified transcriptome assembly.

In [2]:
!ls ./resources/seq2

joined_R1.fastq.gz   SL94882_R2.fastq.gz  SL94885_R1.fastq.gz
joined_R2.fastq.gz   SL94883_R1.fastq.gz  SL94885_R2.fastq.gz
SL94881_R1.fastq.gz  SL94883_R2.fastq.gz  SL94886_R1.fastq.gz
SL94881_R2.fastq.gz  SL94884_R1.fastq.gz  SL94886_R2.fastq.gz
SL94882_R1.fastq.gz  SL94884_R2.fastq.gz


**Step 4:** Now we are set to perform the assembly using the sequences within the directory `seq2/`.  
> The specific sequences here are from zebrafish, and they represent a selected subset of the sequences from the experiment of [Hartig et al](https://journals.biologists.com/bio/article-pdf/5/8/1134/1114440/bio020065.pdf).

The data was selected in order to create a reasonably large assembly (targeting a few hundred transcripts), while also being able to be checked against the "known" transcripts and genes).

We will set only a small number of the options used in TransPi, focusing on the following:
- `-profile docker`: This is a key setting, as it allows all software to be run from Docker container images, negating the need to install all programs locally (in other scenarios, there is the option to add more than one profile).
    - The profile names are pointing to pre-defined groupings of setting within the `nextflow.config` file. 
- `--k 17,25,43`: The size(s) of *k*-mers to be used in the generation of the de Bruijn graphs (see the background file for a discussion of the role of *k*, and why it needs to be variable).
- `--maxReadLen 50`: The maximum length of the reads (since these files all come from one experiment, this represents the length of all sequences).
- `--all`: This setting tells Nextflow to run all steps from pre-assembly QC, through assembly and refinement, and then finally the analysis and tabulation of annotations to the putative transcripts.

Under the assumption of an n1-high-memory node with 16 processors and 104GB of RAM, this run should take approximately **58 minutes**.

As the workflow executes, the Nextflow engine will generate a directory called `work` where it places all of the intermediate information and output that is needed to carry out the work.

<div class="alert alert-block alert-info">
    <i class="fa fa-lightbulb-o" aria-hidden="true"></i>
    <b>Tip: </b> Run-Time Reminder
</div>


> <img src="../images/jupyterRuntime.png" width="500">
>
> Remember that you can tell if the cell is still running by referring to the contents inside the `[ ]:` that sits to the left of the code cell. Or you can check the top right of the screen for the circle.

In [3]:
%%capture
! mamba create  -n nextflow -c bioconda nextflow -y
! mamba install -n nextflow ipykernel -y

<div class=\"alert alert-block alert-danger\">
    <i class=\"fa fa-exclamation-circle\" aria-hidden=\"true\"></i>
    <b>Alert: </b> Remember to change your kernel to <b>conda_nextflow</b> to run nextflow.
    </div>

In [1]:
%cd denovotranscript

/home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/AWS/denovotranscript


In [None]:
2+2

In [None]:
!nextflow run main.nf --input test_samplesheet.csv -profile awsbach,docker --run_mode full


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 24.10.5[m
[K
Launching[35m `main.nf` [0;2m[[0;1;36mecstatic_tuckerman[0;2m] DSL2 - [36mrevision: [0;36me7227e7015[m
[K

-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/denovotranscript 1.2.0[0m
-[2m----------------------------------------------------[0m-
[1mInput/output options[0m
  [0;34minput              : [0;32mtest_samplesheet.csv[0m
  [0;34moutdir             : [0;32ms3://hadi-test-transcriptome/outdir_transcriptome/[0m

[1mBUSCO options[0m
  [0;34mbusco_lineage      : [0;32mvertebrata_odb10[0m

[1mGeneric options[0m
  [0;34mtr

The beauty and power of using a defined workflow in a management system (such as Nextflow) are that we not only get a defined set of steps that are carried out in the proper order, but we also get a well-structured and concise directory structure that holds all pertinent output (with naming specified in the command-line call of Nextflow and TransPi).

**Step 5:** With the execution complete, let's look at what we have generated, first in the results directory. We will add the `-l` argument for a "long listing".

In [2]:
!ls -l ./basicRun/output

ls: cannot access ./basicRun/output: No such file or directory


## Investigation and Exploration: Assembly and Annotation Results
The use of an established and complex multi-step workflow (such as the TransPi workflow that you just ran) has the benefit of saving you a lot of manual effort in setting up and carrying out the individual steps yourself. It also is highly reproducible, given the same input data and parameters.

It does, however, generate a lot of output, and it is beyond the scope of this training exercise to go through all of it in detail. We recommend that you download the complete results directory onto another machine or storage so that you can view it at your convenience, and on a less expensive machine than you are using to run this tutorial. *If you would like the proceed with the data in its current location, this also works, just bear in mind that it will cost roughly $0.72 per hour.*

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  To Download...
</div>

>Here are two possible options to access the results files outside of this expensive JupyterLab instance.  
>- If you instead have an external machine that accepts ssh connections, then you can use the secure copy scp command: `!scp -r ./basicRun/output/YOUR_USERID@YOUR.MACHINE`
>- If you have a Google Cloud Storage bucket available, you can use the gsutil command: `!gsutil -m cp -r ./basicRun/output gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output` to place all of your results into that bucket. 
>    - From there you have two options: 
>         1. (Recommended) You could create a new (cheaper) Vertex AI instance (or use an old one) and copy the files down into that new directory using the following gsutil command:`!gsutil -m cp -r gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output ./`
>         2. You could navigate to the bucket through the Google Cloud console and open the files through the links labeled `Authenticated URL`
>
>**In all of the commands above, you will need to edit the All-Caps part to match your own bucket or machine.**

<div class="alert alert-block alert-info">
    <i class="fa fa-lightbulb-o" aria-hidden="true"></i>
    <b>Tip: </b>
</div>

> - **After you have the output directory in its desired location, consider the information in the cell below as you explore the output.**
> - **If you are viewing the output in a different location, consider copying or taking a screenshot of the cell below.**
> - **Make sure that if you are viewing your output in a different location, you save your notebooks here, and then stop the VM instance, or it will keep costing money.**
> - **Upon completion of your exploration, return to this submodule to complete the checkpoint quiz.**

## Output Overview
*These sub-directories will be mentioned in the order of their execution within TransPi.*

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  HTML files
</div>

>**If you are viewing your output within a JupyterLab VM instance, for the `.html` files to work correctly, you will need to select `Trust HTML` at the top of the screen.** This is due to the dynamic elements within the files.

### FastQC
> FastQC takes the raw read files and runs a swift analysis of the read data. The two key output files are `joined_R1_fastqc.html` and `joined_R2_fastqc.html` which provide a visual illustration of the read quality metrics. It is important to note that FastQC does not manipulate the data for further steps, it just outputs an analysis of the data quality.

### Filter
> FastP is a bioinformatics tool that preprocesses the raw read data. It trims poor-quality reads, removes adapter sequences, and corrects errors noticed within the reads. The `joined.fastp.html` provides an overview of the processing done on both read files.

### Assemblies
> TransPi uses five different assembly tools. All of the assembly `.fa` files are placed within the assemblies directory. For all of the assemblies except for Trinity, there are four `.fa` files: one for each of the *k*-mer length plus a compilation of all three. Trinity does not have the option to customize the *k*-mer size. Instead, it runs at a default `k=25`, therefore only having one assembly.

### EviGene
> At this point, we have a major overassembly of the transcriptome. We use a small piece of the EvidentialGene (EviGene) program known as tr2aacds which takes all of the assemblies and crunches them into a single, unified transcriptome. Within the evigene directory, there are two files: `joined.combined.fa` is all of the assemblies placed into the same file and`joined.combined.okay.fa` is the combined transcriptome after EviGene has reduced it down. In each header line, there is key information about the sequence.
>> For example: `>SOAP.k17.C9429 58.0 evgclass=main,okay,match:SPADES.k43.NODE_313_length_1670_cov_12.047941_g161_i0,pct:100/100/.; aalen=392,75%,complete;`
>>
>> - This header indicates that this sequence was found in both the SOAP and SPADES assemblies.
>> - The `eviclass=main` means that this sequence is the primary transcript, and there are alternates identified.
>> - The `aalen=392` is the amino acid length of the sequence.
>> - The `complete` means that it is a complete reading frame.
>> - For more information on interpreting the headers from EviGene, reference the following [link](http://arthropods.eugenes.org/EvidentialGene/evigene/) in section 3.

### BUSCO
> BUSCO uses a database of known universal single-copy orthologs under a specific lineage (vertebrata in this case) and checks our assembled transcriptome for those sequences which it expects to find. BUSCO was run on both the TransPi assembly along with the assembly just done by Trinity. To visualize BUSCO's results, refer to the `short_summary.specific.vertebrata_odb10.joined.TransPi.bus4.txt` and `short_summary.specific.vertebrata_odb10.joined.Trinity.bus4.txt` files.

### Mapping 
> One way to verify the quality of the assembly is to map the original input reads to the assembly (using an alignment program called bowtie2). There are two output files, one for the TransPi assembly and one for the Trinity exclusive assembly. These files are named `log_joined.combined.okay.fa.txt` and `log_joined.Trinity.fa.txt`.

### rnaQUAST
> rnaQUAST is another assembly assessment program. It provides statistics about the transcripts that have been produced. For a brief overview of the transcript statistics, refer to `joined_rnaQUAST.csv`.

### TransDecoder 
> TransDecoder is a program that "decodes" the transcripts. First, it identifies open reading frames (ORFs). From there, it then will make predictions on what is likely to be coding regions. For statistics regarding TransDecoder, refer to the `joined_transdecoder.stats` file.

### Trinotate
> Trinotate uses the information regarding likely coding regions produced by TransDecoder to make predictions about potential protein function. It does this by cross-referencing the assembled transcripts to various databases such as pfam and hmmer. These annotations can be viewed in the `joined.trinotate_annotation_report.xls` file.

### Report
> Within `report` is one file: `TransPi_Report_joined.html`. This is an HTML file that combines the results throughout TransPi into a series of visual tables and figures.
>> The sub-directories `stats` and `figures` are intermediary sub-directories that hold information to generate the report.

### pipeline_info
> One of the benefits of using Nexflow and a well-defined pipeline/workflow is that when the run is completed, we get a high-level summary of the execution timeline and resources. Two key files within this sub-directory are `transpi_timeline.html` and `transpi_report.html`. In the `transpi_timeline.html` file, you can see a graphic representation of the total execution time of the entire workflow, along with where in the process each of the programs was active. From this diagram, you can also infer the ***dependency*** ordering that is encoded into the TransPi workflow. For example, none of the assembly runs started until the process labeled **`normalize reads`** was complete because each of these is run on the normalized data, rather than the raw input. Similarly, **`evigene`**, the program that integrates and refines the output of all of the assembly runs doesn't start until all of the assembly processes are complete. Within the `transpi_report.html` file, you can get a view of the resources used and activities carried out by each process, including CPUs, RAM, input/output, container used, and more.

### RUN-INFO.txt
> `RUN-INFO.txt` provides the specific details of the run such as where the directories are and the versions of the various programs used.

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Checkpoint 1:</b>
</div>

*The green cards below are interactive. Spend some time to consider the question and click on the card to check your answer.*

In [2]:
# This is an install that you need to run once to allow the quizes to be functional.
!pip install jupyterquiz
!pip install jupytercards

Collecting jupyterquiz
  Downloading jupyterquiz-2.8.1-py2.py3-none-any.whl.metadata (13 kB)
Downloading jupyterquiz-2.8.1-py2.py3-none-any.whl (21 kB)
Installing collected packages: jupyterquiz
Successfully installed jupyterquiz-2.8.1
Collecting jupytercards
  Downloading jupytercards-3.0.5-py2.py3-none-any.whl.metadata (6.5 kB)
Downloading jupytercards-3.0.5-py2.py3-none-any.whl (15 kB)
Installing collected packages: jupytercards
Successfully installed jupytercards-3.0.5


In [4]:
from jupytercards import display_flashcards
display_flashcards('../quiz-material/02-cp1-1.json')

<IPython.core.display.Javascript object>

In [5]:
display_flashcards('../quiz-material/02-cp1-2.json')

<IPython.core.display.Javascript object>

## Conclusion

This Jupyter Notebook demonstrated a complete transcriptome assembly workflow using the TransPi Nextflow pipeline.  We successfully executed the pipeline, encompassing quality control, normalization, multiple assembly generation with Trinity, TransAbyss, SOAP, rnaSpades, and Velvet/Oases, integration via EvidentialGene, and subsequent assessment using rnaQuast and BUSCO.  The final assembly underwent annotation with DIAMOND/BLAST and HMMER/Pfam, culminating in comprehensive reports detailing the entire process and the resulting transcriptome characteristics.  The generated output, accessible in the `basicRun/output` directory, provides a rich dataset for further investigation and analysis, including detailed quality metrics, assembly statistics, and functional annotations.  This module provided a practical introduction to automated transcriptome assembly, highlighting the efficiency and reproducibility offered by integrated workflows like TransPi.  Further exploration of the detailed output is encouraged, and the subsequent notebook focuses on a more in-depth annotation analysis.

## Clean Up

Remember to proceed to the next notebook [`Submodule_03_annotation_only.ipynb`](Submodule_03_annotation_only.ipynb) or shut down your instance if you are finished.