# MDIBL Transcriptome Assembly Learning Module
# Notebook 2: Using "denovoscript" to Performing an "Annotation Only" Run

## Overview

This Jupyter Notebook provides a learning module on transcriptome assembly, specifically focusing on annotation using the `denovoscript` pipeline. It guides users through an "annotation only" run, assuming a pre-assembled transcriptome.  The notebook begins with an introductory video and illustration of the annotation workflow.  It then demonstrates downloading a rainbow trout transcriptome from an Amazon S3 bucket and counting its sequences.  Users are instructed to set up AWS Batch for serverless Nextflow execution, either automatically via a CloudFormation template or manually. After installing Nextflow and switching the kernel, `denovoscript` is executed in annotation-only mode using the downloaded transcriptome.  Results are then downloaded from S3 to the local directory for inspection. The notebook then introduces the concept of Docker containers and guides users through running BUSCO within a container to assess transcriptome completeness using the vertebrata gene set.  Finally, interactive quizzes prompt users to interpret BUSCO, GO, and TransDecoder results, emphasizing the importance of understanding data provenance.  A second, user-driven BUSCO analysis on a different transcriptome is assigned as a final exercise, encouraging exploration of different lineages and critical evaluation of results.

## Learning Objectives

* **Understanding transcriptome annotation:** Learn the process of annotating a pre-assembled transcriptome.
* **Using `denovoscript` for annotation:**  Gain practical experience using the `denovoscript` pipeline with the `annotation_only` run mode.
* **Working with AWS Batch:** Learn how to set up and utilize AWS Batch for running Nextflow pipelines in a serverless environment.
* **Understanding and using Docker containers:**  Become familiar with Docker containers and how to execute bioinformatics tools like BUSCO within them.
* **Assessing transcriptome completeness with BUSCO:** Learn how to use BUSCO to evaluate the completeness of a transcriptome assembly using different lineage datasets.
* **Interpreting BUSCO, GO, and TransDecoder results:** Develop skills in interpreting the output files generated by these tools and understanding their implications.
* **Understanding data provenance:** Appreciate the importance of considering the origin and processing of transcriptomic data before analysis.
* **Running BUSCO analysis independently:**  Apply learned concepts by independently constructing and executing BUSCO commands for different transcriptomes and lineages.
* **Critical evaluation of BUSCO results:** Learn to analyze BUSCO results critically, considering factors such as transcriptome quality, lineage selection, and biological explanations for observed patterns (e.g., duplicated or fragmented genes).

## Prerequisites

**1. Software/Environment:**

*   Jupyter Notebook (Python kernel)
*   AWS CLI (configured)
*   Nextflow (installed via `mamba`, switch kernel to `conda_nextflow`)
*   Docker (running, user permissions correct)
*   `jupytercards` (install via `pip`)

**2. Enabled APIs:**

*   AWS Batch
*   Amazon S3



## Get Started

In [None]:
#Run the command below to watch the video
from IPython.display import YouTubeVideo

YouTubeVideo('AGuUHmSobEA', width=800, height=400)

> <img src="../images/AnnotationProcess.png" width="800">
>
> **Figure 1:** Annotation workflow for a new, unannotated transcriptome. 

### **Step 1:** Count the sequences

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  The Transcriptome
</div>

> The transcriptome that we are using will be downloaded onto from a public Amazon S3 bucket your local directory. It lives within the `resources` directory in the sub-directory named `trans`. It is in the file format `.fa`.

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  aws s3
</div>

>`aws s3` is a tool allows you to interact with Amazon S3 buckes through the command line.

In [None]:
! aws s3 cp --recursive s3://nigms-sandbox/nosi-inbremaine-storage/resources ./resources

> You should get a count of 31,176.

In [None]:
! grep -c ">" ./resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  What is the Oncorhynchus mykiss?
</div>

> The Oncorhynchus mykiss is commonly known as the **Rainbow Trout**. Here is what they look like:
>
> <img src="../images/rainbowTrout.jpeg"  width="500" >
> 
>> Image Source: https://www.ndow.org/species/rainbow-trout/

### **Step 2:** AWS Batch Setup

AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. 

If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. 

[![Launch Stack](../images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml)


Before beginning this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to **manually** set those up please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md) to set that up.

#### Change the parameters as desired in `aws` profile inside `../denovotrascript/nextflow.config` file:
 - Name of your **AWS Batch Job Queue**
 - AWS region 
 - Nextflow work directory
 - Nextflow output directory

### **Step 3:** Install Nextflow

In [None]:
%%capture
! mamba create  -n nextflow -c bioconda nextflow -y
! mamba install -n nextflow ipykernel -y

<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Alert: </b> Remember to change your kernel to <b>conda_nextflow</b> to run nextflow.
</div>

### **Step 4:** Run `denovotranscript`

Now we can run `denovotranscript` using the option `annotation_only` run-mode which assumes that the transcriptome has been generated, and will only run the various steps for annotation of the transcripts.

>This run should take about **5 minutes**

In [None]:
! nextflow run ../denovotranscript/main.nf --input ../denovotranscript/test_samplesheet_aws.csv -profile aws \
--run_mode annotation_only --transcript_fasta s3://nigms-sandbox/nosi-inbremaine-storage/resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa

The output will be arranged in a directory structure in your Amazon S3 bucket. We will download it into our local directory:

In [None]:
! mkdir -p <Your-Output-Directory-annotation-only>
! aws s3 cp --recursive s3://<YOUR-BUCKET-NAME>/<Your-Output-Directory-annotation-only>/ ./<Your-Output-Directory-annotation-only>

In [None]:
! ls -l ./<Your-Output-Directory-annotation-only>

----
# Andrea, please update this part

Let's take a look at the `RUN_INFO.txt` file to see what the parameters and programs associated with our analysis were.

In [None]:
! cat ./onlyAnnRun/output/RUN_INFO.txt

---

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  Containers
</div>

>Note that while the "annotation_only" run-mode carries out the searches against Pfam and the BLAS analysis against known proteins, it does not carry out the BUSCO analysis. We can make that happen ourselves however, to do so, we need to learn a little bit about running programs from containers.
>
>Container systems (and associated images) are one approach that simplifies the use of a broad set of programs, such as is commonly found in the wide field of computational biology. To put it concisely, most programs are not "stand-alone" but instead rely upon at least a few supporting libraries or auxiliary programs.  Since many analyses require multiple programs, installation of the necessary programs will also require installation of the supporting components, and critically *sometimes the supporting components of one program conflict with those of other programs.* 
>
>Container systems ([Docker](https://www.docker.com/) and [Singularity](https://sylabs.io/singularity/) are the two most well-known examples) address this by installing and encapsulating the program and all of its necessary supporting components in an image. Each program is then executed in the context of its container image, which is activated just long enough to run its program.
>
>Because of the way that we have run the TransPi workflow in the previous, our system will already have several container images installed. We can now work directly with these images.

### **Step 5:** Activate the BUSCO container
>We want the Docker image that contains the program (and all necessary infrastructure) for running the BUSCO analysis. The name is in the first column, but we also need the version number, which is in the second column. So let's put that together and first activate the container and ask it to run BUSCO and just give us back the help message.
>
>We will use the `docker run` command, and we will use the following options with it:
>- `-it`, which means run interactively
>- `--rm`, which means clean up after shutting down
>- `--volume /home:/home` This is critical because, by default, a Docker image can only see the file system inside of the container image. We need to have it see our working directory, so we create a volume mapping. For simplicity, we will just map the /home directory outside the container to the same address inside. This will let us access and use all of the files that are below `/home`.

In [None]:
! docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco --help

Get a listing of the images that are currently loaded:

In [None]:
! docker images

### **Step 6:** Run BUSCO (in the container)
>Now we will fill out a complete command and ask BUSCO to analyze the same trout data that we just used above. Here is the full command needed to make this run go. A lot is going on here:
>
>- `-i /home/jupyter/resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa`: this points to the location and name of the file to be examined.
>- `-l vertebrata_odb10`: this tells BUSCO to use the vertebrata gene set (genes common to vertebrates) as the target.
>- `-o GGBN01_busco_vertebrata`: this tells BUSCO to use this as the label for the output.
>- `--out_path /home/jupyter/buscoOutput`: this tells BUSCO where to put the output directory. 
>- `-m tran`: this tells BUSCO that the inputs are transcripts (rather than protein or genomic data). 
>- `-c 14`: this tells BUSCO to use 14 CPUs
>
> This should take about **an hour**



In [None]:
#Run the command below to watch the video
from IPython.display import YouTubeVideo

YouTubeVideo('D95mFnIjRo4', width=800, height=400)

In [None]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'
THREADS = int(numthreads[0])
! echo $THREADS

In [None]:
! docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco \
-i /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/AWS/resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa \
-l vertebrata_odb10 -o GGBN01_busco_vertebrata \
--out_path /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/AWS/buscoOutput -m tran -c $THREADS

Look at the output: 

In [None]:
! ls ./buscoOutput/GGBN01_busco_vertebrata

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Checkpoint 1:</b> Interpret The Results 
</div>

> Consider the following result files:
> - The BUSCO result `./buscoOutput/GGBN01_busco_vertebrata/short_summary.specific.vertebrata_odb10.GGBN01_busco_vertebrata.txt`
> - The GO stats result `./onlyAnnRun/output/stats/Oncorhynchus_mykiss_GGBN01.sum_GO.txt`
> - The TransDecoder stats result: `./onlyAnnRun/output/stats/Oncorhynchus_mykiss_GGBN01.sum_transdecoder.txt`

*The green cards below are interactive. Spend some time to consider the question and click on the card to check your answer.*

In [None]:
! pip install jupytercards

In [None]:
from jupytercards import display_flashcards
display_flashcards('../quiz-material/03-cp1-1.json')

> Now let's take a look at where the data came from... Consider the abstract of the [Al-Tobasel et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4764514/) paper published from this data.
>
>>*The ENCODE project revealed that ~70% of the human genome is transcribed. While only 1–2% of the RNAs encode for proteins, the rest are non-coding RNAs. Long non-coding RNAs (lncRNAs) form a diverse class of non-coding RNAs that are longer than 200nt. Emerging evidence indicates that lncRNAs play critical roles in various cellular processes including regulation of gene expression. LncRNAs show low levels of gene expression and sequence conservation, which make their computational identification in genomes difficult. In this study, more than two billion Illumina sequence reads were mapped to the genome reference using the TopHat and Cufflinks software. Transcripts shorter than 200nt, with more than 83–100 amino acids ORF, or with significant homologies to the NCBI nr-protein database were removed. In addition, a computational pipeline was used to filter the remaining transcripts based on a protein-coding-score test. Depending on the filtering stringency conditions, between 31,195 and 54,503 lncRNAs were identified, with only 421 matching known lncRNAs in other species. A digital gene expression atlas revealed 2,935 tissue-specific and 3,269 ubiquitously-expressed lncRNAs. This study annotates the lncRNA rainbow trout genome and provides a valuable resource for functional genomics research in salmonids.*

In [None]:
display_flashcards('../quiz-material/03-cp1-2.json')

>**The key takeaway is to always be mindful of the data you are using before performing analysis on it.**

Now let's try with one of the other transcriptomes that we downloaded from the NCBI Transcriptome Shotgun Assembly archive.
> This should take about **an hour**

In [None]:
! docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco \
    -i /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/AWS/resources/trans/Pseudacris_regilla_GAEI01.1.fa \
    -l vertebrata_odb10 -o GAEI01_busco_vertebrata \
    --out_path /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/AWS/buscoOutpu -m tran -c $THREADS

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Checkpoint 2:</b> Your turn to run a BUSCO analysis 
</div>

>For this checkpoint, you will run another BUSCO analysis, however, this time you will write your own execution command. For the transcriptome used, you have two options:
>1. Within the directory that we have been using for the previous two BUSCO runs, `./resources/trans`, there is one more assembled transcriptome named `Microcaecilia_dermatophaga_GFOE01.1.fa`.
>2. Go onto the NCBI Transcriptome Shotgun Assembly archive, find your own complete, assembled transcriptome, and use that.
>    - If you download the file onto your local computer, there is an upload button (up arrow) in the top left of the Jupyter interface where you can upload the file.
>    - If the file you have uploaded is zipped, you will need to unzip it using the following commands: (make sure that the file name after the `>` has the `.fa` extension.)
>```python
        !gzip -d -c ./PATH/TO/FILE.fsa_nt.gz > ./PATH/TO/FILE.1.fa
        !rm ./PATH/TO/FILE.fsa_nt.gz
>```
> Additionally, consider trying a different lineage (`-l` selection). EZlab, the creators of BUSCO, have produced a large selection of lineages to choose from. Each one has a different set of genes that BUSCO looks for. If you decide to try a different lineage, it is recommended to choose a lineage that falls somewhere within the same family. (e.g., Don't choose the `primates_odb10` lineage if you are choosing to use a bullfrog transcriptome.)
>```python
        # This will be a complete list of the available datasets
        !docker run -it --rm --volume /home:/home quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0 busco --list-datasets
>```
> Feel free to reference the commands for the previous BUSCO runs and the help command we ran earlier if you are stuck. Additionally, feel free to check out the [BUSCO user guide](https://busco.ezlab.org/busco_userguide.html).
>
>After the run has been complete, consider the following:
>1. How did BUSCO perform on this transcriptome? Does the transcriptome appear to be well assembled based on the provided lineage? If the results are not good, consider the possible reasons why? Is it more likely that the transcriptome chosen was not good? Or potentially a poorly chosen lineage? Or maybe something else entirely?
>2. What could be a logical biological reason the output says that there are duplicate copies of the same gene?
>3. What could be a possible reason for fragmented copies?
>4. Why is it that broader lineages such as metazoa have far fewer genes (954) that BUSCO looks for compared to more specific lineages such as mammalia which has far more genes (9226) that BUSCO looks for?

In [None]:
# Put your BUSCO command here

## Conclusion

This notebook provided a comprehensive hands-on experience in transcriptome annotation using the `denovoscript` pipeline in annotation-only mode, leveraging AWS Batch for serverless execution and Docker containers for BUSCO analysis. Through a guided workflow, users learned to set up AWS Batch, execute `denovoscript` to annotate a rainbow trout transcriptome, assess transcriptome completeness with BUSCO, and critically interpret the results from BUSCO, GO, and TransDecoder analyses. Furthermore, the notebook emphasized the importance of understanding data provenance and culminated in an independent BUSCO analysis exercise, challenging users to apply their newfound skills to different transcriptomes and critically evaluate the outcomes, thus solidifying their understanding of transcriptome assembly and annotation principles.

## Clean Up

Remember to proceed to the next notebook [`Submodule_04_gls_assembly.ipynb`](Submodule_04_gls_assembly.ipynb) or shut down your instance if you are finished.