# MDIBL Transcriptome Assembly Learning Module
# Bonus notebook: Using TransPi on a new dataset

## Overview
In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable transcriptome assembly, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis.

The data we are using here comes from SRA. In this example, we are using data from an experiment that compared RNA sequences in honeybees with and without viral infections. The BioProject ID is [PRJNA274674](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674). This experiment includes 6 RNA-seq samples and 2 methylation-seq samples. We are only considering the RNA-seq data here. Additionally, we have subsampled them to about 2 millions reads collectively accross all of the samples. In a real analysis this would not be a good idea, but to keep costs and runtimes low we will use the down-sampled files in this demonstration. If you want to explore the full dataset, we recommend pulling the fastq files using the [STRIDES tutorial on SRA downloads](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/notebooks/SRADownload/SRA-Download.ipynb). As with the original example in this module, we have concatenated all 6 files into one set of combined fastq files called joined_R{1,2}.fastq.gz We have stored the subsampled fastq files in this module's cloud storage bucket.

## Learning Objectives:

1. **Adapting a Nextflow workflow:**  The notebook demonstrates how to modify a Nextflow pipeline's configuration to point to a new dataset, highlighting the workflow's reusability and flexibility.  This involves understanding how to change input parameters within a configuration file.

2. **Data preparation and management:**  Users learn how to download and manage data from the SRA (Sequence Read Archive) using `gsutil` (although a pre-downloaded, subsampled dataset is provided for convenience).  This includes understanding file organization and paths.

3. **Software installation and environment setup:** The notebook guides users through installing necessary software (Java, Mamba, sra-tools, perl modules, Nextflow) and setting up the computational environment. This emphasizes reproducibility and dependency management.

4. **Running a transcriptome assembly:**  The notebook shows how to execute the TransPi Nextflow pipeline with the new dataset, demonstrating the complete process from data input to (presumably) assembly output.

## Prerequisites

* **Java:** The notebook installs the default JDK.
* **Miniforge** Used for package management.
* **sra-tools, perl-dbd-sqlite, perl-dbi:**  Bioinformatics tools for working with SRA data.
* **Nextflow:** A workflow management system.
* **Docker** Either Docker pre-installed on the VM, or permissions to install and run Docker containers.
* **`gsutil`:** The Google Cloud Storage command-line tool.

## Get Started

Before we start any analysis, let's set up the environment just like we did in Submodule_01 and Submodule_02 where we move to the correct directory and install software.

In [None]:
%cd /home/jupyter

In [None]:
!pwd

In [None]:
#update java
! sudo apt update
! sudo apt-get install default-jdk -y
! java -version

In [None]:
# install Miniforge
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh
! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge

In [None]:
# add Miniforge to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/miniforge/bin"

Use Miniforge to install: sra-tools perl-dbd-sqlite perl-dbi from channel bioconda

<details>
  <summary>Click for help</summary>

```
mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y
```
    
</details>

In [None]:
! <YOUR COMMAND HERE>

In [None]:
#install Nextflow
! curl https://get.nextflow.io | bash
! chmod +x nextflow
! ./nextflow self-update

In [None]:
# Copy the software from gs://nigms-sandbox/nosi-inbremaine-storage/TransPi
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>

```
gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/TransPi ./```
    
</details>

In [None]:
# Copy the data from gs://nigms-sandbox/nosi-inbremaine-storage/resources
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>

```
gsutil -m cp -r gs://nigms-sandbox/nosi-inbremaine-storage/resources ./
```
  
</details>

In [None]:
#Make the program executable
!chmod -R +x ./TransPi/bin

Let's have a look at what we've downloaded to make sure it's there.

In [None]:
! ls ./resources/seq2

You should see the joined fastq files alongside the others that we use in the previous submodules. Now let's adjust the workflow to run on them.

One of the great benefits of using a workflow manager like Nextflow is that it allows easy swapping of input samples without drastic changes to the code. In the true spirit of reproducible workflows, the only change necessary in order to run the joined samples is to adjust the `reads` line in the `nextflow.config` file `params` section to point to the new reads location. In the line below, write the updated reads path that you would add to the config file. 

In [None]:
# <Your path here>

<details>
  <summary>Click for help</summary>
    


```
// Directory for reads
reads="/home/jupyter/resources/seq2/joined*R[1,2].fastq.gz"
```
    
    
</details>


After this change, you should be able to run the same Nextflow command as you did in Submodule_02 and everything will progress automatically.

In [None]:
! NXF_VER=22.10.1 ./nextflow run \
    ./TransPi/TransPi.nf \
    -profile docker \
    --k 17,25,43 \
    --maxReadLen 50 \
    --all 

With the subsampled reads, the assembly should complete in about 2 hours using a n1-highmem-16 machine.

## Conclusion

This notebook demonstrated the adaptability of the MDIBL Transcriptome Assembly Learning Module's TransPi pipeline by applying it to a new RNA-Seq dataset from a honeybee viral infection study (PRJNA274674).  While utilizing a subsampled dataset for demonstration purposes, the process highlighted the ease of integrating new data into the existing Nextflow workflow.  By simply modifying the `nextflow.config` file to specify the new reads' location, the pipeline executed seamlessly, showcasing its robustness and reproducibility.  This adaptability makes the module a valuable resource for researchers seeking to perform scalable and rigorous transcriptome assemblies on their own datasets, facilitating efficient and reproducible analyses within their research groups.  The successful execution underscores the power of workflow management systems like Nextflow for streamlining bioinformatics analyses.

## Clean Up

Shut down your instance if you are finished.