# MDIBL Transcriptome Assembly Learning Module
# Notebook 1: Setup

## Overview

This notebook is designed to configure your virtual machine (VM) to have the proper tools and data in place to run the transcriptome assembly training module.

## Learning Objectives

1. **Understand and utilize shell commands within Jupyter Notebooks:**  The notebook explicitly teaches the difference between `!` and `%` prefixes for executing shell commands, and how to navigate directories using `cd` and `pwd`.

2. **Set up the necessary software:** Students will install and configure essential tools including:
    * Java (a prerequisite for Nextflow).
    * Miniforge (a package manager for bioinformatics tools).
    *  `sra-tools`, `perl-dbd-sqlite`, and `perl-dbi` (specific bioinformatics packages).
    * Nextflow (a workflow management system).
    *  `gsutil` (for interacting with Google Cloud Storage).

3. **Download and organize necessary data:** Students will download the TransPi transcriptome assembly software and its associated resources (databases, scripts, configuration files) from a Google Cloud Storage bucket.  This includes understanding the directory structure and file organization.

4. **Manage file permissions:** Students will use the `chmod` command to set executable permissions for the necessary files and directories within the TransPi software.

5. **Navigate file paths:** The notebook provides examples and explanations for using relative file paths (e.g., `./`, `../`) within shell commands.

## Prerequisites

* **Operating System:** A Linux-based system is assumed (commands like `apt`, `uname` are used).  The specific distribution isn't specified but a Debian-based system is likely.
* **Shell Access:**  The ability to execute shell commands from within the Jupyter Notebook environment (using `!` and `%`).
* **Java Development Kit (JDK):**  Required for Nextflow.
* **Miniforge** A package manager for installing bioinformatics tools.
* **`gsutil`:** The Google Cloud Storage command-line tool. This is crucial for downloading data from Google Cloud Storage.

## Get Started

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  ! and % in code cells
</div>

>You may notice that many of the lines in the code cells begin with one of these symbols: `!` or `%`. They both allow you (the user) to run shell commands in the code cells of a Juypter notebook. They do, however, operate slightly differently:  
>- The `!` executes the command and then immediately terminates.
>- The `%` executes the command and has a lasting effect.

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Example:</b> 
</div>

>Take this example code snippet: *Imagine that you are currently in the directory named* `original-directory`.
>```python
!cd different-directory/
>```
>After this line executes, you will still be in the directory named `original-directory`.
>
>**Vs.**
>```python
%cd different-directory/
>```
>After this line executes, you will now be in the directory `different-directory`.

## Time to begin!

**Step 1:** To start, make sure that you are in the right starting place with a `cd`.
> `pwd` prints our current local working directory. Make sure the output from the command is: `/home/jupyter`

In [1]:
%cd /home/

[Errno 2] No such file or directory: '/home/jupyter'
/home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/AWS


In [2]:
! pwd

/home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/AWS


**Step 2:** Now, update the system and install Java (which is needed for Nextflow to run).

In [None]:
! sudo apt update
! sudo apt-get install default-jdk -y
! java -version

**Step 3:** Install Miniforge (a package manager), which is needed to support the information held within the TransPi databases.

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh
! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge

Next, add it to the path.

In [None]:
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/miniforge/bin"

Next, using Miniforge and bioconda, install the tools that will be used in this tutorial.

In [None]:
! mamba install -c bioconda sra-tools perl-dbd-sqlite perl-dbi -y

**Step 4:** Now, install Nextflow, make it executable, and update it.

In [None]:
! curl https://get.nextflow.io | bash
! chmod +x nextflow
! ./nextflow self-update

**Step 5:** Time to get TransPi.
>The original version of TransPi is available on GitHub, however, we have made a variety of alterations to the program and will be using the updated version in the following modules.

In [6]:
! aws s3 cp --recursive s3://nigms-sandbox/nosi-inbremaine-storage/TransPi ./TransPi

download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/Dockerfile to TransPi/Dockerfile
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/bin/busco_comparison.R to TransPi/bin/busco_comparison.R
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/bin/TransPi_Report_Ind.Rmd to TransPi/bin/TransPi_Report_Ind.Rmd
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/bin/custom_uniprot_hits.R to TransPi/bin/custom_uniprot_hits.R
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/LICENSE to TransPi/LICENSE
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/bin/GO_plots.R to TransPi/bin/GO_plots.R
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/bin/SOS_busco.py to TransPi/bin/SOS_busco.py
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/bin/get_busco_val.sh to TransPi/bin/get_busco_val.sh
download: s3://nigms-sandbox/nosi-inbremaine-storage/TransPi/docs/Makefile to TransPi/docs/Makefile
download: s3://nigms-sandbox

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  aws s3
</div>

>`aws s3` is a tool allows you to interact with Amazon S3 through the command line.

**Step 6:** Now copy over all of the additional resources needed for TransPi to run. This may take a few minutes.
> Within the resources directory, 5 sub-directories are needed: `/bin`, `/conf`, `/DBs`, `/seq2`, and `trans`.
> - In the **`/bin`** directory, there are a set of programs that get called by various processes within the TransPi workflow. One example `GO_plots.R` is an R script that creates plots showing gene ontology of the built transcriptome.
> - In the **`/conf`** directory, there are 3 files, but we will only be using `uni_tax.txt` which contains the UniProt taxonomy codes.
> - In the **`/DBs`** directory, there are 3 sub-directories containing 3 databases that TransPi needs:
>    - **`/hmmerdb`**  contains the `Pfam_A.hmm` file which is a database of protein families. This database is used to annotate the transcriptome that is built using probabilities built from Hidden Markov Models.
>    - **`/sqlite_db`** contains the necessary files and database to run DIAMOND, a program that swiftly aligns the built transcriptome to a database of known proteins.
>    - **`/uniprot_db`:** contains a different database to run DIAMOND and to run TransDecoder, a program that identifies coding regions.

In [7]:
! aws s3 cp --recursive s3://nigms-sandbox/nosi-inbremaine-storage/resources ./resources

download: s3://nigms-sandbox/nosi-inbremaine-storage/resources/DBs/hmmerdb/.lastrun.txt to resources/DBs/hmmerdb/.lastrun.txt
download: s3://nigms-sandbox/nosi-inbremaine-storage/resources/DBs/sqlite_db/.lastrun.txt to resources/DBs/sqlite_db/.lastrun.txt
download: s3://nigms-sandbox/nosi-inbremaine-storage/resources/DBs/sqlite_db/Trinotate_build_scripts/.gitmodules to resources/DBs/sqlite_db/Trinotate_build_scripts/.gitmodules
download: s3://nigms-sandbox/nosi-inbremaine-storage/resources/DBs/sqlite_db/Trinotate_build_scripts/LICENSE.txt to resources/DBs/sqlite_db/Trinotate_build_scripts/LICENSE.txt
download: s3://nigms-sandbox/nosi-inbremaine-storage/resources/DBs/sqlite_db/Trinotate_build_scripts/Changelog.txt to resources/DBs/sqlite_db/Trinotate_build_scripts/Changelog.txt
download: s3://nigms-sandbox/nosi-inbremaine-storage/resources/DBs/sqlite_db/Trinotate_build_scripts/PerlLib/DelimParser.pm to resources/DBs/sqlite_db/Trinotate_build_scripts/PerlLib/DelimParser.pm
download: s3:/

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  File Paths
</div>

>Consider the following file structure and you are currently in the directory `toDo`: 
>
> <img src="../images/fileDemo.png" width="1200">
>
>- If you were to type `!ls ./`, it would return the contents of your current directory, so it would return `nextWeek`, `Today.txt`, `Tomorrow.txt`, `Yesterday.txt`.
>     - The `./` path points to your current directory.
>
>- If you were to type `!ls ../`, it would return the contents of the directory 1 layer up from your current directory, so it would return `coolPicturesOcean`, `shoppingList`, `toDo`.
>    - The `../` path points to the directory one layer up from the current directory.
>    - They can also be stacked so `../../` will take you two layers up.
>
>- If you were to type `!ls ./nextWeek/` it would return the contents of the `nextWeek` directory which is one layer down from the current directory, so it would return `manyThings.txt`.
>
>**This means that in the second line of the code cell above, the file `TransPi.nf` will be copied from the Google Cloud Storage bucket to the current directory.**

**Step 7:** Make the contents of `./TransPi/bin` executable.

In [8]:
! chmod -R +x ./TransPi/bin

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Note: </b>  chmod
</div>

>The `chmod` command is responsible for granting access to files and directories.
>
>Following the `chmod` can be a series of letters and symbols, in the case above `a+rx`.
>- The first letter can be `u`, `g`, `o`, or `a`.
>    - `u` stands for owner
>    - `g` stands for group
>    - `o` stands for other users
>    - `a` stands for all
>    
>    
>- Next can be either a `+` or a `-`.
>    - `+` grants access
>    - `-` revokes access
>
>
>- Next the type of permission is indicated (more than one can be there). The options are `r`, `w`, and `x`.
>    - `r` is read permission
>    - `w` is write permission
>    - `x` is execute permission
>
>
>- Finally, the file or directory is designated.

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Checkpoint 1:</b>
</div>

In [10]:
from jupyterquiz import display_quiz
display_quiz("../quiz-material/01-cp1.json", shuffle_questions = True)

<IPython.core.display.Javascript object>

## Conclusion

This notebook successfully configured the virtual machine for the MDIBL Transcriptome Assembly Learning Module.  We updated the system, installed necessary software including Java, Mambaforge, and Nextflow, and downloaded the TransPi program and its associated resources from Google Cloud Storage.  The `chmod` command ensured executability of the TransPi scripts.  The VM is now prepared for the next notebook, `Submodule_02_basic_assembly.ipynb`, which will delve into the transcriptome assembly process itself.  Successful completion of this notebook's steps is crucial for the successful execution of subsequent modules.

## Clean Up

Remember to proceed to the next notebook [`Submodule_02_basic_assembly.ipynb`](./Submodule_02_basic_assembly.ipynb) or shut down your instance if you are finished.