# Extended RNA-Seq Analysis Training Demo

## Overview

For simplicity and time, The short tutorial workflow uses truncated and partial run data from the Cushman et al., project.

The tutorial repeats the short tutorial, but with the full fastq files and includes some extra steps, such as how to download and prepare the transcriptome files used by salmon, alternate ways to navigate the NCBI databases for annotation or reference files you might need, and how to combine salmon outputs at the end into a single genecount file.

Full fastq files can be rather large, and so the downloading, extracting, and analysis of them means this tutorial can take over 1 hour 45 minutes to run the code fully. This is part of the reason we have a short and easy introductory tutorial, and this longer more full tutorial for those interested.

If this is too lengthy feel free to move on to the snakemake tutorial or the DEG analysis tutorial -- all the files used in the DEG tutorial were created using this extended tutorial workflow.

![RNA-Seq workflow](images/rnaseq-workflow.png)

## Learning Objectives

* **Install necessary bioinformatics tools:**  Learn to install and manage bioinformatics software using mamba.

* **Set up a project directory structure:** Organize files efficiently for RNA-Seq analysis.

* **Download RNA-Seq data from the SRA:** Utilize `prefetch` and `fasterq-dump` to download and convert SRA data to FASTQ format.  Learn to obtain accession numbers from NCBI databases (both manually and using BigQuery).

* **Quality control of raw reads:** Use `FastQC` to assess the quality of raw sequencing reads and `MultiQC` to generate a summary report.

* **Read trimming and adapter removal:** Employ `Trimmomatic` to improve read quality by removing adapter sequences and low-quality bases.

* **Transcriptome preparation:** Download and prepare a reference transcriptome using `entrez-direct` and `gffread`, including creating a decoy file for Salmon.

* **RNA-Seq read alignment and quantification:** Use `Salmon` to align reads to the transcriptome and quantify gene expression levels.

* **Gene expression analysis:**  Interpret Salmon output to identify highly expressed genes and analyze the expression of specific genes of interest.

* **Combine gene counts:** Merge individual sample quantification results into a single gene count table for downstream analysis (e.g., differential expression).

## Prerequisites

**APIs:**

* **gsutil:**  The Google Cloud Storage (GCS) tool `gsutil` is used extensively to download data from a Google Cloud Storage bucket.  This implicitly requires the Google Cloud Storage API to be enabled.

**Software and Dependencies:**

* **Trimmomatic:**  Used for quality trimming of raw sequencing reads.
* **FastQC:**  A quality control tool for high-throughput sequence data.
* **MultiQC:**  Aggregates results from multiple FastQC runs into a single report.
* **Salmon:** A tool for quantifying transcript abundances from RNA-Seq data.
* **Entrez Direct (EDirect):** NCBI's command-line tool for accessing the Entrez databases (used here to retrieve reference genome information).
* **gffread:**  Parses GFF/GTF annotation files and extracts information from them, such as to create a transcriptome reference file from genome and annotation files.
* **parallel-fastq-dump:**  A parallel version of the `fastq-dump` tool (part of SRA Toolkit).  It is likely optimized to process multiple files efficiently.
* **sra-tools:** NCBI's SRA Toolkit for downloading and processing data from the Sequence Read Archive.  Specifically, `prefetch` and `fasterq-dump` are used here.
* **pigz:** A parallel version of the gzip compression/decompression tool.  It's faster for larger files.
* **gsutil:** Google Cloud Storage (GCS) command line tool (used to download the accession list file).
* **BigQuery (optional):** Google's data warehouse service (used optionally to generate an accession list file, requires the BigQuery API to be enabled in your GCP Project and  proper authentication credentials in the environment).

### STEP 1: Install the tools

Using mamba and bioconda, install the tools that will be used in this tutorial.

In [None]:
! mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon entrez-direct gffread parallel-fastq-dump sra-tools=3.0.5 pigz

In [None]:
! prefetch --version

### STEP 2: Setup Environment

Create a set of directories to store the reads, reference sequence files, and output files. Notice that first we remove the `data` directory to clean up files from Tutorial_1


In [None]:
! cd $HOMEDIR
! echo $PWD
! rm -r data/
! mkdir -p data
! mkdir -p data/raw_fastq
! mkdir -p data/trimmed
! mkdir -p data/fastqc
! mkdir -p data/aligned
! mkdir -p data/reference

Set number of cores depending on your VM size

In [None]:
numthreads=!nproc
numthreadsint = int(numthreads[0])
%env CORES = $numthreadsint
#!echo ${CORES}

### STEP 3: Downloading relevant FASTQ files using SRA Tools

Next we will need to download the relevant fastq files.

Because these files can be large, the process of downloading and extracting fastq files can be quite lengthy.

The sequence data for this tutorial comes from work by Cushman et al., <em><a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191103/'>Increased whiB7 expression and antibiotic resistance in Mycobacterium chelonae carrying two prophages</a><em>.

We will be downloading the sample runs from this project using SRA tools, downloading from the NCBI's SRA (Sequence Run Archives).

However, first we need to find the associated accession numbers in order to download.


### STEP 3.1: Finding run accession numbers.

The SRA stores sequence data in terms of runs, (SRR stands for Sequence Read Run). To download runs, we will need the accession ID for each run we wish to download. 

The Cushman et al., project contains 12 runs. To make it easier, these are the run IDs associated with this project:

+ SRR13349122
+ SRR13349123
+ SRR13349124
+ SRR13349125
+ SRR13349126
+ SRR13349127
+ SRR13349128
+ SRR13349129
+ SRR13349130
+ SRR13349131
+ SRR13349132
+ SRR13349133


In this case, all these runs belong to the SRP (Sequence Run Project): SRP300216.

Sequence run experiments can be searched for using the SRA database on the NCBI website; and article-specific sample run information can be found in the supplementary section of that article.

For instance, here, the the authors posted a link to the sequence data GSE (Gene Series number), <a href='https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164210'>GSE164210</a>. This leads to the appropriate 'Gene Expression Omnibus' page where, among other useful files and information, the relevant SRA database link can be found. 

You can download this text file with the accession numbers and continue to STEP 3.2, or you can optionally use BigQuery to generate an accession list following the instructions outlined in [this notebook](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/notebooks/SRADownload/SRA-Download.ipynb).
### STEP 3.1.1: Download the accession list file with gsutil

In [None]:
! gsutil cp gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/accs.txt .

### STEP 3.1.2 (Optional): Generate the accession list file with BigQuery

In [None]:
# Import the biquery api
from google.cloud import bigquery

Now make sure you have enabled the BigQuery API. You just need to search for BigQuery, go to the BQ page and click `Enable`

In [None]:
# Designate the client for the API
client = bigquery.Client(location="US")
print("Client creating using default project: {}".format(client.project))

The table we are working with has the following Id:

In [None]:
table_id = "nih-sra-datastore.sra.metadata"

It has the following column names:

In [None]:
table_ref = client.get_table(table_id)
column_names = [field.name for field in table_ref.schema]
column_names

The first column (`acc`), is accession ID.

Now we will query BigQuery using the species name and a range of accession numbers associated with this particular study. Feel free to play around with the query to generate different variations of accession numbers!

In [None]:
query = """
#standardSQL
SELECT acc
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Mycobacteroides chelonae'
and acc LIKE '%SRR133491%'
ORDER BY acc
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

In [None]:
result=list(query_job)
result

In [None]:
with open('accs2.txt', 'w') as f:
    for acc in result:
        f.write(acc[0]+'\n')

In [None]:
cat accs2.txt

### STEP 3.2: Using the SRA-toolkit for a single sample.

Now use the Sequence Run accession ID to download the sequence data.

In [None]:
! prefetch SRR13349124 -O data/raw_fastq -f yes

Notice the SRA archives sequence files in the SRA format. 

Typically genome workflows process data in the form of zipped or unzipped .fastq, or .fasta files

So before we move on, we need to convert the files from .sra to .fastq using the fastq-dump tool.

We will also compresss the fastq files to make them take less space, making them fastq.gz files.

In [None]:
! for x in `cat accs.txt`; do fasterq-dump -f -O data/raw_fastq -e $CORES -m 4G data/raw_fastq/$x/$x.sra; done
# Note that you will get a result only for SRR13349124. Igonre error related to other accession IDs

### STEP 3.3 Downloading multiple files using the SRA-toolkit.

One may, as in our case, wish to download multiple runs at once.

To aid in this, SRA-tools supports batch downloading.

We can download multiple SRA files using a single line of code by creating a list of the SRA IDs we wish to download, and inputting that into the prefetch command.

And then feed that list into the sra-toolkit prefetch command. Note, it may take some time to download all the fastq files.

In [None]:
! prefetch -O data/raw_fastq/ --option-file accs.txt

### STEP 3.3 Converting Multiple SRA files to Fastq

We used fasterq-dump before to convert SRA files to fastq. However, fasterq-dump does not have native batch compatibility. As before, we will use a loop to convert each file in our list. In this case, we are going to convert to fastq.gz for downstream processing. This step should take about 30 minutes.

In [None]:
! for x in `cat accs.txt`; do fasterq-dump -f -O data/raw_fastq -e $CORES -m 4G data/raw_fastq/$x/$x.sra; done

Convert to fastq.gz

In [None]:
! time pigz data/raw_fastq/*.fastq

### STEP 4: Copy reference transcriptome files that will be used by Salmon using E-Direct

Salmon is a tool that aligns RNA-Seq reads to a transcriptome.

So we will need a transcriptome reference file.

To get one, we can search through the NCBI assembly database, find an assembly, and download transcriptome reference files from that assembly using FTP links.

For instance, we will use the <a href='https://www.ncbi.nlm.nih.gov/assembly/GCF_001632805.1'>ASM163280v1</a> refseq assembly, found by searching through the NCBI assembly database. The FTP links can be accessed through the website in various ways, one way is to click the 'FTP directory for RefSeq assembly' link, found under 'Access the data', section.

Alternatively, if one were inclined, one could take the less common route and perform this through the NCBI command line tool suite called 'Entrez Direct' (EDirect).

This is an intricate and complicated set of tools, with many ways to do any one thing.

Below is an example of using an eDirect search query with a refseq identifier to obtain the relevant FTP directory, and then using that to download desired reference files.

In [None]:
#parse for the ftp link and download the genome reference fasta file

! esearch -db assembly -query GCF_001632805.1 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_genomic.fna.gz " $0"/"$NF"_genomic.fna.gz"}' \
| bash

#parse for the ftp link and download the gtf reference fasta file

! esearch -db assembly -query GCF_001632805.1 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_genomic.gff.gz " $0"/"$NF"_genomic.gff.gz"}' \
| bash

# parse for the ftp link and download the feature-table reference file 
# (for later use for merging readcounts with gene names in R code).

! esearch -db assembly -query GCF_001632805.1 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_feature_table.txt.gz " $0"/"$NF"_feature_table.txt.gz"}' \
| bash


#unzip the compresseed fasta files

! gzip -d data/reference/*.gz --force

Next we can use a tool called gffread to create a transcriptome reference file using the gtf and genome files we downloaded.

In [None]:
! gffread -w data/reference/GCF_001632805.1_transcriptome_reference.fa -g data/reference/GCF_001632805.1_ASM163280v1_genomic.fna data/reference/GCF_001632805.1_ASM163280v1_genomic.gff

It is also recommended to include the full genome at the end of the transcriptome reference file, for the purpose of performing a 'decoy-aware' mapping, more information about which can be found in the Salmon documentation.

To alert the tool to the presence of this, we will also create a 'decoy file', which salmon needs pointed towards the full genome sequence in our transcriptome reference file.

In [None]:
! cat data/reference/GCF_001632805.1_transcriptome_reference.fa <(echo) data/reference/GCF_001632805.1_ASM163280v1_genomic.fna > data/reference/GCF_001632805.1_transcriptome_reference_w_decoy.fa
! echo "NZ_CP007220.1" > data/reference/decoys.txt

### STEP 5: Copy data file for Trimmomatic

One of trimmomatics functions is to trim sequence machine specific adapter sequences. These are usually within the trimmomatic installation directory in a folder called adapters.

Directories of packages within mamba installations can be confusing, so in the case of using mamba with trimmomatic, it may be easier to simply download or create a file with the relevant adapter sequencecs and store it in an easy to find directory.

In [None]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa .
! head TruSeq3-PE.fa 

### STEP 6: Run Trimmomatic
Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Using piping and our original list, it is possible to queue up a batch run of trimmomatic for all our files, note that this is a different way to run a loop compared with what we did before.

The below code may take approximately 35 minutes to run.

In [None]:
! cat accs.txt | xargs -I {} trimmomatic PE -threads $CORES 'data/raw_fastq/{}_1.fastq.gz' 'data/raw_fastq/{}_2.fastq.gz' 'data/trimmed/{}_1_trimmed.fastq.gz' 'data/trimmed/{}_1_trimmed_unpaired.fastq.gz' 'data/trimmed/{}_2_trimmed.fastq.gz' 'data/trimmed/{}_2_trimmed_unpaired.fastq.gz' ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

### STEP 7: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

If you notice the results of the trimming, you may have noted the sequences in the reverse reads were few, and largely unpaired. This may be an artifact from how the original sequencing process. This is okay, we can proceed from here simply using the forward reads.

The below code may take around 10 minutes to run.

In [None]:
! cat accs.txt | xargs -I {} fastqc -t $CORES "data/trimmed/{}_1_trimmed.fastq.gz" -o data/fastqc/

### STEP 8: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

In [None]:
! multiqc -f data/fastqc/

### STEP 9: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

In [None]:
! salmon index -t data/reference/GCF_001632805.1_transcriptome_reference_w_decoy.fa -p $CORES -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates

### STEP 10: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

In [None]:
! cat accs.txt | xargs -I {} salmon quant -i data/reference/transcriptome_index -l SR -r "data/trimmed/{}_1_trimmed.fastq.gz" -p $CORES --validateMappings -o "data/quants/{}_quant"

### STEP 11: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [None]:
! head data/quants/SRR13349122_quant/quant.sf -n 1
! sort -nrk 4,4 data/quants/SRR13349122_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349123_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349124_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349125_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349126_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349127_quant/quant.sf | head -10

Top 10 most highly expressed genes in the double lysogen samples.


In [None]:
! head data/quants/SRR13349122_quant/quant.sf -n 1
! sort -nrk 4,4 data/quants/SRR13349128_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349129_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349130_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349131_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349132_quant/quant.sf | head -10
! sort -nrk 4,4 data/quants/SRR13349133_quant/quant.sf | head -10

### STEP 12: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
! grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349123_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349124_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349125_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349126_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349127_quant/quant.sf

Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
! grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349129_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349130_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349131_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349132_quant/quant.sf
! grep 'BB28_RS16545' data/quants/SRR13349133_quant/quant.sf

### STEP 12: Combine Genecounts to a Single Genecount File
Commonly, the readcounts for each sample are combined into a single table, where the rows contain the gene ID, and the columns identify the sample.

In [None]:
##first merge salmon files by number of reads.
! salmon quantmerge --column numreads --quants data/quants/*_quant -o data/quants/merged_quants.txt
##optinally we can rename the columns
! sed -i "1s/.*/Name\tSRR13349122\tSRR13349123\tSRR13349124\tSRR13349125\tSRR13349126\tSRR13349127\tSRR13349128\tSRR13349129\tSRR13349130\tSRR13349131\tSRR13349132\tSRR13349133/" data/quants/merged_quants.txt

##for further formatting, it may be easier in our r-code to later merge
##if we remove the gene- and rna- prefix
! sed -i "s/gene-//" data/quants/merged_quants.txt
! sed -i "s/rna-//" data/quants/merged_quants.txt

print("An example of a combined genecount outputfile.")
! head data/quants/merged_quants.txt

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)

## Conclusion

This extended RNA-Seq analysis tutorial provided a comprehensive workflow for processing RNA-Seq data, from raw SRA files to a combined gene count table suitable for differential gene expression analysis.  We demonstrated the use of various bioinformatics tools, including `prefetch`, `fasterq-dump`, `Trimmomatic`, `FastQC`, `MultiQC`, `Salmon`, `entrez-direct`, and `gffread`,  highlighting best practices for data management and efficient command-line execution, particularly for batch processing of multiple samples. The tutorial included detailed steps for downloading data from the NCBI SRA database, utilizing both manual accession identification and an optional approach leveraging Google Cloud BigQuery for programmatic retrieval.  The integration of these tools facilitated quality control, read trimming, transcriptome preparation, read alignment, and quantification, culminating in a consolidated gene count file.  This workflow, although time-consuming due to the processing of full FASTQ datasets, provides a robust foundation for more advanced analyses such as those detailed in the subsequent Snakemake and DEG analysis tutorials.  The resulting gene count matrix serves as a crucial input for downstream differential expression analyses, building upon the knowledge gained in this comprehensive guide.

## Clean Up

Remember to move to the next notebook or shut down your instance if you are finished.