# Extended RNA-Seq Analysis Training Demo

## Overview

<div class="alert alert-block alert-danger"> <b>WARNING</b>: Full fastq files can be rather large, and so the downloading, extracting, and analysis of them means this tutorial can take almost <u>4 hours</u> to run the code fully using an <b>ml.m5.4xlarge</b> instance. </div>

This tutorial workflow uses the full dataset from <i>Rochester JD et al.</i>, project and repeats the short tutorial. All outputs used in the DEG tutorial were created using this extended full dataset tutorial workflow.

## STEP 1: Getting Started

<div class="alert alert-block alert-warning"> NOTE: This Jupyter Notebook was developed to run within a customized container on AWS with all software and tools pre-configured. If running without this customized container, you will need to install tools using the Miniforge environment setup instructions below before moving on to Step 2.</div>

### Without Container: Install Miniforge and Workflow Tools

Miniforge is a lightweight Conda distribution that offers a streamlined installation process and efficient package management. It provides access to a vast repository of packages.

Conda packages and environments are useful for several reasons. Conda packages contain metadata. This metadata includes information about what other programs the given software needs to be installed, in order to run. When installing a package with Conda, those other packages are automatically also installed. In this way, the user does not have to worry about manually installing each dependency. This makes installation quick and simple.

These packages are installed inside of environments, which are simply folders within the local installation of Conda. This has several benefits. Local installation means easier installation for non-admin users who may not have access to all system directories. Each environment can hold specific software with specific versions, and it easy to swap to different environments. In addition, the environments themselves are portable, as each environment contains a manifest on how to recreate that environment.

Miniforge itself is a Conda package manager, this means it requires Conda in order to work. It is used to install and update Conda packages, which it gets from a ‘channel’, or repository. It is an alternative to the native Conda package manager. It is often used for reasons of speed.

Bioconda is a ‘channel’, or repository, that the Mambaforge package manager can download packages from. It is a repository of Conda packages that are related to biology. These packages are versions of popular biology software that are curated and uploaded by contributing users.

The following code performs these steps:
- Downloads Miniforge or Mambaforge (you can use either based on preference)
- Installs Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
- Using miniforge and bioconda, installs the tools that will be used in this tutorial

<div class="alert alert-block alert-info">Tip: If using the Miniforge install, run the following code cells by removing the %%script false command. </div>

In [None]:
%%script false --no-raise-error
# Download Miniforge or Mambaforge (you can use either based on preference)
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh > /dev/null 2>&1

# Install Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
!bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge > /dev/null
!date +"%T"

Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [None]:
%%script false --no-raise-error
# Update PATH to point to the Miniforge (or Mambaforge) bin files
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/miniforge/bin"

#now we can easily use 'mamba' command to install software 
!mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc sql-magic entrez-direct gffread parallel-fastq-dump sra-tools sql-magic pyathena samtools star rsem entrez-direct subread pigz -y > /dev/null

---------------------------------------
## If running from a container, as noted above, start with <b> STEP 2 </b> below:
## STEP 2: Define Threads & Setup Directory Structures

Create a set of directories in the sra-data-athena to store the reads, reference sequence files, and output files. Notice that first we remove the `data` directory to clean up files from Tutorial_1

In [None]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data
!mkdir -p data/trunc_rawfastq
!mkdir -p data/trimmed
!mkdir -p data/fastqc
!mkdir -p data/fastqc_samples/
!mkdir -p data/reference
!mkdir -p data/raw_fastq
!mkdir -p data/aligned_bam
!mkdir -p data/rsem_reference/celegans_rsem_reference
!mkdir -p data/rsem_output
!mkdir -p data/reference/STAR_index
!mkdir -p data/multiqc_samples
!mkdir -p data/rsem_reference

Set # THREADS depending on your VM size

In [None]:
import multiprocessing

num_cores = multiprocessing.cpu_count()
THREADS = max(1, num_cores - 1)

print("Number of threads:", THREADS)
os.environ["THREADS"] = str(THREADS)

## STEP 3: Downloading relevant FASTQ files using SRA Tools



Next we will need to download the relevant fastq files. Because these files can be large, the process of downloading and extracting fastq files can be quite lengthy. We will be downloading the sample runs from this project using SRA tools, downloading from the NCBI's SRA (Sequence Run Archives). However, first we need to find the associated accession numbers in order to download.

### STEP 3.1: Finding run accession numbers.


The SRA stores sequence data in terms of runs, (SRR stands for Sequence Read Run). To download runs, we will need the accession ID for each run we wish to download. The Rochester JD et al., project contains 6 runs. To make it easier, these are the run IDs associated with this project:

- `SRR11550221`
- `SRR11550223`
- `SRR11550225`
- `SRR11550227`
- `SRR11550229`
- `SRR11550231`

In this case, all these runs belong to the Bioproject PRJNA625528. Sequence run experiments can be searched for using the SRA database on the NCBI website; and article-specific sample run information can be found in the supplementary section of that article. Once the accession numbers are located, one can make a text file containing the list of accession IDs however they like. Once again, to make things easier, we have made a .txt with these IDs that you can simply download here:

In [None]:
!esearch -db sra -query "PRJNA625528" | efetch -format runinfo | cut -d',' -f1 | tail -n +2 > all_accs.txt
!grep -E "SRR11550221|SRR11550223|SRR11550225|SRR11550227|SRR11550229|SRR11550231" all_accs.txt > accs.txt
!cat accs.txt

### STEP 3.2 Downloading multiple files using the SRA-toolkit.

The code uses prefetch to download multiple SRA files in parallel. It reads the list of SRR IDs from accs.txt, uses xargs to execute prefetch for each ID, and specifies the output directory and the -f option to create FASTQ files in the same directory as the SRA files. To speed up the download the code uses -P $THREADS option allowing parallel execution using the specified number of threads.

In [None]:
!cat accs.txt | xargs -P $THREADS -I {} prefetch {} -O data/raw_fastq -f yes

### STEP 3.3 Converting Multiple SRA files to Fastq

In this step, the SRA files will be processed in parallel using parallel-fastq-dump. Each SRR ID from accs.txt will be read, and xargs will be used to execute parallel-fastq-dump for each SRA ID. This will result in the creation of two paired-end FASTQ files for each SRR ID, which will be compressed into a .gz file to save space.

In [None]:
!cat accs.txt | xargs -P $THREADS -I {} fastq-dump --outdir data/raw_fastq/ --gzip data/raw_fastq/{}/{}.sra

As before, it is good practice to turn .fastq files into .fastq.gz files to save space. In our case, we will actually need to concatenate the fastq files later on, and so will zip after this.

In [None]:
#find and delete all SRR subfolders in the raw_fastq directory
!find data/raw_fastq -type d -name 'SRR*' -exec rm -rf {} \;

### STEP 3.4 Download reference transcriptome files that will be used by STAR


This step downloads and prepares the reference data needed for your RNA-seq analysis. It retrieves three essential files:

- C elegans genome (celegans_genome.fa): This FASTA file contains the complete C elegans genome sequence, that will be used as the reference for aligning your RNA-seq reads.
- C elegans gene annotations (celegans_annotation.gtf): This GTF file provides information about the genes and transcripts in the C elegans genome, including their locations and structures. This data will crucial for interpreting the aligned RNA-seq reads and understanding what genes are expressed in each.
- C elegans feature table (celegans_feature_table.txt): This table provides additional annotations for the C elegans genome features, potentially including information about gene functions and pathways. This step will further used to analyze the differential gene expression (DEG) analysis. 

In [None]:
!wget https://nigms-sandbox.s3.us-east-1.amazonaws.com/bulk-scRNAseq/reference/celegans_genome.fa -O data_1/reference/celegans_genome.fa
!wget https://nigms-sandbox.s3.us-east-1.amazonaws.com/bulk-scRNAseq/reference/celegans_genomic.gtf -O data_1/reference/celegans_annotation.gtf
!wget https://nigms-sandbox.s3.us-east-1.amazonaws.com/bulk-scRNAseq/reference/celegans_feature_table.txt -O data_1/reference/celegans_feature_table.txt

### STEP 3.5: Copy data file for Trimmomatic

One of trimmomatics functions is to trim sequence machine specific adapter sequences. These are usually within the trimmomatic installation directory in a folder called adapters.

Directories of packages within conda installations can be confusing, so in the case of using conda with trimmomatic, it may be easier to simply download or create a file with the relevant adapter sequencecs and store it in an easy to find directory.

In [None]:
!wget https://nigms-sandbox.s3.us-east-1.amazonaws.com/bulk-scRNAseq/reference/TruSeq3-SE.fa -O data_1/trimmed/TruSeq3-SE.fa

## STEP 4: Run FastQC

FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads. The below code may take a while to run. To make it run faster we can use threads to speed up the process.

In [None]:
!cat accs.txt | xargs -P $THREADS -I {} fastqc "data/raw_fastq/{}.fastq.gz" -o data/fastqc/

Fastqc will output the results in HTML format, as below, for all forward and reverse reads.

In [None]:
from IPython.display import IFrame
IFrame(src='data/fastqc/SRR11550221_fastqc.html', width=800, height=600)

Although its best practice to look over them individually, tools like multiqc allow one to quickly look at a summary of the quality reports of the fastq files. For instance, the below table shows which warnings, passes, or failures, from each fastqc report. There are other summaries created as well by multiqc.

In [None]:
!multiqc -f data/fastqc/

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)

## STEP 5: Run Trimmomatic

Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files. Using piping and our original list, it is possible to queue up a batch run of trimmomatic for all our files, note that this is a different way to run a loop compared with what we did before. The below code may take approximately 30 minutes to run.

In [None]:
!cat accs.txt | xargs -I {} \
trimmomatic SE -threads $THREADS \
data/raw_fastq/{}.fastq.gz data/trimmed/{}_trimmed.fastq \
ILLUMINACLIP:data/trimmed/TruSeq3-SE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

## STEP 6: Run FastQC
It's best practice to run FastQC after trimming. However, you may decide to run FastQC only once, before or after trimming.

We will proceed with only the forward reads -- this is because, looking at trimmomatic, there were very few 'orphaned' reads. That is to say, most forward and reverse reads were successfully paired together. Because we are just trying to map to a transcriptome, the read lengths of the forward reads alone, in this case, around 60 millions~ basepairs, should be sufficient.

The below code may take around 15-20 minutes to run.

In [None]:
# Run FastQC
!cat accs.txt | xargs -P $THREADS -I {} fastqc data/trimmed/{}_trimmed.fastq -o data/fastqc_samples/

## STEP 7: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

In [None]:
#!multiqc -f data/fastqc_samples/
!multiqc -f -o data/multiqc_samples/ data/fastqc_samples/

## STEP 8: Preparing the STAR-Compatible RSEM Reference

The provided code snippet demonstrates the process of preparing a reference genome for RNA-seq analysis using STAR. The initial command, rsem-prepare-reference, creates a reference index for the C. elegans genome based on the provided GTF annotation file. However, the subsequent STAR genome generation step encounters a warning due to an excessively large genomeSAindexNbases parameter.

The warning indicates that the specified value of 14 for --genomeSAindexNbases is too large for the given genome size, which can lead to segmentation faults during the mapping process. To address this issue, the code adjusts the --genomeSAindexNbases parameter to 12, which is more appropriate for the genome size. This modification ensures that the STAR genome generation process completes successfully without errors.

In [None]:
!rsem-prepare-reference --gtf data/reference/celegans_annotation.gtf --star -p $THREADS data/reference/celegans_genome.fa celegans_reference > /dev/null

In [None]:
!STAR  --runThreadN 15  --runMode genomeGenerate  --genomeDir .  --genomeFastaFiles data/reference/celegans_genome.fa  --sjdbGTFfile data/reference/celegans_annotation.gtf  --sjdbOverhang 100  --outFileNamePrefix celegans_reference  --genomeSAindexNbases 12 > /dev/null

## STEP 9: Automated RNA-seq Quantification with RSEM

This cell automates the quantification of gene expression from RNA-seq data using RSEM. It reads a list of SRR accession numbers from a text file, iterates over each accession, and executes the RSEM command to calculate gene expression levels. 

In [None]:
import os

# Ensure you've set the path to the RSEM binary
with open('accs.txt', 'r') as f:
    srr_accessions = [line.strip() for line in f.readlines()]
    
# Define the output directory
output_dir = "data/rsem_output"

for srr in srr_accessions:
    os.system(f"rsem-calculate-expression -p $THREADS --star "
              f"data/trimmed/{srr}_trimmed.fastq celegans_reference "
              f"{output_dir}/{srr}celegans > /dev/null")

## STEP 10: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}celegans.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Sort the DataFrame by TPM values in descending order and get the top 10 genes
    top_10_genes = df.sort_values(by='TPM', ascending=False).head(10)

    # Print the top 10 genes with their TPM values
    print(f"Top 10 Genes by TPM for {srr_id}:")
    print(top_10_genes[['gene_id', 'TPM']])

## STEP 11: Report the expression of WBGene00004512 for each file

Use `grep` to report the expression in the wild-type sample. The fields in the RSEM `genes.results` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
- `Name`
- `Length`
- `EffectiveLength`
- `TPM`
- `NumReads`

In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Target gene ID
target_gene = 'WBGene00004512'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}celegans.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Filter for the target gene
    target_gene_data = df[df['gene_id'] == target_gene]

    # Print the target gene's TPM value for the SRR ID
    print(f"TPM for {target_gene} in {srr_id}: {target_gene_data['TPM'].values[0]}")

## STEP 12: Export Read counts to S3 Bucket


The code effectively extracts gene expression data from RSEM output files and stores them in a structured format on an S3 bucket. This data will be accessible for further analysis in Tutorial 2 and Tutorial 3.

In [None]:
import os
import pandas as pd
import boto3

# Define the path to your RSEM output directory
rsem_output_path = "data/rsem_output"

# Define the S3 bucket and output path
s3_bucket = "sra-data-athena"
s3_output_path = "readcounts/"

# Initialize S3 client
s3_client = boto3.client('s3')

# Get a list of all .genes.results files in the directory
genes_files = [f for f in os.listdir(rsem_output_path) if f.endswith('celegans.genes.results')]

# Loop through each file to extract gene ID, expected counts, and gene length
for file in genes_files:
    file_path = os.path.join(rsem_output_path, file)
    
    # Read the .genes.results file
    rsem_data = pd.read_csv(file_path, sep="\t")

    # Check if the necessary columns exist
    if all(col in rsem_data.columns for col in ["gene_id", "expected_count", "length"]):
        # Create a new dataframe with required columns
        result_data = rsem_data[["gene_id", "expected_count", "length"]]
        result_data.columns = ["GeneID", "Count", "GeneLength"]

        # Define the output filename based on the input file name
        output_file_name = f"{os.path.splitext(file)[0]}.txt"
        s3_output_file_path = f"{s3_output_path}{output_file_name}"

        # Convert the DataFrame to a CSV string
        csv_buffer = result_data.to_csv(sep="\t", index=False)

        # Upload the result directly to S3
        s3_client.put_object(Bucket=s3_bucket, Key=s3_output_file_path, Body=csv_buffer)

    else:
        print(f"Warning: Required columns are missing in file: {file}")

# Optionally, print a message indicating completion
print("Extraction and file creation complete.")



## STEP 13: Save Merged Read Counts

This code combines multiple RSEM gene count files into a single, unified file, making it easier to analyze and visualize the gene expression data. This files was also uploaded to S3 Bucket to allow further analysis in other Tutorials. 

In [None]:
# Ensure the RSEM quantification results directory exists
!mkdir -p data/rsem_output

# Merge RSEM results by gene counts (similar to Salmon's numreads merge)
!rsem-generate-data-matrix data/rsem_output/*celegans.genes.results > data/rsem_output/merged_gene_counts_celegans.txt

# Optionally, rename the columns based on the samples
# If you want to assign your GSM identifiers or any other custom names, edit the header.
!sed -i "1s/.*/Name\tGSM4478068\tGSM4478070\tGSM4478072\tGSM4478074\tGSM4478076\tGSM4478078/" data/rsem_output/merged_gene_counts_celegans.txt

# Remove any unnecessary prefixes like 'gene-' or 'rna-' for easier formatting
!sed -i "s/gene-//g" data/rsem_output/merged_gene_counts_celegans.txt
!sed -i "s/rna-//g" data/rsem_output/merged_gene_counts_celegans.txt

# Show a preview of the merged quantification file
!head data/rsem_output/merged_gene_counts_celegans.txt

import boto3
import os

# Define the file path and S3 bucket details
file_path = "data/rsem_output/merged_gene_counts_celegans.txt"
bucket_name = "nigms-sandbox/bulk-scRNAseq"
s3_key = "readcounts/merged_gene_counts_celegans.txt"

# Initialize an S3 client
s3_client = boto3.client('s3')

# Upload the file to the specified S3 bucket
try:
    s3_client.upload_file(file_path, bucket_name, s3_key)
    print(f"File {file_path} uploaded successfully to {bucket_name}/{s3_key}")
except Exception as e:
    print(f"Error uploading file: {e}")

# Define the file paths and S3 bucket details
rsem_output_path = "data/rsem_output"
feature_table_path = "data/reference/celegans_feature_table.txt"
bucket_name = "nigms-sandbox/bulk-scRNAseq"
s3_output_path = "readcounts/"
s3_feature_table_path = "reference/celegans_feature_table.txt"

# ... (rest of the code remains the same)

# Upload the gene count file
s3_client.upload_file(file_path, bucket_name, s3_key)

# Upload the feature table file
s3_client.upload_file(feature_table_path, bucket_name, s3_feature_table_path)

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1_subsampling_celegans.ipynb) A short introduction to downloading and mapping sequences to a C elegans genome using STAR and RSEM.


[Workflow Two (DEG Analysis):](Tutorial_2_DEG_Analysis_celegans.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.

