# Extended RNA-Seq Analysis Training Demo

## Overview

For simplicity and time, The short tutorial workflow uses truncated and partial run data from the Cushman et al., project.

The tutorial repeats the short tutorial, but with the full fastq files and includes some extra steps, such as how to download and prepare the transcriptome files used by salmon, alternate ways to navigate the NCBI databases for annotation or reference files you might need, and how to combine salmon outputs at the end into a single genecount file.

Full fastq files can be rather large, and so the downloading, extracting, and analysis of them means this tutorial can take over 1 hour 45 minutes to run the code fully. This is part of the reason we have a short and easy introductory tutorial, and this longer more full tutorial for those interested.

If this is too lengthy feel free to move on to the snakemake tutorial or the DEG analysis tutorial -- all the files used in the DEG tutorial were created using this extended tutorial workflow.

![RNA-Seq workflow](images/rnaseq-workflow.png)

### STEP 1: Install Mambaforge

First install Mambaforge.


In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -u -p $HOME/mambaforge
!date +"%T"

Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [None]:
#tell the computer where the mambaforge bin files are located
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

#now we can easily use 'mamba' command to install software 
!mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon gsutil sql-magic entrez-direct gffread parallel-fastq-dump sra-tools sql-magic pyathena samtools star rsem entrez-direct subread pigz -y

### STEP 2: Setup Environment

Create a set of directories in the sra-data-athena to store the reads, reference sequence files, and output files. Notice that first we remove the `data` directory to clean up files from Tutorial_1

In [None]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data
!mkdir -p data/trunc_rawfastq
!mkdir -p data/trimmed
!mkdir -p data/fastqc
!mkdir -p data/fastqc_samples/
!mkdir -p data/reference
!mkdir -p data/star_alignments
!mkdir -p data/quants

Set # THREADS depending on your VM size

In [None]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'

#python variable to hold the amount of threads your cpu has,
#useful for downstream tools like STAR, trimmomatic, etc
threads = int(numthreads[0])

#its also good to have a shell version of the variable for commands that use piping, 
#in jupyter, shell commandds with piping sometimes causes python variables to not work and generally be wonky.
%env THREADS=$threads

### STEP 3: Downloading relevant FASTQ files using SRA Tools



### STEP 3.1: Finding run accession numbers.


In [None]:
!esearch -db sra -query "PRJNA892075" | efetch -format runinfo | cut -d',' -f1 | tail -n +2 > accs.txt
!cat accs.txt

In [None]:
from pyathena import connect
import pandas as pd

# Use the correct argument name: s3_staging_dir
conn = connect(s3_staging_dir='s3://sra-data-athena/', region_name='us-east-1')

In [None]:
import boto3

# Initialize the Glue client
glue_client = boto3.client('glue', region_name='us-east-1')

# Run the crawler
crawler_name = 'sra_crawler'  # Use your crawler's name
glue_client.start_crawler(Name=crawler_name)

print(f"Crawler {crawler_name} started.")

In [None]:
query = """
SELECT *
FROM AwsDataCatalog.srametadata.metadata
WHERE bioproject = 'PRJNA892075'
AND acc IN ('SRR21972729', 'SRR21972728', 'SRR21972725', 'SRR21972724')
"""
df = pd.read_sql(query, conn)
df


In [None]:
#write the SRR column to a text file
with open('accs.txt', 'w') as f:
    accs = df['acc'].to_string(header=False, index=False)
    f.write(accs)
    
#print the text file
!cat accs.txt

### STEP 3.2: Using the SRA-toolkit for a single sample.

In [None]:
# Example usage for SRA download:
!prefetch SRR21972724 -O data/raw_fastq -f yes

In [None]:
!mamba install -c conda-forge pigz -y

#convert sra to fastq
!fasterq-dump data/raw_fastq/SRR21972724 -f -O data/raw_fastq/ -e $THREADS
#compress fastq to fastq.gz to save space
!pigz -p $THREADS data/raw_fastq/SRR21972724_1.fastq
!pigz -p $THREADS data/raw_fastq/SRR21972724_2.fastq

### STEP 3.3 Downloading multiple files using the SRA-toolkit.

In [None]:
!cat accs.txt | xargs -P $THREADS -I {} prefetch {} -O data/raw_fastq -f yes

### STEP 3.4 Converting Multiple SRA files to Fastq


In [None]:
#!for x in `cat accs.txt`; do fasterq-dump -f -O data/raw_fastq -e $THREADS -m 4G data/raw_fastq/$x/$x.sra; done

##example of how to alternatively do the above process with parallel-fastq-dump using piping
!cat accs.txt | xargs -I {} parallel-fastq-dump -O data/raw_fastq/ --tmpdir . --threads $THREADS --gzip --split-files --sra-id {}

### STEP 4: Download reference transcriptome files that will be used by STAR using E-Direct


In [None]:
! wget ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz -O data/reference/mouse_genome.fa.gz
! wget ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz -O data/reference/mouse_annotation.gtf.gz
! wget -O data/reference/mouse_feature_table.txt.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_feature_table.txt.gz

In [None]:
!gunzip -f data/reference/mouse_genome.fa.gz 
!gunzip -f data/reference/mouse_annotation.gtf.gz
!gunzip -f data/reference/mouse_feature_table.txt.gz

### STEP 5: Run FastQC

In [None]:
# Run fastqc for forward reads in parallel
!cat accs.txt | xargs -P $THREADS -I {} fastqc "data/raw_fastq/{}_1.fastq.gz" -o data/fastqc/

# Run fastqc for reverse reads in parallel
!cat accs.txt | xargs -P $THREADS -I {} fastqc "data/raw_fastq/{}_2.fastq.gz" -o data/fastqc/

In [None]:
from IPython.display import IFrame
IFrame(src='./data/fastqc/SRR21972724.html', width=800, height=600)

In [None]:
!multiqc -f data/fastqc/

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)

### STEP 5.1 Merging our fastq files (Optional if there are multiple SRR per GSM)

In [None]:
from pyathena import connect
import pandas as pd

# Use the correct argument name: s3_staging_dir
conn = connect(s3_staging_dir='s3://sra-data-athena/', region_name='us-east-1')

query = """
SELECT *
FROM AwsDataCatalog.srametadata.metadata
WHERE bioproject = 'PRJNA1132229'
AND organism = 'Mus musculus'
"""
df = pd.read_sql(
    query, conn
)
df

In [None]:
#import os so we can easily pass strings to shell commands using 'subprocess'
import os
import subprocess

#now get the accession id's and sample id's from the created dataframe
runs = df['acc'].values
samples = list(set(df['acc'].values))

#sort them to be in numerical order
runs.sort()
samples.sort()
samples

In [None]:
#now iterate through the samples, 
#because there are two SRRs to a run, 
#this means corresponding SRRs indices to an index of a GSM will be
#gsm index *2, and *2+1 
for index, item in enumerate(samples):
    
    #concatenate the two SRRs
    os.system(f"cat data/raw_fastq/{runs[index*2]}_1.fastq data/raw_fastq/{runs[index*2+1]}_1.fastq > data/raw_fastq/{samples[index]}_1.fastq")
    #delete the previous fastq files to save space
    os.system(f"rm data/raw_fastq/{runs[index*2]}_1.fastq")
    os.system(f"rm data/raw_fastq/{runs[index*2+1]}_1.fastq")
    #zip the merged fastq file to save more space
    os.system(f"gzip data/raw_fastq/{samples[index]}_1.fastq")
    
    #repeat for reverse reads
    os.system(f"cat data/raw_fastq/{runs[index*2]}_2.fastq data/raw_fastq/{runs[index*2+1]}_2.fastq > data/raw_fastq/{samples[index]}_2.fastq")
    
    os.system(f"rm data/raw_fastq/{runs[index*2]}_2.fastq")
    os.system(f"rm data/raw_fastq/{runs[index*2+1]}_2.fastq")  
   
    #its good practice to zip files to save space
    os.system(f"gzip data/raw_fastq/{samples[index]}_2.fastq")

In [None]:
#since our files will now be samples, not SRRs we can write a new text file to use for downstream batch processes.
#we can use the DF we made in the previous cell.
with open('samples.txt', 'w') as f:
    df = df.sort_values(by='sample_name', ascending=True)
    samples = df['acc'].unique()
    samples = '\n'.join(map(str, samples))
    f.write(samples)
    
!cat samples.txt

### STEP 5.3: Copy data file for Trimmomatic


In [None]:
!gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa .
!head TruSeq3-PE.fa 

### STEP 6: Run Trimmomatic

In [None]:
!cat accs.txt | xargs -I {} \
trimmomatic PE -threads $THREADS \
'data/raw_fastq/{}_1.fastq.gz' 'data/raw_fastq/{}_2.fastq.gz' \
'data/trimmed/{}_1_trimmed.fastq.gz' 'data/trimmed/{}_1_trimmed_unpaired.fastq.gz' \
'data/trimmed/{}_2_trimmed.fastq.gz' 'data/trimmed/{}_2_trimmed_unpaired.fastq.gz' \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

### STEP 7: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Because jupyter is at its core a python editor, we can use python code and html support to display results in-line.

FastQC looks for different characteristics of quality in reads. It is very rare that every metric will pass. In many cases, they serve as warnings, which should be compared to the context of the experiment. For instance, here, per base sequence content, sequence length distribution, sequence duplication levels, and overrepresented sequences all throw warnings. Per base sequence content routinely fails in RNA-sequencing in the first 15~ or so bases due to biased fragmentation. In most of our samples, this is where we see the failure (20% or more difference between A/T or G/C), and so is not unexpected. The overrepresented sequences can be BLASTed to show the majority of them are ribosomal RNA. Ribosomal RNA contamination is also common and will not be indexed later, and so not a large concern. Other metrics look good.

In [None]:
# Run FastQC
!cat accs.txt | xargs -P $THREADS -I {} fastqc data/trimmed/{}_1_trimmed.fastq.gz data/trimmed/{}_2_trimmed.fastq.gz -o data/fastqc_samples/

### STEP 8: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

In [None]:
#!multiqc -f data/fastqc_samples/
!multiqc -f -o data/multiqc_samples/ data/fastqc_samples/

### STEP 9: STAR

In [None]:
import os
import subprocess
import pandas as pd

# Get the number of threads from the shell command
numthreads = !lscpu | grep '^CPU(s)' | awk '{print $2-1}'
threads = int(numthreads[0])

os.environ['THREADS'] = str(threads)

!/usr/bin/time -v STAR --runThreadN $THREADS --runMode genomeGenerate \
    --genomeDir data/reference/STAR_index \
    --genomeFastaFiles data/reference/mouse_genome.fa \
    --sjdbGTFfile data/reference/mouse_annotation.gtf \
    --sjdbOverhang 100 \
    --limitGenomeGenerateRAM 60000000000 \
    --genomeSAsparseD 2

### STEP 10: Run STAR for Alignment, Prepare and Run RSEM for Quantification

In [None]:
# Create a directory to store STAR alignment output
!mkdir -p data/aligned_bam

# Align each sample
!cat accs.txt | xargs -I {} \
    STAR --runThreadN $THREADS \
      --genomeDir data/reference/STAR_index \
      --readFilesIn data/trimmed/{}_1_trimmed.fastq.gz data/trimmed/{}_2_trimmed.fastq.gz \
      --readFilesCommand zcat \
      --outFileNamePrefix data/aligned_bam/{}_ \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode TranscriptomeSAM GeneCounts

In [None]:
!mkdir -p data/rsem_reference/mouse_rsem_reference
!rsem-prepare-reference --gtf data/reference/mouse_annotation.gtf \
                        --star \
                        --star-path /sw/STAR \
                        data/reference/mouse_genome.fa \
                        data/rsem_reference/mouse_rsem_reference

In [None]:
# Create a directory to store RSEM quantification results
!mkdir -p data/rsem_output

# Run RSEM quantification for each sample listed in accs.txt
!cat accs.txt | xargs -I {} bash -c 
rsem-calculate-expression --paired-end \
    --alignments \
    --bam \
    --star \
    -p $THREADS \
    data/aligned_bam/{}_Aligned.toTranscriptome.out.bam \
    data/rsem_reference/mouse_rsem_reference \
    data/rsem_output/{}

In [None]:
import os
import subprocess
import pandas as pd

# Get the number of threads from the shell command
numthreads = !lscpu | grep '^CPU(s)' | awk '{print $2-1}'
threads = int(numthreads[0])

os.environ['THREADS'] = str(threads)

!/usr/bin/time -v STAR --runThreadN $THREADS --runMode genomeGenerate --genomeDir data/reference/STAR_index --genomeFastaFiles data/reference/mouse_genome.fa --sjdbGTFfile data/reference/mouse_annotation.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM 60000000000 --genomeSAsparseD 2

In [None]:
# Create a directory to store STAR alignment output
!mkdir -p data/aligned_bam

# Align each sample
!cat accs.txt | xargs -I {} STAR --runThreadN $THREADS --genomeDir data/reference/STAR_index --readFilesIn data/trimmed/{}_1_trimmed.fastq.gz data/trimmed/{}_2_trimmed.fastq.gz --readFilesCommand zcat --outFileNamePrefix data/aligned_bam/{}_ --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts

In [None]:
!mkdir -p data/rsem_reference/mouse_rsem_reference
!rsem-prepare-reference --gtf data/reference/mouse_annotation.gtf --star --star-path /sw/STAR data/reference/mouse_genome.fa data/rsem_reference/mouse_rsem_reference

In [None]:
# Create a directory to store RSEM quantification results
!mkdir -p data/rsem_output

# Run RSEM quantification for each sample listed in accs.txt
!cat accs.txt | xargs -I {} bash -c "rsem-calculate-expression --paired-end --alignments --bam --star -p $THREADS data/aligned_bam/{}_Aligned.toTranscriptome.out.bam data/rsem_reference/mouse_rsem_reference data/rsem_output/{}"

### STEP 11: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Sort the DataFrame by TPM values in descending order and get the top 10 genes
    top_10_genes = df.sort_values(by='TPM', ascending=False).head(10)

    # Print the top 10 genes with their TPM values
    print(f"Top 10 Genes by TPM for {srr_id}:")
    print(top_10_genes[['gene_id', 'TPM']])

Top 10 most highly expressed genes in the double lysogen samples.


### STEP 12: Report the expression of ENSMUSG00000064356 for each file

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Target gene ID
target_gene = 'ENSMUSG00000064356'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Filter for the target gene
    target_gene_data = df[df['gene_id'] == target_gene]

    # Print the target gene's TPM value for the SRR ID
    print(f"TPM for {target_gene} in {srr_id}: {target_gene_data['TPM'].values[0]}")

Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

### STEP 12: Combine Genecounts to a Single Genecount File


In [None]:
# Ensure the RSEM quantification results directory exists
!mkdir -p data/rsem_output

# Merge RSEM results by gene counts (similar to Salmon's numreads merge)
!rsem-generate-data-matrix data/rsem_output/*.genes.results > data/rsem_output/merged_gene_counts.txt

# Optionally, rename the columns based on the samples
# If you want to assign your GSM identifiers or any other custom names, edit the header.
!sed -i "1s/.*/Name\tGSM6658437\tGSM6658435\tGSM6658431\tGSM6658429/" data/rsem_output/merged_gene_counts.txt

# Remove any unnecessary prefixes like 'gene-' or 'rna-' for easier formatting
!sed -i "s/gene-//g" data/rsem_output/merged_gene_counts.txt
!sed -i "s/rna-//g" data/rsem_output/merged_gene_counts.txt

# Show a preview of the merged quantification file
!head data/rsem_output/merged_gene_counts.txt

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)