# Extended RNA-Seq Analysis Training Demo

## Overview

This tutorial workflow uses the full dataset from Mittenbühler MJ et al., project.

The tutorial repeats the short tutorial, but with the full fastq files and includes some extra steps, such as how to access SRA metadata using Athena, a more powerful and fast step compared to sra-tools.

Full fastq files can be rather large, and so the downloading, extracting, and analysis of them means this tutorial can take over 13 hours to run the code fully using ml.m5.4xlarge instance.

All outputs used in the DEG tutorial were created using this extended full dataset tutorial workflow.

![RNA-Seq workflow](../images/rnaseq-workflow.png)

## STEP 1: Install Miniforge

First install Miniforge.

In [16]:
# Download Miniforge or Mambaforge (you can use either based on preference)
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh

# Install Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
!bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge
!date +"%T"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 86.0M  100 86.0M    0     0   145M      0 --:--:-- --:--:-- --:--:--  301M
PREFIX=/home/ec2-user/miniforge

Transaction

  Prefix: /home/ec2-user/miniforge/envs/_virtual_specs_checks

  All requested packages already installed

Dry run. Not executing the transaction.
Unpacking payload ...
Extracting _libgcc_mutex-0.1-conda_forge.tar.bz2
Extracting ca-certificates-2024.8.30-hbcca054_0.conda
Extracting ld_impl_linux-64-2.40-hf3520f5_7.conda
Extracting pybind11-abi-4-hd8ed1ab_3.tar.bz2
Extracting python_abi-3.12-5_cp312.conda
Extracting tzdata-2024a-h8827d51_1.conda
Extracting libgomp-14.1.0-h77fa898_1.conda
Extracting _openmp_mutex-4.5-2_gnu.tar.bz2
Extracting libgcc-14.1.0

Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [17]:
# Update PATH to point to the Miniforge (or Mambaforge) bin files
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/miniforge/bin"

#now we can easily use 'mamba' command to install software 
!mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc sql-magic entrez-direct gffread parallel-fastq-dump sra-tools sql-magic pyathena samtools star rsem entrez-direct subread pigz -y


Looking for: ['trimmomatic', 'fastqc', 'multiqc', 'sql-magic', 'entrez-direct', 'gffread', 'parallel-fastq-dump', 'sra-tools', 'sql-magic', 'pyathena', 'samtools', 'star', 'rsem', 'entrez-direct', 'subread', 'pigz']

conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
bioconda/linux-64                                           Using cache
bioconda/noarch                                             Using cache
nvidia/linux-64                                             Using cache
nvidia/noarch                                               Using cache
pytorch/linux-64                                            Using cache
pytorch/noarch                                              Using cache
[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
https://aws-ml-conda.s3.us-west-2.amazonaws.com/.. [33m━━━━━━━╸[0m[90m━━━━━━━[0m   0.0 B  0.1s
https://aws-ml-conda.s3.us-west-2.amazonaws.com/.. [33m━━━

## STEP 2: Setup Environment

Create a set of directories in the sra-data-athena to store the reads, reference sequence files, and output files. Notice that first we remove the `data` directory to clean up files from Tutorial_1

In [18]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data
!mkdir -p data/trunc_rawfastq
!mkdir -p data/trimmed
!mkdir -p data/fastqc_samples/
!mkdir -p data/reference
!mkdir -p data/aligned_bam
!mkdir -p data/rsem_reference/mouse_rsem_reference
!mkdir -p data/rsem_output
!mkdir -p data/reference/STAR_index

/home/ec2-user/SageMaker


Set # THREADS depending on your VM size

In [19]:
import multiprocessing

num_cores = multiprocessing.cpu_count()
THREADS = max(1, num_cores - 1)

print("Number of threads:", THREADS)
os.environ["THREADS"] = str(THREADS)

Number of threads: 15


## STEP 3: Downloading relevant FASTQ files using SRA Tools



Next we will need to download the relevant fastq files.

Because these files can be large, the process of downloading and extracting fastq files can be quite lengthy.

We will be downloading the sample runs from this project using SRA tools, downloading from the NCBI's SRA (Sequence Run Archives).

However, first we need to find the associated accession numbers in order to download.

### STEP 3.1: Finding run accession numbers.


The SRA stores sequence data in terms of runs, (SRR stands for Sequence Read Run). To download runs, we will need the accession ID for each run we wish to download.

The Mittenbühler MJ et al., project contains 8 runs. To make it easier, these are the run IDs associated with this project:

- SRR21972730
- SRR21972729
- SRR21972728
- SRR21972727
- SRR21972725
- SRR21972724
- SRR21972723
- SRR21972726

In this case, all these runs belong to the Bioproject PRJNA892075.

Sequence run experiments can be searched for using the SRA database on the NCBI website; and article-specific sample run information can be found in the supplementary section of that article.

For instance, here, the the authors posted a link to the sequence data GSE (Gene Series number), GSE164210. This leads to the appropriate 'Gene Expression Omnibus' page where, among other useful files and information, the relevant SRA database link can be found.

Once the accession numbers are located, one can make a text file containing the list of accession IDs however they like.

Once again, to make things easier, we have made a .txt with these IDs that you can simply download here:

In [20]:
!esearch -db sra -query "PRJNA892075" | efetch -format runinfo | cut -d',' -f1 | tail -n +2 > accs.txt
!cat accs.txt

SRR21972730
SRR21972729
SRR21972728
SRR21972727
SRR21972725
SRR21972724
SRR21972723
SRR21972726


### STEP 3.2 Finding run accession numbers using Athena (Optional)

Athena is a serverless query engine that allows you to analyze data stored in Amazon S3 using standard SQL. It offers several advantages over traditional SRA tools, making it a more efficient and scalable solution for large-scale RNA-seq data analysis.

Using Athena to access metadata is optional, but allows you to query large SRA metadata directly from AWS without needing to download and process files locally, making it faster and more scalable.

In [None]:
from pyathena import connect
import pandas as pd

# Use the correct argument name: s3_staging_dir
conn = connect(s3_staging_dir='s3://sra-data-athena/', region_name='us-east-1')

In [None]:
import boto3

# Initialize the Glue client
glue_client = boto3.client('glue', region_name='us-east-1')

# Run the crawler
crawler_name = 'sra_crawler'  # Use your crawler's name
glue_client.start_crawler(Name=crawler_name)

print(f"Crawler {crawler_name} started.")

In [None]:
query = """
SELECT *
FROM AwsDataCatalog.srametadata.metadata
WHERE bioproject = 'PRJNA892075'
"""
df = pd.read_sql(query, conn)
df


In [None]:
#write the SRR column to a text file
with open('accs.txt', 'w') as f:
    accs = df['acc'].to_string(header=False, index=False)
    f.write(accs)
    
#print the text file
!cat accs.txt

### STEP 3.3 Using the SRA-toolkit for a single sample.

The code snippet demonstrates how to download and preprocess single SRA data using the prefetch and fasterq-dump commands. First, prefetch downloads the specified SRA file and then, fasterq-dump converts the SRA file into paired-end FASTQ files. Finally, the generated FASTQ files are compressed using pigz to save space.

In [None]:
# Example usage for SRA download:
!prefetch SRR21972723 -O data/raw_fastq -f yes

In [None]:
#convert sra to fastq
!fasterq-dump data/raw_fastq/SRR21972723 -f -O data/raw_fastq/ -e $THREADS
#compress fastq to fastq.gz to save space
!pigz -p $THREADS data/raw_fastq/SRR21972723_1.fastq
!pigz -p $THREADS data/raw_fastq/SRR21972723_2.fastq

### STEP 3.4 Downloading multiple files using the SRA-toolkit.

The code uses prefetch to download multiple SRA files in parallel. It reads the list of SRR IDs from accs.txt, uses xargs to execute prefetch for each ID, and specifies the output directory and the -f option to create FASTQ files in the same directory as the SRA files. To speed up the download the code uses -P $THREADS option allowing parallel execution using the specified number of threads.

In [22]:
!cat accs.txt | xargs -P $THREADS -I {} prefetch {} -O data/raw_fastq -f yes

2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972724'...
2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972727'...
2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972728'...
2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972726'...
2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972729'...
2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972725'...
2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972730'...
2024-10-06T15:51:31 prefetch.3.1.1: 1) Resolving 'SRR21972723'...
2024-10-06T15:51:31 prefetch.3.1.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2024-10-06T15:51:31 prefetch.3.1.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2024-10-06T15:51:31 prefetch.3.1.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2024-10-06T15:51:31 prefetch.3.1.1: Current preference is set to retrieve SR

### STEP 3.5 Converting Multiple SRA files to Fastq


In this step, the SRA files will be processed in parallel using parallel-fastq-dump. Each SRR ID from accs.txt will be read, and xargs will be used to execute parallel-fastq-dump for each SRA ID. This will result in the creation of two paired-end FASTQ files for each SRR ID, which will be compressed into a .gz file to save space.

In [32]:
#!for x in `cat accs.txt`; do fasterq-dump -f -O data/raw_fastq -e $THREADS -m 4G data/raw_fastq/$x/$x.sra; done

##example of how to alternatively do the above process with parallel-fastq-dump using piping
!cat accs.txt | xargs -I {} parallel-fastq-dump -O data/raw_fastq/ --tmpdir . --threads $THREADS --gzip --split-files --sra-id {}

2024-10-06 15:59:07,193 - SRR ids: ['SRR21972730']
2024-10-06 15:59:07,193 - extra args: ['--gzip', '--split-files']
2024-10-06 15:59:07,194 - tempdir: ./pfd_1_oz_37l
2024-10-06 15:59:07,194 - CMD: sra-stat --meta --quick SRR21972730
2024-10-06 15:59:08,063 - SRR21972730 spots: 52663760
2024-10-06 15:59:08,063 - blocks: [[1, 3510917], [3510918, 7021834], [7021835, 10532751], [10532752, 14043668], [14043669, 17554585], [17554586, 21065502], [21065503, 24576419], [24576420, 28087336], [28087337, 31598253], [31598254, 35109170], [35109171, 38620087], [38620088, 42131004], [42131005, 45641921], [45641922, 49152838], [49152839, 52663760]]
2024-10-06 15:59:08,063 - CMD: fastq-dump -N 1 -X 3510917 -O ./pfd_1_oz_37l/0 --gzip --split-files SRR21972730
2024-10-06 15:59:08,063 - CMD: fastq-dump -N 3510918 -X 7021834 -O ./pfd_1_oz_37l/1 --gzip --split-files SRR21972730
2024-10-06 15:59:08,064 - CMD: fastq-dump -N 7021835 -X 10532751 -O ./pfd_1_oz_37l/2 --gzip --split-files SRR21972730
2024-10-06 1

As before, it is good practice to turn .fastq files into .fastq.gz files to save space.

In our case, we will actually need to concatenate the fastq files later on, and so will zip after this.

The no redundant SRA files can also be deleted to save more space.

In [None]:
#find and delete all SRR subfolders in the raw_fastq directory
!find data/raw_fastq -type d -name 'SRR*' -exec rm -rf {} \;

### STEP 3.6 Download reference transcriptome files that will be used by STAR


This step downloads and prepares the reference data needed for your RNA-seq analysis. It retrieves three essential files:

- Mouse genome (Mus_musculus.GRCm39.dna.primary_assembly.fa.gz): This compressed FASTA file contains the complete mouse genome sequence, that will be used as the reference for aligning your RNA-seq reads.
- Mouse gene annotations (Mus_musculus.GRCm39.104.gtf.gz): This compressed GTF file provides information about the genes and transcripts in the mouse genome, including their locations and structures. This data will crucial for interpreting the aligned RNA-seq reads and understanding what genes are expressed in each.
- Mouse feature table (GCF_000001635.27_GRCm39_feature_table.txt.gz): This compressed table provides additional annotations for the mouse genome features, potentially including information about gene functions and pathways. This step will further used to analyze the differential gene expression (DEG) analysis. 

In [33]:
! wget ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz -O data/reference/mouse_genome.fa.gz
! wget ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz -O data/reference/mouse_annotation.gtf.gz
! wget -O data/reference/mouse_feature_table.txt.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_feature_table.txt.gz

--2024-10-06 17:10:24--  ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
           => ‘data/reference/mouse_genome.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-104/fasta/mus_musculus/dna ... done.
==> SIZE Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... 806418890
==> PASV ... done.    ==> RETR Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... done.
Length: 806418890 (769M) (unauthoritative)


2024-10-06 17:10:50 (32.1 MB/s) - ‘data/reference/mouse_genome.fa.gz’ saved [806418890]

--2024-10-06 17:10:50--  ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz
           => ‘data/reference/mouse_annotation.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62

In [34]:
!gunzip -f data/reference/mouse_genome.fa.gz 
!gunzip -f data/reference/mouse_annotation.gtf.gz
!gunzip -f data/reference/mouse_feature_table.txt.gz

### STEP 3.7: Copy data file for Trimmomatic

One of trimmomatics functions is to trim sequence machine specific adapter sequences. These are usually within the trimmomatic installation directory in a folder called adapters.

Directories of packages within conda installations can be confusing, so in the case of using conda with trimmomatic, it may be easier to simply download or create a file with the relevant adapter sequencecs and store it in an easy to find directory.

In [35]:
!wget -P data/trimmed/ https://sra-data-athena.s3.amazonaws.com/reference/TruSeq3-PE.fa

download: s3://sra-data-athena/reference/TruSeq3-PE.fa to data/trimmed/TruSeq3-PE.fa


### STEP 4: Run FastQC

FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

The below code may take a while to run. To make it run faster we can use threads to speed up the process.

In [None]:
# Run fastqc for forward reads in parallel
!cat accs.txt | xargs -P $THREADS -I {} fastqc "data/raw_fastq/{}_1.fastq.gz" -o data/fastqc/

# Run fastqc for reverse reads in parallel
!cat accs.txt | xargs -P $THREADS -I {} fastqc "data/raw_fastq/{}_2.fastq.gz" -o data/fastqc/

Fastqc will output the results in HTML format, as below, for all forward and reverse reads.

In [None]:
from IPython.display import IFrame
IFrame(src='./data/fastqc/SRR21972724.html', width=800, height=600)

Although its best practice to look over them individually, tools like multiqc allow one to quickly look at a summary of the quality reports of the fastq files.

For instance, the below table shows which warnings, passes, or failures, from each fastqc report. There are other summaries created as well by multiqc.

In [None]:
!multiqc -f data/fastqc/

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)

## STEP 5: Merging our fastq files (Optional if there are multiple SRR per GSM)

If the project used presents the multiple SRAs per GSM we can use Athena to access the SRA metadata and merging the FASTQ files, the code simplifies the subsequent analysis steps and reduces the number of files to process. This can improve efficiency and reduce computational overhead.
In this study this step was not used.

In [None]:
from pyathena import connect
import pandas as pd

# Use the correct argument name: s3_staging_dir
conn = connect(s3_staging_dir='s3://sra-data-athena/', region_name='us-east-1')

query = """
SELECT *
FROM AwsDataCatalog.srametadata.metadata
WHERE bioproject = 'PRJNAXXXXXXX' #Change to the Bioproject number
AND organism = 'Mus musculus'
"""
df = pd.read_sql(
    query, conn
)
df

In [None]:
#import os so we can easily pass strings to shell commands using 'subprocess'
import os
import subprocess

#now get the accession id's and sample id's from the created dataframe
runs = df['acc'].values
samples = list(set(df['acc'].values))

#sort them to be in numerical order
runs.sort()
samples.sort()
samples

In [None]:
#now iterate through the samples, 
#because there are two SRRs to a run, 
#this means corresponding SRRs indices to an index of a GSM will be
#gsm index *2, and *2+1 
for index, item in enumerate(samples):
    
    #concatenate the two SRRs
    os.system(f"cat data/raw_fastq/{runs[index*2]}_1.fastq data/raw_fastq/{runs[index*2+1]}_1.fastq > data/raw_fastq/{samples[index]}_1.fastq")
    #delete the previous fastq files to save space
    os.system(f"rm data/raw_fastq/{runs[index*2]}_1.fastq")
    os.system(f"rm data/raw_fastq/{runs[index*2+1]}_1.fastq")
    #zip the merged fastq file to save more space
    os.system(f"gzip data/raw_fastq/{samples[index]}_1.fastq")
    
    #repeat for reverse reads
    os.system(f"cat data/raw_fastq/{runs[index*2]}_2.fastq data/raw_fastq/{runs[index*2+1]}_2.fastq > data/raw_fastq/{samples[index]}_2.fastq")
    
    os.system(f"rm data/raw_fastq/{runs[index*2]}_2.fastq")
    os.system(f"rm data/raw_fastq/{runs[index*2+1]}_2.fastq")  
   
    #its good practice to zip files to save space
    os.system(f"gzip data/raw_fastq/{samples[index]}_2.fastq")

In [None]:
#since our files will now be samples, not SRRs we can write a new text file to use for downstream batch processes.
#we can use the DF we made in the previous cell.
with open('samples.txt', 'w') as f:
    df = df.sort_values(by='sample_name', ascending=True)
    samples = df['acc'].unique()
    samples = '\n'.join(map(str, samples))
    f.write(samples)
    
!cat samples.txt

## STEP 6: Run Trimmomatic

Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Using piping and our original list, it is possible to queue up a batch run of trimmomatic for all our files, note that this is a different way to run a loop compared with what we did before.

The below code may take approximately 30 minutes to run.

In [36]:
!cat accs.txt | xargs -I {} \
trimmomatic PE -threads $THREADS \
'data/raw_fastq/{}_1.fastq.gz' 'data/raw_fastq/{}_2.fastq.gz' \
'data/trimmed/{}_1_trimmed.fastq' 'data/trimmed/{}_1_trimmed_unpaired.fastq' \
'data/trimmed/{}_2_trimmed.fastq' 'data/trimmed/{}_2_trimmed_unpaired.fastq' \
ILLUMINACLIP:data/trimmed/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 15 data/raw_fastq/SRR21972730_1.fastq.gz data/raw_fastq/SRR21972730_2.fastq.gz data/trimmed/SRR21972730_1_trimmed.fastq data/trimmed/SRR21972730_1_trimmed_unpaired.fastq data/trimmed/SRR21972730_2_trimmed.fastq data/trimmed/SRR21972730_2_trimmed_unpaired.fastq ILLUMINACLIP:data/trimmed/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 52663760 Both Surviving: 36050371 (68.45%) Forward Only Surviving: 16604182 (31.53%) Reverse Only Surviving: 0 (0.00%) Dropped: 9207 (0.02%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -threads 15 data/raw_fastq/SRR21972729_1.fastq.gz data/raw_fastq/SRR21972729_2.fastq.gz data/trimmed/SRR21972

## STEP 7: Run FastQC
It's best practice to run FastQC after trimming. However, you may decide to run FastQC only once, before or after trimming.

We will proceed with only the forward reads -- this is because, looking at trimmomatic, there were very few 'orphaned' reads. That is to say, most forward and reverse reads were successfully paired together. Because we are just trying to map to a transcriptome, the read lengths of the forward reads alone, in this case, around 60 millions~ basepairs, should be sufficient.

The below code may take around 15-20 minutes to run.

In [37]:
# Run FastQC
!cat accs.txt | xargs -P $THREADS -I {} fastqc data/trimmed/{}_1_trimmed.fastq data/trimmed/{}_2_trimmed.fastq -o data/fastqc_samples/

null
null
null
null
null
null
null
null
null
null
null
null
Started analysis of SRR21972723_1_trimmed.fastq
Started analysis of SRR21972727_1_trimmed.fastq
Started analysis of SRR21972729_1_trimmed.fastq
null
null
Started analysis of SRR21972726_1_trimmed.fastq
Started analysis of SRR21972725_1_trimmed.fastq
null
Started analysis of SRR21972730_1_trimmed.fastq
Started analysis of SRR21972728_1_trimmed.fastq
null
Started analysis of SRR21972724_1_trimmed.fastq
Approx 5% complete for SRR21972729_1_trimmed.fastq
Approx 5% complete for SRR21972730_1_trimmed.fastq
Approx 5% complete for SRR21972724_1_trimmed.fastq
Approx 5% complete for SRR21972725_1_trimmed.fastq
Approx 5% complete for SRR21972726_1_trimmed.fastq
Approx 5% complete for SRR21972723_1_trimmed.fastq
Approx 5% complete for SRR21972728_1_trimmed.fastq
Approx 5% complete for SRR21972727_1_trimmed.fastq
Approx 10% complete for SRR21972729_1_trimmed.fastq
Approx 10% complete for SRR21972730_1_trimmed.fastq
Approx 10% complete for 

## STEP 8: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

In [38]:
#!multiqc -f data/fastqc_samples/
!multiqc -f -o data/multiqc_samples/ data/fastqc_samples/


[91m///[0m ]8;id=766235;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25.1[0m

[34m       file_search[0m | Search path: /home/ec2-user/SageMaker/data/fastqc_samples
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m32/32[0m  l[0mm  
[?25h[34m            fastqc[0m | Found 16 reports
[34m     write_results[0m | Data        : data/multiqc_samples/multiqc_data
[34m     write_results[0m | Report      : data/multiqc_samples/multiqc_report.html
[34m           multiqc[0m | MultiQC complete


## STEP 9: Preparing the STAR-Compatible RSEM Reference

This command prepares a reference genome and annotation files for RNA-Seq analysis using RSEM (RNA-Seq by Expectation-Maximization) and STAR (Spliced Transcripts Alignment to a Reference). It generates files needed to quantify gene and isoform expression. The rsem-prepare-reference function takes a GTF file with gene annotations (mouse_annotation.gtf) and a FASTA file with the reference genome sequence (mouse_genome.fa). It processes these files to create a reference, saving the output in the mouse_reference directory. The --star option ensures the reference is compatible with STAR for efficient transcriptome alignment. The -p $THREADS option sets the number of threads used for parallel processing, speeding up the preparation process.

In [39]:
# Preparing the reference genome
!rsem-prepare-reference --gtf data/reference/mouse_annotation.gtf --star -p $THREADS data/reference/mouse_genome.fa mouse_reference

rsem-extract-reference-transcripts mouse_reference 0 data/reference/mouse_annotation.gtf None 0 data/reference/mouse_genome.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
Parsed 800000 lines
Parsed 1000000 lines
Parsed 1200000 lines
Parsed 1400000 lines
Parsed 1600000 lines
Parsed 1800000 lines
Parsing gtf File is done!
data/reference/mouse_genome.fa is processed!
142434 transcripts are extracted and 0 transcripts are omitted.
Extracting sequences is done!
Group File is generated!
Transcript Information File is generated!
Chromosome List File is generated!
Extracted Sequences File is generated!

rsem-preref mouse_reference.transcripts.fa 1 mouse_reference
Refs.makeRefs finished!
Refs.saveRefs finished!
mouse_reference.idx.fa is generated!
mouse_reference.n2g.idx.fa is generated!

STAR  --runThreadN 15  --runMode genomeGenerate  --genomeDir .  --genomeFastaFiles data/reference/mouse_genome.fa  --sjdbGTFfile data/reference/mouse_annotation.gtf  --sjdbOverhang 100  --outFi

## STEP 10: Run STAR for Alignment, Prepare and Run RSEM for Quantification

This script automates RNA-Seq gene expression quantification using RSEM and STAR. It reads SRR accession IDs from accs.txt, saves results in data/rsem_output, and runs rsem-calculate-expression for each ID. It uses paired-end trimmed FASTQ files from data/trimmed/ and a STAR-aligned RSEM reference (mouse_reference).

In [None]:
import os

# Ensure you've set the path to the RSEM binary
# Read the SRR accessions from the file
with open('accs.txt', 'r') as f:
    srr_accessions = [line.strip() for line in f.readlines()]

# Define the output directory
output_dir = "data/rsem_output"

# Loop through each SRR accession and run rsem-calculate-expression
for srr in srr_accessions:
    !rsem-calculate-expression -p $THREADS --paired-end --star \
    data/trimmed/{srr}_1_trimmed.fastq data/trimmed/{srr}_2_trimmed.fastq mouse_reference data/rsem_output/{srr}

STAR --genomeDir .  --outSAMunmapped Within  --outFilterType BySJout  --outSAMattributes NH HI AS NM MD  --outFilterMultimapNmax 20  --outFilterMismatchNmax 999  --outFilterMismatchNoverLmax 0.04  --alignIntronMin 20  --alignIntronMax 1000000  --alignMatesGapMax 1000000  --alignSJoverhangMin 8  --alignSJDBoverhangMin 1  --sjdbScore 1  --runThreadN 15  --genomeLoad NoSharedMemory  --outSAMtype BAM Unsorted  --quantMode TranscriptomeSAM  --outSAMheaderHD \@HD VN:1.4 SO:unsorted  --outFileNamePrefix data/rsem_output/SRR21972730.temp/SRR21972730  --readFilesIn data/trimmed/SRR21972730_1_trimmed.fastq data/trimmed/SRR21972730_2_trimmed.fastq
	/home/ec2-user/anaconda3/envs/tensorflow2_p310/bin/STAR-avx2 --genomeDir . --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBov

## STEP 11: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [45]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Sort the DataFrame by TPM values in descending order and get the top 10 genes
    top_10_genes = df.sort_values(by='TPM', ascending=False).head(10)

    # Print the top 10 genes with their TPM values
    print(f"Top 10 Genes by TPM for {srr_id}:")
    print(top_10_genes[['gene_id', 'TPM']])

Top 10 Genes by TPM for SRR21972730:
                  gene_id       TPM
17914  ENSMUSG00000064351  54274.56
17919  ENSMUSG00000064356  44835.11
17475  ENSMUSG00000062515  31017.87
37133  ENSMUSG00000102070  23336.21
10569  ENSMUSG00000037071  15546.30
17930  ENSMUSG00000064368  15275.92
17904  ENSMUSG00000064341  13164.45
35987  ENSMUSG00000100862  12272.57
17932  ENSMUSG00000064370  12100.55
36359  ENSMUSG00000101249  11552.95
Top 10 Genes by TPM for SRR21972729:
                  gene_id       TPM
17914  ENSMUSG00000064351  55799.15
17919  ENSMUSG00000064356  51797.76
17475  ENSMUSG00000062515  31240.68
37133  ENSMUSG00000102070  24800.54
17930  ENSMUSG00000064368  15971.53
10569  ENSMUSG00000037071  15678.96
17904  ENSMUSG00000064341  14656.16
35987  ENSMUSG00000100862  13933.80
17902  ENSMUSG00000064339  12825.33
17932  ENSMUSG00000064370  12382.64
Top 10 Genes by TPM for SRR21972728:
                  gene_id       TPM
17914  ENSMUSG00000064351  50658.47
17919  ENSMUSG00000064356

## STEP 12: Report the expression of ENSMUSG00000064351 for each file

Use `grep` to report the expression in the wild-type sample. The fields in the RSEM `genes.results` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [46]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data/rsem_output'

# Target gene ID
target_gene = 'ENSMUSG00000064351'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Filter for the target gene
    target_gene_data = df[df['gene_id'] == target_gene]

    # Print the target gene's TPM value for the SRR ID
    print(f"TPM for {target_gene} in {srr_id}: {target_gene_data['TPM'].values[0]}")

TPM for ENSMUSG00000064351 in SRR21972730: 54274.56
TPM for ENSMUSG00000064351 in SRR21972729: 55799.15
TPM for ENSMUSG00000064351 in SRR21972728: 50658.47
TPM for ENSMUSG00000064351 in SRR21972727: 44916.76
TPM for ENSMUSG00000064351 in SRR21972725: 59348.28
TPM for ENSMUSG00000064351 in SRR21972724: 61388.25
TPM for ENSMUSG00000064351 in SRR21972723: 52818.82
TPM for ENSMUSG00000064351 in SRR21972726: 57367.34


## STEP 13: Export Read counts to S3 Bucket


The code effectively extracts gene expression data from RSEM output files and stores them in a structured format on an S3 bucket. This data will be accessible for further analysis in Tutorial 2 and Tutorial 3.

In [47]:
import os
import pandas as pd
import boto3

# Define the path to your RSEM output directory
rsem_output_path = "data/rsem_output"

# Define the S3 bucket and output path
s3_bucket = "sra-data-athena"
s3_output_path = "readcounts/"

# Initialize S3 client
s3_client = boto3.client('s3')

# Get a list of all .genes.results files in the directory
genes_files = [f for f in os.listdir(rsem_output_path) if f.endswith('.genes.results')]

# Loop through each file to extract gene ID, expected counts, and gene length
for file in genes_files:
    file_path = os.path.join(rsem_output_path, file)
    
    # Read the .genes.results file
    rsem_data = pd.read_csv(file_path, sep="\t")

    # Check if the necessary columns exist
    if all(col in rsem_data.columns for col in ["gene_id", "expected_count", "length"]):
        # Create a new dataframe with required columns
        result_data = rsem_data[["gene_id", "expected_count", "length"]]
        result_data.columns = ["GeneID", "Count", "GeneLength"]

        # Define the output filename based on the input file name
        output_file_name = f"{os.path.splitext(file)[0]}.txt"
        s3_output_file_path = f"{s3_output_path}{output_file_name}"

        # Convert the DataFrame to a CSV string
        csv_buffer = result_data.to_csv(sep="\t", index=False)

        # Upload the result directly to S3
        s3_client.put_object(Bucket=s3_bucket, Key=s3_output_file_path, Body=csv_buffer)

    else:
        print(f"Warning: Required columns are missing in file: {file}")

# Optionally, print a message indicating completion
print("Extraction and file creation complete.")



Extraction and file creation complete.


## STEP 14: Save Merged Read Counts

This code combines multiple RSEM gene count files into a single, unified file, making it easier to analyze and visualize the gene expression data. This files was also uploaded to S3 Bucket to allow further analysis in other Tutorials. 

In [48]:
# Ensure the RSEM quantification results directory exists
!mkdir -p data/rsem_output

# Merge RSEM results by gene counts (similar to Salmon's numreads merge)
!rsem-generate-data-matrix data/rsem_output/*.genes.results > data/rsem_output/merged_gene_counts.txt

# Optionally, rename the columns based on the samples
# If you want to assign your GSM identifiers or any other custom names, edit the header.
!sed -i "1s/.*/Name\tGSM6658439\tGSM6658438\tGSM6658435\tGSM6658441\tGSM6658433\tGSM6658431\tGSM6658429\tGSM6658427/" data/rsem_output/merged_gene_counts.txt

# Remove any unnecessary prefixes like 'gene-' or 'rna-' for easier formatting
!sed -i "s/gene-//g" data/rsem_output/merged_gene_counts.txt
!sed -i "s/rna-//g" data/rsem_output/merged_gene_counts.txt

# Show a preview of the merged quantification file
!head data/rsem_output/merged_gene_counts.txt

import boto3
import os

# Define the file path and S3 bucket details
file_path = "data/rsem_output/merged_gene_counts.txt"
bucket_name = "sra-data-athena"
s3_key = "readcounts/merged_gene_counts.txt"

# Initialize an S3 client
s3_client = boto3.client('s3')

# Upload the file to the specified S3 bucket
try:
    s3_client.upload_file(file_path, bucket_name, s3_key)
    print(f"File {file_path} uploaded successfully to {bucket_name}/{s3_key}")
except Exception as e:
    print(f"Error uploading file: {e}")

# Define the file paths and S3 bucket details
rsem_output_path = "data/rsem_output"
feature_table_path = "data/reference/mouse_feature_table.txt"
bucket_name = "sra-data-athena"
s3_output_path = "readcounts/"
s3_feature_table_path = "reference/mouse_feature_table.txt"

# ... (rest of the code remains the same)

# Upload the gene count file
s3_client.upload_file(file_path, bucket_name, s3_key)

# Upload the feature table file
s3_client.upload_file(feature_table_path, bucket_name, s3_feature_table_path)

Name	GSM6658439	GSM6658438	GSM6658435	GSM6658441	GSM6658433	GSM6658431	GSM6658429	GSM6658427
"ENSMUSG00000000001"	4569.00	3083.00	3009.00	3806.00	6170.00	5161.00	3356.00	3267.00
"ENSMUSG00000000003"	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
"ENSMUSG00000000028"	124.00	94.00	93.00	124.00	137.00	143.00	67.00	68.00
"ENSMUSG00000000031"	2243.00	1375.00	1828.00	3267.00	941.00	496.00	313.00	322.00
"ENSMUSG00000000037"	57.00	26.00	38.00	33.00	46.00	49.00	36.00	26.00
"ENSMUSG00000000049"	8.00	0.00	1.00	0.00	3.00	0.00	1.00	1.00
"ENSMUSG00000000056"	3981.00	2848.00	2834.00	3353.00	4591.00	4372.00	2811.00	3048.00
"ENSMUSG00000000058"	15626.00	10835.00	10112.00	15088.00	11439.00	10551.00	5909.00	5965.00
"ENSMUSG00000000078"	5817.00	3781.00	3761.00	4817.00	6031.00	5024.00	3076.00	3038.00
File data/rsem_output/merged_gene_counts.txt uploaded successfully to sra-data-athena/readcounts/merged_gene_counts.txt


## STEP 15: Save RSEM reference and STAR index to S3 Bucket

In [50]:
!aws s3 cp data/rsem_reference s3://sra-data-athena/reference/rsem_reference/ --recursive

upload: data/rsem_reference/chrName.txt to s3://sra-data-athena/reference/rsem_reference/chrName.txt
upload: data/rsem_reference/Log.out to s3://sra-data-athena/reference/rsem_reference/Log.out
upload: data/rsem_reference/chrNameLength.txt to s3://sra-data-athena/reference/rsem_reference/chrNameLength.txt
upload: data/rsem_reference/chrStart.txt to s3://sra-data-athena/reference/rsem_reference/chrStart.txt
upload: data/rsem_reference/chrLength.txt to s3://sra-data-athena/reference/rsem_reference/chrLength.txt
upload: data/rsem_reference/exonInfo.tab to s3://sra-data-athena/reference/rsem_reference/exonInfo.tab
upload: data/rsem_reference/geneInfo.tab to s3://sra-data-athena/reference/rsem_reference/geneInfo.tab
upload: data/rsem_reference/mouse_reference.chrlist to s3://sra-data-athena/reference/rsem_reference/mouse_reference.chrlist
upload: data/rsem_reference/genomeParameters.txt to s3://sra-data-athena/reference/rsem_reference/genomeParameters.txt
upload: data/rsem_reference/exonGeT

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1_subsampling_mouse-miniforge.ipynb) A short introduction to downloading and mapping sequences to a mouse genome using STAR and RSEM.


[Workflow Two (DEG Analysis):](Tutorial_2_DEG_Analysis_mouse.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.

[Workflow Three (Network Analysis):](Tutorial_3_NetAct.ipynb) Using NetAct and R to conduct transcription factor network analysis.
