# Bulk RNA-Seq Analysis Training Demo

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a subsampling Mus musculus data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression. This tutorial workflow uses truncated and partial run data from the Mittenbühler MJ et al., project.

This tutorial can take over 1 hour 30 minutes to run the code fully. This is part of the reason we have a short and easy introductory tutorial, and this longer more full tutorial for those interested.

![RNA-Seq workflow](../images/rnaseq-workflow.png)

## STEP 1: Install Miniforge

!df -hMiniforge is a lightweight Conda distribution that offers a streamlined installation process and efficient package management. It provides access to a vast repository of packages.

Conda packages and environments are useful for several reasons. Conda packages contain metadata. This metadata includes information about what other programs the given software needs to be installed, in order to run. When installing a package with Conda, those other packages are automatically also installed. In this way, the user does not have to worry about manually installing each dependency. This makes installation quick and simple.

These packages are installed inside of environments, which are simply folders within the local installation of Conda. This has several benefits. Local installation means easier installation for non-admin users who may not have access to all system directories. Each environment can hold specific software with specific versions, and it easy to swap to different environments. In addition, the environments themselves are portable, as each environment contains a manifest on how to recreate that environment.

Miniforge itself is a Conda package manager, this means it requires Conda in order to work. It is used to install and update Conda packages, which it gets from a ‘channel’, or repository. It is an alternative to the native Conda package manager. It is often used for reasons of speed.

Bioconda is a ‘channel’, or repository, that the Mambaforge package manager can download packages from. It is a repository of Conda packages that are related to biology. These packages are versions of popular biology software that are curated and uploaded by contributing users.

In [1]:
# Download Miniforge or Mambaforge (you can use either based on preference)
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh

# Install Miniforge (or Mambaforge) - no need to install conda since mamba will be available immediately
!bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p $HOME/miniforge
!date +"%T"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 86.0M  100 86.0M    0     0   175M      0 --:--:-- --:--:-- --:--:--  175M
PREFIX=/home/ec2-user/miniforge

Transaction

  Prefix: /home/ec2-user/miniforge/envs/_virtual_specs_checks

  All requested packages already installed

Dry run. Not executing the transaction.
Unpacking payload ...
Extracting _libgcc_mutex-0.1-conda_forge.tar.bz2
Extracting ca-certificates-2024.8.30-hbcca054_0.conda
Extracting ld_impl_linux-64-2.40-hf3520f5_7.conda
Extracting pybind11-abi-4-hd8ed1ab_3.tar.bz2
Extracting python_abi-3.12-5_cp312.conda
Extracting tzdata-2024a-h8827d51_1.conda
Extracting libgomp-14.1.0-h77fa898_1.conda
Extract

Next, using miniforge and bioconda, install the tools that will be used in this tutorial.

In [2]:
# Update PATH to point to the Miniforge (or Mambaforge) bin files
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/miniforge/bin"

# Now use mamba directly to install your software packages
!mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc sql-magic entrez-direct gffread parallel-fastq-dump sra-tools pyathena samtools star rsem subread pigz -y


Looking for: ['trimmomatic', 'fastqc', 'multiqc', 'sql-magic', 'entrez-direct', 'gffread', 'parallel-fastq-dump', 'sra-tools', 'pyathena', 'samtools', 'star', 'rsem', 'subread', 'pigz']

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
bioconda/linux-64 (check zst) [90m━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.1s[2K[1A[2K[0Gbioconda/linux-64 (check zst)                       Checked  0.2s
[?25h[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0Gbioconda/noarch (check zst)                        Checked  0.0s
[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
https://aws-ml-conda.s3.us-west-2.amazonaws.com/.. [33m━━━━━━━━━━━╸[0m[90m━━━[0m   0.0 B  0.1s[2K[1A[2K[0Ghttps://aws-ml-conda.s3.us-west-2.amazonaws.com/..  0.2s
[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0Ghttps://aws-ml-conda.s3.us-west-2.amazonaws.com/.. Checked  0.1s
[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0Gnvidia/linux-64 (check zst)                        Checked  0.0s
[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0Gnvidia/n

## STEP 2: Setup Environment

Create a set of directories to store the reads, reference sequence files, and output files.

In [3]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data_sub
!mkdir -p data_sub/trunc_rawfastq
!mkdir -p data_sub/trimmed
!mkdir -p data_sub/fastqc_samples/
!mkdir -p data_sub/reference
!mkdir -p data_sub/aligned_bam
!mkdir -p data_sub/rsem_reference/mouse_rsem_reference
!mkdir -p data_sub/rsem_output
!mkdir -p data_sub/reference/STAR_index

/home/ec2-user/SageMaker


Specify the number of available threads based on the VM.

This is useful for later tools such as trimmomatic, or STAR.

In [4]:
import multiprocessing

num_cores = multiprocessing.cpu_count()
THREADS = max(1, num_cores - 1)

print("Number of threads:", THREADS)
os.environ["THREADS"] = str(THREADS)

Number of threads: 15


## STEP 3: Downloading relevant files



### STEP 3.1: Finding run accession numbers.


This code retrieves the SRR IDs associated with project PRJNA892075 from the NCBI database and saves them to a .txt file

In [5]:
!esearch -db sra -query "PRJNA892075" | efetch -format runinfo | cut -d',' -f1 | tail -n +2 > accs.txt
!cat accs.txt

SRR21972730
SRR21972729
SRR21972728
SRR21972727
SRR21972725
SRR21972724
SRR21972723
SRR21972726


### STEP 3.2: Download subsampling FASTQ files from S3 bucket

In order for this tutorial to run quickly, we will only analyze 1,000,000 reads from each of the eight samples in PRJNA892075. These subsampled files have been posted on a publicly accessible S3 bucket.

The code will connect to the S3 bucket and download the paired-end FASTQ files for each SRR.

In [6]:
#Load SRR IDs from accs.txt (if available)
with open('accs.txt', 'r') as f:
    accs = [line.strip() for line in f.readlines()]

In [7]:
for acc in accs:
    !wget -P data_sub/raw_fastq/ https://sra-data-athena.s3.amazonaws.com/fastqfiles/subsampled_{acc}_1.fastq
    !wget -P data_sub/raw_fastq/ https://sra-data-athena.s3.amazonaws.com/fastqfiles/subsampled_{acc}_2.fastq

--2024-10-07 07:39:57--  https://sra-data-athena.s3.amazonaws.com/fastqfiles/subsampled_SRR21972730_1.fastq
Resolving sra-data-athena.s3.amazonaws.com (sra-data-athena.s3.amazonaws.com)... 3.5.2.141, 52.216.220.177, 54.231.165.97, ...
Connecting to sra-data-athena.s3.amazonaws.com (sra-data-athena.s3.amazonaws.com)|3.5.2.141|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 376258608 (359M) [binary/octet-stream]
Saving to: ‘data_sub/raw_fastq/subsampled_SRR21972730_1.fastq.1’


2024-10-07 07:40:17 (18.1 MB/s) - ‘data_sub/raw_fastq/subsampled_SRR21972730_1.fastq.1’ saved [376258608/376258608]

--2024-10-07 07:40:17--  https://sra-data-athena.s3.amazonaws.com/fastqfiles/subsampled_SRR21972730_2.fastq
Resolving sra-data-athena.s3.amazonaws.com (sra-data-athena.s3.amazonaws.com)... 52.217.234.81, 52.216.249.140, 3.5.9.100, ...
Connecting to sra-data-athena.s3.amazonaws.com (sra-data-athena.s3.amazonaws.com)|52.217.234.81|:443... connected.
HTTP request sent, awaitin

### STEP 3.3: Download reference genome and annotation files that will be used by STAR and RSEM

This step downloads and unzips reference files for the mouse genome and annotations needed by STAR. These reference files will be used by STAR to align RNA-seq reads during the analysis.

In [8]:
! wget ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz -O data_sub/reference/mouse_genome.fa.gz
! wget ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz -O data_sub/reference/mouse_annotation.gtf.gz
! wget -O data_sub/reference/mouse_feature_table.txt.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_feature_table.txt.gz

--2024-10-07 07:45:06--  ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
           => ‘data_sub/reference/mouse_genome.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-104/fasta/mus_musculus/dna ... done.
==> SIZE Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... 806418890
==> PASV ... done.    ==> RETR Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... done.
Length: 806418890 (769M) (unauthoritative)


2024-10-07 07:45:31 (32.4 MB/s) - ‘data_sub/reference/mouse_genome.fa.gz’ saved [806418890]

--2024-10-07 07:45:31--  ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz
           => ‘data_sub/reference/mouse_annotation.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.or

In [9]:
!gunzip -f data_sub/reference/mouse_genome.fa.gz 
!gunzip -f data_sub/reference/mouse_annotation.gtf.gz
!gunzip -f data_sub/reference/mouse_feature_table.txt.gz

### STEP 3.4: Copy data file for Trimmomatic


One of trimmomatics functions is to trim sequence machine specific adapter sequences. These are usually within the trimmomatic installation directory in a folder called adapters.

Directories of packages within conda installations can be confusing, so in the case of using conda with trimmomatic, it may be easier to simply download or create a file with the relevant adapter sequencecs and store it in an easy to find directory.

In [10]:
!wget -P data_sub/trimmed/ https://sra-data-athena.s3.amazonaws.com/reference/TruSeq3-PE.fa

--2024-10-07 07:45:58--  https://sra-data-athena.s3.amazonaws.com/reference/TruSeq3-PE.fa
Resolving sra-data-athena.s3.amazonaws.com (sra-data-athena.s3.amazonaws.com)... 3.5.1.135, 52.216.214.137, 52.217.204.201, ...
Connecting to sra-data-athena.s3.amazonaws.com (sra-data-athena.s3.amazonaws.com)|3.5.1.135|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 95 [binary/octet-stream]
Saving to: ‘data_sub/trimmed/TruSeq3-PE.fa.1’


2024-10-07 07:45:58 (5.91 MB/s) - ‘data_sub/trimmed/TruSeq3-PE.fa.1’ saved [95/95]



### STEP 4: Run Trimmomatic

Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Trimmomatic takes an input, in this case a forward and reverse FASTQ file, and the user names the outputs. In this case the first two outputs are the ‘paired’ and ‘orphaned’ trimmed files for the forward reads. The second two are the ‘paired’ and ‘orphaned’ trimmed files for the reverse reads.

Paired in this case means the forward and reverse sequences were able to be aligned to each other after trimming. Orphaned, or unpaired, typically means one of the reads was discarded as a result of trimming, and only the forward or reverse read survived.

Here we take only the file with paired reads, as there are only a few orphaned reads, and including orphaned reads can complicate different downstream analyses. Unless there is a significant amount of them, or a specific reason to use them, it is generally easier to discard unpaired reads. Also in the interest of simplicity and speed, we proceed in further steps with just using the paired-end reads of just the forward-end read files, which in this context is sufficient, however in different contexts, using both forward and reverse is often preferable.

The last part of the command specifies how the trimming is done: ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

In greater detail:

‘ILLUMINACLIP:TruSeq3-PE.fa’ refers to which adapters should be cut from the reads.

‘2:30:10:2’ refers to various metrics, which are recommended defaults.

‘2’ refers to the seed mismatch. This refers to the amount of mismatches a 'seed' may have in aligning to a possible adapter.

‘30’ refers to the palindrome clip threshold. This refers to the similarity score. If forward and reverse reads, after having an adapter attached to them, are greater than this score, trimming of adapter fragments will be performed. Forward reads are clipped, and reverse reads dropped.

‘10’ refers to the simple clip threshold. The basic, and alternative method to palindromic searching for adapters. Adapters are tested against reads and if sufficiently matched (above the threshold, in this case, 10), they are clipped.

‘2’ refers to the minimum adapter fragment length in palindrome mode.

‘LEADING:3’ refers to trimming bases from the start of a read. Bases at the start of a read will continue to be trimmed, sequentially, as long as the bases remain below a PHRED score of 3.

‘TRAILING:3’ refers to the same as above, but at the end of a read.

‘MINLEN:36’, the read is discarded if below this length.

Greater information about parameters can be found in the trimmomatic documentation.

In [11]:
!cat accs.txt | xargs -I {} \
trimmomatic PE -threads $THREADS \
'data_sub/raw_fastq/subsampled_{}_1.fastq' 'data_sub/raw_fastq/subsampled_{}_2.fastq' \
'data_sub/trimmed/subsampled_{}_1_trimmed.fastq' 'data_sub/trimmed/subsampled_{}_1_trimmed_unpaired.fastq' \
'data_sub/trimmed/subsampled_{}_2_trimmed.fastq' 'data_sub/trimmed/subsampled_{}_2_trimmed_unpaired.fastq' \
ILLUMINACLIP:data_sub/trimmed/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36


TrimmomaticPE: Started with arguments:
 -threads 15 data_sub/raw_fastq/subsampled_SRR21972730_1.fastq data_sub/raw_fastq/subsampled_SRR21972730_2.fastq data_sub/trimmed/subsampled_SRR21972730_1_trimmed.fastq data_sub/trimmed/subsampled_SRR21972730_1_trimmed_unpaired.fastq data_sub/trimmed/subsampled_SRR21972730_2_trimmed.fastq data_sub/trimmed/subsampled_SRR21972730_2_trimmed_unpaired.fastq ILLUMINACLIP:data_sub/trimmed/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 1000000 Both Surviving: 684232 (68.42%) Forward Only Surviving: 315569 (31.56%) Reverse Only Surviving: 0 (0.00%) Dropped: 199 (0.02%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -threads 15 data_sub/raw_fa

### STEP 5: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Because jupyter is at its core a python editor, we can use python code and html support to display results in-line.

FastQC looks for different characteristics of quality in reads. It is very rare that every metric will pass. In many cases, they serve as warnings, which should be compared to the context of the experiment. For instance, here, per base sequence content, sequence length distribution, sequence duplication levels, and overrepresented sequences all throw warnings. Per base sequence content routinely fails in RNA-sequencing in the first 15~ or so bases due to biased fragmentation. In most of our samples, this is where we see the failure (20% or more difference between A/T or G/C), and so is not unexpected. The overrepresented sequences can be BLASTed to show the majority of them are ribosomal RNA. Ribosomal RNA contamination is also common and will not be indexed later, and so not a large concern. Other metrics look good.

In [12]:
# Run FastQC
!cat accs.txt | xargs -P $THREADS -I {} fastqc data_sub/trimmed/subsampled_{}_1_trimmed.fastq data_sub/trimmed/subsampled_{}_2_trimmed.fastq -o data_sub/fastqc_samples/

null
null
null
null
null
null
null
null
null
Started analysis of subsampled_SRR21972724_1_trimmed.fastq
null
Started analysis of subsampled_SRR21972723_1_trimmed.fastq
Started analysis of subsampled_SRR21972726_1_trimmed.fastq
null
null
Started analysis of subsampled_SRR21972729_1_trimmed.fastq
null
null
Started analysis of subsampled_SRR21972727_1_trimmed.fastq
Started analysis of subsampled_SRR21972728_1_trimmed.fastq
null
Started analysis of subsampled_SRR21972725_1_trimmed.fastq
null
Started analysis of subsampled_SRR21972730_1_trimmed.fastq
Approx 5% complete for subsampled_SRR21972724_1_trimmed.fastq
Approx 5% complete for subsampled_SRR21972729_1_trimmed.fastq
Approx 5% complete for subsampled_SRR21972726_1_trimmed.fastq
Approx 5% complete for subsampled_SRR21972730_1_trimmed.fastq
Approx 5% complete for subsampled_SRR21972725_1_trimmed.fastq
Approx 5% complete for subsampled_SRR21972728_1_trimmed.fastq
Approx 5% complete for subsampled_SRR21972727_1_trimmed.fastq
Approx 5% comp

In [13]:
from IPython.display import IFrame
IFrame(src='./data_sub/fastqc_samples/subsampled_SRR21972726_1_trimmed_fastqc.html', width=800, height=600)

### STEP 6: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

Being able to use python with bash also means we can seamlessly use popular python packages, such as pandas, to interact with or view the files we create.

In [14]:
#!multiqc -f data_sub/fastqc_samples/
!multiqc -f -o data_sub/multiqc_samples/ data_sub/fastqc_samples/


[91m///[0m ]8;id=798493;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25.1[0m

[34m       file_search[0m | Search path: /home/ec2-user/SageMaker/data_sub/fastqc_samples
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m32/32[0m  p[0mm  
[?25h[34m            fastqc[0m | Found 16 reports
[34m     write_results[0m | Data        : data_sub/multiqc_samples/multiqc_data   (overwritten)
[34m     write_results[0m | Report      : data_sub/multiqc_samples/multiqc_report.html   (overwritten)
[34m           multiqc[0m | MultiQC complete


### STEP 7: Download STAR-Compatible RSEM Reference

To save time and instance power, this command downloads from S3 Bucket a reference genome and annotation files for RNA-Seq analysis using RSEM (RNA-Seq by Expectation-Maximization) and STAR (Spliced Transcripts Alignment to a Reference). If you want to see the whole command to set up the necessary reference for subsequent RNA-Seq quantification using both STAR and RSEM tools is on the Tutorial 1B.

In [5]:
# Preparing the reference genome
!rsem-prepare-reference --gtf data_sub/reference/mouse_annotation.gtf --star -p $THREADS data_sub/reference/mouse_genome.fa mouse_reference

rsem-extract-reference-transcripts mouse_reference 0 data_sub/reference/mouse_annotation.gtf None 0 data_sub/reference/mouse_genome.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
Parsed 800000 lines
Parsed 1000000 lines
Parsed 1200000 lines
Parsed 1400000 lines
Parsed 1600000 lines
Parsed 1800000 lines
Parsing gtf File is done!
data_sub/reference/mouse_genome.fa is processed!
142434 transcripts are extracted and 0 transcripts are omitted.
Extracting sequences is done!
Group File is generated!
Transcript Information File is generated!
Chromosome List File is generated!
Extracted Sequences File is generated!

rsem-preref mouse_reference.transcripts.fa 1 mouse_reference
Refs.makeRefs finished!
Refs.saveRefs finished!
mouse_reference.idx.fa is generated!
mouse_reference.n2g.idx.fa is generated!

STAR  --runThreadN 15  --runMode genomeGenerate  --genomeDir .  --genomeFastaFiles data_sub/reference/mouse_genome.fa  --sjdbGTFfile data_sub/reference/mouse_annotation.gtf  --sjdbO

### STEP 8: RNA-Seq Expression Quantification with RSEM and STAR for Multiple Samples

This script automates the process of quantifying gene expression for multiple RNA-Seq samples using RSEM with STAR for alignment. It reads SRR accession IDs from accs.txt, sets the output directory for the RSEM results as data_sub/rsem_output, and then, for each SRR accession, runs rsem-calculate-expression to quantify gene and isoform expression using paired-end trimmed FASTQ files from data_sub/trimmed/. The script uses a STAR-aligned RSEM reference (mouse_reference) and saves the results for each SRR sample in the corresponding subdirectory under data_sub/rsem_output/.

This alignment process typically takes approximately 10 minutes per sample.

In [6]:
import os

# Ensure you've set the path to the RSEM binary
# Read the SRR accessions from the file
with open('accs.txt', 'r') as f:
    srr_accessions = [line.strip() for line in f.readlines()]

# Define the output directory
output_dir = "data_sub/rsem_output"

# Loop through each SRR accession and run rsem-calculate-expression
for srr in srr_accessions:
    !rsem-calculate-expression -p $THREADS --paired-end --star \
    data_sub/trimmed/subsampled_{srr}_1_trimmed.fastq data_sub/trimmed/subsampled_{srr}_2_trimmed.fastq mouse_reference data_sub/rsem_output/{srr}

STAR --genomeDir .  --outSAMunmapped Within  --outFilterType BySJout  --outSAMattributes NH HI AS NM MD  --outFilterMultimapNmax 20  --outFilterMismatchNmax 999  --outFilterMismatchNoverLmax 0.04  --alignIntronMin 20  --alignIntronMax 1000000  --alignMatesGapMax 1000000  --alignSJoverhangMin 8  --alignSJDBoverhangMin 1  --sjdbScore 1  --runThreadN 15  --genomeLoad NoSharedMemory  --outSAMtype BAM Unsorted  --quantMode TranscriptomeSAM  --outSAMheaderHD \@HD VN:1.4 SO:unsorted  --outFileNamePrefix data_sub/rsem_output/SRR21972730.temp/SRR21972730  --readFilesIn data_sub/trimmed/subsampled_SRR21972730_1_trimmed.fastq data_sub/trimmed/subsampled_SRR21972730_2_trimmed.fastq
	/home/ec2-user/anaconda3/envs/tensorflow2_p310/bin/STAR-avx2 --genomeDir . --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --

### STEP 9: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [7]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data_sub/rsem_output'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Sort the DataFrame by TPM values in descending order and get the top 10 genes
    top_10_genes = df.sort_values(by='TPM', ascending=False).head(10)

    # Print the top 10 genes with their TPM values
    print(f"Top 10 Genes by TPM for {srr_id}:")
    print(top_10_genes[['gene_id', 'TPM']])

Top 10 Genes by TPM for SRR21972730:
                  gene_id       TPM
17914  ENSMUSG00000064351  53324.68
17919  ENSMUSG00000064356  41884.60
17475  ENSMUSG00000062515  30282.61
37133  ENSMUSG00000102070  21456.98
10569  ENSMUSG00000037071  14982.54
17930  ENSMUSG00000064368  14954.44
17904  ENSMUSG00000064341  12861.93
17932  ENSMUSG00000064370  11785.29
35987  ENSMUSG00000100862  11672.87
17917  ENSMUSG00000064354  11594.19
Top 10 Genes by TPM for SRR21972729:
                  gene_id       TPM
17919  ENSMUSG00000064356  54418.73
17914  ENSMUSG00000064351  54164.17
17475  ENSMUSG00000062515  30580.97
37133  ENSMUSG00000102070  26667.12
17930  ENSMUSG00000064368  15596.24
10569  ENSMUSG00000037071  15413.18
35987  ENSMUSG00000100862  14560.83
17904  ENSMUSG00000064341  14205.65
17902  ENSMUSG00000064339  12560.83
17932  ENSMUSG00000064370  12053.88
Top 10 Genes by TPM for SRR21972728:
                  gene_id       TPM
17914  ENSMUSG00000064351  49818.70
17919  ENSMUSG00000064356

Top 10 most highly expressed genes in the double lysogen samples.


### STEP 10: Report the expression of ENSMUSG00000100862 for each file

Use `grep` to report the expression in the wild-type sample. The fields in the RSEM `genes.results` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [10]:
import pandas as pd

# Path to RSEM results directory
rsem_results_dir = 'data_sub/rsem_output'

# Target gene ID
target_gene = 'ENSMUSG00000100862'

# Loop through each file in accs.txt
for srr_id in open('accs.txt'):
    srr_id = srr_id.strip()  # Remove newline character
    rsem_result_file = f'{rsem_results_dir}/{srr_id}.genes.results'

    # Load the RSEM results into a Pandas DataFrame
    df = pd.read_csv(rsem_result_file, sep='\t')

    # Filter for the target gene
    target_gene_data = df[df['gene_id'] == target_gene]

    # Print the target gene's TPM value for the SRR ID
    print(f"TPM for {target_gene} in {srr_id}: {target_gene_data['TPM'].values[0]}")

TPM for ENSMUSG00000100862 in SRR21972730: 11672.87
TPM for ENSMUSG00000100862 in SRR21972729: 14560.83
TPM for ENSMUSG00000100862 in SRR21972728: 10527.53
TPM for ENSMUSG00000100862 in SRR21972727: 12227.18
TPM for ENSMUSG00000100862 in SRR21972725: 19901.58
TPM for ENSMUSG00000100862 in SRR21972724: 17771.41
TPM for ENSMUSG00000100862 in SRR21972723: 17723.14
TPM for ENSMUSG00000100862 in SRR21972726: 20410.83


## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One (Extended):](Tutorial_1B_Extended_mouse.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset.

[Workflow Two (DEG Analysis):](Tutorial_2_DEG_Analysis_mouse.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.

[Workflow Three (Network Analysis):](Tutorial_3_NetAct.ipynb) Using NetAct and R to conduct transcription factor network analysis.
