# RNA-Seq Analysis Training Demo

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression.

![RNA-Seq workflow](images/rnaseq-workflow.png)

### STEP 1: Install Mambaforge.

First install Mambaforge. 

Mambaforg is a Conda package manager. Conda is a tool which creates Conda ‘environments’ that Conda packages can be installed into. 

Conda packages and environments are useful for several reasons. Conda packages contain metadata. This metadata includes information about what other programs the given software needs to be installed, in order to run. When installing a package with Conda, those other packages are automatically also installed. In this way, the user does not have to worry about manually installing each dependency. This makes installation quick and simple. 

These packages are installed inside of environments, which are simply folders within the local installation of Conda. This has several benefits. Local installation means easier installation for non-admin users who may not have access to all system directories. Each environment can hold specific software with specific versions, and it easy to swap to different environments. In addition, the environments themselves are portable, as each environment contains a manifest on how to recreate that environment.

Mambaforge itself is a Conda package manager, this means it requires Conda in order to work. It is used to install and update Conda packages, which it gets from a ‘channel’, or repository. It is an alternative to the native Conda package manager. It is often used for reasons of speed. 

Bioconda is a ‘channel’, or repository, that the Mambaforge package manager can download packages from. It is a repository of Conda packages that are related to biology. These packages are versions of popular biology software that are curated and uploaded by contributing users. 


In [1]:
!df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G  608K   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1  135G   84G   52G  62% /
/dev/nvme1n1    492G   89G  383G  19% /home/ec2-user/SageMaker
tmpfs           3.1G     0  3.1G   0% /run/user/1002
tmpfs           3.1G     0  3.1G   0% /run/user/1000
tmpfs           3.1G     0  3.1G   0% /run/user/1001


In [5]:
# Download Mambaforge quietly
!curl -s -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh > /dev/null 2>&1
# Install Mambaforge quietly
!bash Mambaforge-$(uname)-$(uname -m).sh -b -u -p $HOME/mambaforge > /dev/null 2>&1
# Display the current time
!date +"%T"

21:44:40


Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [6]:
#tell the computer where the mambaforge bin files are located
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

#now we can easily use 'mamba' command to install software 
!mamba install -y -q -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon gsutil parallel-fastq-dump sra-tools star samtools subread rsem -y

### STEP 2: Setup Environment

Create a set of directories to store the reads, reference sequence files, and output files.


In [7]:
!cd $HOMEDIR
!echo $PWD
!mkdir -p data
!mkdir -p data/trunc_rawfastq
!mkdir -p data/trimmed
!mkdir -p data/fastqc
!mkdir -p data/reference
!mkdir -p data/star_alignments
!mkdir -p data/quants

/home/ec2-user/SageMaker


Specify the number of available threads based on the VM.

This is useful for later tools such as trimmomatic, or STAR.

In [8]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'
THREADS = int(numthreads[0])
print("The number of threads is ", THREADS)

The number of threads is  7


### STEP 3: Copy FASTQ Files
Mus musculus SRA from 
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1072813


In [9]:
# Step 1: Download FASTQ files from NCBI using `parallel-fastq-dump`
! parallel-fastq-dump --sra-id SRR30723333 --threads $THREADS --outdir data/trunc_rawfastq --split-files --gzip

2024-09-24 21:45:36,219 - SRR ids: ['SRR30723333']
2024-09-24 21:45:36,219 - extra args: ['--split-files', '--gzip']
2024-09-24 21:45:36,220 - tempdir: /tmp/pfd_x175do_s
2024-09-24 21:45:36,220 - CMD: sra-stat --meta --quick SRR30723333
2024-09-24 21:45:37,298 - SRR30723333 spots: 26201305
2024-09-24 21:45:37,298 - blocks: [[1, 3743043], [3743044, 7486086], [7486087, 11229129], [11229130, 14972172], [14972173, 18715215], [18715216, 22458258], [22458259, 26201305]]
2024-09-24 21:45:37,298 - CMD: fastq-dump -N 1 -X 3743043 -O /tmp/pfd_x175do_s/0 --split-files --gzip SRR30723333
2024-09-24 21:45:37,299 - CMD: fastq-dump -N 3743044 -X 7486086 -O /tmp/pfd_x175do_s/1 --split-files --gzip SRR30723333
2024-09-24 21:45:37,300 - CMD: fastq-dump -N 7486087 -X 11229129 -O /tmp/pfd_x175do_s/2 --split-files --gzip SRR30723333
2024-09-24 21:45:37,300 - CMD: fastq-dump -N 11229130 -X 14972172 -O /tmp/pfd_x175do_s/3 --split-files --gzip SRR30723333
2024-09-24 21:45:37,301 - CMD: fastq-dump -N 14972173 

### STEP 4: Copy reference transcriptome files that will be used by STAR
STAR is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [10]:
! wget ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz -O data/reference/mouse_genome.fa.gz
! wget ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz -O data/reference/mouse_annotation.gtf.gz
! wget -O data/reference/mouse_feature_table.txt.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_feature_table.txt.gz

--2024-09-24 21:49:14--  ftp://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
           => ‘data/reference/mouse_genome.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-104/fasta/mus_musculus/dna ... done.
==> SIZE Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... 806418890
==> PASV ... done.    ==> RETR Mus_musculus.GRCm39.dna.primary_assembly.fa.gz ... done.
Length: 806418890 (769M) (unauthoritative)


2024-09-24 21:49:39 (32.7 MB/s) - ‘data/reference/mouse_genome.fa.gz’ saved [806418890]

--2024-09-24 21:49:39--  ftp://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.gtf.gz
           => ‘data/reference/mouse_annotation.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62

In [11]:
!gunzip -f data/reference/mouse_genome.fa.gz 
!gunzip -f data/reference/mouse_annotation.gtf.gz
!gunzip -f data/reference/mouse_feature_table.txt.gz


gzip: data/reference/mouse_feature_table.txt.gz: unexpected end of file


### STEP 5: Copy data file for Trimmomatic

In [12]:
! gsutil -m cp -r gs://rnaseq-myco-bucket/reference/TruSeq3-PE.fa .

Copying gs://rnaseq-myco-bucket/reference/TruSeq3-PE.fa...
/ [1/1 files][   95.0 B/   95.0 B] 100% Done                                    
Operation completed over 1 objects/95.0 B.                                       


### STEP 6: Run Trimmomatic
Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Trimmomatic takes an input, in this case a forward and reverse FASTQ file, and the user names the outputs. In this case the first two outputs are the ‘paired’ and ‘orphaned’ trimmed files for the forward reads. The second two are the ‘paired’ and ‘orphaned’ trimmed files for the reverse reads. 

Paired in this case means the forward and reverse sequences were able to be aligned to each other after trimming. Orphaned, or unpaired, typically means one of the reads was discarded as a result of trimming, and only the forward or reverse read survived. 

Here we take only the file with paired reads, as there are only a few orphaned reads, and including orphaned reads can complicate different downstream analyses. Unless there is a significant amount of them, or a specific reason to use them, it is generally easier to discard unpaired reads. Also in the interest of simplicity and speed, we proceed in further steps with just using the paired-end reads of just the forward-end read files, which in this context is sufficient, however in different contexts, using both forward and reverse is often preferable. 

The last part of the command specifies how the trimming is done:
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

In greater detail:

‘ILLUMINACLIP:TruSeq3-PE.fa’ refers to which adapters should be cut from the reads. 

‘2:30:10:2’ refers to various metrics, which are recommended defaults. 

‘2’ refers to the seed mismatch. This refers to the amount of mismatches a 'seed' may have in aligning to a possible adapter.

‘30’ refers to the palindrome clip threshold. This refers to the similarity score. If forward and reverse reads, after having an adapter attached to them, are greater than this score, trimming of adapter fragments will be performed. Forward reads are clipped, and reverse reads dropped.

‘10’ refers to the simple clip threshold. The basic, and alternative method to palindromic searching for adapters. Adapters are tested against reads and if sufficiently matched (above the threshold, in this case, 10), they are clipped. 

‘2’ refers to the minimum adapter fragment length in palindrome mode. 

‘LEADING:3’ refers to trimming bases from the start of a read. Bases at the start of a read will continue to be trimmed, sequentially, as long as the bases remain below a PHRED score of 3.

‘TRAILING:3’ refers to the same as above, but at the end of a read.

‘MINLEN:36’, the read is discarded if below this length.

Greater information about parameters can be found in the trimmomatic documentation. 

In [13]:
! trimmomatic PE -threads $THREADS data/trunc_rawfastq/SRR30723333_1.fastq.gz data/trunc_rawfastq/SRR30723333_2.fastq.gz \
data/trimmed/SRR30723333_1_trimmed.fastq.gz data/trimmed/SRR30723333_1_trimmed_unpaired.fastq.gz \
data/trimmed/SRR30723333_2_trimmed.fastq.gz data/trimmed/SRR30723333_2_trimmed_unpaired.fastq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 7 data/trunc_rawfastq/SRR30723333_1.fastq.gz data/trunc_rawfastq/SRR30723333_2.fastq.gz data/trimmed/SRR30723333_1_trimmed.fastq.gz data/trimmed/SRR30723333_1_trimmed_unpaired.fastq.gz data/trimmed/SRR30723333_2_trimmed.fastq.gz data/trimmed/SRR30723333_2_trimmed_unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 26201305 Both Surviving: 26201305 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully


### STEP 7: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Because jupyter is at its core a python editor, we can use python code and html support to display results in-line.

FastQC looks for different characteristics of quality in reads. It is very rare that every metric will pass. In many cases, they serve as warnings, which should be compared to the context of the experiment. For instance, here, per base sequence content, sequence length distribution, sequence duplication levels, and overrepresented sequences all throw warnings. Per base sequence content routinely fails in RNA-sequencing in the first 15~ or so bases due to biased fragmentation. In most of our samples, this is where we see the failure (20% or more difference between A/T or G/C), and so is not unexpected. The overrepresented sequences can be BLASTed to show the majority of them are ribosomal RNA. Ribosomal RNA contamination is also common and will not be indexed later, and so not a large concern. Other metrics look good.

In [14]:
! fastqc -o data/fastqc data/trimmed/SRR30723333_1_trimmed.fastq.gz

from IPython.display import IFrame
IFrame(src='./data/fastqc/SRR30723333_1_trimmed_fastqc.html', width=800, height=600)

application/gzip
Started analysis of SRR30723333_1_trimmed.fastq.gz
Approx 5% complete for SRR30723333_1_trimmed.fastq.gz
Approx 10% complete for SRR30723333_1_trimmed.fastq.gz
Approx 15% complete for SRR30723333_1_trimmed.fastq.gz
Approx 20% complete for SRR30723333_1_trimmed.fastq.gz
Approx 25% complete for SRR30723333_1_trimmed.fastq.gz
Approx 30% complete for SRR30723333_1_trimmed.fastq.gz
Approx 35% complete for SRR30723333_1_trimmed.fastq.gz
Approx 40% complete for SRR30723333_1_trimmed.fastq.gz
Approx 45% complete for SRR30723333_1_trimmed.fastq.gz
Approx 50% complete for SRR30723333_1_trimmed.fastq.gz
Approx 55% complete for SRR30723333_1_trimmed.fastq.gz
Approx 60% complete for SRR30723333_1_trimmed.fastq.gz
Approx 65% complete for SRR30723333_1_trimmed.fastq.gz
Approx 70% complete for SRR30723333_1_trimmed.fastq.gz
Approx 75% complete for SRR30723333_1_trimmed.fastq.gz
Approx 80% complete for SRR30723333_1_trimmed.fastq.gz
Approx 85% complete for SRR30723333_1_trimmed.fastq.g

### STEP 8: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

Being able to use python with bash also means we can seamlessly use popular python packages, such as pandas, to interact with or view the files we create.

In [15]:
! multiqc -f data/fastqc

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)


[91m///[0m ]8;id=539645;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25[0m

[34m       file_search[0m | Search path: /home/ec2-user/SageMaker/data/fastqc
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m2/2[0m  0/2[0m  
[?25h[34m            fastqc[0m | Found 1 reports
[34m     write_results[0m | Data        : multiqc_data   (overwritten)
[34m     write_results[0m | Report      : multiqc_report.html   (overwritten)
[34m           multiqc[0m | MultiQC complete


Unnamed: 0,Sample,Filename,File type,Encoding,Total Sequences,Total Bases,Sequences flagged as poor quality,Sequence length,%GC,total_deduplicated_percentage,...,per_base_sequence_quality,per_tile_sequence_quality,per_sequence_quality_scores,per_base_sequence_content,per_sequence_gc_content,per_base_n_content,sequence_length_distribution,sequence_duplication_levels,overrepresented_sequences,adapter_content
0,SRR30723333_1,SRR30723333_1_trimmed.fastq.gz,Conventional base calls,Sanger / Illumina 1.9,26201305.0,1.3 Gbp,0.0,50-51,50.0,49.980388,...,fail,fail,pass,warn,pass,pass,warn,fail,warn,pass


### STEP 9: Generate STAR Genome Index

In [35]:
import os
import subprocess
import pandas as pd

# Get the number of threads from the shell command
numthreads = !lscpu | grep '^CPU(s)' | awk '{print $2-1}'
threads = int(numthreads[0])

os.environ['THREADS'] = str(threads)

!/usr/bin/time -v STAR --runThreadN $THREADS --runMode genomeGenerate \
    --genomeDir data/reference/STAR_index \
    --genomeFastaFiles data/reference/mouse_genome.fa \
    --sjdbGTFfile data/reference/mouse_annotation.gtf \
    --sjdbOverhang 100 \
    --limitGenomeGenerateRAM 25000000000 \
    --genomeSAsparseD 2

SyntaxError: invalid syntax (1003292935.py, line 9)

### STEP 10: Run STAR for Alignment, Prepare and Run RSEM for Quantification

In [20]:
# Create directories for STAR alignment output and RSEM output
!mkdir -p data/aligned_bam
!mkdir -p data/rsem_output

# Run STAR alignment for SRR30723333
!STAR --runThreadN $THREADS \
      --genomeDir data/reference/STAR_index \
      --readFilesIn data/trimmed/SRR30723333_1_trimmed.fastq.gz data/trimmed/SRR30723333_2_trimmed.fastq.gz \
      --readFilesCommand zcat \
      --outFileNamePrefix data/aligned_bam/SRR30723333_ \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode TranscriptomeSAM GeneCounts

	/home/ec2-user/anaconda3/envs/python3/bin/STAR-avx2 --runThreadN 7 --genomeDir data/reference/STAR_index --readFilesIn data/trimmed/SRR30723333_1_trimmed.fastq.gz data/trimmed/SRR30723333_2_trimmed.fastq.gz --readFilesCommand zcat --outFileNamePrefix data/aligned_bam/SRR30723333_ --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts
	STAR version: 2.7.11b   compiled: 2024-07-03T14:39:20+0000 :/opt/conda/conda-bld/star_1720017372352/work/source
Sep 24 23:32:32 ..... started STAR run
Sep 24 23:32:32 ..... loading genome
Sep 24 23:41:29 ..... started mapping
Sep 24 23:49:03 ..... finished mapping
Sep 24 23:49:04 ..... started sorting BAM
Sep 24 23:49:51 ..... finished successfully


In [21]:
!mkdir -p data/rsem_reference/mouse_rsem_reference
!rsem-prepare-reference --gtf data/reference/mouse_annotation.gtf \
                        --star \
                        --star-path /sw/STAR \
                        data/reference/mouse_genome.fa \
                        data/rsem_reference/mouse_rsem_reference

rsem-extract-reference-transcripts data/rsem_reference/mouse_rsem_reference 0 data/reference/mouse_annotation.gtf None 0 data/reference/mouse_genome.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
Parsed 800000 lines
Parsed 1000000 lines
Parsed 1200000 lines
Parsed 1400000 lines
Parsed 1600000 lines
Parsed 1800000 lines
Parsing gtf File is done!
data/reference/mouse_genome.fa is processed!
142434 transcripts are extracted and 0 transcripts are omitted.
Extracting sequences is done!
Group File is generated!
Transcript Information File is generated!
Chromosome List File is generated!
Extracted Sequences File is generated!

rsem-preref data/rsem_reference/mouse_rsem_reference.transcripts.fa 1 data/rsem_reference/mouse_rsem_reference
Refs.makeRefs finished!
Refs.saveRefs finished!
data/rsem_reference/mouse_rsem_reference.idx.fa is generated!
data/rsem_reference/mouse_rsem_reference.n2g.idx.fa is generated!

/sw/STAR/STAR  --runThreadN 1  --runMode genomeGenerate  --genomeDir

In [22]:
!rsem-calculate-expression --paired-end \
                           --alignments \
                           --bam \
                           --star \
                           -p $THREADS \
                           data/aligned_bam/SRR30723333_Aligned.toTranscriptome.out.bam \
                           data/rsem_reference/mouse_rsem_reference \
                           data/rsem_output/SRR30723333

rsem-parse-alignments data/rsem_reference/mouse_rsem_reference data/rsem_output/SRR30723333.temp/SRR30723333 data/rsem_output/SRR30723333.stat/SRR30723333 data/aligned_bam/SRR30723333_Aligned.toTranscriptome.out.bam 3 -tag XM
Parsed 1000000 entries
Parsed 2000000 entries
Parsed 3000000 entries
Parsed 4000000 entries
Parsed 5000000 entries
Parsed 6000000 entries
Parsed 7000000 entries
Parsed 8000000 entries
Parsed 9000000 entries
Parsed 10000000 entries
Parsed 11000000 entries
Parsed 12000000 entries
Parsed 13000000 entries
Parsed 14000000 entries
Parsed 15000000 entries
Parsed 16000000 entries
Parsed 17000000 entries
Parsed 18000000 entries
Parsed 19000000 entries
Parsed 20000000 entries
Parsed 21000000 entries
Parsed 22000000 entries
Parsed 23000000 entries
Parsed 24000000 entries
Parsed 25000000 entries
Parsed 26000000 entries
Parsed 27000000 entries
Parsed 28000000 entries
Parsed 29000000 entries
Parsed 30000000 entries
Parsed 31000000 entries
Parsed 32000000 entries
Parsed 33000000

### STEP 11: Report the top 10 most highly expressed genes in the samples.


Top 10 most highly expressed genes in the wild-type sample. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`


In [32]:
import pandas as pd

# Path to RSEM results file for SRR30723333
rsem_result_file = 'data/rsem_output/SRR30723333.genes.results'

# Load the RSEM results into a Pandas DataFrame
df = pd.read_csv(rsem_result_file, sep='\t')

# Sort the DataFrame by TPM values in descending order and get the top 10 genes
top_10_genes = df.sort_values(by='TPM', ascending=False).head(10)

# Print the top 10 genes with their TPM values
print("Top 10 Genes by TPM for SRR30723333:")
print(top_10_genes[['gene_id', 'TPM']])

Top 10 Genes by TPM for SRR30723333:
                  gene_id       TPM
512    ENSMUSG00000002985  38956.97
17919  ENSMUSG00000064356  20611.35
17914  ENSMUSG00000064351  15772.65
35987  ENSMUSG00000100862  15386.48
37133  ENSMUSG00000102070  14513.38
14848  ENSMUSG00000050708  14481.05
36228  ENSMUSG00000101111  11555.59
1497   ENSMUSG00000014542  10865.87
14609  ENSMUSG00000049775   9460.69
10487  ENSMUSG00000036887   7131.72


### STEP 12: Report the expression of ENSMUSG00000101111
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [33]:
# Filter for the specific gene of interest
gene_id = 'ENSMUSG00000101111'
gene_data = df[df['gene_id'] == gene_id]

# Print the results for the gene
if not gene_data.empty:
    print(f"Expression data for {gene_id}:")
    # Change 'NumReads' to 'expected_count'
    print(gene_data[['gene_id', 'TPM', 'expected_count']])
else:
    print(f"No results found for {gene_id} in {rsem_result_file}.")

Expression data for ENSMUSG00000101111:
                  gene_id       TPM  expected_count
36228  ENSMUSG00000101111  11555.59       100285.94


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)