# RNA-Seq Analysis Training Demo on Azure

## Overview

This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitate gene expression.

![RNA-Seq workflow]('https://github.com/STRIDES/NIHCloudLabAzure/blob/azurenotebooks-1/tutorials/notebooks/rnaseq-myco-tutorial-main/images/rnaseq-workflow.png')

### STEP 1: Setup Environment

Note that within Jupyter you can run a bash comman either by using the magic '!' in front of your command, or by adding %%bash to the top of your cell.

For example:\
%%bash\
example command

or\
!example command

The first step is to install mamba forge, which is the newer and faster version of the conda package manager. Normally after install we could add mamba to our path, but paths can act funny in Sagemaker, so we are going to just assign the bin/mamba path to a variable and use that for conda installation.

In [1]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 90.5M  100 90.5M    0     0   120M      0 --:--:-- --:--:-- --:--:--  168M
PREFIX=/home/azureuser/mambaforge
Unpacking payload ...
Extracting python-3.10.6-h582c2e5_0_cpython.tar.bz2
Extracting _libgcc_mutex-0.1-conda_forge.tar.bz2
Extracting ca-certificates-2022.6.15-ha878542_0.tar.bz2
Extracting ld_impl_linux-64-2.36.1-hea4e1c9_2.tar.bz2
Extracting libstdcxx-ng-12.1.0-ha89aaad_16.tar.bz2
Extracting pybind11-abi-4-hd8ed1ab_3.tar.bz2
Extracting tzdata-2022c-h191b570_0.tar.bz2
Extracting libgomp-12.1.0-h8d9b700_16.tar.bz2
Extracting _openmp_mutex-4.5-2_gnu.tar.bz2
Extracting libgcc-ng-12.1.0-h8d9b700_16.tar.bz2
Extracting bzip2-1.0.8-h7f98852_4.tar.bz2
Extracting c-ares-1.

In [2]:
mamba='/home/azureuser/mambaforge/bin/mamba'

In [3]:
!$mamba info --envs


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.25.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████

# conda environments:
#
base                     /home/azureuser/mambaforge



Next, we will install the necessary packages into the currect environment. If you want to create a different environment within the notebook instance, follow [these instructions](https://github.com/aws/studio-lab-examples/blob/main/custom-environments/custom_environment.ipynb).

In [4]:
!$mamba install -c conda-forge -c bioconda -c defaults -y sra-tools  pigz=2.6 pbzip2=1.1 trimmomatic=0.36 fastqc=0.11.9 multiqc=1.10.1 salmon=1.5.1 


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.25.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['sra-tools', 'pigz=2.6', 'pbzip2=1.1', 'trimmomatic=0.36', 'fastqc=0.11.9', 'multiqc=1.10.1', 'sal

Create a set of directories to store the reads, reference sequence files, and output files.


In [5]:
%%bash
mkdir -p data
mkdir -p data/raw_fastq
mkdir -p data/trimmed
mkdir -p data/fastqc
mkdir -p data/aligned
mkdir -p data/reference

### STEP 2: Copy FASTQ Files
In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Azure Blob storage containers that we made publicly accessible.

In [3]:
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349122_1.fastq --output data/raw_fastq/SRR13349122_1.fastq
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349122_2.fastq --output data/raw_fastq/SRR13349122_2.fastq
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349128_1.fastq --output data/raw_fastq/SRR13349128_1.fastq
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349128_2.fastq --output data/raw_fastq/SRR13349128_2.fastq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  9709k      0 --:--:-- --:--:-- --:--:-- 9704k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  8444k      0  0:00:01  0:00:01 --:--:-- 8452k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  11.0M      0 --:--:-- --:--:-- --:--:-- 11.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8452k  100 8452k    0     0  8928k      0 --:--:-- --:--:-- --:--:-- 8925k


### STEP 3: Copy reference transcriptome files that will be used by Salmon
Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome.

In [1]:
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/reference/M_chelonae_transcripts.fasta --output ./M_chelonae_transcripts.fasta
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/reference/decoys.txt --output ./decoys.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9599k  100 9599k    0     0   9.9M      0 --:--:-- --:--:-- --:--:--  9.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    14  100    14    0     0     34      0 --:--:-- --:--:-- --:--:--    34


### STEP 4: Copy data file for Trimmomatic

In [3]:
!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/TruSeq3-PE.fa --output TruSeq3-PE.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    95  100    95    0     0    368      0 --:--:-- --:--:-- --:--:--   368


### STEP 5: Run Trimmomatic
Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

### Install Trimmomatic

In [12]:
!wget https://github.com/usadellab/Trimmomatic/archive/refs/tags/v0.39.zip

--2022-10-25 20:06:35--  https://github.com/usadellab/Trimmomatic/archive/refs/tags/v0.39.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/usadellab/Trimmomatic/zip/refs/tags/v0.39 [following]
--2022-10-25 20:06:35--  https://codeload.github.com/usadellab/Trimmomatic/zip/refs/tags/v0.39
Resolving codeload.github.com (codeload.github.com)... 140.82.114.9
Connecting to codeload.github.com (codeload.github.com)|140.82.114.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.39.zip’

v0.39.zip               [ <=>                ] 106.71K  --.-KB/s    in 0.02s   

2022-10-25 20:06:36 (4.28 MB/s) - ‘v0.39.zip’ saved [109271]



In [13]:
!unzip v0.39.zip

Archive:  v0.39.zip
e77bb89c70f60cb139393a22e89af47c5c1c69e8
   creating: Trimmomatic-0.39/
  inflating: Trimmomatic-0.39/MANIFEST.MF  
 extracting: Trimmomatic-0.39/README.md  
   creating: Trimmomatic-0.39/adapters/
  inflating: Trimmomatic-0.39/adapters/NexteraPE-PE.fa  
  inflating: Trimmomatic-0.39/adapters/TruSeq2-PE.fa  
  inflating: Trimmomatic-0.39/adapters/TruSeq2-SE.fa  
  inflating: Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa  
  inflating: Trimmomatic-0.39/adapters/TruSeq3-PE.fa  
  inflating: Trimmomatic-0.39/adapters/TruSeq3-SE.fa  
  inflating: Trimmomatic-0.39/build.xml  
   creating: Trimmomatic-0.39/distSrc/
  inflating: Trimmomatic-0.39/distSrc/LICENSE  
   creating: Trimmomatic-0.39/lib/
  inflating: Trimmomatic-0.39/lib/jbzip2-0.9.jar  
   creating: Trimmomatic-0.39/src/
   creating: Trimmomatic-0.39/src/org/
   creating: Trimmomatic-0.39/src/org/usadellab/
   creating: Trimmomatic-0.39/src/org/usadellab/trimmomatic/
  inflating: Trimmomatic-0.39/src/org/usadellab/t

### Install 'ant'

In [9]:
!sudo apt-get install ant

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ant is already the newest version (1.10.7-1).
The following packages were automatically installed and are no longer required:
  cmake-data cuda-command-line-tools-11-3 cuda-compiler-11-3 cuda-cudart-11-3
  cuda-cudart-dev-11-3 cuda-cuobjdump-11-3 cuda-cupti-11-3 cuda-cupti-dev-11-3
  cuda-cuxxfilt-11-3 cuda-documentation-11-3 cuda-driver-dev-11-3
  cuda-gdb-11-3 cuda-libraries-11-3 cuda-libraries-dev-11-3 cuda-memcheck-11-3
  cuda-nsight-11-3 cuda-nsight-compute-11-3 cuda-nsight-systems-11-3
  cuda-nvcc-11-3 cuda-nvdisasm-11-3 cuda-nvml-dev-11-3 cuda-nvprof-11-3
  cuda-nvprune-11-3 cuda-nvrtc-11-3 cuda-nvrtc-dev-11-3 cuda-nvtx-11-3
  cuda-nvvp-11-3 cuda-samples-11-3 cuda-sanitizer-11-3 cuda-thrust-11-3
  cuda-toolkit-11-3 cuda-toolkit-11-3-config-common
  cuda-toolkit-11-config-common cuda-toolkit-config-common cuda-tools-11-3
  cuda-visual-tools-11-3 libcublas-11-3 libcublas-dev-11-3 libcuf

In [None]:
!sudo ant 

In [19]:
!java -jar trimmomatic-0.39.jar

Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or: 
       -version


In [None]:
cd /tutorials/notebooks/rnaseq-myco-tutorial-main/

In [None]:
cd data/

In [35]:
!java -jar /tutorials/notebooks/rnaseq-myco-tutorial-main/Trimmomatic-0.39/dist/jar/trimmomatic-0.39.jar PE -threads 2 raw_fastq/SRR13349122_1.fastq raw_fastq/SRR13349122_2.fastq trimmed/SRR13349122_1_trimmed.fastq trimmed/SRR13349122_2_trimmed.fastq trimmed/SRR13349122_1_trimmed_unpaired.fastq  trimmed/SRR13349122_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 2 raw_fastq/SRR13349122_1.fastq raw_fastq/SRR13349122_2.fastq trimmed/SRR13349122_1_trimmed.fastq trimmed/SRR13349122_2_trimmed.fastq trimmed/SRR13349122_1_trimmed_unpaired.fastq trimmed/SRR13349122_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 50000 Both Surviving: 49870 (99.74%) Forward Only Surviving: 130 (0.26%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully


In [37]:
!java -jar /tutorials/notebooks/rnaseq-myco-tutorial-main/Trimmomatic-0.39/dist/jar/trimmomatic-0.39.jar PE -threads 2 raw_fastq/SRR13349122_1.fastq raw_fastq/SRR13349122_2.fastq trimmed/SRR13349122_1_trimmed.fastq trimmed/SRR13349122_2_trimmed.fastq trimmed/SRR13349122_1_trimmed_unpaired.fastq  trimmed/SRR13349122_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
!java -jar /tutorials/notebooks/rnaseq-myco-tutorial-main/Trimmomatic-0.39/dist/jar/trimmomatic-0.39.jar PE -threads 2 raw_fastq/SRR13349128_1.fastq raw_fastq/SRR13349128_2.fastq trimmed/SRR13349128_1_trimmed.fastq trimmed/SRR13349128_2_trimmed.fastq trimmed/SRR13349128_1_trimmed_unpaired.fastq  trimmed/SRR13349128_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 2 raw_fastq/SRR13349122_1.fastq raw_fastq/SRR13349122_2.fastq trimmed/SRR13349122_1_trimmed.fastq trimmed/SRR13349122_2_trimmed.fastq trimmed/SRR13349122_1_trimmed_unpaired.fastq trimmed/SRR13349122_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 50000 Both Surviving: 49870 (99.74%) Forward Only Surviving: 130 (0.26%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -threads 2 raw_fastq/SRR13349128_1.fastq raw_fastq/SRR13349128_2.fastq trimmed/SRR13349128_1_trimmed.fastq trimmed/SRR13349128_2_trimmed.fastq trimmed/SRR13349128_1_trimmed_u

### STEP 6: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

Once FastQC is done running, look at the outputs in data/fastqc. What can you say about the quality of the two samples we are looking at here? 

### Install fastqc

In [None]:
!conda install -c bioconda fastqc

In [1]:
%%bash
fastqc -o data/fastqc data/trimmed/SRR13349122_1_trimmed.fastq
fastqc -o data/fastqc data/trimmed/SRR13349128_1_trimmed.fastq

Started analysis of SRR13349122_1_trimmed.fastq
Approx 5% complete for SRR13349122_1_trimmed.fastq
Approx 10% complete for SRR13349122_1_trimmed.fastq
Approx 15% complete for SRR13349122_1_trimmed.fastq
Approx 20% complete for SRR13349122_1_trimmed.fastq
Approx 25% complete for SRR13349122_1_trimmed.fastq
Approx 30% complete for SRR13349122_1_trimmed.fastq
Approx 35% complete for SRR13349122_1_trimmed.fastq
Approx 40% complete for SRR13349122_1_trimmed.fastq
Approx 45% complete for SRR13349122_1_trimmed.fastq
Approx 50% complete for SRR13349122_1_trimmed.fastq
Approx 55% complete for SRR13349122_1_trimmed.fastq
Approx 60% complete for SRR13349122_1_trimmed.fastq
Approx 65% complete for SRR13349122_1_trimmed.fastq
Approx 70% complete for SRR13349122_1_trimmed.fastq
Approx 75% complete for SRR13349122_1_trimmed.fastq
Approx 80% complete for SRR13349122_1_trimmed.fastq
Approx 85% complete for SRR13349122_1_trimmed.fastq
Approx 90% complete for SRR13349122_1_trimmed.fastq
Approx 95% comple

Analysis complete for SRR13349122_1_trimmed.fastq


Started analysis of SRR13349128_1_trimmed.fastq
Approx 5% complete for SRR13349128_1_trimmed.fastq
Approx 10% complete for SRR13349128_1_trimmed.fastq
Approx 15% complete for SRR13349128_1_trimmed.fastq
Approx 20% complete for SRR13349128_1_trimmed.fastq
Approx 25% complete for SRR13349128_1_trimmed.fastq
Approx 30% complete for SRR13349128_1_trimmed.fastq
Approx 35% complete for SRR13349128_1_trimmed.fastq
Approx 40% complete for SRR13349128_1_trimmed.fastq
Approx 45% complete for SRR13349128_1_trimmed.fastq
Approx 50% complete for SRR13349128_1_trimmed.fastq
Approx 55% complete for SRR13349128_1_trimmed.fastq
Approx 60% complete for SRR13349128_1_trimmed.fastq
Approx 65% complete for SRR13349128_1_trimmed.fastq
Approx 70% complete for SRR13349128_1_trimmed.fastq
Approx 75% complete for SRR13349128_1_trimmed.fastq
Approx 80% complete for SRR13349128_1_trimmed.fastq
Approx 85% complete for SRR13349128_1_trimmed.fastq
Approx 90% complete for SRR13349128_1_trimmed.fastq
Approx 95% comple

Analysis complete for SRR13349128_1_trimmed.fastq


### STEP 7: Run MultiQC
MultiQC reads in the FastQQ reports and generate a compiled report for all the analyzed FASTQ files.

Just as with fastqc, we can look at the mulitqc results after it finishes at data/multiqc_data

### Install MultiQC

In [None]:
!conda install -c bioconda -c conda-forge multiqc

In [None]:
!multiqc -f data/fastqc -f
!mv multiqc_data/ data/

### STEP 8: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

In [None]:
!conda install -c bioconda salmon

In [3]:
!salmon index -t data/reference/M_chelonae_transcripts.fasta -p 8 -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with bug fixes is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
index ["data/reference/transcriptome_index"] did not previously exist  . . . creating it
[2022-10-25 20:55:18.902] [jLog] [info] building index
[00m[2022-10-25 20:55:19.014] [jointLog] [info] [Step 1 of 4] : counting k-mers
[00mElapsed time: 0.51723s

[00m[00m[2022-10-25 20:55:19.669] [jointLog] [info] Replaced 0 non-ATCG nucleotides
[00m[00m[2022-10-25 20:55:19.669] [jointLog] [info] Clipped poly-A tails from 0 transcripts
[00m[00m[2022-10-25 20:55:19.835] [jointLog] [info] Building rank-select dictionary and saving to disk
[00m[00m[2022-10-25 20:55:19.849] [jo

### STEP 9: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

### Install Salmon

In [4]:
%%bash
salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349122_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349122_quant
salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349128_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349128_quant

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with bug fixes is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
### salmon (mapping-based) v0.14.1
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { data/reference/transcriptome_index }
### [ libType ] => { SR }
### [ unmatedReads ] => { data/trimmed/SRR13349122_1_trimmed.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { data/quants/SRR13349122_quant }
Logs will be written to data/quants/SRR13349122_quant/logs
[2022-10-25 20:55:43.645] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2022-10-25 20:55:43.645] [jointLog] [

### STEP 10: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in the wild-type sample.


In [6]:
!sort -nrk 4,4 data/quants/SRR13349122_quant/quant.sf | head -10

BB28_RS23830	213	11.625	46100.292029	5.000
BB28_RS02220	204	10.377	30985.260419	3.000
BB28_RS05530	180	7.996	26808.995984	2.000
BB28_RS18945	222	13.150	24450.866462	3.000
BB28_RS11370	195	9.348	22931.570883	2.000
BB28_RS18745	300	52.326	20483.046547	10.000
BB28_RS12480	207	10.766	19910.602420	2.000
BB28_RS20695	231	15.032	14260.286436	2.000
BB28_RS19155	282	37.744	14198.094026	5.000
BB28_RS09440	300	52.326	12289.827928	6.000
sort: write failed: 'standard output': Broken pipe
sort: write error


Top 10 most highly expressed genes in the double lysogen sample.


In [7]:
!sort -nrk 4,4 data/quants/SRR13349128_quant/quant.sf | head -10

BB28_RS18025	177	7.769	43112.896691	2.000
BB28_RS02220	204	10.377	32275.338708	2.000
BB28_RS13585	243	18.264	27506.503978	3.000
BB28_RS01170	225	13.734	24387.215145	2.000
BB28_RS20695	231	15.032	22281.025005	2.000
BB28_RS19045	183	8.236	20333.140719	1.000
BB28_RS18745	300	52.326	19202.276519	6.000
BB28_RS04995	192	9.045	18514.928346	1.000
BB28_RS14885	195	9.348	17914.748997	1.000
BB28_RS19155	282	37.744	17747.081839	4.000
sort: write failed: 'standard output': Broken pipe
sort: write error


### STEP 11: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [8]:
!grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf

BB28_RS16545	987	738.000	580.913806	4.000


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [9]:
!grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf

BB28_RS16545	987	738.000	226.912606	1.000


### That's it! 