# Extended RNA-Seq Analysis Training Demo

## Overview

For simplicity and time, The short tutorial workflow uses truncated and partial run data from the Cushman et al., project.

The tutorial repeats the short tutorial, but with the full fastq files and includes some extra steps, such as how to download and prepare the transcriptome files used by salmon, alternate ways to navigate the NCBI databases for annotation or reference files you might need, and how to combine salmon outputs at the end into a single genecount file.

Full fastq files can be rather large, and so the downloading, extracting, and analysis of them means this tutorial can take over 1 hour 45 minutes to run the code fully. This is part of the reason we have a short and easy introductory tutorial, and this longer more full tutorial for those interested.

If this is too lengthy feel free to move on to the snakemake tutorial or the DEG analysis tutorial -- all the files used in the DEG tutorial were created using this extended tutorial workflow.

![RNA-Seq workflow](images/rnaseq-workflow.png)

### STEP 1: Install Mambaforge

First install Mambaforge.


In [1]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -u -p $HOME/mambaforge
!date +"%T"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 86.8M  100 86.8M    0     0  97.5M      0 --:--:-- --:--:-- --:--:-- 97.5M
PREFIX=/home/ec2-user/mambaforge

Transaction

  Prefix: /home/ec2-user/mambaforge/envs/_virtual_specs_checks

  All requested packages already installed

Dry run. Not executing the transaction.
Unpacking payload ...
Extracting _libgcc_mutex-0.1-conda_forge.tar.bz2
Extracting ca-certificates-2024.8.30-hbcca054_0.conda
Extracting ld_impl_linux-64-2.40-hf3520f5_7.conda
Extracting pybind11-abi-4-hd8ed1ab_3.tar.bz2
Extracting python_abi-3.12-5_cp312.conda
Extracting tzdata-2024a-h8827d51_1.conda
Extracting libgomp-14.1.0-h77fa898_1.conda
Extracting _openmp_mutex-4.5-2_gnu.tar.bz2
Extracting libgcc-14.1

Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [2]:
#tell the computer where the mambaforge bin files are located
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

#now we can easily use 'mamba' command to install software 
!mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon gsutil sql-magic entrez-direct gffread parallel-fastq-dump sra-tools sql-magic pyathena -y


Looking for: ['trimmomatic', 'fastqc', 'multiqc', 'salmon', 'gsutil', 'sql-magic', 'entrez-direct', 'gffread', 'parallel-fastq-dump', 'sra-tools', 'sql-magic', 'pyathena']

[?25l[2K[0G[+] 0.0s
bioconda/linux-64 (check zst) [33m━━━━━━━━━━╸[0m[90m━━━━[0m   0.0 B @  ??.?MB/s Checking  0.0s[2K[1A[2K[0G[+] 0.1s
bioconda/linux-64 (check zst) [33m━━━━━━━━━━╸[0m[90m━━━━[0m   0.0 B @  ??.?MB/s Checking  0.1s[2K[1A[2K[0G[+] 0.2s
bioconda/linux-64 (check zst) [33m━━━━━━━━━━━╸[0m[90m━━━[0m   0.0 B @  ??.?MB/s Checking  0.2s[2K[1A[2K[0Gbioconda/linux-64 (check zst)                       Checked  0.2s
[?25h[?25l[2K[0G[+] 0.0s
bioconda/noarch (check zst) [33m━━━━━━━━━━━━━━━╸[0m[90m━[0m   0.0 B @  ??.?MB/s Checking  0.0s[2K[1A[2K[0Gbioconda/noarch (check zst)                         Checked  0.0s
[?25l[2K[0G[+] 0.0s
https://aws-ml-conda.s3.us-west-2.amazonaws.com/.. [33m━━━━━━━━━━━━━━━[0m   0.0 B  0.0s[2K[1A[2K[0G[+] 0.1s
https://aws-ml-conda.s3.us-west

### STEP 2: Setup Environment

Create a set of directories to store the reads, reference sequence files, and output files. Notice that first we remove the `data` directory to clean up files from Tutorial_1


In [3]:
! cd $HOMEDIR
! echo $PWD
! rm -r data/
! mkdir -p data
! mkdir -p data/raw_fastq
! mkdir -p data/trimmed
! mkdir -p data/fastqc
! mkdir -p data/aligned
! mkdir -p data/reference
! mkdir -p data/fastqc_samples
! mkdir -p data/multiqc_samples
! mkdir -p data data/fasterqdump/raw_fastq data/prefetch_fasterqdump/raw_fastq

/home/ec2-user/SageMaker/rnaseq-myco-notebook


Set # THREADS depending on your VM size

In [4]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'

#python variable to hold the amount of threads your cpu has,
#useful for downstream tools like salmon, trimmomatic, etc
threads = int(numthreads[0])

#its also good to have a shell version of the variable for commands that use piping, 
#in jupyter, shell commandds with piping sometimes causes python variables to not work and generally be wonky.
%env THREADS=$threads

env: THREADS=1


### STEP 3: Downloading relevant FASTQ files using SRA Tools

Next we will need to download the relevant fastq files.

Because these files can be large, the process of downloading and extracting fastq files can be quite lengthy.

The sequence data for this tutorial comes from work by Cushman et al., <em><a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191103/'>Increased whiB7 expression and antibiotic resistance in Mycobacterium chelonae carrying two prophages</a><em>.

We will be downloading the sample runs from this project using SRA tools, downloading from the NCBI's SRA (Sequence Run Archives).

However, first we need to find the associated accession numbers in order to download.


### STEP 3.1: Finding run accession numbers.

The SRA stores sequence data in terms of runs, (SRR stands for Sequence Read Run). To download runs, we will need the accession ID for each run we wish to download. 

The Cushman et al., project contains 12 runs. To make it easier, these are the run IDs associated with this project:

+ SRR13349122
+ SRR13349123
+ SRR13349124
+ SRR13349125
+ SRR13349126
+ SRR13349127
+ SRR13349128
+ SRR13349129
+ SRR13349130
+ SRR13349131
+ SRR13349132
+ SRR13349133


In this case, all these runs belong to the SRP (Sequence Run Project): SRP300216.

Sequence run experiments can be searched for using the SRA database on the NCBI website; and article-specific sample run information can be found in the supplementary section of that article.

For instance, here, the the authors posted a link to the sequence data GSE (Gene Series number), <a href='https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164210'>GSE164210</a>. This leads to the appropriate 'Gene Expression Omnibus' page where, among other useful files and information, the relevant SRA database link can be found. 

Once the accession numbers are located, one can make a text file containing the list of accession IDs however they like.

Once again, to make things easier, we have made a .txt with these IDs that you can simply download here:

In [5]:
!gsutil cp gs://rnaseq-myco-bucket/reference/accs.txt .
!cat accs.txt

Copying gs://rnaseq-myco-bucket/reference/accs.txt...
/ [1 files][  144.0 B/  144.0 B]                                                
Operation completed over 1 objects/144.0 B.                                      
SRR13349122
SRR13349123
SRR13349124
SRR13349125
SRR13349126
SRR13349127
SRR13349128
SRR13349129
SRR13349130
SRR13349131
SRR13349132
SRR13349133


You can can also use BigQuery to generate an accession list following the instructions outlined in [this notebook](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/tutorials/notebooks/SRADownload/SRA-Download.ipynb).

### STEP 3.1.2 (Optional): Generate the accession list file with BigQuery
This step uses Python. We will use the BigQuery API. 

We will create a client the using default project. 

Then we will query BigQuery using the species name and a range of accession numbers associated with this particular study. 

Feel free to play around with the query to generate different variations of accession numbers!

Please note that if you have errors to make sure you have this API enabled. You can search for BigQuery by navigating back to the Google Cloud Platform dashboard, back to the Google Cloud Platform, and using the search bar at the top, search for 'BigQuery'. On the BigQuery page, click  `Enable`

aws sts get-caller-identity
### STEP 3.2: Using the SRA-toolkit for a single sample.

Sequence run accession IDs can be used to download sequence data, using the 'prefetch' tool of the SRA-toolkit.


In [6]:
from pyathena import connect
import pandas as pd

# Use the correct argument name: s3_staging_dir
conn = connect(s3_staging_dir='s3://sra-data-athena/', region_name='us-east-1')

In [7]:
import boto3

# Initialize the Glue client
glue_client = boto3.client('glue', region_name='us-east-1')

# Run the crawler
crawler_name = 'sra_crawler'  # Use your crawler's name
glue_client.start_crawler(Name=crawler_name)

print(f"Crawler {crawler_name} started.")


Crawler sra_crawler started.


In [8]:
query = """
SELECT *
FROM AwsDataCatalog.srametadata.metadata
WHERE organism = 'Mycobacteroides chelonae' 
AND acc LIKE '%SRR133491%'
"""
df = pd.read_sql(
    query, conn
)
df

  df = pd.read_sql(


Unnamed: 0,acc,assay_type,center_name,consent,experiment,sample_name,instrument,librarylayout,libraryselection,librarysource,...,geo_loc_name_sam,ena_first_public_run,ena_last_update_run,sample_name_sam,datastore_filetype,datastore_provider,datastore_region,attributes,jattr,run_file_version
0,SRR13349130,RNA-Seq,GEO,public,SRX9775150,GSM5004092,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004092}, {k=bases...","{""geo_accession_exp"": [""GSM5004092""], ""bases"":...",1
1,SRR13349129,RNA-Seq,GEO,public,SRX9775149,GSM5004091,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004091}, {k=bases...","{""geo_accession_exp"": [""GSM5004091""], ""bases"":...",1
2,SRR13349125,RNA-Seq,GEO,public,SRX9775147,GSM5004089,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004089}, {k=bases...","{""geo_accession_exp"": [""GSM5004089""], ""bases"":...",1
3,SRR13349124,RNA-Seq,GEO,public,SRX9775147,GSM5004089,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004089}, {k=bases...","{""geo_accession_exp"": [""GSM5004089""], ""bases"":...",1
4,SRR13349127,RNA-Seq,GEO,public,SRX9775148,GSM5004090,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004090}, {k=bases...","{""geo_accession_exp"": [""GSM5004090""], ""bases"":...",1
5,SRR13349122,RNA-Seq,GEO,public,SRX9775146,GSM5004088,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004088}, {k=bases...","{""geo_accession_exp"": [""GSM5004088""], ""bases"":...",1
6,SRR13349133,RNA-Seq,GEO,public,SRX9775151,GSM5004093,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004093}, {k=bases...","{""geo_accession_exp"": [""GSM5004093""], ""bases"":...",1
7,SRR13349132,RNA-Seq,GEO,public,SRX9775151,GSM5004093,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004093}, {k=bases...","{""geo_accession_exp"": [""GSM5004093""], ""bases"":...",1
8,SRR13349126,RNA-Seq,GEO,public,SRX9775148,GSM5004090,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004090}, {k=bases...","{""geo_accession_exp"": [""GSM5004090""], ""bases"":...",1
9,SRR13349131,RNA-Seq,GEO,public,SRX9775150,GSM5004092,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004092}, {k=bases...","{""geo_accession_exp"": [""GSM5004092""], ""bases"":...",1


In [9]:
#write the SRR column to a text file
with open('accs.txt', 'w') as f:
    accs = df['acc'].to_string(header=False, index=False)
    f.write(accs)
    
#print the text file
!cat accs.txt


SRR13349130
SRR13349129
SRR13349125
SRR13349124
SRR13349127
SRR13349122
SRR13349133
SRR13349132
SRR13349126
SRR13349131
SRR13349123
SRR13349128

In [None]:
#the 'prefetch' command downloads an SRA file.
!prefetch SRR13349123 -O data/raw_fastq -f yes

2024-09-19T04:57:18 prefetch.3.1.1: 1) Resolving 'SRR13349123'...
2024-09-19T04:57:18 prefetch.3.1.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2024-09-19T04:57:18 prefetch.3.1.1: 1) Downloading 'SRR13349123'...
2024-09-19T04:57:18 prefetch.3.1.1:  SRA Normalized Format file is being retrieved
2024-09-19T04:57:18 prefetch.3.1.1:  Downloading via HTTPS...


Here we can see the command for downloading a single SRA file using an acecssion ID 'SRR13349123'

Notice the SRA archives sequence files in the SRA format. 
Typically genome workflows process data in the form of zipped or unzipped .fastq, or .fasta files
So before we move on, we need to convert the files from .sra to .fastq.

There are multiple ways to do this. 

Included in the sra toolskit are fastq-dump and fasterq-dump. These convert SRA to FASTQ.

If you use fasterq-dump, its recommended to zip your fastq files after they are created.

There is also a tool called 'parallel-fastq-dump' which supports zipping the fastq files automatically into fastq.gz files.

The below code may take approximately 15 minutes to run.

In [9]:
#convert sra to fastq
!fasterq-dump data/raw_fastq/SRR13349123 -f -O data/raw_fastq/
#compress fastq to fastq.gz to save space
!gzip data/raw_fastq/SRR13349123_1.fastq
!gzip data/raw_fastq/SRR13349123_2.fastq

spots read      : 11,165,256
reads read      : 22,330,512
reads written   : 22,330,512


### STEP 3.3 Downloading multiple files using the SRA-toolkit.

Often one wants to, as in our case, wish to download multiple runs at once.

To aid in this, SRA-tools supports batch downloading. This is why we created the text file earlier.

We can download multiple SRA files using a single line of code by using our list SRA IDs, and inputting that into the prefetch command.

And then feed that list into the sra-toolkit prefetch command. Note, it may take some time to download all the fastq files.

In [27]:
!prefetch --option-file accs.txt -O data/raw_fastq -f yes

2024-09-18T05:22:37 prefetch.3.1.1: 1) Resolving 'SRR13349128'...
2024-09-18T05:22:37 prefetch.3.1.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2024-09-18T05:22:37 prefetch.3.1.1: 1) Downloading 'SRR13349128'...
2024-09-18T05:22:37 prefetch.3.1.1:  SRA Normalized Format file is being retrieved
2024-09-18T05:22:37 prefetch.3.1.1:  Downloading via HTTPS...
2024-09-18T05:23:07 prefetch.3.1.1:  HTTPS download succeed
2024-09-18T05:23:09 prefetch.3.1.1:  'SRR13349128' is valid: 788638616 bytes were streamed from 788633419
2024-09-18T05:23:09 prefetch.3.1.1: 1) 'SRR13349128' was downloaded successfully
2024-09-18T05:23:09 prefetch.3.1.1: 'SRR13349128' has 0 dependencies
2024-09-18T05:23:09 prefetch.3.1.1: 2) Resolving 'SRR13349122'...
2024-09-18T05:23:09 prefetch.3.1.1: 2) Downloading 'SRR13349122'...
2024-09-18T05:23:09 prefetch.3.1.1:  SRA Normalized Format file is being retrieved
2024-09-18T05:23:09 prefetch.3.1.1:  Downloading via HTT

### STEP 3.4 Converting Multiple SRA files to Fastq

Fasterq-dump does not natively support batch converting of files.

There are several ways we can get around this.

For instance, one could use loops, or utilize piping.

The below code uses a 'for' loop to iterate through all the accession IDs in our accs.txt file.

It also adds includes various flags

-e for cpu threads
-m for maximum memory useage
-f to force overwrite
-O output directory. 

This process should take about 35 minutes or so.

In [69]:
!for x in `cat accs.txt`; do fasterq-dump -f -O data/raw_fastq -e $THREADS -m 4G data/raw_fastq/$x/$x.sra; done

##example of how to alternatively do the above process with parallel-fastq-dump using piping
#!cat accs.txt | xargs -I {} parallel-fastq-dump -O data/raw_fastq/ --tmpdir . --threads $THREADS --gzip --split-files --sra-id {}

fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3
fasterq-dump quit with error code 3


As before, it is good practice to turn .fastq files into .fastq.gz files to save space.

In our case, we will actually need to concatenate the fastq files later on, and so will zip after this.

The no redundant SRA files can also be deleted to save more space.

In [68]:
#find and delete all SRR subfolders in the raw_fastq directory
!find data/raw_fastq -type d -name 'SRR*' -exec rm -rf {} \;

### STEP 4: Copy reference transcriptome files that will be used by Salmon using E-Direct

Salmon is a tool that aligns RNA-Seq reads to a transcriptome.

So we will need a transcriptome reference file.

To get one, we can search through the NCBI assembly database, find an assembly, and download transcriptome reference files from that assembly using FTP links.

For instance, we will use the <a href='https://www.ncbi.nlm.nih.gov/assembly/GCF_001632805.1'>ASM163280v1</a> refseq assembly, found by searching through the NCBI assembly database. The FTP links can be accessed through the website in various ways, one way is to click the 'FTP directory for RefSeq assembly' link, found under 'Access the data', section.

Alternatively, if one were inclined, one could take the less common route and perform this through the NCBI command line tool suite called 'Entrez Direct' (EDirect).

This is an intricate and complicated set of tools, with many ways to do any one thing.

Below is an example of using an eDirect search query with a refseq identifier to obtain the relevant FTP directory, and then using that to download desired reference files.

In [30]:
#parse for the ftp link and download the genome reference fasta file

!esearch -db assembly -query GCF_001632805.1 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_genomic.fna.gz " $0"/"$NF"_genomic.fna.gz"}' \
| bash

#parse for the ftp link and download the gtf reference fasta file

!esearch -db assembly -query GCF_001632805.1 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_genomic.gff.gz " $0"/"$NF"_genomic.gff.gz"}' \
| bash

# parse for the ftp link and download the feature-table reference file 
# (for later use for merging readcounts with gene names in R code).

!esearch -db assembly -query GCF_001632805.1 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_feature_table.txt.gz " $0"/"$NF"_feature_table.txt.gz"}' \
| bash


#unzip the compresseed fasta files

!gzip -d data/reference/*.gz --force

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1436k  100 1436k    0     0  1399k      0  0:00:01  0:00:01 --:--:-- 1400k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  397k  100  397k    0     0   909k      0 --:--:-- --:--:-- --:--:--  910k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  225k  100  225k    0     0   329k      0 --:--:-- --:--:-- --:--:--  329k


Next we can use a tool called gffread to create a transcriptome reference file using the gtf and genome files we downloaded.

In [31]:
!gffread -w data/reference/GCF_001632805.1_transcriptome_reference.fa -g data/reference/GCF_001632805.1_ASM163280v1_genomic.fna data/reference/GCF_001632805.1_ASM163280v1_genomic.gff

It is also recommended to include the full genome at the end of the transcriptome reference file, for the purpose of performing a 'decoy-aware' mapping, more information about which can be found in the Salmon documentation.

To alert the tool to the presence of this, we will also create a 'decoy file', which salmon needs pointed towards the full genome sequence in our transcriptome reference file.

In [32]:
!cat data/reference/GCF_001632805.1_transcriptome_reference.fa > data/reference/GCF_001632805.1_transcriptome_reference_w_decoy.fa
!echo >> data/reference/GCF_001632805.1_transcriptome_reference_w_decoy.fa
!cat data/reference/GCF_001632805.1_ASM163280v1_genomic.fna >> data/reference/GCF_001632805.1_transcriptome_reference_w_decoy.fa
!echo "NZ_CP007220.1" > data/reference/decoys.txt

### STEP 5: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

The below code may take around 25 minutes to run. 

In [33]:
#run fastqc for forward reads
!cat accs.txt | xargs -I {} fastqc "data/raw_fastq/{}_1.fastq" -o data/fastqc/
#run fastqc for reverse reads
!cat accs.txt | xargs -I {} fastqc "data/raw_fastq/{}_2.fastq" -o data/fastqc/

null
Started analysis of SRR13349128_1.fastq
Approx 5% complete for SRR13349128_1.fastq
Approx 10% complete for SRR13349128_1.fastq
Approx 15% complete for SRR13349128_1.fastq
Approx 20% complete for SRR13349128_1.fastq
Approx 25% complete for SRR13349128_1.fastq
Approx 30% complete for SRR13349128_1.fastq
Approx 35% complete for SRR13349128_1.fastq
Approx 40% complete for SRR13349128_1.fastq
Approx 45% complete for SRR13349128_1.fastq
Approx 50% complete for SRR13349128_1.fastq
Approx 55% complete for SRR13349128_1.fastq
Approx 60% complete for SRR13349128_1.fastq
Approx 65% complete for SRR13349128_1.fastq
Approx 70% complete for SRR13349128_1.fastq
Approx 75% complete for SRR13349128_1.fastq
Approx 80% complete for SRR13349128_1.fastq
Approx 85% complete for SRR13349128_1.fastq
Approx 90% complete for SRR13349128_1.fastq
Approx 95% complete for SRR13349128_1.fastq
Analysis complete for SRR13349128_1.fastq
null
Started analysis of SRR13349122_1.fastq
Approx 5% complete for SRR1334912

Fastqc will output the results in HTML format, as below, for all forward and reverse reads.


In [34]:
from IPython.display import IFrame
IFrame(src='./data/fastqc/SRR13349126_1_fastqc.html', width=800, height=600)

Although its best practice to look over them individually, tools like multiqc allow one to quickly look at a summary of the quality reports of the fastq files.

For instance, the below table shows which warnings, passes, or failures, from each fastqc report. There are other summaries created as well by multiqc. 

In [35]:
!multiqc -f data/fastqc/

import pandas as pd
dframe = pd.read_csv("./multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)


[91m///[0m ]8;id=428511;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25[0m

[34m       file_search[0m | Search path: /home/ec2-user/SageMaker/rnaseq-myco-notebook/data/fastqc
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m48/48[0m  tqc/SRR13349128_2_fastqc.zip[0m.html[0m
[?25h[34m            fastqc[0m | Found 24 reports
[34m     write_results[0m | Data        : multiqc_data   (overwritten)
[34m     write_results[0m | Report      : multiqc_report.html   (overwritten)
[34m           multiqc[0m | MultiQC complete


Unnamed: 0,Sample,Filename,File type,Encoding,Total Sequences,Total Bases,Sequences flagged as poor quality,Sequence length,%GC,total_deduplicated_percentage,...,basic_statistics,per_base_sequence_quality,per_sequence_quality_scores,per_base_sequence_content,per_sequence_gc_content,per_base_n_content,sequence_length_distribution,sequence_duplication_levels,overrepresented_sequences,adapter_content
0,SRR13349122_1,SRR13349122_1.fastq,Conventional base calls,Sanger / Illumina 1.9,10827590.0,552.2 Mbp,0.0,51.0,55.0,5.503069,...,pass,pass,pass,fail,pass,pass,pass,fail,fail,pass
1,SRR13349122_2,SRR13349122_2.fastq,Conventional base calls,Sanger / Illumina 1.9,10827590.0,552.2 Mbp,0.0,51.0,56.0,9.335313,...,pass,pass,pass,fail,pass,pass,pass,fail,fail,pass
2,SRR13349123_1,SRR13349123_1.fastq,Conventional base calls,Sanger / Illumina 1.9,11165256.0,569.4 Mbp,0.0,51.0,56.0,5.097632,...,pass,pass,pass,fail,pass,pass,pass,fail,fail,pass
3,SRR13349123_2,SRR13349123_2.fastq,Conventional base calls,Sanger / Illumina 1.9,11165256.0,569.4 Mbp,0.0,51.0,56.0,9.070512,...,pass,pass,pass,fail,warn,pass,pass,fail,fail,pass
4,SRR13349124_1,SRR13349124_1.fastq,Conventional base calls,Sanger / Illumina 1.9,10727273.0,547 Mbp,0.0,51.0,55.0,5.313842,...,pass,pass,pass,fail,pass,pass,pass,fail,fail,pass
5,SRR13349124_2,SRR13349124_2.fastq,Conventional base calls,Sanger / Illumina 1.9,10727273.0,547 Mbp,0.0,51.0,56.0,9.139844,...,pass,pass,pass,fail,pass,pass,pass,fail,fail,pass
6,SRR13349125_1,SRR13349125_1.fastq,Conventional base calls,Sanger / Illumina 1.9,10992686.0,560.6 Mbp,0.0,51.0,55.0,4.918485,...,pass,pass,pass,fail,pass,pass,pass,fail,fail,pass
7,SRR13349125_2,SRR13349125_2.fastq,Conventional base calls,Sanger / Illumina 1.9,10992686.0,560.6 Mbp,0.0,51.0,56.0,8.541477,...,pass,pass,pass,fail,warn,pass,pass,fail,fail,pass
8,SRR13349126_1,SRR13349126_1.fastq,Conventional base calls,Sanger / Illumina 1.9,12267497.0,625.6 Mbp,0.0,51.0,55.0,4.804107,...,pass,pass,pass,fail,pass,pass,pass,fail,fail,pass
9,SRR13349126_2,SRR13349126_2.fastq,Conventional base calls,Sanger / Illumina 1.9,12267497.0,625.6 Mbp,0.0,51.0,55.0,9.482849,...,pass,pass,pass,fail,warn,pass,pass,fail,fail,pass


### STEP 5.1 Merging our fastq files

The following step may not be necessary -- it depends on the study.

In our case, if we look at our SRA files:

https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP300216&o=acc_s%3Aa

We will notice that, although there are 12 SRA files, coming from 12 SRR runs -- Actually, there are only 6 total samples. 

In such a case it is possible that, for instance, multiple different lanes in a flowcell may have been used for the same sample.

In our analysis will be comparing at the sample level. So, we would like to merge the fastq files that, although were created as separate fastq files by the sequencer, actually came from the sample. 

It is generally easier to do this merging after an initial fastqc report, as it makes it easier to pinpoint errors that may be lane specific.

Combining two FASTQ files is a straightforward process. Remember how FASTQ files are formatted, they are a list of readcounts. Consequently, we can simply 'concatenate' or add one fastq file to the bottom of another to create a merged fastq. Note that header information in a single fastq file may now contain different lane information -- however for our downstream processes this is acceptable. Remember if your fastq files are zipped, you will have to unzip them first.

In [36]:
#example of how to concatenate two of our fastq files from the same experiment.
!cat data/raw_fastq/SRR13349122_1.fastq data/raw_fastq/SRR13349123_1.fastq > data/raw_fastq/GSM5004088_1.fastq
!cat data/raw_fastq/SRR13349122_2.fastq data/raw_fastq/SRR13349123_2.fastq > data/raw_fastq/GSM5004088_2.fastq
#notice we concat the forward read with a forward read, and reverse with reverse.
#also note here we are naming it after the GSM, which is the sample experiment ID.

We could manually do the above for all 12 of our runs, and it wouldn't be much work.

If you are comfortable doing so, that is the best process to use.

As always though, there are ways to automate things. For instance, we could make use of our query code from 3.1.2 to obtain a list of the SRX IDs, and take advantage of our Jupyter's ability to write python to write a loop to iterate and concat our list.

Specific to our case, each sample contains two paired-end SRRs. 

Note; running this step will remove the previous unmerged fastq files in order to save space.

This will take about one hour.

In [37]:
from pyathena import connect
import pandas as pd

# Use the correct argument name: s3_staging_dir
conn = connect(s3_staging_dir='s3://sra-data-athena/', region_name='us-east-1')

query = """
SELECT *
FROM AwsDataCatalog.srametadata.metadata
WHERE organism = 'Mycobacteroides chelonae' 
AND acc LIKE '%SRR133491%'
"""
df = pd.read_sql(
    query, conn
)
df


  df = pd.read_sql(


Unnamed: 0,acc,assay_type,center_name,consent,experiment,sample_name,instrument,librarylayout,libraryselection,librarysource,...,geo_loc_name_sam,ena_first_public_run,ena_last_update_run,sample_name_sam,datastore_filetype,datastore_provider,datastore_region,attributes,jattr,run_file_version
0,SRR13349128,RNA-Seq,GEO,public,SRX9775149,GSM5004091,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004091}, {k=bases...","{""geo_accession_exp"": [""GSM5004091""], ""bases"":...",1
1,SRR13349127,RNA-Seq,GEO,public,SRX9775148,GSM5004090,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004090}, {k=bases...","{""geo_accession_exp"": [""GSM5004090""], ""bases"":...",1
2,SRR13349129,RNA-Seq,GEO,public,SRX9775149,GSM5004091,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004091}, {k=bases...","{""geo_accession_exp"": [""GSM5004091""], ""bases"":...",1
3,SRR13349122,RNA-Seq,GEO,public,SRX9775146,GSM5004088,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004088}, {k=bases...","{""geo_accession_exp"": [""GSM5004088""], ""bases"":...",1
4,SRR13349131,RNA-Seq,GEO,public,SRX9775150,GSM5004092,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004092}, {k=bases...","{""geo_accession_exp"": [""GSM5004092""], ""bases"":...",1
5,SRR13349123,RNA-Seq,GEO,public,SRX9775146,GSM5004088,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004088}, {k=bases...","{""geo_accession_exp"": [""GSM5004088""], ""bases"":...",1
6,SRR13349132,RNA-Seq,GEO,public,SRX9775151,GSM5004093,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004093}, {k=bases...","{""geo_accession_exp"": [""GSM5004093""], ""bases"":...",1
7,SRR13349125,RNA-Seq,GEO,public,SRX9775147,GSM5004089,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004089}, {k=bases...","{""geo_accession_exp"": [""GSM5004089""], ""bases"":...",1
8,SRR13349130,RNA-Seq,GEO,public,SRX9775150,GSM5004092,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004092}, {k=bases...","{""geo_accession_exp"": [""GSM5004092""], ""bases"":...",1
9,SRR13349133,RNA-Seq,GEO,public,SRX9775151,GSM5004093,Illumina HiSeq 2500,PAIRED,cDNA,TRANSCRIPTOMIC,...,,,,,"[fastq, run.zq, sra]","[gs, ncbi, s3]","[gs.us-east1, ncbi.public, s3.us-east-1]","[{k=geo_accession_exp, v=GSM5004093}, {k=bases...","{""geo_accession_exp"": [""GSM5004093""], ""bases"":...",1


In [None]:
import subprocess

# Now get the accession IDs and sample IDs from the created dataframe
runs = df['acc'].values
samples = list(set(df['sample_name'].values))

# Sort them to be in numerical order
runs.sort()
samples.sort()

# Iterate through the samples
for index, item in enumerate(samples):
    try:
        print(f"Processing sample {samples[index]}...")

        # Concatenate the two SRRs for forward reads
        #print(f"Concatenating forward reads for {samples[index]}...")
        subprocess.run(["cat", f"data/raw_fastq/{runs[index*2]}_1.fastq", f"data/raw_fastq/{runs[index*2+1]}_1.fastq", ">", f"data/raw_fastq/{samples[index]}_1.fastq"], check=True)

        # Delete the original fastq files for forward reads
        #print(f"Deleting original forward reads for {samples[index]}...")
        subprocess.run(["rm", f"data/raw_fastq/{runs[index*2]}_1.fastq"], check=True)
        subprocess.run(["rm", f"data/raw_fastq/{runs[index*2+1]}_1.fastq"], check=True)

        # Zip the merged forward fastq file
        #print(f"Zipping forward reads for {samples[index]}...")
        subprocess.run(["gzip", f"data/raw_fastq/{samples[index]}_1.fastq"], check=True)

        # Concatenate the two SRRs for reverse reads
        #print(f"Concatenating reverse reads for {samples[index]}...")
        subprocess.run(["cat", f"data/raw_fastq/{runs[index*2]}_2.fastq", f"data/raw_fastq/{runs[index*2+1]}_2.fastq", ">", f"data/raw_fastq/{samples[index]}_2.fastq"], check=True)

        # Delete the original fastq files for reverse reads
        #print(f"Deleting original reverse reads for {samples[index]}...")
        subprocess.run(["rm", f"data/raw_fastq/{runs[index*2]}_2.fastq"], check=True)
        subprocess.run(["rm", f"data/raw_fastq/{runs[index*2+1]}_2.fastq"], check=True)

        # Zip the merged reverse fastq file
        #print(f"Zipping reverse reads for {samples[index]}...")
        subprocess.run(["gzip", f"data/raw_fastq/{samples[index]}_2.fastq"], check=True)

        print(f"Processing complete for sample {samples[index]}.\n")

    except subprocess.CalledProcessError as e:
        print(f"Error processing sample {samples[index]}: {e}")
        # Optionally, you can exit or skip to the next sample
        continue

print("All samples processed.")

In [39]:
#since our files will now be samples, not SRRs we can write a new text file to use for downstream batch processes.
#we can use the DF we made in the previous cell.
with open('samples.txt', 'w') as f:
    df = df.sort_values(by='sample_name', ascending=True)
    samples = df['sample_name'].unique()
    samples = '\n'.join(map(str, samples))
    f.write(samples)
    
!cat samples.txt

GSM5004088
GSM5004089
GSM5004090
GSM5004091
GSM5004092
GSM5004093

### STEP 5.3: Copy data file for Trimmomatic

One of trimmomatics functions is to trim sequence machine specific adapter sequences. These are usually within the trimmomatic installation directory in a folder called adapters.

Directories of packages within conda installations can be confusing, so in the case of using conda with trimmomatic, it may be easier to simply download or create a file with the relevant adapter sequencecs and store it in an easy to find directory.

In [64]:
!gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa .
!head TruSeq3-PE.fa 

Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa...
/ [1/1 files][   95.0 B/   95.0 B] 100% Done                                    
Operation completed over 1 objects/95.0 B.                                       
>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT



### STEP 6: Run Trimmomatic
Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Using piping and our original list, it is possible to queue up a batch run of trimmomatic for all our files, note that this is a different way to run a loop compared with what we did before.

The below code may take approximately 35 minutes to run.

In [65]:
!cat samples.txt | xargs -I {} trimmomatic PE -threads $THREADS 'data/raw_fastq/{}_1.fastq.gz' 'data/raw_fastq/{}_2.fastq.gz' 'data/trimmed/{}_1_trimmed.fastq.gz' 'data/trimmed/{}_1_trimmed_unpaired.fastq.gz' 'data/trimmed/{}_2_trimmed.fastq.gz' 'data/trimmed/{}_2_trimmed_unpaired.fastq.gz' ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 1 data/raw_fastq/GSM5004088_1.fastq.gz data/raw_fastq/GSM5004088_2.fastq.gz data/trimmed/GSM5004088_1_trimmed.fastq.gz data/trimmed/GSM5004088_1_trimmed_unpaired.fastq.gz data/trimmed/GSM5004088_2_trimmed.fastq.gz data/trimmed/GSM5004088_2_trimmed_unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Error: Unable to detect quality encoding
TrimmomaticPE: Started with arguments:
 -threads 1 data/raw_fastq/GSM5004089_1.fastq.gz data/raw_fastq/GSM5004089_2.fastq.gz data/trimmed/GSM5004089_1_trimmed.fastq.gz data/trimmed/GSM5004089_1_trimmed_unpaired.fastq.gz data/trimmed/GSM5004089_2_trimmed.fastq.gz data/trimmed/GSM5004089_2_trimmed_unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:

### STEP 7 (Optional): Run FastQC

It's best practice to run FastQC after trimming. However, you may decide to run FastQC only once, before or after trimming.

We will proceed with only the forward reads -- this is because, looking at trimmomatic, there were very few 'orphaned' reads. That is to say, most forward and reverse reads were successfully paired together. Because we are just trying to map to a transcriptome, the read lengths of the forward reads alone, in this case, around 50~ basepairs, should be sufficient.

The below code may take around 10 minutes to run.

In [49]:
!cat samples.txt | xargs -I {} fastqc "data/trimmed/{}_1_trimmed.fastq.gz" -o data/fastqc_samples/

Skipping 'data/trimmed/GSM5004088_1_trimmed.fastq.gz' which didn't exist, or couldn't be read
Skipping 'data/trimmed/GSM5004089_1_trimmed.fastq.gz' which didn't exist, or couldn't be read
Skipping 'data/trimmed/GSM5004090_1_trimmed.fastq.gz' which didn't exist, or couldn't be read
Skipping 'data/trimmed/GSM5004091_1_trimmed.fastq.gz' which didn't exist, or couldn't be read
Skipping 'data/trimmed/GSM5004092_1_trimmed.fastq.gz' which didn't exist, or couldn't be read
Skipping 'data/trimmed/GSM5004093_1_trimmed.fastq.gz' which didn't exist, or couldn't be read


### STEP 8: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

In [50]:
#!multiqc -f data/fastqc_samples/
!multiqc -f -o data/multiqc_samples/ data/fastqc_samples/


[91m///[0m ]8;id=248969;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.25[0m

[34m       file_search[0m | Search path: /home/ec2-user/SageMaker/rnaseq-myco-notebook/data/fastqc_samples
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m  0%[0m [32m0/0[0m  0/0[0m  
[?25h[34m           multiqc[0m | [33mNo analysis results found. Cleaning up…[0m


### STEP 9: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

To run salmon we must specify the reference transcriptome, and the folder of our created index.

Note here, -i does not mean input, but the folder where the index will be created.

In [58]:
!salmon index -t data/reference/GCF_001632805.1_transcriptome_reference_w_decoy.fa -p $THREADS -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates

Version Server Response: Not Found
[2024-09-18 06:16:37.320] [jLog] [info] building index
out : data/reference/transcriptome_index
[00m[2024-09-18 06:16:37.320] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers

[00m[00m[2024-09-18 06:16:37.544] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[00m[00m[2024-09-18 06:16:37.544] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts
[00mwrote 4919 cleaned references
[00m[2024-09-18 06:16:37.564] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[00m[00m[2024-09-18 06:16:37.663] [puff::index::jointLog] [info] ntHll estimated 4966944 distinct k-mers, setting filter size to 2^27
[00mThreads = 1
Vertex length = 31
Hash functions = 5
Filter size = 134217728
Capacity = 2
Files: 
data/reference/transcriptome_index/ref_k31_fixed.fa
--------------------------------------------------------------------------------
Round 0, 0:

### STEP 10: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

In [59]:
!cat samples.txt | xargs -I {} salmon quant -i data/reference/transcriptome_index -l SR -r "data/trimmed/{}_1_trimmed.fastq.gz" -p 8 --validateMappings -o "data/quants/{}_quant" > dump.txt

Version Server Response: Not Found
### salmon (selective-alignment-based) v1.10.3
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { data/reference/transcriptome_index }
### [ libType ] => { SR }
### [ unmatedReads ] => { data/trimmed/GSM5004088_1_trimmed.fastq.gz }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { data/quants/GSM5004088_quant }
Logs will be written to data/quants/GSM5004088_quant/logs
[00m[2024-09-18 06:16:58.535] [jointLog] [info] setting maxHashResizeThreads to 8
[00m[00m[2024-09-18 06:16:58.535] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[00m[00m[2024-09-18 06:16:58.535] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[00m[00m[2024-09-18 06:16:58.535] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[00m[00m[2024-09-18 06:16:58.5

### STEP 11: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [60]:
!head data/quants/SRR13349122_quant/quant.sf -n 1
!sort -nrk 4,4 data/quants/GSM5004088_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/GSM5004089_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/GSM5004090_quant/quant.sf | head -10


head: cannot open ‘data/quants/SRR13349122_quant/quant.sf’ for reading: No such file or directory
sort: cannot read: data/quants/GSM5004088_quant/quant.sf: No such file or directory
sort: cannot read: data/quants/GSM5004089_quant/quant.sf: No such file or directory
sort: cannot read: data/quants/GSM5004090_quant/quant.sf: No such file or directory


Top 10 most highly expressed genes in the double lysogen samples.


In [61]:
!head data/quants/SRR13349122_quant/quant.sf -n 1
!sort -nrk 4,4 data/quants/GSM5004091_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/GSM5004092_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/GSM5004093_quant/quant.sf | head -10

head: cannot open ‘data/quants/SRR13349122_quant/quant.sf’ for reading: No such file or directory
sort: cannot read: data/quants/GSM5004091_quant/quant.sf: No such file or directory
sort: cannot read: data/quants/GSM5004092_quant/quant.sf: No such file or directory
sort: cannot read: data/quants/GSM5004093_quant/quant.sf: No such file or directory


### STEP 12: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [62]:
!grep 'BB28_RS16545' data/quants/GSM5004088_quant/quant.sf
!grep 'BB28_RS16545' data/quants/GSM5004089_quant/quant.sf
!grep 'BB28_RS16545' data/quants/GSM5004090_quant/quant.sf

grep: data/quants/GSM5004088_quant/quant.sf: No such file or directory
grep: data/quants/GSM5004089_quant/quant.sf: No such file or directory
grep: data/quants/GSM5004090_quant/quant.sf: No such file or directory


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [63]:
!grep 'BB28_RS16545' data/quants/GSM5004091_quant/quant.sf
!grep 'BB28_RS16545' data/quants/GSM5004092_quant/quant.sf
!grep 'BB28_RS16545' data/quants/GSM5004093_quant/quant.sf

grep: data/quants/GSM5004091_quant/quant.sf: No such file or directory
grep: data/quants/GSM5004092_quant/quant.sf: No such file or directory
grep: data/quants/GSM5004093_quant/quant.sf: No such file or directory


### STEP 12: Combine Genecounts to a Single Genecount File
Commonly, the readcounts for each sample are combined into a single table, where the rows contain the gene ID, and the columns identify the sample.

As before, this can be done in many ways. The quantmerge function outputs a table.

However, it is common for readcount tables to have sample headers for the columns, which this does not have.

So you could manually add those headers in different ways, for instance using a spreadsheet editor, or using a shell command like sed as below to insert a line at the start of table with the relevant tab-seperated headers.

You could also use sed to remove 'gene' or 'rna' prefixes from the table.

In [57]:
##first merge salmon files by number of reads.
!salmon quantmerge --column numreads --quants data/quants/*_quant -o data/quants/merged_quants.txt
##optinally we can rename the columns
!sed -i "1s/.*/Name\tGSM5004088\tGSM5004089\tGSM5004090\tGSM5004091\tGSM5004092\tGSM5004093/" data/quants/merged_quants.txt

##for further formatting, it may be easier in our r-code to later merge
##if we remove the gene- and rna- prefix
!sed -i "s/gene-//" data/quants/merged_quants.txt
!sed -i "s/rna-//" data/quants/merged_quants.txt

print("An example of a combined genecount outputfile.")
!head data/quants/merged_quants.txt

Version Server Response: Not Found
[00m[2024-09-18 06:15:56.885] [mergeLog] [info] samples: [ data/quants/GSM5004088_quant, data/quants/GSM5004089_quant, data/quants/GSM5004090_quant, data/quants/GSM5004091_quant, data/quants/GSM5004092_quant, data/quants/GSM5004093_quant ]
[00m[00m[2024-09-18 06:15:56.885] [mergeLog] [info] sample names : [ GSM5004088_quant, GSM5004089_quant, GSM5004090_quant, GSM5004091_quant, GSM5004092_quant, GSM5004093_quant ]
[00m[00m[2024-09-18 06:15:56.885] [mergeLog] [info] output column : NUMREADS
[00m[00m[2024-09-18 06:15:56.885] [mergeLog] [info] output file : data/quants/merged_quants.txt
[00m[1m[41m[2024-09-18 06:15:56.885] [mergeLog] [critical] The sample directory data/quants/GSM5004088_quant either doesn't exist, or doesn't contain a quant.sf file
[00msed: can't read data/quants/merged_quants.txt: No such file or directory
sed: can't read data/quants/merged_quants.txt: No such file or directory
sed: can't read data/quants/merged_quants.txt: 

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)