# Extended RNA-Seq Analysis Training Demo

## Overview

In this tutorial, we will repeat the Tutorial 1B extended submodule with a different dataset. Using the provided dataset is a great way to get introduced to the module and learn its basic functionality, but to use it with your own data or to understand the module at a deeper level, it is helpful to practice adapting it to a new dataset. We will guide you through that here, providing basic guidance on the workflow and letting you write the commands yourself. If you get stuck, you can always view our hints and suggestions by clicking on the dropdown arrow under the command cells.

The initial dataset used a prokaryotic sample for simplicity and compute efficiency. This time, we will increase the complexity a bit by using a eukaryotic sample. We have selected data from a time series experiment on Plasmodium falciparum, the parasite responsible for malaria. The experiment did RNA-seq analysis on P. falciparum cells at 25 time points after erythrocyte invasion. The data is taken from Kucharski M, Tripathi J, Nayak S, Zhu L et al. A comprehensive RNA handling and transcriptomics guide for high-throughput processing of Plasmodium blood-stage samples. Malar J 2020 Oct 9;19(1):363. The sequence data is available from SRA with the accession number [SRP261441](https://www.ncbi.nlm.nih.gov/sra/?term=SRP261441). To keep things simple, we are not using all time points, and have selected time points 1, 13, and 25 as a proxy for early, mid, and late infection. Feel free to add or remove samples from your analysis to see how the results differ. The workflow structure remains the same as the original submodule 1B, with the diagram shown below.

![RNA-Seq workflow](images/rnaseq-workflow.png)

### STEP 1: Install Mambaforge and then install snakemake using bioconda.

First install Mambaforge.


In [None]:
<YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    
  Make sure you include the `!` in front of your command! 
    
  ```  
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -u -p $HOME/mambaforge
date +"%T"
  ```

</details>




Next, add it to the path

In [29]:
<YOUR COMMAND HERE>

SyntaxError: invalid syntax (2048400170.py, line 1)

<details>
  <summary>Click for help</summary>
    
  Make sure you include the `!` in front of your command! 
    
  ```  
#add to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"  
  ```

</details>




Next, using mambaforge and bioconda, install the tools that will be used in this tutorial.

In [None]:
<YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    
  Make sure you include the `!` in front of your command! 
    
  ```  
  mamba install -y -c conda-forge -c bioconda trimmomatic fastqc multiqc salmon entrez-direct gffread parallel-fastq-dump sra-tools pigz
  ```

</details>

### STEP 2: Setup Environment

The directory structure will be the same as in Tutorial 1B. Create that below.

In [None]:
<YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    
  Make sure you include the `!` in front of your command! 
    
  ```  
 cd $HOMEDIR
 mkdir -p data
 mkdir -p data/raw_fastq
 mkdir -p data/trimmed
 mkdir -p data/fastqc
 mkdir -p data/aligned
 mkdir -p data/reference
  ```

</details>

Set number of cores depending on your VM size

In [None]:
<YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    
  Make sure you include the `!` in front of your command! 
    
  ```  

numthreads=!nproc
numthreadsint = int(numthreads[0])
%env CORES = $numthreadsint
#!echo ${CORES}

### STEP 3: Downloading relevant FASTQ files using SRA Tools

Next we will need to download the relevant fastq files.

Because these files can be large, the process of downloading and extracting fastq files can be quite lengthy.

The sequence data for this tutorial comes from work by Cushman et al., <em><a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191103/'>Increased whiB7 expression and antibiotic resistance in Mycobacterium chelonae carrying two prophages</a><em>.

We will be downloading the sample runs from this project using SRA tools, downloading from the NCBI's SRA (Sequence Run Archives).

However, first we need to find the associated accession numbers in order to download.


### STEP 3.1: Finding run accession numbers.

The SRA stores sequence data in terms of runs, (SRR stands for Sequence Read Run). To download runs, we will need the accession ID for each run we wish to download. 

The Cushman et al., project contains 12 runs. To make it easier, these are the run IDs associated with this project:

+ SRR13349122
+ SRR13349123
+ SRR13349124
+ SRR13349125
+ SRR13349126
+ SRR13349127
+ SRR13349128
+ SRR13349129
+ SRR13349130
+ SRR13349131
+ SRR13349132
+ SRR13349133


In this case, all these runs belong to the SRP (Sequence Run Project): SRP300216.

Sequence run experiments can be searched for using the SRA database on the NCBI website; and article-specific sample run information can be found in the supplementary section of that article.

For instance, here, the the authors posted a link to the sequence data GSE (Gene Series number), <a href='https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164210'>GSE164210</a>. This leads to the appropriate 'Gene Expression Omnibus' page where, among other useful files and information, the relevant SRA database link can be found. 

### STEP 3.1.2 (Optional): Generate the accession list file with BigQuery

In [6]:
# Import the biquery api
from google.cloud import bigquery
import pandas

Now make sure you have enabled the BigQuery API. You just need to search for BigQuery, go to the BQ page and click `Enable`

In [7]:
# Designate the client for the API
client = bigquery.Client(location="US")
print("Client creating using default project: {}".format(client.project))

Client creating using default project: cit-oconnellka-9999


Now we will query BigQuery using the species name and a range of accession numbers associated with this particular study. Feel free to play around with the query to generate different variations of accession numbers!

In [8]:
query = """
#standardSQL
SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Plasmodium falciparum 3D7'
and acc IN ('SRR11784387','SRR11784398','SRR11784410')
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()

In [9]:
df

Unnamed: 0,acc,assay_type,center_name,consent,experiment,sample_name,instrument,librarylayout,libraryselection,librarysource,...,geo_loc_name_sam,ena_first_public_run,ena_last_update_run,sample_name_sam,datastore_filetype,datastore_provider,datastore_region,attributes,run_file_version,jattr
0,SRR11784387,RNA-Seq,GEO,public,SRX8336810,GSM4551251,Illumina HiSeq 4000,PAIRED,cDNA,TRANSCRIPTOMIC,...,[],[],[],[],"[fastq, sra, run.zq]","[ncbi, s3, gs]","[s3.us-east-1, gs.US, ncbi.public]","[{'k': 'geo_accession_exp', 'v': 'GSM4551251'}...",1,"{""geo_accession_exp"": [""GSM4551251""], ""bases"":..."
1,SRR11784410,RNA-Seq,GEO,public,SRX8336833,GSM4551274,Illumina HiSeq 4000,PAIRED,cDNA,TRANSCRIPTOMIC,...,[],[],[],[],"[sra, run.zq, fastq]","[gs, s3, ncbi]","[gs.US, s3.us-east-1, ncbi.public]","[{'k': 'geo_accession_exp', 'v': 'GSM4551274'}...",1,"{""geo_accession_exp"": [""GSM4551274""], ""bases"":..."
2,SRR11784398,RNA-Seq,GEO,public,SRX8336821,GSM4551262,Illumina HiSeq 4000,PAIRED,cDNA,TRANSCRIPTOMIC,...,[],[],[],[],"[run.zq, sra, fastq]","[s3, ncbi, gs]","[ncbi.public, s3.us-east-1, gs.US]","[{'k': 'geo_accession_exp', 'v': 'GSM4551262'}...",1,"{""geo_accession_exp"": [""GSM4551262""], ""bases"":..."


In [10]:
with open('accs.txt', 'w') as f:
    accs = df['acc'].to_string(header=False, index=False)
    f.write(accs)

In [11]:
cat accs.txt

SRR11784387
SRR11784410
SRR11784398

### STEP 3.3 Downloading multiple files using the SRA-toolkit.

One may, as in our case, wish to download multiple runs at once.

To aid in this, SRA-tools supports batch downloading.

We can download multiple SRA files using a single line of code by creating a list of the SRA IDs we wish to download, and inputting that into the prefetch command.

And then feed that list into the sra-toolkit prefetch command. Note, it may take some time to download all the fastq files.

In [25]:
!prefetch -O data/raw_fastq/ --option-file accs.txt


2023-12-13T19:54:46 prefetch.3.0.9: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-12-13T19:54:46 prefetch.3.0.9: 1) Downloading 'SRR11784410'...
2023-12-13T19:54:46 prefetch.3.0.9: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-12-13T19:54:46 prefetch.3.0.9:  Downloading via HTTPS...
2023-12-13T19:56:45 prefetch.3.0.9:  HTTPS download succeed
2023-12-13T19:56:53 prefetch.3.0.9:  'SRR11784410' is valid
2023-12-13T19:56:53 prefetch.3.0.9: 1) 'SRR11784410' was downloaded successfully
2023-12-13T19:56:53 prefetch.3.0.9: 'SRR11784410' has 0 unresolved dependencies

2023-12-13T19:56:53 prefetch.3.0.9: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-12-13T19:56:53 prefetch.3.0.9: 2) Downloading 'SRR11784398'...
2023-12-13T19:56:53 prefetch.3.0.9: SRA Normalized Format file is being retrieved, 

### STEP 3.3 Converting Multiple SRA files to Fastq

We used fasterq-dump before to convert SRA files to fastq. However, fasterq-dump does not have native batch compatibility. As before, we will use a loop to convert each file in our list. In this case, we are going to convert to fastq.gz for downstream processing. This step should take about 30 minutes.

In [26]:
!for x in `cat accs.txt`; do fasterq-dump -f -O data/raw_fastq -e $CORES -m 4G data/raw_fastq/$x/$x.sra; done

spots read      : 26,704,146
reads read      : 53,408,292
reads written   : 53,408,292
spots read      : 17,580,775
reads read      : 35,161,550
reads written   : 35,161,550
spots read      : 13,008,763
reads read      : 26,017,526
reads written   : 26,017,526


Convert to fastq.gz

In [2]:
!pigz data/raw_fastq/*.fastq

### STEP 4: Copy reference transcriptome files that will be used by Salmon using E-Direct

We will need a refseq assembly here. Try searching through [NCBI genome database](https://www.ncbi.nlm.nih.gov/datasets/genome) for the appropriate Plasmodium falciparum assembly.

Salmon is a tool that aligns RNA-Seq reads to a transcriptome.

So we will need a transcriptome reference file.

To get one, we can search through the NCBI assembly database, find an assembly, and download transcriptome reference files from that assembly using FTP links.

For instance, we will use the <a href='https://www.ncbi.nlm.nih.gov/assembly/GCF_001632805.1'>ASM163280v1</a> refseq assembly, found by searching through the NCBI assembly database. The FTP links can be accessed through the website in various ways, one way is to click the 'FTP directory for RefSeq assembly' link, found under 'Access the data', section.

Alternatively, if one were inclined, one could take the less common route and perform this through the NCBI command line tool suite called 'Entrez Direct' (EDirect).

This is an intricate and complicated set of tools, with many ways to do any one thing.

Below is an example of using an eDirect search query with a refseq identifier to obtain the relevant FTP directory, and then using that to download desired reference files.

In [12]:
#parse for the ftp link and download the genome reference fasta file

!esearch -db assembly -query GCF_000002765.6 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_genomic.fna.gz " $0"/"$NF"_genomic.fna.gz"}' \
| bash

#parse for the ftp link and download the gtf reference fasta file

!esearch -db assembly -query GCF_000002765.6 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_genomic.gff.gz " $0"/"$NF"_genomic.gff.gz"}' \
| bash

# parse for the ftp link and download the feature-table reference file 
# (for later use for merging readcounts with gene names in R code).

!esearch -db assembly -query GCF_000002765.6 | efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| awk -F"/" '{print "curl -o data/reference/"$NF"_feature_table.txt.gz " $0"/"$NF"_feature_table.txt.gz"}' \
| bash


#unzip the compresseed fasta files

!gzip -d data/reference/*.gz --force

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6504k  100 6504k    0     0  7443k      0 --:--:-- --:--:-- --:--:-- 7442k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  995k  100  995k    0     0  1490k      0 --:--:-- --:--:-- --:--:-- 1492k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  320k  100  320k    0     0   382k      0 --:--:-- --:--:-- --:--:--  383k


Next we can use a tool called gffread to create a transcriptome reference file using the gtf and genome files we downloaded.

In [13]:
!gffread -w data/reference/GCF_000002765.6_transcriptome_reference.fa -g data/reference/GCF_000002765.6_GCA_000002765_genomic.fna data/reference/GCF_000002765.6_GCA_000002765_genomic.gff


FASTA index file data/reference/GCF_000002765.6_GCA_000002765_genomic.fna.fai created.


It is also recommended to include the full genome at the end of the transcriptome reference file, for the purpose of performing a 'decoy-aware' mapping, more information about which can be found in the Salmon documentation.

To alert the tool to the presence of this, we will also create a 'decoy file', which salmon needs pointed towards the full genome sequence in our transcriptome reference file.

In [14]:
!cat data/reference/GCF_000002765.6_transcriptome_reference.fa <(echo) data/reference/GCF_000002765.6_GCA_000002765_genomic.fna > data/reference/GCF_000002765.6_transcriptome_reference_w_decoy.fa

!cat data/reference/GCF_000002765.6_GCA_000002765_genomic.fna | grep ">" | sed 's/Plasmodium.*//g' | sed 's/>//g' > data/reference/decoys.txt

### STEP 5: Copy data file for Trimmomatic

One of trimmomatics functions is to trim sequence machine specific adapter sequences. These are usually within the trimmomatic installation directory in a folder called adapters.

Directories of packages within conda installations can be confusing, so in the case of using conda with trimmomatic, it may be easier to simply download or create a file with the relevant adapter sequencecs and store it in an easy to find directory.

In [15]:
! gsutil -m cp -r gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa .
!head TruSeq3-PE.fa 

Copying gs://nigms-sandbox/me-inbre-rnaseq-pipelinev2/config/TruSeq3-PE.fa...
/ [1/1 files][   95.0 B/   95.0 B] 100% Done                                    
Operation completed over 1 objects/95.0 B.                                       
>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT



### STEP 6: Run Trimmomatic
Trimmomatic will trim off any adapter sequences or low quality sequence it detects in the FASTQ files.

Using piping and our original list, it is possible to queue up a batch run of trimmomatic for all our files, note that this is a different way to run a loop compared with what we did before.

The below code may take approximately 35 minutes to run.

In [28]:
!cat accs.txt | xargs -I {} trimmomatic PE -threads $CORES 'data/raw_fastq/{}_1.fastq.gz' 'data/raw_fastq/{}_2.fastq.gz' 'data/trimmed/{}_1_trimmed.fastq.gz' 'data/trimmed/{}_1_trimmed_unpaired.fastq.gz' 'data/trimmed/{}_2_trimmed.fastq.gz' 'data/trimmed/{}_2_trimmed_unpaired.fastq.gz' ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36

TrimmomaticPE: Started with arguments:
 -threads 16 data/raw_fastq/SRR11784387_1.fastq.gz data/raw_fastq/SRR11784387_2.fastq.gz data/trimmed/SRR11784387_1_trimmed.fastq.gz data/trimmed/SRR11784387_1_trimmed_unpaired.fastq.gz data/trimmed/SRR11784387_2_trimmed.fastq.gz data/trimmed/SRR11784387_2_trimmed_unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 13008763 Both Surviving: 12877289 (98.99%) Forward Only Surviving: 131474 (1.01%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -threads 16 data/raw_fastq/SRR11784410_1.fastq.gz data/raw_fastq/SRR11784410_2.fastq.gz data/trimmed/SRR11784410_1_t

### STEP 7: Run FastQC
FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads.

If you notice the results of the trimming, you may have noted the sequences in the reverse reads were few, and largely unpaired. This may be an artifact from how the original sequencing process. This is okay, we can proceed from here simply using the forward reads.

The below code may take around 10 minutes to run.

In [30]:
!cat accs.txt | xargs -I {} fastqc -t $CORES "data/trimmed/{}_1_trimmed.fastq.gz" -o data/fastqc/

application/gzip
Started analysis of SRR11784387_1_trimmed.fastq.gz
Approx 5% complete for SRR11784387_1_trimmed.fastq.gz
Approx 10% complete for SRR11784387_1_trimmed.fastq.gz
Approx 15% complete for SRR11784387_1_trimmed.fastq.gz
Approx 20% complete for SRR11784387_1_trimmed.fastq.gz
Approx 25% complete for SRR11784387_1_trimmed.fastq.gz
Approx 30% complete for SRR11784387_1_trimmed.fastq.gz
Approx 35% complete for SRR11784387_1_trimmed.fastq.gz
Approx 40% complete for SRR11784387_1_trimmed.fastq.gz
Approx 45% complete for SRR11784387_1_trimmed.fastq.gz
Approx 50% complete for SRR11784387_1_trimmed.fastq.gz
Approx 55% complete for SRR11784387_1_trimmed.fastq.gz
Approx 60% complete for SRR11784387_1_trimmed.fastq.gz
Approx 65% complete for SRR11784387_1_trimmed.fastq.gz
Approx 70% complete for SRR11784387_1_trimmed.fastq.gz
Approx 75% complete for SRR11784387_1_trimmed.fastq.gz
Approx 80% complete for SRR11784387_1_trimmed.fastq.gz
Approx 85% complete for SRR11784387_1_trimmed.fastq.g

### STEP 8: Run MultiQC
MultiQC reads in the FastQC reports and generate a compiled report for all the analyzed FASTQ files.

In [31]:
!multiqc -f data/fastqc/


  [91m///[0m ]8;id=131870;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Search path : /home/jupyter/RNA-Seq-Differential-Expression-Analysis/data/fastqc
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m12/12[0m  12[0m  
[?25h[34m|            fastqc[0m | Found 6 reports
[34m|           multiqc[0m | Report      : multiqc_report.html
[34m|           multiqc[0m | Data        : multiqc_data   (overwritten)
Traceback (most recent call last):
  File "/opt/conda/bin/multiqc", line 10, in <module>
    sys.exit(run_multiqc())
  File "/opt/conda/lib/python3.7/site-packages/multiqc/__main__.py", line 23, in run_multiqc
    multiqc.run_cli(prog_name="multiqc")
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/rich_click/rich_command.py", line 126, in main


### STEP 9: Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon

In [32]:
!salmon index -t data/reference/GCF_000002765.6_transcriptome_reference_w_decoy.fa -p $CORES -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates

Version Server Response: Not Found
[2023-12-15 20:56:42.421] [jLog] [info] building index
out : data/reference/transcriptome_index
[00m[2023-12-15 20:56:42.421] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers

[00m[00m[2023-12-15 20:56:43.146] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[00m[00m[2023-12-15 20:56:43.146] [puff::index::jointLog] [info] Clipped poly-A tails from 1 transcripts
[00mwrote 5543 cleaned references
[00m[2023-12-15 20:56:43.191] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[00m[00m[2023-12-15 20:56:43.531] [puff::index::jointLog] [info] ntHll estimated 21261541 distinct k-mers, setting filter size to 2^29
[00mThreads = 16
Vertex length = 31
Hash functions = 5
Filter size = 536870912
Capacity = 2
Files: 
data/reference/transcriptome_index/ref_k31_fixed.fa
--------------------------------------------------------------------------------
Round 0, 

### STEP 10: Run Salmon to Map Reads to Transcripts and Quantify Expression Levels
Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript.

In [33]:
!cat accs.txt | xargs -I {} salmon quant -i data/reference/transcriptome_index -l SR -r "data/trimmed/{}_1_trimmed.fastq.gz" -p $CORES --validateMappings -o "data/quants/{}_quant"

Version Server Response: Not Found
### salmon (selective-alignment-based) v1.10.2
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { data/reference/transcriptome_index }
### [ libType ] => { SR }
### [ unmatedReads ] => { data/trimmed/SRR11784387_1_trimmed.fastq.gz }
### [ threads ] => { 16 }
### [ validateMappings ] => { }
### [ output ] => { data/quants/SRR11784387_quant }
Logs will be written to data/quants/SRR11784387_quant/logs
[00m[2023-12-15 20:57:04.541] [jointLog] [info] setting maxHashResizeThreads to 16
[00m[00m[2023-12-15 20:57:04.541] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[00m[00m[2023-12-15 20:57:04.541] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[00m[00m[2023-12-15 20:57:04.541] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[00m[00m[2023-12-15 20:57

### STEP 11: Report the top 10 most highly expressed genes in the samples

Top 10 most highly expressed genes in each wild-type sample.


In [26]:
!head data/quants/SRR11784387_quant/quant.sf -n 1
!sort -nrk 4,4 data/quants/SRR11784398_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/SRR11784410_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/SRR11784387_quant//quant.sf | head -10

Name	Length	EffectiveLength	TPM	NumReads
rna-XM_001347701.1	312	62.441	62647.794533	1999.000
rna-XM_001348736.3	1014	764.000	45909.700034	17924.000
rna-XM_961070.1	399	149.000	27514.437041	2095.000
rna-XM_001349953.1	951	701.000	23354.077683	8366.000
rna-XR_002273103.1	159	5.631	17897.187939	51.500
rna-XR_002273083.1	159	5.631	17897.187939	51.500
rna-XR_002273104.1	199	8.782	17591.080307	78.940
rna-XM_002808709.1	165	5.972	16383.649843	50.000
rna-XR_002273085.2	3788	3538.000	16106.393946	29120.127
rna-XM_001347370.1	696	446.000	15602.341142	3556.000
sort: write failed: 'standard output': Broken pipe
sort: write error
rna-XM_001347680.1	285	38.947	159170.317502	3528.000
rna-XM_001349503.1	321	71.167	98883.838238	4005.000
rna-XM_002808697.2	918	668.000	35730.020232	13583.324
rna-XM_001348563.3	1110	860.000	26653.195632	13045.000
rna-XM_001347701.1	312	62.441	24116.517322	857.000
rna-XR_002273085.2	3788	3538.000	22670.427542	45647.173
rna-XM_001347268.1	324	74.118	18847.259208	795.000
rna

Top 10 most highly expressed genes in the double lysogen samples.


In [None]:
!head data/quants/SRR13349122_quant/quant.sf -n 1
!sort -nrk 4,4 data/quants/SRR13349128_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/SRR13349129_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/SRR13349130_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/SRR13349131_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/SRR13349132_quant/quant.sf | head -10
!sort -nrk 4,4 data/quants/SRR13349133_quant/quant.sf | head -10

### STEP 12: Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type
A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study.
![RNA-Seq workflow](images/table-cushman.png)

Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
!grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349123_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349124_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349125_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349126_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349127_quant/quant.sf


Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields:  
`Name    Length  EffectiveLength TPM     NumReads`

In [None]:
!grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349129_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349130_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349131_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349132_quant/quant.sf
!grep 'BB28_RS16545' data/quants/SRR13349133_quant/quant.sf

### STEP 12: Combine Genecounts to a Single Genecount File
Commonly, the readcounts for each sample are combined into a single table, where the rows contain the gene ID, and the columns identify the sample.

In [None]:
##first merge salmon files by number of reads.
!salmon quantmerge --column numreads --quants data/quants/*_quant -o data/quants/merged_quants.txt
##optinally we can rename the columns
!sed -i "1s/.*/Name\tSRR13349122\tSRR13349123\tSRR13349124\tSRR13349125\tSRR13349126\tSRR13349127\tSRR13349128\tSRR13349129\tSRR13349130\tSRR13349131\tSRR13349132\tSRR13349133/" data/quants/merged_quants.txt

##for further formatting, it may be easier in our r-code to later merge
##if we remove the gene- and rna- prefix
!sed -i "s/gene-//" data/quants/merged_quants.txt
!sed -i "s/rna-//" data/quants/merged_quants.txt

print("An example of a combined genecount outputfile.")
!head data/quants/merged_quants.txt

## <a name="workflow">Additional Workflows</a>

Now that you have read counts per gene, feel free to explore the R workflow which creates plots and analyses using these readcount files, or try other alternate workflows for creating read count files, such as using snakemake.


[Workflow One:](Tutorial_1.ipynb) A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: <https://youtu.be/ChGfBR4do_Y>.

[Workflow One (Extended):](Tutorial_1B_Extended.ipynb) An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.

[Workflow One (Using Snakemake):](Tutorial_2_Snakemake.ipynb) Using snakemake to run workflow one.

[Workflow Two (DEG Analysis):](Tutorial_3_DEG_Analysis.ipynb) Using Deseq2 and R to conduct clustering and differential gene expression analysis.


![RNA-Seq workflow](images/RNA-Seq_Notebook_Homepage.png)