# Salmon pipeline

The pipeline presented in this notebook will require the following:

1. __salmon__ for alignment-free transcript abundance quantification
2. __multiQC__ to generate QC report and check if there are low quality samples to exclude from downstream analysis
3. __tximport__ to aggregate transcript counts and produce gene-level count matrices and normalizing offsets

In [1]:
#Imports required libraries
import os, IPython, re
import pandas as pd
#Other imports 
import tools.progressbar as pg
import tools.utilities as utils

In [2]:
#Select sound for audio alerts
sound_file = 'tools/test.wav'

## Quantify transcripts using salmon

* Download reference transcripts

In [3]:
# Create folders to store the reference genome
QUANTDIR= 'quant'
try:
    os.makedirs(QUANTDIR)
except FileExistsError:
    # directory already exists
    pass

In [4]:
#dowload transcript file
transcript_file = os.path.join(os.getcwd(),QUANTDIR,'gencode.v30.transcripts.fa.gz')
utils.download_ftp('ftp.ebi.ac.uk', 'pub/databases/gencode/Gencode_human/release_30/gencode.v30.transcripts.fa.gz', transcript_file)

/Users/Jb_Macbook/Documents/GitHub/RNAseq/pipelines/quant/gencode.v30.transcripts.fa.gz already downloaded


* Create Salmon index https://salmon.readthedocs.io/en/latest/salmon.html

In [5]:
transcript_index= os.path.join(QUANTDIR,'salmon_gencode.v30_quasi_index')
if not os.path.exists(transcript_index):
    ## Create salmon index (make sure to have sufficient disk space and memory)
    utils.run_command(f'salmon index -t {transcript_file} -i {transcript_index} k 31 --perfectHash')
    #sound alert when done
    IPython.display.Audio(sound_file, autoplay=True)
else:
    print(transcript_index, 'already created')

quant/salmon_gencode.v30_quasi_index already created


* quantify transcripts

In [6]:
########## alignment-free transcript abundance quantification ##########
FASTQDIR = 'data/fastq'
OUTDIR = os.path.join(QUANTDIR,'salmon_output')
single_samples = [f.split('.')[0] for f in os.listdir(FASTQDIR) if f.endswith('.sra.fastq')]
paired_samples = [f.split('.')[0] for f in os.listdir(FASTQDIR) if f.endswith('.sra_1.fastq')]

## Align and assemble single-end sequencing reads
for sample in pg.log_progress(single_samples):
    fastq = os.path.join(FASTQDIR,sample+'.sra.fastq')
    output = os.path.join(OUTDIR,sample)
    if not os.path.exists(os.path.join(output,'quant.sf')):
        print('Processing sample',sample)
        utils.run_command(f'salmon quant --index {transcript_index} --libType A -r {fastq} --threads 4 --validateMappings --output {output}')


## Align and assemble paired-end sequencing reads
for sample in pg.log_progress(paired_samples):
    fastq1 = os.path.join(FASTQDIR,sample+'.sra_1.fastq')
    fastq2 = os.path.join(FASTQDIR,sample+'.sra_2.fastq')
    output = os.path.join(OUTDIR,sample)
    if not os.path.exists(os.path.join(output,'quant.sf')):
        print('Processing sample',sample)
        utils.run_command(f'salmon quant --index {transcript_index} --libType A -1 {fastq1} -2 {fastq2} --threads 4 --validateMappings --output {output}')

print('Transcript abundance quantification done')
#sound alert when done
IPython.display.Audio(sound_file, autoplay=True)

VBox(children=(HTML(value=''), IntProgress(value=0, max=14)))

Processing sample SRR6231083
### salmon (mapping-based) v0.13.1
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { quant/salmon_gencode.v30_quasi_index }
### [ libType ] => { A }
### [ unmatedReads ] => { data/fastq/SRR6231083.sra.fastq }
### [ threads ] => { 4 }
### [ validateMappings ] => { }
### [ output ] => { quant/salmon_output/SRR6231083 }
Version Info: This is the most recent version of salmon.
Logs will be written to quant/salmon_output/SRR6231083/logs
[2019-05-13 20:20:30.273] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2019-05-13 20:20:30.273] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2019-05-13 20:20:30.273] [jointLog] [info] Usage of --validateMappings, without --hardFilter implies use of range factorization. rangeFactorizationBins is being set to 4
[2019-05-13 20:20:30.273] [jointLog] [info] Usag

VBox(children=(HTML(value=''), IntProgress(value=0, max=0)))

Transcript abundance quantification done


## QC
* Check the quality of the raw reads (fastq files) and %mapping (salmon) to check which samples need to be excluded

In [7]:
#We can examine the QC report generated by multiQC to evaluate the quality of the data
!multiqc . -f --outdir data
IPython.display.IFrame('data/multiqc_report.html', width=800, height=350)

  configs = yaml.load(f)
  sp = yaml.load(f)
[INFO   ]         multiqc : This is MultiQC v1.7
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
[?25lSearching 230 files..  [####################################]  100%          [?25h
[INFO   ]          salmon : Found 14 meta reports
[INFO   ]          salmon : Found 14 fragment length distributions
[INFO   ]        kallisto : Found 1 reports
[INFO   ]          fastqc : Found 13 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : data/multiqc_report.html
[INFO   ]         multiqc : Data        : data/multiqc_data
[INFO   ]         multiqc : MultiQC complete


Here there did not seem to be adaptor contamination and base quality was good throughout the reads for all retained samples so we decided not to perform adaptor and quality trimmimg. 

## Pre-process quantification results for downstream analysis
* Keep only transcript name in the quant files and trim other identifiers:

In [8]:
def trim_ids(file):
    with open(file, 'r') as f, open('temp.txt', 'x') as t:
        for line in f:
            t.write(re.sub(r'\|.*\|','', line))
    os.remove(file)
    os.rename('temp.txt', file)

In [9]:
for sample in [s for s in os.listdir(OUTDIR) if s.startswith('SRR')]:
    file = os.path.join(OUTDIR,sample,'quant.sf')
    trim_ids(file)

## Aggregate transcript counts using tximport
* This step is performed in the R script 'tx2gene.R'

In [10]:
#Download gtf_file
gtf_file = os.path.join(QUANTDIR,'gencode.v30.annotation.gtf.gz')
utils.download_ftp('ftp.ebi.ac.uk', '/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz', gtf_file)

*** Downloading quant/gencode.v30.annotation.gtf.gz (size 40122079 MB) ***
[████████████████████████████████████████████████████████████████████████████████████████████████████] 100 %



In [12]:
utils.run_command(f'Rscript tx2gene_salmon.R data/ {gtf_file} {QUANTDIR}')

Setting WORKDIR to: /Users/Jb_Macbook/Documents/GitHub/RNAseq/pipelines/data/ 
Le chargement a nécessité le package : lmtest
Le chargement a nécessité le package : zoo

Attachement du package : ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Parsed with column specification:
cols(
  TXNAME = col_character(),
  GENEID = col_character()
)
# A tibble: 6 x 2
  TXNAME            GENEID           
  <chr>             <chr>            
1 ENST00000456328.2 ENSG00000223972.5
2 ENST00000450305.2 ENSG00000223972.5
3 ENST00000473358.1 ENSG00000243485.5
4 ENST00000469289.1 ENSG00000243485.5
5 ENST00000607096.1 ENSG00000284332.1
6 ENST00000606857.1 ENSG00000268020.3
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 
summarizing abundance
summarizing counts
summarizing length
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 
summarizing abundance
summarizing counts
summarizing length
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10

Check results and export gene counts and length as separate matrices

In [12]:
## Load the expression matrix
txi = pd.read_csv(os.path.join(os.getcwd(),'data', 'txi.csv'), index_col=0)
print(txi.shape)
txi.head(3)

(58434, 43)


Unnamed: 0,abundance.SRR6231076,abundance.SRR6231077,abundance.SRR6231078,abundance.SRR6231079,abundance.SRR6231080,abundance.SRR6231081,abundance.SRR6231082,abundance.SRR6231083,abundance.SRR6231084,abundance.SRR6231085,...,length.SRR6231081,length.SRR6231082,length.SRR6231083,length.SRR6231084,length.SRR6231085,length.SRR6231086,length.SRR6231087,length.SRR6231088,length.SRR6231089,countsFromAbundance
ENSG00000000003.14,0.904747,1.728894,2.916382,1.672736,1.783212,0.839226,2.240565,0.740904,2.093233,0.184065,...,2267.286693,1631.755105,2116.087129,2106.547427,3547.0,2008.672212,3547.0,3547.0,1375.863712,no
ENSG00000000005.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,no
ENSG00000000419.12,49.58387,42.782853,70.89121,45.358392,61.097149,49.674284,42.417455,49.326152,43.877174,53.120665,...,740.18289,659.593369,689.528982,752.280808,704.72891,675.620844,657.309811,634.535054,655.377462,no


Here we add a column with trimmed ENSEMBL IDs (remove version .XX) and check for possible duplicates

In [13]:
txi['ENSEMBL'] = [ID.split('.')[0] for ID in txi.index]
txi[txi.duplicated(subset='ENSEMBL', keep=False)].head(6)

Unnamed: 0,abundance.SRR6231076,abundance.SRR6231077,abundance.SRR6231078,abundance.SRR6231079,abundance.SRR6231080,abundance.SRR6231081,abundance.SRR6231082,abundance.SRR6231083,abundance.SRR6231084,abundance.SRR6231085,...,length.SRR6231082,length.SRR6231083,length.SRR6231084,length.SRR6231085,length.SRR6231086,length.SRR6231087,length.SRR6231088,length.SRR6231089,countsFromAbundance,ENSEMBL


No duplicates

In [14]:
#Drop duplicates
#txi = txi.drop_duplicates(subset='ENSEMBL', keep='first').reset_index(drop=True).set_index('ENSEMBL')
#print(txi.shape)
#txi.head(3)

In [15]:
#TPM correspond to abundance calculated by salmon/tximport
TPM = txi[[col for col in txi.columns if col.startswith("abundance.")]]
#remove prefix "abundance."
TPM.columns = [col.split('.')[1] for col in TPM.columns]
print(TPM.shape)
TPM.to_csv(os.path.join(os.getcwd(),'data', 'TPM.csv'))
TPM.head(3)

(58434, 14)


Unnamed: 0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSG00000000003.14,0.904747,1.728894,2.916382,1.672736,1.783212,0.839226,2.240565,0.740904,2.093233,0.184065,1.386044,0.135562,0.098742,2.971101
ENSG00000000005.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.12,49.58387,42.782853,70.89121,45.358392,61.097149,49.674284,42.417455,49.326152,43.877174,53.120665,50.326895,34.555001,50.109094,69.003028


In [16]:
counts = txi[[col for col in txi.columns if col.startswith("counts.")]]
#remove prefix "counts."
counts.columns = [col.split('.')[1] for col in counts.columns]
print(counts.shape)
counts.to_csv(os.path.join(os.getcwd(),'data', 'counts.csv'))
counts.head(3)

(58434, 14)


Unnamed: 0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSG00000000003.14,31.812,60.686,121.274,34.846,69.902,25.15,66.644,22.725,48.893,10.447,49.374,9.59,7.226,80.811
ENSG00000000005.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.12,574.999,565.0,1041.999,516.999,789.0,486.0,510.001,493.0,366.001,598.999,603.0,453.0,656.0,894.0


In [17]:
lengths = txi[[col for col in txi.columns if col.startswith("length.")]]
#remove prefix "counts."
lengths.columns = [col.split('.')[1] for col in lengths.columns]
print(lengths.shape)
lengths.to_csv(os.path.join(os.getcwd(),'data', 'lengths.csv'))
lengths.head(3)

(58434, 14)


Unnamed: 0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSG00000000003.14,2076.907223,1807.147185,2014.507826,1320.160816,2139.434153,2267.286693,1631.755105,2116.087129,2106.547427,3547.0,2008.672212,3547.0,3547.0,1375.863712
ENSG00000000005.6,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5
ENSG00000000419.12,684.9782,679.919878,712.065908,722.332479,704.799894,740.18289,659.593369,689.528982,752.280808,704.72891,675.620844,657.309811,634.535054,655.377462


For use with limma-voom, export counts calculated with the lengthScaled TPM method (from [tximport vignette](https://bioc.ism.ac.jp/packages/3.4/bioc/vignettes/tximport/inst/doc/tximport.html): "limma-voom does not use the offset matrix stored in y$offset, so we recommend using the scaled counts generated from abundances, either 'scaledTPM' or 'lengthScaledTPM' ")

In [18]:
## Load the expression matrix
txi_lengthScaledTPM = pd.read_csv(os.path.join(os.getcwd(),'data', 'txi_lengthScaledTPM.csv'), index_col=0)
print(txi_lengthScaledTPM.shape)

(58434, 43)


In [19]:
# Add trimmed ENSEMBL IDs and drop duplicates
txi_lengthScaledTPM['ENSEMBL'] = [ID.split('.')[0] for ID in txi_lengthScaledTPM.index]
txi_lengthScaledTPM = txi_lengthScaledTPM.drop_duplicates(subset='ENSEMBL', keep='first').reset_index(drop=True).set_index('ENSEMBL')
print(txi_lengthScaledTPM.shape)
txi_lengthScaledTPM.head(3)

(58434, 43)


Unnamed: 0_level_0,abundance.SRR6231076,abundance.SRR6231077,abundance.SRR6231078,abundance.SRR6231079,abundance.SRR6231080,abundance.SRR6231081,abundance.SRR6231082,abundance.SRR6231083,abundance.SRR6231084,abundance.SRR6231085,...,length.SRR6231081,length.SRR6231082,length.SRR6231083,length.SRR6231084,length.SRR6231085,length.SRR6231086,length.SRR6231087,length.SRR6231088,length.SRR6231089,countsFromAbundance
ENSEMBL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003,0.904747,1.728894,2.916382,1.672736,1.783212,0.839226,2.240565,0.740904,2.093233,0.184065,...,2267.286693,1631.755105,2116.087129,2106.547427,3547.0,2008.672212,3547.0,3547.0,1375.863712,lengthScaledTPM
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,lengthScaledTPM
ENSG00000000419,49.58387,42.782853,70.89121,45.358392,61.097149,49.674284,42.417455,49.326152,43.877174,53.120665,...,740.18289,659.593369,689.528982,752.280808,704.72891,675.620844,657.309811,634.535054,655.377462,lengthScaledTPM


In [20]:
counts_lengthScaledTPM = txi_lengthScaledTPM[[col for col in txi_lengthScaledTPM.columns if col.startswith("counts.")]]
#remove prefix "counts."
counts_lengthScaledTPM.columns = [col.split('.')[1] for col in counts_lengthScaledTPM.columns]
print(counts_lengthScaledTPM.shape)
counts_lengthScaledTPM.to_csv(os.path.join(os.getcwd(),'data', 'counts_lengthScaledTPM.csv'))
counts_lengthScaledTPM.head(3)

(58434, 14)


Unnamed: 0_level_0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSEMBL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ENSG00000000003,33.7848,73.529145,130.315007,57.585408,72.726868,24.408716,89.253164,23.65048,50.90397,6.355986,52.170935,5.709193,4.373745,128.361278
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419,568.489779,558.66174,972.591453,479.436096,765.069502,443.593996,518.798506,483.440794,327.612726,563.200818,581.620598,446.823223,681.48476,915.319682


After you completed successfully the above steps, you can start to analyze the processed gene expression matrix

## References
---

1. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. (2017) **Salmon provides fast and bias-aware quantification of transcript expression.** _Nature Methods_,14, 417-419. PMID:[28263959](https://www.ncbi.nlm.nih.gov/pubmed/28263959)
2. Ewels **multiQC** https://multiqc.info/
3. Soneson C., Love M.I., Robinson M.D. (2015): **Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.** _F1000Research_ http://dx.doi.org/10.12688/f1000research.7563.1