# Kallisto pipeline

The pipeline presented in this notebook will require the following:

1. __kallisto__ for alignment-free transcript abundance quantification
2. __multiQC__ to generate QC report and check if there are low quality samples to exclude from downstream analysis
3. __tximport__ to aggregate transcript counts and produce gene-level count matrices and normalizing offsets

In [1]:
#Imports required libraries
import os, IPython, re
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
from functools import reduce
#Other imports 
import tools.progressbar as pg
import tools.utilities as utils

In [2]:
#Select sound for audio alerts
sound_file = 'tools/test.wav'

## Quantify transcripts using kallisto
* Download reference transcripts

In [3]:
# Create folders to store the reference genome
QUANTDIR= 'quant'
try:
    os.makedirs(QUANTDIR)
except FileExistsError:
    # directory already exists
    pass

In [4]:
#dowload transcript file
transcript_file = os.path.join(os.getcwd(),QUANTDIR,'gencode.v30.transcripts.fa.gz')
utils.download_ftp('ftp.ebi.ac.uk', 'pub/databases/gencode/Gencode_human/release_30/gencode.v30.transcripts.fa.gz', transcript_file)

/Users/jeagor/Documents/GitHub/RNAseq/pipelines/quant/gencode.v30.transcripts.fa.gz already downloaded


* Create Kallisto index https://pachterlab.github.io/kallisto/starting

In [5]:
transcript_index= os.path.join(os.getcwd(),QUANTDIR,'kallisto_gencode.v30_quasi_index')
if not os.path.exists(transcript_index):
    ## Create salmon index (make sure to have sufficient disk space and memory)
    utils.run_command(f'kallisto index -i {transcript_index} {transcript_file}')
    #sound alert when done
    IPython.display.Audio(sound_file, autoplay=True)
else:
    print(transcript_index, 'already created')

/Users/jeagor/Documents/GitHub/RNAseq/pipelines/quant/kallisto_gencode.v30_quasi_index already created


* quantify transcripts

In [6]:
#For SINGLE END libraries, you will need to know the average and standard deviation of the FRAGMENT length. These cannot be infered.
#If you do not have this information, use Salmon.
length = 100
std = 30

In [7]:
########## alignment-free transcript abundance quantification ##########
FASTQDIR = 'data/fastq'
OUTDIR = os.path.join(os.getcwd(),QUANTDIR,'kallisto_output')
single_samples = [f.split('.')[0] for f in os.listdir(FASTQDIR) if f.endswith('.sra.fastq.gz')]
paired_samples = [f.split('.')[0] for f in os.listdir(FASTQDIR) if f.endswith('.sra_1.fastq.gz')]

In [7]:
## Align and assemble single-end sequencing reads
for sample in pg.log_progress(single_samples):
    fastq = os.path.join(FASTQDIR,sample+'.sra.fastq.gz')
    output = os.path.join(OUTDIR,sample)
    if not os.path.exists(output):
        os.makedirs(output)
        print('Processing sample',sample)
        #If your reads are single end only you can run kallisto by specifying the --single flag,
        #however you must supply the length and standard deviation of the fragment length (not the read length).
        utils.run_command(f'kallisto quant -i {transcript_index} -o {output} -t {utils.N_CPU} --single -l {length} -s {std} {fastq}')


## Align and assemble paired-end sequencing reads
for sample in pg.log_progress(paired_samples):
    fastq1 = os.path.join(FASTQDIR,sample+'.sra_1.fastq.gz')
    fastq2 = os.path.join(FASTQDIR,sample+'.sra_2.fastq.gz')
    output = os.path.join(OUTDIR,sample)
    if not os.path.exists(output):
        os.makedirs(output)
        print('Processing sample',sample)
        utils.run_command(f'kallisto quant -i {transcript_index} -o {output} -t {utils.N_CPU} {fastq1} {fastq2}')

print('Transcript abundance quantification done')
#sound alert when done
IPython.display.Audio(sound_file, autoplay=True)

VBox(children=(HTML(value=''), IntProgress(value=0, max=14)))

Processing sample SRR6231076

[quant] fragment length distribution is truncated gaussian with mean = 100, sd = 30
[index] k-mer length: 31
[index] number of targets: 208,621
[index] number of k-mers: 130,783,978
[index] number of equivalence classes: 873,012
[quant] running in single-end mode
[quant] will process file 1: data/fastq/SRR6231076.sra.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 7,877,339 reads, 6,549,068 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,077 rounds

Processing sample SRR6231077

[quant] fragment length distribution is truncated gaussian with mean = 100, sd = 30
[index] k-mer length: 31
[index] number of targets: 208,621
[index] number of k-mers: 130,783,978
[index] number of equivalence classes: 873,012
[quant] running in single-end mode
[quant] will process file 1: data/fastq/SRR6231077.sra.fastq.gz
[quant] finding pseudoalignments for the reads ...

VBox(children=(HTML(value=''), IntProgress(value=0, max=0)))

Transcript abundance quantification done


## QC
* Check the quality of the raw reads (fastq files) and %mapping (salmon) to check which samples need to be excluded

In [8]:
#We can examine the QC report generated by multiQC to evaluate the quality of the data
!multiqc . -f --outdir data
IPython.display.IFrame('data/multiqc_report.html', width=800, height=350)

  configs = yaml.load(f)
  sp = yaml.load(f)
[INFO   ]         multiqc : This is MultiQC v1.7
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
[?25lSearching 307 files..  [####################################]  100%          [?25h
[INFO   ]          salmon : Found 14 meta reports
[INFO   ]          salmon : Found 14 fragment length distributions
[INFO   ]        kallisto : Found 14 reports
[INFO   ]          fastqc : Found 14 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : data/multiqc_report.html
[INFO   ]         multiqc : Data        : data/multiqc_data
[INFO   ]         multiqc : MultiQC complete


Here there did not seem to be adaptor contamination and base quality was good throughout the reads for all retained samples so we decided not to perform adaptor and quality trimmimg. 

## Aggregate transcripts TPM and counts by gene

### Import and merge all quant files from kallisto's output

In [8]:
%%time
quants = []
namelist = None
tx_number = None

#import all quant.sf files
samples = sorted([s for s in os.listdir(OUTDIR) if s.startswith('SRR')])
for sample in samples:
    file = os.path.join(OUTDIR,sample,'abundance.tsv')
    quant_df = pd.read_csv(file, sep=r'\t', engine='python')
    quant_df.rename(columns= {'est_counts':'counts', 'target_id':'IDs'}, inplace= True)
    #check all files have the same list of transcripts
    if namelist is None:
        namelist = quant_df['IDs'].values
    if not all(quant_df['IDs'].values == namelist):
        print(sample, 'has different transcript list')
    if tx_number is None:
        tx_number = len(quant_df)
    if len(quant_df) != tx_number:
        print(sample, 'has different transcript number')

    #drop length column as it won't be used
    quant_df.drop(['length', 'eff_length'], axis= 1, inplace= True)
    #add sample name to columns to facilitate merging
    quant_df.columns = [f'{col}_{sample}' if col != 'IDs' else col for col in quant_df.columns]
    quants += [quant_df]

CPU times: user 16.6 s, sys: 616 ms, total: 17.2 s
Wall time: 17.3 s


In [9]:
%%time
#merge all quant files
merged = reduce(lambda left,right: pd.merge(left,right, on='IDs', how='outer'), quants)
#sort columns alphabetically
merged = merged.reindex(['IDs'] + sorted(merged.columns[1:], key=lambda x: x.lower()), axis=1)
print(merged.shape)
merged.head()

(208621, 29)
CPU times: user 2.43 s, sys: 462 ms, total: 2.89 s
Wall time: 2.89 s


In [10]:
#split all IDs in Name
IDs = merged.IDs.str.split(pat='|', expand=True).drop([6,8], axis=1)
IDs.columns = ['Tx', 'Gene', 'ID2', 'ID3', 'ID4', 'ID5', 'Type']
IDs.head()

Unnamed: 0,Tx,Gene,ID2,ID3,ID4,ID5,Type
0,ENST00000456328.2,ENSG00000223972.5,OTTHUMG00000000961.2,OTTHUMT00000362751.1,DDX11L1-202,DDX11L1,processed_transcript
1,ENST00000450305.2,ENSG00000223972.5,OTTHUMG00000000961.2,OTTHUMT00000002844.2,DDX11L1-201,DDX11L1,transcribed_unprocessed_pseudogene
2,ENST00000488147.1,ENSG00000227232.5,OTTHUMG00000000958.1,OTTHUMT00000002839.1,WASH7P-201,WASH7P,unprocessed_pseudogene
3,ENST00000619216.1,ENSG00000278267.1,-,-,MIR6859-1-201,MIR6859-1,miRNA
4,ENST00000473358.1,ENSG00000243485.5,OTTHUMG00000000959.2,OTTHUMT00000002840.1,MIR1302-2HG-202,MIR1302-2HG,lincRNA


In [11]:
tx_quant = IDs.merge(merged, left_index=True, right_index=True)
tx_quant.head()

Unnamed: 0,Tx,Gene,ID2,ID3,ID4,ID5,Type,IDs,counts_SRR6231076,counts_SRR6231077,counts_SRR6231078,counts_SRR6231079,counts_SRR6231080,counts_SRR6231081,counts_SRR6231082,counts_SRR6231083,counts_SRR6231084,counts_SRR6231085,counts_SRR6231086,counts_SRR6231087,counts_SRR6231088,counts_SRR6231089,tpm_SRR6231076,tpm_SRR6231077,tpm_SRR6231078,tpm_SRR6231079,tpm_SRR6231080,tpm_SRR6231081,tpm_SRR6231082,tpm_SRR6231083,tpm_SRR6231084,tpm_SRR6231085,tpm_SRR6231086,tpm_SRR6231087,tpm_SRR6231088,tpm_SRR6231089
0,ENST00000456328.2,ENSG00000223972.5,OTTHUMG00000000961.2,OTTHUMT00000362751.1,DDX11L1-202,DDX11L1,processed_transcript,ENST00000456328.2|ENSG00000223972.5|OTTHUMG000...,0.0,0.0,0.0,4.96552,0.0,6.03969,1.08636e-07,38.5047,5.20728,0.0,0.0,10.3783,0.0,0.009436,0.0,0.0,0.0,0.552004,0.0,0.811906,1.07629e-08,4.68295,0.833679,0.0,0.0,0.934056,0.0,0.000866
1,ENST00000450305.2,ENSG00000223972.5,OTTHUMG00000000961.2,OTTHUMT00000002844.2,DDX11L1-201,DDX11L1,transcribed_unprocessed_pseudogene,ENST00000450305.2|ENSG00000223972.5|OTTHUMG000...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ENST00000488147.1,ENSG00000227232.5,OTTHUMG00000000958.1,OTTHUMT00000002839.1,WASH7P-201,WASH7P,unprocessed_pseudogene,ENST00000488147.1|ENSG00000227232.5|OTTHUMG000...,0.0,22.4311,0.0,10.2553,0.0,0.0,0.0,5.87389,6.26873,2.31459,1.01631,0.0,0.0,2.5949,0.0,2.58461,0.0,1.41871,0.0,0.0,0.0,0.88899,1.24892,0.32187,0.130272,0.0,0.0,0.296507
3,ENST00000619216.1,ENSG00000278267.1,-,-,MIR6859-1-201,MIR6859-1,miRNA,ENST00000619216.1|ENSG00000278267.1|-|-|MIR685...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ENST00000473358.1,ENSG00000243485.5,OTTHUMG00000000959.2,OTTHUMT00000002840.1,MIR1302-2HG-202,MIR1302-2HG,lincRNA,ENST00000473358.1|ENSG00000243485.5|OTTHUMG000...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Aggregate TPM and counts by gene

In [12]:
TPM_cols = [col for col in tx_quant.columns if col.startswith('TPM')]
counts_cols = [col for col in tx_quant.columns if col.startswith('counts')]
agg = tx_quant.loc[:,['Tx','Gene']+TPM_cols+counts_cols].groupby('Gene').sum()
agg.head()

Unnamed: 0_level_0,counts_SRR6231076,counts_SRR6231077,counts_SRR6231078,counts_SRR6231079,counts_SRR6231080,counts_SRR6231081,counts_SRR6231082,counts_SRR6231083,counts_SRR6231084,counts_SRR6231085,counts_SRR6231086,counts_SRR6231087,counts_SRR6231088,counts_SRR6231089
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ENSG00000000003.14,32.77868,55.85452,119.70139,29.5717,69.2353,17.36088,65.00494,18.8484,47.40966,3.21189,45.19161,5.52263,5.95529,78.4476
ENSG00000000005.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.12,588.9996,582.9998,1077.9997,529.0004,803.0,493.000186,524.0004,503.0005,370.9997,603.99977,609.0,459.0001,672.9995,905.0001
ENSG00000000457.14,288.547205,333.1185,34.526202,247.39485,224.7059,348.5801,344.524367,118.7598,166.2008,291.477,322.7931,263.784709,266.7482,320.5316
ENSG00000000460.17,3.20626,75.2651,95.51374,129.515608,75.963721,72.83391,42.17847,56.80857,35.071425,111.77762,42.27115,66.20022,120.72019,115.744723


In [13]:
TPM = agg.loc[:,TPM_cols]
TPM.columns = [col.split('_')[1] for col in TPM.columns]
TPM.head().to_csv(os.path.join(QUANTDIR,'kallisto_TPM.csv'))

In [14]:
counts = agg.loc[:,counts_cols]
counts.columns = [col.split('_')[1] for col in counts.columns]
counts.to_csv(os.path.join(QUANTDIR,'kallisto_counts.csv'))

After you completed successfully the above steps, you can start to analyze the processed gene expression matrix

## References
---

1. Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, **Near-optimal probabilistic RNA-seq quantification.** (2016)_Nature Biotechnology_ 34, 525–527 doi:10.1038/nbt.3519
2. Ewels **multiQC** https://multiqc.info/