# Analyse kallisto output using [sleuth](https://pachterlab.github.io/sleuth/about)

## Introduction

This analysis is based on [walkthroughs](https://pachterlab.github.io/sleuth/walkthroughs) from the pachter lab. We will examine in detail how to analyze the RNA-Seq dataset in order to obtain both gene-level and transcript-level differential expression results that are consistent with each other.

## Preliminaries

Requirements:

* `cowplot` for making prettier plots and plots with grids. Available in CRAN: `install.packages('cowplot')`.
* `gridExtra` Available in CRAN: `install.packages('gridExtra')`.

In [1]:
#load the requisite packages:
suppressMessages({
  library('cowplot')
  library('sleuth')
})

In [2]:
#set the path to quant file
QUANTDIR <- file.path('quant', 'kallisto_output')

## Parsing metadata
A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. A metadata file should have been downloaded along with the kallisto quantifications. The first step in a sleuth analysis is loading of the metadata file. You might need the path in read_table below to where you have downloaded the kallisto dataset, so that the path directs to the SraRunTable.txt. We then select the relevant columns of the metadata.

In [3]:
metadata <- read.table(file.path('data','SraRunTable.txt'), sep='\t', header=TRUE, comment.char = '', stringsAsFactors= FALSE)
metadata <- dplyr::select(metadata, c('Run', 'treatment'))
head(metadata, n = 20)

Run,treatment
<chr>,<chr>
SRR8914928,Unstimulated
SRR8914929,Unstimulated
SRR8914930,Unstimulated
SRR8914931,T(anti-CD3/CD28 beads only)
SRR8914932,T(anti-CD3/CD28 beads only)
SRR8914933,T(anti-CD3/CD28 beads only)
SRR8914934,"C(cytokines, IL-12/-15/-18/TL1A)"
SRR8914935,"C(cytokines, IL-12/-15/-18/TL1A)"
SRR8914936,"C(cytokines, IL-12/-15/-18/TL1A)"
SRR8914937,TC(beads+cytokines)


In [4]:
#select the groups that you want to compare
metadata <- metadata[metadata[, 'treatment'] %in% c('Unstimulated', 'TC(beads+cytokines)'),]
head(metadata, n = 20)

Unnamed: 0_level_0,Run,treatment
Unnamed: 0_level_1,<chr>,<chr>
1,SRR8914928,Unstimulated
2,SRR8914929,Unstimulated
3,SRR8914930,Unstimulated
10,SRR8914937,TC(beads+cytokines)
11,SRR8914938,TC(beads+cytokines)
12,SRR8914939,TC(beads+cytokines)


This file describes the experimental design, we add the path names of the kallisto output directories to the metadata table. We use the SRA run names listed under Run to identify the folders we must use for the correpsonding kallisto quantifications:

In [5]:
metadata <- dplyr::mutate(metadata,
  path = file.path(QUANTDIR, Run, 'abundance.h5'))
head(metadata)

Run,treatment,path
<chr>,<chr>,<chr>
SRR8914928,Unstimulated,quant/kallisto_output/SRR8914928/abundance.h5
SRR8914929,Unstimulated,quant/kallisto_output/SRR8914929/abundance.h5
SRR8914930,Unstimulated,quant/kallisto_output/SRR8914930/abundance.h5
SRR8914937,TC(beads+cytokines),quant/kallisto_output/SRR8914937/abundance.h5
SRR8914938,TC(beads+cytokines),quant/kallisto_output/SRR8914938/abundance.h5
SRR8914939,TC(beads+cytokines),quant/kallisto_output/SRR8914939/abundance.h5


It is important to spot check the metadata file again to make sure that the kallisto runs correspond to the accession numbers in the table, so that each row is associated with the correct sample.

We rename the ‘Run’ column to ‘sample.’ ‘sample’ and ‘path’ are the two column names that sleuth will need to find the sample name and the path of the kallisto qunatifications.

In [6]:
metadata <- dplyr::rename(metadata, sample = Run)
head(metadata)

sample,treatment,path
<chr>,<chr>,<chr>
SRR8914928,Unstimulated,quant/kallisto_output/SRR8914928/abundance.h5
SRR8914929,Unstimulated,quant/kallisto_output/SRR8914929/abundance.h5
SRR8914930,Unstimulated,quant/kallisto_output/SRR8914930/abundance.h5
SRR8914937,TC(beads+cytokines),quant/kallisto_output/SRR8914937/abundance.h5
SRR8914938,TC(beads+cytokines),quant/kallisto_output/SRR8914938/abundance.h5
SRR8914939,TC(beads+cytokines),quant/kallisto_output/SRR8914939/abundance.h5


## Associating transcripts to genes
The sample quantifications performed by kallisto have produced transcript abundance and count estimates. These have been parsed by sleuth in the steps just performed, however sleuth does not “know” about genes yet. To perform gene-level analysis sleuth must parse a gene annotation. These can be imported from the ttg.csv file created in the RNA_seq pipeline:



In [7]:
ttg <- read.table(file.path('data', 'ttg_kallisto.csv'), sep=',', header=TRUE, stringsAsFactors= FALSE)
head(ttg)

target_id,ens_gene,ext_gene
<chr>,<chr>,<chr>
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|,ENSG00000223972.5,DDX11L1
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|,ENSG00000223972.5,DDX11L1
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|,ENSG00000227232.5,WASH7P
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|,ENSG00000278267.1,MIR6859-1
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lincRNA|,ENSG00000243485.5,MIR1302-2HG
ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-2HG-201|MIR1302-2HG|535|lincRNA|,ENSG00000243485.5,MIR1302-2HG


The resulting table contains Ensembl gene names (‘ens_gene’) and the associated transcripts (‘target_id’). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.

## Preparing the analysis
The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, `metadata`
 is the experimental design and `target_mapping` describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.

In [8]:
so <- sleuth_prep(metadata, target_mapping = ttg,
  aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)

reading in kallisto results
dropping unused factor levels
......
normalizing est_counts
59371 targets passed the filter
normalizing tpm
merging in metadata
summarizing bootstraps



## The analysis
Then the full model is fit with

In [9]:
so <- sleuth_fit(so, ~treatment, 'full')

fitting measurement error models
shrinkage estimation
3 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
These are the target ids with NA values: ENST00000488905.1|ENSG00000094631.19|OTTHUMG00000034496.7|OTTHUMT00000318645.1|HDAC6-230|HDAC6|492|retained_intron|, ENST00000561579.6|ENSG00000159593.15|OTTHUMG00000137513.8|OTTHUMT00000421048.4|NAE1-205|NAE1|728|protein_coding|, ENST00000566745.1|ENSG00000260898.5|OTTHUMG00000173100.1|OTTHUMT00000422137.1|ADPGK-AS1-202|ADPGK-AS1|765|antisense|
computing variance of betas


What this has accomplished is to “smooth” the raw kallisto abundance estimates for each sample using a linear model with a parameter that represents the experimental condition. To test for transcripts that are differential expressed between the conditions, sleuth performs a second fit to a “reduced” model that presumes abundances are equal in the two conditions. To identify differential expressed transcripts sleuth will then identify transcripts with a significantly better fit with the “full” model.

The “reduced” model is fit with

In [10]:
so <- sleuth_fit(so, ~1, 'reduced')

fitting measurement error models
shrinkage estimation
1 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
These are the target ids with NA values: ENST00000566745.1|ENSG00000260898.5|OTTHUMG00000173100.1|OTTHUMT00000422137.1|ADPGK-AS1-202|ADPGK-AS1|765|antisense|
computing variance of betas


and the test is performed with

In [11]:
#The likelihood ratio test (lrt) is performed with
so <- sleuth_lrt(so, 'reduced', 'full')

In general, sleuth can utilize the likelihood ratio test with any pair of models that are nested, and other walkthroughs illustrate the power of such a framework for accounting for batch effects and more complex experimental designs.

The models that have been fit can always be examined with the models() function.

In [12]:
models(so)

[  full  ]
formula:  ~treatment 
data modeled:  obs_counts 
transform sync'ed:  TRUE 
coefficients:
	(Intercept)
 	treatmentUnstimulated
[  reduced  ]
formula:  ~1 
data modeled:  obs_counts 
transform sync'ed:  TRUE 
coefficients:
	(Intercept)


## Obtaining gene-level differential expression results
When running the command ‘sleuth_results,’ sleuth uses the p-values from comparing transcripts to make a gene-level determination and perform gene differential expression.

In [13]:
sleuth_table_gene <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_table_gene <- dplyr::filter(sleuth_table_gene, qval <= 0.05)
#The most significantly differential genes are
head(sleuth_table_gene, 20)

target_id,ext_gene,num_aggregated_transcripts,sum_mean_obs_counts,pval,qval
<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>
ENSG00000140105.18,WARS,31,83.27013,6.6740330000000005e-52,1.107422e-47
ENSG00000081059.20,TCF7,17,47.69766,4.867035e-40,4.0379349999999995e-36
ENSG00000182199.11,SHMT2,19,59.60593,3.453029e-37,1.909871e-33
ENSG00000100453.13,GZMB,9,35.24479,8.196842e-34,3.4002549999999996e-30
ENSG00000010810.17,FYN,22,89.76911,2.391908e-33,7.937783999999999e-30
ENSG00000078304.19,PPP2R5C,24,93.51351,2.9279070000000003e-31,8.097126e-28
ENSG00000169045.17,HNRNPH1,30,124.59301,1.1173689999999999e-30,2.648644e-27
ENSG00000184640.18,SEPT9,22,72.74082,3.620723e-30,7.509831e-27
ENSG00000183010.16,PYCR1,9,22.44236,6.204286e-30,1.134772e-26
ENSG00000132341.12,RAN,11,56.45409,6.838859e-30,1.134772e-26


The ‘num_aggregated_transcripts’ column lists the number of transcripts used to make the gene determination. ‘pval’ displays the p-value for the gene. ‘qval’ displays the Benjamini-Hochberg-adjusted false discovery rate for the gene.

## Obtaining consistent transcript-level differential expression results
Because gene results are built on transcript results, the gene and transcript results are entirely consistent and compatible with each other. To visualize the transcript results that led to the gene results above, one merely runs sleuth_results again but this time setting the flag ‘pval_aggregate’ to FALSE.

In [14]:
sleuth_table_tx <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE, pval_aggregate = FALSE)
sleuth_table_tx <- dplyr::filter(sleuth_table_tx, qval <= 0.05)
head(sleuth_table_tx, 20)

target_id,ens_gene,ext_gene,pval,qval,test_stat,rss,degrees_free,mean_obs,var_obs,tech_var,sigma_sq,smooth_sigma_sq,final_sigma_sq
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENST00000253063.4|ENSG00000130766.5|OTTHUMG00000003532.2|OTTHUMT00000009840.2|SESN2-201|SESN2|3462|protein_coding|,ENSG00000130766.5,SESN2,3.109966e-09,2.023716e-05,35.11375,11.377268,1,7.744285,2.275454,0.0007887173,2.274665,0.114599,2.274665
ENST00000360110.9|ENSG00000072274.13|OTTHUMG00000155714.3|OTTHUMT00000341346.2|TFRC-201|TFRC|5012|protein_coding|,ENSG00000072274.13,TFRC,1.711042e-09,2.023716e-05,36.27777,19.051851,1,8.165266,3.81037,0.0010891066,3.809281,0.1169632,3.809281
ENST00000374517.6|ENSG00000136810.13|OTTHUMG00000020480.2|OTTHUMT00000053614.2|TXN-202|TXN|737|protein_coding|,ENSG00000136810.13,TXN,2.928016e-09,2.023716e-05,35.23116,13.393763,1,7.809075,2.678753,0.0008575954,2.677895,0.1147166,2.677895
ENST00000376509.4|ENSG00000102096.9|OTTHUMG00000024132.2|OTTHUMT00000060805.1|PIM2-201|PIM2|2075|protein_coding|,ENSG00000102096.9,PIM2,2.092917e-09,2.023716e-05,35.88521,14.101121,1,8.751628,2.820224,0.0003598689,2.819864,0.1269611,2.819864
ENST00000379959.7|ENSG00000134460.17|OTTHUMG00000017616.4|OTTHUMT00000046627.1|IL2RA-203|IL2RA|3176|protein_coding|,ENSG00000134460.17,IL2RA,3.408594e-09,2.023716e-05,34.93522,49.495861,1,8.245347,9.899172,0.003950745,9.895222,0.1178502,9.895222
ENST00000380956.9|ENSG00000137265.15|OTTHUMG00000016294.5|OTTHUMT00000043638.3|IRF4-201|IRF4|5314|protein_coding|,ENSG00000137265.15,IRF4,2.933747e-09,2.023716e-05,35.22735,42.622878,1,7.696539,8.524576,0.0029354681,8.52164,0.1145687,8.52164
ENST00000389266.8|ENSG00000106105.14|OTTHUMG00000152769.2|OTTHUMT00000327735.2|GARS-201|GARS|2437|protein_coding|,ENSG00000106105.14,GARS,2.241307e-09,2.023716e-05,35.75175,15.205941,1,8.880037,3.041188,0.0003338515,3.040854,0.1303138,3.040854
ENST00000394053.7|ENSG00000065911.12|OTTHUMG00000129814.5|OTTHUMT00000252045.3|MTHFD2-201|MTHFD2|4403|protein_coding|,ENSG00000065911.12,MTHFD2,7.186466e-10,2.023716e-05,37.96935,27.775012,1,7.452825,5.555002,0.0044991653,5.550503,0.11516,5.550503
ENST00000418386.2|ENSG00000226979.8|OTTHUMG00000031135.2|OTTHUMT00000076237.2|LTA-213|LTA|1422|protein_coding|,ENSG00000226979.8,LTA,2.686411e-09,2.023716e-05,35.39888,36.640844,1,8.616529,7.328169,0.0014446939,7.326724,0.1239097,7.326724
ENST00000470592.5|ENSG00000065911.12|OTTHUMG00000129814.5|OTTHUMT00000351281.1|MTHFD2-205|MTHFD2|1396|nonsense_mediated_decay|,ENSG00000065911.12,MTHFD2,1.921975e-09,2.023716e-05,36.05123,33.93434,1,7.540757,6.786868,0.0158822959,6.770986,0.1148028,6.770986


The transcript pvals listed in sleuth_table_tx were the ones aggregated to obtain the gene pvals in sleuth_table_gene. In fact, the most differential transcript is one from the gene Fam107a, which is also the most differential gene.

## Visualizing the results
One can visualize the results within our R shiny app by calling:

In [None]:
sleuth_live(so)

Loading required package: shiny

Listening on http://127.0.0.1:42427
“Error in : objet 'sigma_sq' introuvable”

This will open a new browser that runs the R shiny app. One can visualize the transcript dynamics that resulted in these gene differential results under ‘analysis’ -> ‘gene view.’ Enterring the Ensembl gene name and selecting ‘ens_gene’ from the ‘genes from’ dropdown will display each transcript corresponding to that gene. ‘analyses’ -> ‘test table’ will provide the same results as sleuth_table. As we previously mentioned, because our gene results are based on the transcript results, there is no need to visualize gene abundances separately. Instead, one can use the transcript abundances as the evidence for the gene level differential expression.