TSSr: an R/Bioconductor package for comprehensive analyses of transcription start sit (TSS) data

Zhaolian Lu, Keenan Berry, Zhenbin Hu, Yu Zhan, Tae-Hyuk Ahn, Zhenguo Lin

How to cite TSSr?

TSSr was published in NAR Genomics and Bioinformatics in 2021, please cite:

Lu, Z., Berry, K., Hu, Z., Zhan, Y., Ahn, T., & Lin, Z. (2021). TSSr: an R package for comprehensive analyses of TSS sequencing data. NAR Genomics and Bioinformatics, 3.

1. Intruduction

Documentation is also available on GitHub Pages: https://github.com/Linlab-slu/TSSr

TSSr is an R package that provides rich functions for mapping transcription start sites (TSSs) and characterizations of structures and activities of core promoters based on various types of sequencing data generated by high-throughput techniques that sequence the very 5’end of RNA transcripts (TSS sequencing), such as cap analysis of gene expression (CAGE),nAnT-iCAGE, NanoCAGE, transcript leader sequencing (TL-seq), transcript isoform sequencing (TIF-seq), TSS-seq, RAMPAGE, single-cell tagged reverse transcription (STRT), global nuclear run-on cap (GRO-cap) and STRIPE-seq.

The main fucntions of TSSr include identifications of TSSs, normalization and filtration of TSS signals, clustering of TSS to infer core promoters, quantifications of core promoter activities and shape, assigning TSS cluster to genes, gene differential expression, core promoter shift. TSSr uses multiple formats of files as input, such as Binary Sequence Alignment Mao (BAM) files (single-ended or paired-ended), Browser Extension Data (bed) files, BigWig files, or TSS tables. TSSr generates various types of TSS or core promoter track files which can be visualized in the UCSC Genome Browser or Integrative Genomics Viewer (IGV). TSSr also exports various raw and processed result tables and visualization plots. Multiple cores are supported on Linux or Mac OS platforms.

2. Pre-requisites:

R version

Download R (>4.1.0) version from CRAN.
- Windows: https://cran.r-project.org/bin/windows/base/
- Mac OS X: https://cran.r-project.org/bin/macosx/
- Linux: https://cran.r-project.org/bin/linux/

R packages

Rsamtools
GenomicRanges
GenomicFeature
Gviz
rtracklayer
DESeq2

BSgenome

install the packages by using the following R command:

install.packages("devtools")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rsamtools")
BiocManager::install("GenomicRanges")
BiocManager::install("GenomicFeatures")
BiocManager::install("Gviz")
BiocManager::install("rtracklayer")
BiocManager::install("DESeq2")
BiocManager::install("BSgenome")

data.table
stringr
- install the packages by using the following R command:
```
install.packages("data.table")
install.packages("stringr")
```

Or you can install this packages by conda easily:

build the conda environment manually

conda create -n tssr
conda activate tssr
conda install r-base=4.1.0
conda install -c bioconda -c conda-forge bioconductor-rsamtools bioconductor-gviz \
bioconductor-genomicranges bioconductor-genomicfeatures bioconductor-rtracklayer \
bioconductor-deseq2 bioconductor-bsgenome r-data.table r-stringr  r-devtools pandoc

If you running into some errors when trying install all the packages in one line, please install the packages one by one manually.

build the conda environment from .yml file

You can also install the dependency packages from yml file

wget -c https://raw.githubusercontent.com/Linlab-slu/TSSr/master/tssr.yml
conda create -n tssr -f tssr.yml

3. Installing TSSr Package:

To install the TSSr package all the prerequisites above need to be installed. After confirming those packages are installed, you can install the development version directly from GitHub using devtools:

    devtools::install_github("Linlab-slu/TSSr", build_vignettes = TRUE)

4. Input data for TSSr

Input data files

TSSr accepts two types of input data: read alignment files or TSS tables. The read alignment files in compressed binary alignment map (BAM) format are required if users intend to identify TSSs from mapped sequencing data. BAM files can be derived from mapping of either paired-end or single-end TSS sequencing reads. A TSS table is a matrix of TSS data generated by TSSr or other bioinformatics tools. A TSS table can be a tab-delimited text file, a BigWig (bw) binary type file, or browser extensible data (bed) type file. A tab-delimited TSS table contains chromosome ID, genomic coordinates, strand information, and raw or normalized read counts of each sample. An example TSS table can be downlaoded from http://zlinlab.org/TSSr/ALL.samples.TSS.raw.txt.

Reference genome

The reference genome stored as a BSgenome data package must be provided for identification of TSSs from bam files. For available BSgenome genome objects, please check (https://bioconductor.org/packages/BSgenome). In case the BSgenome object of a species is unavailable in the BSgenome package (not in the list returned by BSgenome::available.genomes() function), a custom BSgenome object has to be built and installed before running this package. (See the vignette “How to forge a BSgenome data package” within the BSgenome package for instructions). A genome annotation file (GTF or GFF file) is required for assigning TSS clusters to genes or other genomic features by using the annotateCluster function of TSSr. The chromosome names in BSgenome, genome annotation file as well as the reference genome used for mapping sequencing reads must be consistent.

S4 object

TSSr uses S4 object to store all input files and arguments and generate downstream analysis results.

5. Usage:

Launch TSSr
```
  library(TSSr)
```
Example TSSr object. Users may explore the structure of a TSSr object and TSSr fucntions using provided example TSSr object "exampleTSSr". Skip this step if the user plan to generate a new TSSr object from bam or TSS table files.
```
  data(exampleTSSr)
  myTSSr <- exampleTSSr 
```
Generating a new TSSr object from BAM or TSS table files. To generate a new TSSr object, several arguments of the constructer "new" function must be specified. The first step is to provide the path and file names of input files for argument "inputFiles". Four example BAM files (S01.sorted.bam, S02.sorted.bam, S03.sorted.bam, S04.sorted.bam) can be downloaded from http://www.zlinlab.org/TSSr.html. These BAM files were generated using nAnT-iCAGE from a budding yeast Saccharomyces cerevisiae (Lu and Lin 2019). The nAnT-iCAGE reads were mapped to the reference genome of S. cerevisiae (R64-2-1) using HISAT2 (Kim, Langmead et al. 2015). To reduced sample file size, each BAM file only includes sequencing reads mapped to two chromosomes (Chr I and Chr II). A TSS table genereated from the four BAM files can be downloaded from http://zlinlab.org/TSSr/ALL.samples.TSS.raw.txt, which can also be used as a input file. Assuming the bam or TSS files are saved to the current working directory:
```
  inputFiles <- c("S01.sorted.bam", "S02.sorted.bam", "S03.sorted.bam", "S04.sorted.bam") # if use a TSS table: inputFiles <- "ALL.samples.TSS.raw.txt"
  inputFilesType <- "bam" # set “inputFilesType” as “bamPairedEnd” for paired-end BAM files, and as "TSStable" if the input file is a TSS table 
```
Install and launch required BSgenome object. The corresponding BSgenome object of S. cerevisiae “BSgenome.Scerevisiae.UCSC.sacCer3” is available from BSgenome Bioconductor package (https://bioconductor.org/packages/BSgenome).
```
  if (!requireNamespace("BSgenome.Scerevisiae.UCSC.sacCer3", quietly = TRUE))
  BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3")
  library(BSgenome.Scerevisiae.UCSC.sacCer3) 
```
Provide path and file name of genome annotation file. The example genome annotation of S. cerevisiae (R64-2-1) was originally obtained from the Saccharomyces Genome Database. It ban be downlaoded from http://zlinlab.org/TSSr/saccharomyces_cerevisiae_R64-2-1.gff. Assuming the gff file is saved to the current working directory:
```
  refSource <- "saccharomyces_cerevisiae_R64-2-1.gff" 
```

Creating a new TSSr object using the constructer "new" function. In addition to the input data file "inputFiles" and file type "inputFilesType", BSgenome object name, and genome annotaion file "refSource", several other arguments need to be specified, including "sampleLabels", "sampleLabelsMerged", and "mergeIndex" and "organismName". In the example data, sample lables "SL01, SL02, SL03 and SL04" were used for "S01.sorted.bam, S02.sorted.bam, S03.sorted.bam, S04.sorted.bam", respectively (sampleLabels = c("SL01","SL02","SL03","SL04")). If a TSS table is used as input data, the sample lable of each sample is its column name in the TSS table. "sampleLabelsMerged" will be used if the user need to merge TSS signals from multiple samples (e.g., biological replicates) into a single dataset. In this case, SL01 and SL02 are two biological replicates obtained from S. cerevisiae cells grown in rich medium ("control"), while SL03 and SL04 were obtained by treating S. cerevisiae with α factor ("treat"). Thus, sampleLabelsMerged = c("control","treat") and mergeIndex = c(1,1,2,2), which means SL01 and SL02 will be merged as group 1 "control", and SL03 and SL04 will be merged as group 2 "treat". "organismName" is the species name of the samples, which will be used to annotate TSS clusters by "annotateCluster".

  myTSSr <- new("TSSr", genomeName = "BSgenome.Scerevisiae.UCSC.sacCer3"
            ,inputFiles = inputFiles
            ,inputFilesType = inputFilesType
            ,sampleLabels = c("SL01","SL02","SL03","SL04")
            ,sampleLabelsMerged = c("control","treat")
            ,mergeIndex = c(1,1,2,2)
            ,refSource = refSource
            ,organismName = "Saccharomyces cerevisiae")

To display the available slots in the created TSSr object by typing the TSSr object name. Most slots in the newly created TSSr object are empty at this point since no analysis has been conducted.:

   myTSSr
    
  # An object of class "TSSr"
  # Slot "genomeName":
  # [1] "BSgenome.Scerevisiae.UCSC.sacCer3"
  # 
  # Slot "inputFiles":
  # [1] "S01.sorted.bam" "S02.sorted.bam" "S03.sorted.bam" "S04.sorted.bam"
  # 
  # Slot "inputFilesType":
  # [1] "bam"
  # 
  # Slot "sampleLabels":
  # [1] "SL01" "SL02" "SL03" "SL04"
  # 
  # Slot "sampleLabelsMerged":
  # [1] "control" "treat"  
  # 
  # Slot "librarySizes":
  # numeric(0)
  # 
  # Slot "TSSrawMatrix":
  # data frame with 0 columns and 0 rows
  # 
  # Slot "mergeIndex":
  # [1] 1 1 2 2
  # 
  # Slot "TSSprocessedMatrix":
  # data frame with 0 columns and 0 rows
  # 
  # Slot "tagClusters":
  # list()
  # 
  # Slot "consensusClusters":
  # list()
  # 
  # Slot "clusterShape":
  # list()
  # 
  # Slot "refSource":
  # [1] "saccharomyces_cerevisiae_R64-2-1.gff"
  # 
  # Slot "refTable":
  # data frame with 0 columns and 0 rows
  # 
  # Slot "organismName":
  # [1] "Saccharomyces cerevisiae"
  # 
  # Slot "assignedClusters":
  # list()
  # 
  # Slot "unassignedClusters":
  # list()
  # 
  # Slot "filteredClusters":
  # list()
  # 
  # Slot "DEtables":
  # list()
  # 
  # Slot "TAGtables":
  # list()
  # 
  # Slot "PromoterShift":
  # list()

TSS calling from bam files or retrieving TSS data from TSS table using "getTSS" function. The "getTSS" function identifies genomic coordinates of all TSSs and calculates read counts supporting each TSS from bam file and return values to the slot of TSSrawMatrix. Before TSS calling, TSSr removes reads that are below certain sequencing quality and mapping quality. The default threshold for Phred quality score is 10, and mapping quality (MAPQ score) is 20. Users may change these arguments by setting different values for “sequencingQualityThreshold” and “mappingQualityThreshold” when running the “getTSS” function. If a mapped TSS sequencing read starts with a G that is a mismatch to the reference genome, the uncoded 5’ end G is likely the m7G cap, and thus the uncoded G will be removed from TSS calling by "getTSS". If a matched G at the 5’end of a tag is considered as an added cap, TSSr treats the 5’end of reads with matched G as genome-coded G, and the first G is not removed when calling TSS. This strategy is based on a stronge preference of PyPu dinucleotide at the [-1, +1] sites. This strategy also makes TSSr suitable for calling TSSs from 5’end sequencing reads that are not based on cap capture techniques.

Considering that soft-clipping during mapping of sequencing reads to reference genome could also exclude non-matching 5’end bases other than unmatched Gs, which introduces false positive TSSs. Therefore, we’d recommend not to use soft-clipping for sequencing data. In the case that soft-clipped bam files are used, the users should set "softclippingAllowed = TRUE" for getTSS function.

    getTSS(myTSSr)

TSS calling from bam files or retrieving TSS data from TSS table

    myTSSr@TSSrawMatrix

    #           chr    pos strand SL01 SL02 SL03 SL04
    #      1:  chrI   1561      +    0    0    0    1
    #      2:  chrI   5759      +    1    0    0    0
    #      3:  chrI   5765      +    1    0    0    0
    #      4:  chrI   5773      +    1    0    0    0
    #      5:  chrI   5925      +    0    1    0    0
    #     ---                                        
    # 163199: chrII 810860      -    0    0    0    2
    # 163200: chrII 810963      -    0    1    0    0
    # 163201: chrII 811112      -    0    0    1    0
    # 163202: chrII 811370      -    1    0    0    0
    # 163203: chrII 812101      -    1    0    0    0

TSSrawMatrix contains genomic coordinates and read counts of each called TSS. TSSrawMatrix can be exported by "exportTSStable" function to a TSS table "ALL.samples.TSS.raw.txt" in the working directory. The TSS table can be used as input file for future downstram analyses using TSSr or other tools. It is advised to export unmerged TSSrawMatrix to an TSS table to avoid repeated TSS calling step.
```
  exportTSStable(myTSSr, data = "raw", merged = "FALSE")
```
The "plotCorrelation" function is used to calculate the pairwise correlation coefficients and plot pairwise scatter plots of TSS counts. A subset of samples can also be specified to display the pairwise correlations. Three correlation methods are supported: “pearson”, “kendall”, or “spearman”. The default setting generates scatter plots for all samples (samples = "all"). For plotting a subset of samples, users may provide a list of sample labels for "sample", e.g., samples = c("SL01","SL02","SL03").
```
  plotCorrelation(myTSSr, samples = "all")
```

TSS includes "plotPCA" function to plot results principle component analysis (PCA) of all samples to obtain an overview of the TSS data in a low-dimensional subspace and for quality assessment. plotPCA will make a biplot which visualizes both how samples relate to each other in terms of the first principal component (PC1) and the second principal component (PC2) and simultaneously show how each variable contributes to each principal component.
```
 plotTssPCA(myTSSr, TSS.threshold=10)
```

Merging samples (biological replicates). Users can merge multiple samples (e.g., biological replicates) into previously defined groups with mergeSamples function. The "mergeIndex" argument directs which samples will be merged and how the final dataset will be ordered accordingly. The merged read counts and genomic coordinates are stored in the TSSprocessedMatrix slot.
```
  mergeSamples(myTSSr)
  
  myTSSr@TSSprocessedMatrix
  
  #          chr    pos strand control treat
  #     1:  chrI   1561      +       0     1
  #     2:  chrI   5759      +       1     0
  #     3:  chrI   5765      +       1     0
  #     4:  chrI   5773      +       1     0
  #     5:  chrI   5925      +       1     0
  #    ---                                  
  #163199: chrII 810860      -       0     2
  #163200: chrII 810963      -       1     0
  #163201: chrII 811112      -       0     1
  #163202: chrII 811370      -       1     0
  #163203: chrII 812101      -       1     0
```
To return library sizes (number of total read counts) of merged samples in TSSr object in the specified order (note: order is specified in mergeSample function):
```
  myTSSr@librarySizes
  # control   treat 
  # 3221609 5131202
```
Library sizes are usually different among samples. To provide between-sample comparability, the raw read counts of each TSS need to be scaled as tags per million mapped reads (TPM) with normalizeTSS function. TSSr also provides two options to exclude TSSs with a low support: "poisson" and "tpm". If you intend to remove TSSs with low support using the “poisson” option, you should skip the "normalizeTSS" because the "poisson" method (method = "poisson") use the raw read count data to calculate the probability of observing k numbers of reads supporting each TSS based on the sequencing depth of the sample per the Poisson distribution. Only TSSs with a significantly larger number of supporting reads than expected (default threshold p < 0.01) are considered as qualified TSSs. Non-significant TSSs are thus filtered by TSSr. TSSr will normalize the raw read counts as TPM after filtering by the "poisson" method.
```
  filterTSS(myTSSr, method = "poisson")
```
If you use the "normalizeTSS" function, you may use the “filterTSS” option to remove TSSs with low support from mapped reads based on TMP value (method = "TPM"), and any TSS that has a lower TPM value than user-defined threshold "tpmLow" will be removed (default TPM threshold = 0.1).
```
  normalizeTSS(myTSSr)
  filterTSS(myTSSr, method = "TPM")
  myTSSr@TSSprocessedMatrix

  
  #           chr    pos strand  control    treat
  #      1:  chrI   1561      + 0.000000 0.194886
  #      2:  chrI   5759      + 0.310404 0.000000
  #      3:  chrI   5765      + 0.310404 0.000000
  #      4:  chrI   5773      + 0.310404 0.000000
  #      5:  chrI   5925      + 0.310404 0.000000
  #     ---                                      
  # 163199: chrII 810860      - 0.000000 0.389772
  # 163200: chrII 810963      - 0.310404 0.000000
  # 163201: chrII 811112      - 0.000000 0.194886
  # 163202: chrII 811370      - 0.310404 0.000000
  # 163203: chrII 812101      - 0.310404 0.000000
```
Simialr to the raw read counts, the normalized and filtered (processed) TSS matrix can be exported to delimited text file with "exportTSStable" function or bedGraph/BigWig files with "exportTSStoBedgraph" fucntion. bedGraph/BigWig files can be visualized in the UCSC Genome Browser or Integrative Genomics Viewer (IGV) or Genome Browser at YeasTSS.org for selected yeast species.
```
  exportTSStable(myTSSr, data = "processed") 
  exportTSStoBedgraph(myTSSr, data = "processed", format = "bedGraph") 
  exportTSStoBedgraph(myTSSr, data = "processed", format = "BigWig")
```

Clustering TSSs to infer core promoters with “clusterTSS” function.

The “clusterTSS” function was designed to group neighboring TSSs into distinct TSS clusters (TCs), representing putative core promoters. It implements a TSS clustering algorithm based on peaking identification, namely “peakclu” (peak clustering) (Lu and Lin, 2021). Briefly, peakclu applies a sliding window approach (default window size = 100 bp with step size = 1) to scan TSS signals from the 5′ end of both strands of each chromosome. In each window, the TSS with the highest TPM value was identified as the peak. The surrounding TSSs are grouped with the peak into a TC. The clustering process of a TC terminates if a TSS is ≥ n bp (default n = 25) away from the nearest upstream TSS. In addition to setting a minimal allowed distance between peaks, TSSr offers another option to set maximal allowed extension distance between neighboring TSSs around peaks, which enables users to define the boundaries between neighboring core promoters. clusterTSS calculates inter-quantile width of a core promoter based on the cumulative distribution of TSS signals within the promoter. The positions of the 10th to 90th quantiles of TSS signals, which include at least 80% transcription initiation signals within a cluster, were defined as the 5’ and 3’ boundaries of the core promoter. The clustering step might be slow especially when the number of TSSs is in millions. Using multicores is highly recommended.

  clusterTSS(myTSSr, method = "peakclu",peakDistance=100,extensionDistance=30
         ,localThreshold = 0.02, clusterThreshold = 1
         ,useMultiCore=FALSE, numCores=NULL)
         
  myTSSr@tagClusters
  
  # $control
  #       cluster   chr  start    end strand dominant_tss       tags tags.dominant_tss  q_0.1  q_0.9 interquantile_width
  #    1:       1  chrI   6530   6564      +         6548   9.312118          3.724847   6530   6562                  33
  #    2:       2  chrI   7166   7189      +         7169   1.241616          0.620808   7166   7189                  24
  #    3:       3  chrI   8084   8087      +         8084   1.241616          0.620808   8084   8087                   4
  #    4:       4  chrI   9325   9411      +         9327   4.966464          1.241616   9327   9391                  65
  #    5:       5  chrI   9442   9479      +         9442   3.414443          1.862423   9442   9468                  27
  #   ---                                                                                                               
  # 3303:    3303 chrII 804855 804975      -       804895 260.739280         71.703301 804868 804899                  32
  # 3304:    3304 chrII 807129 807179      -       807165  16.761812          6.518482 807161 807168                   8
  # 3305:    3305 chrII 808679 808723      -       808723   1.862424          0.931212 808679 808723                  45
  # 3306:    3306 chrII 809353 809377      -       809377   1.862424          1.552020 809353 809377                  25
  # 3307:    3307 chrII 809707 809748      -       809707   4.035251          2.172827 809707 809724                  18
  
  # $treat
  #       cluster   chr  start    end strand dominant_tss       tags tags.dominant_tss  q_0.1  q_0.9 interquantile_width
  #    1:       1  chrI   6521   6564      +         6559   5.067037          1.753975   6530   6559                  30
  #    2:       2  chrI   7096   7118      +         7100   1.948860          0.584658   7096   7114                  19
  #    3:       3  chrI   8061   8087      +         8061   1.948860          0.584658   8061   8087                  27
  #    4:       4  chrI   9327   9476      +         9359  11.498278          1.753975   9327   9452                 126
  #    5:       5  chrI  11259  11343      +        11329  59.245376         21.827244  11288  11329                  42
  #   ---                                                                                                               
  # 3244:    3244 chrII 804855 804952      -       804895 155.908888         54.568111 804872 804901                  30
  # 3245:    3245 chrII 807118 807179      -       807165  13.642026          5.456811 807153 807165                  13
  # 3246:    3246 chrII 808655 808691      -       808679   1.753974          0.779544 808655 808691                  37
  # 3247:    3247 chrII 809377 809389      -       809377   1.169316          0.779544 809377 809389                  13
  # 3248:    3248 chrII 809707 809710      -       809707   1.364203          0.974431 809707 809710                   4

The results of TSS clustering can be exported to delimited text file with "exportClustersTable" function or bedGraph files with "exportClustersToBed" fucntion.

  exportClustersTable(myTSSr, data = "tagClusters")
  exportClustersToBed(myTSSr, data = "tagClusters")

Aggregating consensus TSS clusters

TSSr infers a set of consensus core promoters using the “consensusCluster” function to assign the same ID for TCs belong to the same core promoter, which allows subsequent comparative studies across samples. TCs from different samples are considered to belong to the same consensus core promoter if the distance of their dominant TSSs is smaller than a user-defined distance (default = 50 bp). Similarly to clusterTSS function, consensusCluster function also returns genomic coordinates, sum of TSS tags, dominate TSS coordinate, a lower (q0.1) and an upper (q0.9) quantile coordinates, and interquantile widths for each consensus cluster in each sample.
```
  consensusCluster(myTSSr, dis = 50, useMultiCore = FALSE, numCores = NULL)
```
Similarly, the detailed information of consensus clusters can be exported to delimited text files with "exportClustersTable" function or bedGraph files with "exportClustersToBed" fucntion.
```
  exportClustersTable(myTSSr, data = "consensusClusters")
  exportClustersToBed(myTSSr, data = "consensusClusters")	
```
Quantification of core promoter shape

Core promoter shape reflects the distribution of TSS signals within a core promoter. TSSr provides three different options, inter-quantile width, promoter shape score (PSS) and shape index (SI), to quantify core promoter shape. Inter-quantile width refers to the distance between the locations of the 10th percentile to the 90th percentile TSS signals within a TSS cluster. Thus, it measures the width of a core promoter, but lacks the information of distribution patterns of TSS signals within a core promoter. Inter-quantile width could be significantly affected by different clustering methods. plotInterQuantile function plots interquantile width of each sample.
```
  plotInterQuantile(myTSSr,samples = "all",tagsThreshold = 1)
```

Promoter shape score (PSS) integrates both inter quantile width and the observed probabilities of tags at every TSSs within a cluster (Lu and Lin 2019). PSS can be calculated using using shapeCluster function with method set as “PSS”. The smaller value represents the sharper core promoter. PSS is 0 representing singletons. SI is determined by the probabilities of tags at every TSSs within one cluster (Hoskins, Landolin et al. 2011). SI is also calculated using shapeCluster function with method set as “SI”. The greater value represents the sharper core promoter. The SI = 2 represents singletons.

    shapeCluster(myTSSr,clusters = "consensusClusters", method = "PSS",useMultiCore= FALSE, numCores = NULL)

    myTSSr@clusterShape
    
    # $control
    #       cluster   chr  start    end strand dominant_tss       tags tags.dominant_tss  q_0.1  q_0.9 interquantile_width shape.score
    #    1:       1  chrI   6530   6564      +         6548   9.312118          3.724847   6530   6562                  33    9.646916
    #    2:       3  chrI   7166   7189      +         7169   1.241616          0.620808   7166   7189                  24    6.877444
    #    3:       4  chrI   8084   8087      +         8084   1.241616          0.620808   8084   8087                   4    2.000000
    #    4:       5  chrI   9325   9411      +         9327   4.966464          1.241616   9327   9391                  65   17.767262
    #    5:       6  chrI   9442   9479      +         9442   3.414443          1.862423   9442   9468                  27    7.469695
    #   ---                                                                                                                           
    # 3302:    4285 chrII 804855 804975      -       804895 260.739280         71.703301 804868 804899                  32   14.764972
    # 3303:    4286 chrII 807129 807179      -       807165  16.761812          6.518482 807161 807168                   8    5.251317
    # 3304:    4287 chrII 808679 808723      -       808723   1.862424          0.931212 808679 808723                  45    9.844044
    # 3305:    4288 chrII 809353 809377      -       809377   1.862424          1.552020 809353 809377                  25    3.018611
    # 3306:    4289 chrII 809707 809748      -       809707   4.035251          2.172827 809707 809724                  18    7.425270
    # 
    # $treat
    #       cluster   chr  start    end strand dominant_tss       tags tags.dominant_tss  q_0.1  q_0.9 interquantile_width shape.score
    #    1:       1  chrI   6521   6564      +         6559   5.067037          1.753975   6530   6559                  30   13.123285
    #    2:       2  chrI   7096   7118      +         7100   1.948860          0.584658   7096   7114                  19    9.333375
    #    3:       4  chrI   8061   8087      +         8061   1.948860          0.584658   8061   8087                  27    9.012708
    #    4:       5  chrI   9327   9476      +         9359  11.498278          1.753975   9327   9452                 126   25.664108
    #    5:       7  chrI  11259  11343      +        11329  59.245376         21.827244  11288  11329                  42   15.020408
    #   ---                                                                                                                           
    # 3240:    4285 chrII 804855 804952      -       804895 155.908888         54.568111 804872 804901                  30   13.439327
    # 3241:    4286 chrII 807118 807179      -       807165  13.642026          5.456811 807153 807165                  13    7.308196
    # 3242:    4287 chrII 808655 808691      -       808679   1.753974          0.779544 808655 808691                  37   10.725295
    # 3243:    4288 chrII 809377 809389      -       809377   1.169316          0.779544 809377 809389                  13    3.398098
    # 3244:    4289 chrII 809707 809710      -       809707   1.364203          0.974431 809707 809710                   4    1.726241

The distributions of core promoter shape (PSS or SI) can be illustrated by a histograms with plotShape function. Please be noted that there is only one shapeCluster slot, so either PSS or SI can be saved in the shapeCluster slot depending on which method argument was used for shapeCluster.

    plotShape(myTSSr)

The complete list of PSS can be exported to delimited text file with "exportShapeTable" function.

exportShapeTable(myTSSr)

Annotation (Assigning TCs to genes)

Assigning TCs to downstream genes as their core promoters is required for annotation of the 5’ boundaries of genomic features. This process is also a prerequisite for further interrogations of regulated transcription initiation at the gene level. TSSr offers the “annotateCluster” function to assign TCs to their downstream genes. The assignment of a TC to a gene is based on the distance between the position of the dominant TSS of a TC and the annotated 5’ends of coding sequences (CDS) or transcripts. The default maximum distance between the dominant TSS and CDS is 1000 bp (“upstream = 1000”). If a TC overlaps with the CDS of an upstream gene, the dominant TSS of the TC must be within 500 bp to the 3’end of the overlapping CDS by default (“upstreamOverlap = 500”) to be eligible to assign to its downstream gene. If the 5’ends of annotated transcripts, instead of CDS, are used for TC assignment, users should set “annotationType” as “transcript,” and the default distance parameter is 500 bp. Because the genomes size and the number of introns vary substantially among organisms, it is necessary to apply customized criteria for TC assignment for different organisms. Users are advised to adjust the assignment criteria for core promoter assignment in TSSr. By default, only TCs with ≥ 0.02 TPM are used for the annotation process. To reduce transcriptional or technical noise of small clusters downstream a strong cluster, the filterCluster argument was set as "filterClusterThreshold = 0.02", indicating that any TC with TPM value < 0.02 TPM will be excluded from assigning to genes.

  annotateCluster(myTSSr,clusters = "consensusClusters",filterCluster = TRUE,
            filterClusterThreshold = 0.02, annotationType = "genes"
            ,upstream=1000, upstreamOverlap = 500, downstream = 0)
            
  myTSSr@assignedClusters
  
  # $control
  #      cluster   chr  start    end strand dominant_tss       tags tags.dominant_tss  q_0.1  q_0.9 interquantile_width      gene inCoding
  #   1:       5  chrI   9325   9411      +         9327   4.966464          1.241616   9327   9391                  65   YAL066W     <NA>
  #   2:       6  chrI   9442   9479      +         9442   3.414443          1.862423   9442   9468                  27   YAL066W     <NA>
  #   3:       7  chrI  11253  11343      +        11329  40.662913         17.693022  11275  11329                  55 YAL064W-B     <NA>
  #   4:      33  chrI  31026  31151      +        31108  48.112608         17.072215  31108  31145                  38   YAL062W     <NA>
  #   5:      34  chrI  31187  31263      +        31242 320.026426        122.609541  31212  31242                  31   YAL062W     <NA>
  #  ---                                                                                                                                  
  # 856:    4279 chrII 801003 801075      -       801023  11.174543          4.656059 801018 801061                  44 YBR296C-A     <NA>
  # 857:    4284 chrII 804563 804593      -       804592   1.552020          0.931212 804563 804593                  31   YBR298C     <NA>
  # 858:    4285 chrII 804855 804975      -       804895 260.739280         71.703301 804868 804899                  32   YBR298C     <NA>
  # 859:    4288 chrII 809353 809377      -       809377   1.862424          1.552020 809353 809377                  25   YBR300C     <NA>
  # 860:    4289 chrII 809707 809748      -       809707   4.035251          2.172827 809707 809724                  18   YBR300C     <NA>
  # 
  # $treat
  #      cluster   chr  start    end strand dominant_tss       tags tags.dominant_tss  q_0.1  q_0.9 interquantile_width      gene inCoding
  #   1:       5  chrI   9327   9476      +         9359  11.498278          1.753975   9327   9452                 126   YAL066W     <NA>
  #   2:       7  chrI  11259  11343      +        11329  59.245376         21.827244  11288  11329                  42 YAL064W-B     <NA>
  #   3:       8  chrI  11907  11934      +        11919   2.338633          0.974431  11916  11922                   7 YAL064W-B     <NA>
  #   4:      19  chrI  21237  21248      +        21246   1.948861          1.169317  21240  21246                   7   YAL064W     <NA>
  #   5:      34  chrI  31087  31263      +        31242 117.711208         34.299955  31119  31242                 124   YAL062W     <NA>
  #  ---                                                                                                                                  
  # 810:    4271 chrII 799104 799125      -       799125   1.169316          0.584658 799104 799125                  22   YBR296C     <NA>
  # 811:    4279 chrII 801017 801061      -       801023   6.821011          2.338633 801023 801058                  36 YBR296C-A     <NA>
  # 812:    4285 chrII 804855 804952      -       804895 155.908888         54.568111 804872 804901                  30   YBR298C     <NA>
  # 813:    4288 chrII 809377 809389      -       809377   1.169316          0.779544 809377 809389                  13   YBR300C     <NA>
  # 814:    4289 chrII 809707 809710      -       809707   1.364203          0.974431 809707 809710                   4   YBR300C     <NA>

Instead of visualizing TSSs and core promoters in the UCSC Genome Browser or IGV, plotTSS function is able to generate publish ready figures when list of interested genes are provided and plotting region is specified.

The results of TC annotation can be exported to delimited text files with "exportClustersTable" function.

  exportClustersTable(myTSSr, data = "assigned")

Analysis of enhancers

TSS data allow for the robust identification of enhancers by transcription of enhancer RNAs (eRNAs). Active enhancers produce bidirectional transcription of capped eRNAs, resulting in two diverging tag clusters by at most 400 bp. TSSr can identify this bidirectional cluster pairs and calculate a sample-set wide directionality score D for each locus (Andersson et al., 2014). D = (F-R)/(F+R), where F is the aggregated normalized tag counts in forward strandm and R is the aggregated normalized tag counts in reverse strand. Putative enhancers were then filtered with |D| < 0.8.
```
  callEnhancer(myTSSr, flanking = 400)
```
The results of putative enhancers can be exported to delimited text files with "exportEnhancerTable" function.
```
  exportEnhancerTable(myTSSr)
```
Differential expression analysis

The number of tags at each TSS reflects the number of transcripts initiated at the TSS. Thus, TSS data can be used for expression profiling. With specified sample pairs for comparison, deGene function counts raw tags of each consensus clusters and utilizes the DESeq2 package (Love, Huber et al. 2014) for differential expression analysis.
```
  deGene(myTSSr,comparePairs=list(c("control","treat")), pval = 0.01,useMultiCore=FALSE, numCores=NULL)
```
Differential expression analysis results can be visualized by plotDE function which generates a volcano plots. Names of genes differential expressed between the compared pairs are displayed on the dots when the withGeneName argument is set as TRUE.
```
  plotDE(myTSSr, withGeneName = "TRUE",xlim=c(-2.5, 2.5),ylim=c(0,10))
```
Differential expression analysis results can also be exported to a text file with exportDETable function.
```
  exportDETable(myTSSr, data = "sig")
```

Core promoter shifts

One gene might have multiple core promoters which can be used differently in different samples. TSSr implements degree of shift (Ds) algorithm (Lu and Lin 2019) to quantify the degree of promoter shift across different samples.

  shiftPromoter(myTSSr,comparePairs=list(c("control","treat")), pval = 0.01)
          
  myTSSr@PromoterShift
  
  # $control_VS_treat
  #          gene                 Ds                  pval          padj
  #  1:   YBL017C   16.5567506300711                     0  0.000000e+00
  #  2:   YBR006W  -26.3346456736974                     0  0.000000e+00
  #  3:   YBR023C   18.2077046044299                     0  0.000000e+00
  #  4:   YBR039W  -27.6613679948891                     0  0.000000e+00
  #  5:   YBR121C  -32.7177356221814                     0  0.000000e+00
  #  ---                                                                                                                                  
  # 70:   YBR158W   7.23779193641262   0.00110612692207393  4.424508e-03
  # 71:   YBR140C  -6.96937554342683   0.00119456280847422  4.710952e-03
  # 72:   YBR157C  0.867544334037256   0.00122552060110212  4.765913e-03
  # 73:   YBR143C  -1.42890967687923   0.00163014425826738  6.252608e-03
  # 74:   YBL060W    -1.471483572077   0.00197980632089202  7.491159e-03
  # 75:   YBR053C   2.26685314089749   0.00259863406742416  9.701567e-03
  #          gene                 Ds                  pval          padj

Below is an example of core promoter shift in genes YBR168W and YBL067C. The two major core promoters are differently used in control and treat samples.

  plotTSS(myTSSr,samples=c("control","treat"),tssData = "processed"
  ,clusters = "assigned"
  ,clusterThreshold = 0.02 
  ,genelist=c("YBR168W","YBL067C")
  ,up.dis =500,down.dis = 100,yFixed=TRUE)

Results of core promoter shift analysis can also be exported to a text file with exportShiftTable function.

    exportShiftTable(myTSSr)

7. Contact:

Zhenguo Lin, PhD

Department of Biology, Saint Louis University, USA.

Email: zhenguo.lin@slu.edu

8. Reference:

Andersson, R., Gebhard, C., Miguel-Escalada, I., Hoof, I., Bornholdt, J., Boyd, M., Chen, Y., Zhao, X., Schmidl, C., Suzuki, T., Ntini, E., Arner, E., Valen, E., Li, K., Schwarzfischer, L., Glatz, D., Raithel, J., Lilje, B., Rapin, N., Bagger, F. O., … Sandelin, A. (2014). "An atlas of active enhancers across human cell types and tissues". Nature 507(7493), 455–461.

Arribere, J. A. and W. V. Gilbert (2013). "Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing." Genome Res 23(6): 977-987.

Cumbie, J. S., M. G. Ivanchenko and M. Megraw (2015). "NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites." BMC Genomics 16: 597.

Cvetesic, N., H. G. Leitch, M. Borkowska, F. Muller, P. Carninci, P. Hajkova and B. Lenhard (2018). "SLIC-CAGE: high-resolution transcription start site mapping using nanogram-levels of total RNA." Genome Res 28(12): 1943-1956.

Kruesi, W. S., L. J. Core, C. T. Waters, J. T. Lis and B. J. Meyer (2013). "Condensin controls recruitment of RNA polymerase II to achieve nematode X-chromosome dosage compensation." Elife 2: e00808.

Mahat, D. B., H. Kwak, G. T. Booth, I. H. Jonkers, C. G. Danko, R. K. Patel, C. T. Waters, K. Munson, L. J. Core and J. T. Lis (2016). "Base-pair-resolution genome-wide mapping of active RNA polymerases using precision nuclear run-on (PRO-seq)." Nat Protoc 11(8): 1455-1476.

Malabat, C., F. Feuerbach, L. Ma, C. Saveanu and A. Jacquier (2015). "Quality control of transcription start site selection by nonsense-mediated-mRNA decay." Elife 4.

Murata, M., H. Nishiyori-Sueki, M. Kojima-Ishiyama, P. Carninci, Y. Hayashizaki and M. Itoh (2014). "Detecting expressed genes using CAGE." Methods Mol Biol 1164: 67-85.

Pelechano, V., W. Wei and L. M. Steinmetz (2013). "Extensive transcriptional heterogeneity revealed by isoform profiling." Nature 497(7447): 127-131.

Takahashi, H., T. Lassmann, M. Murata and P. Carninci (2012). "5' end-centered expression profiling using cap-analysis gene expression and next-generation sequencing." Nat Protoc 7(3): 542-561.

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
Meta		Meta
R		R
data		data
doc		doc
docs		docs
inst		inst
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS		NEWS
README.md		README.md
_pkgdown.yml		_pkgdown.yml
tssr.yml		tssr.yml

License

Linlab-slu/TSSr

Folders and files

Latest commit

History

Repository files navigation