Skip to content

Full-length linear and circular transcript isoform reconstruction and quantification

License

Notifications You must be signed in to change notification settings

Christina-hshi/psirc

Repository files navigation

psirc

Psirc (pseudo-alignment identification of circular RNAs) is for back-splicing junction detection, full-length linear and circular transcript isoform inference and quantification from RNA-seq data.

The whole psirc pipeline has two main parts: 1. Detecting back-splicing junctions (BSJs) and inferring full-length isoforms (FLIs); and 2. Quantification of FLIs (both linear and circular FLIs at the same time). Each part works well stand-alone, but it is recommended to use them together.

psirc pipeline

If you use psirc in your study, please cite:
Ken Hung-On Yu*, Christina Huan Shi*, Bo Wang, Savio Ho-Chit Chow, Grace Tin-Yun Chung, Ke-En Tan, Yat-Yuen Lim, Anna Chi-Man Tsang, Kwok-Wai Lo, Kevin Y. Yip. Quantifying full-length circular RNAs in cancer. Genome Research 31.12 (2021): 2340-2353. Available from: https://genome.cshlp.org/content/31/12/2340.short

Table of Contents

External libraries

  • zlib
  • HDF5 C libraries

Installation

The first part of psirc was implemented with Perl script, which can be run directly. The second part of psirc was implemented with C and C++.

    git clone https://github.com/Christina-hshi/psirc.git
    cd psirc
    cd psirc-quant
    #you may need to compile htslib under "ext/htslib" by following the README there ("make install" is optional)
    mkdir release
    cd release
    cmake ..
    make psirc-quant
    #the psirc-quant program can be found at "src/psirc-quant"
    make install (optional)

Requirements

Synopsis of requirements

The psirc pipeline has two parts. The first part is the psirc script (for detecting BSJs and inferring FLIs), currently psirc_v1.0.pl which is a Perl script that fully automates the production of the BSJ detection and FLI inference outputs from the above requirements. The second part is psirc-quant which quantifies the abundances of FLIs based upon RNA-seq data.

Input: paired-end RNA-seq reads
The RNA-seq reads should be sequenced from a library preparation strategy that retains circular RNAs (circRNAs), such as ribosomal RNA (rRNA) depletion or exome-capture RNA-seq. We only accept paired-end reads as single-end reads have inherent read density biases and false positive alignments, making them not recommended for circRNA detection2. The paired-end reads need to be in FASTQ format and can be gzipped.

custom_transcriptome_fa
A FASTA file that contains all annotated transcript sequences of a reference sequence, with each sequence having a custom header (indicating various positional, name, and strand information). Ready-to-use custom_transcriptome_fas are available for human and for Epstein-barr virus (EBV) since these were used in our study, although it is very easy to generate one for any well-annotated transcriptome by following the Generation of custom_transcriptome_fa section. The one we generated for human is called "gencode.v29.annotation.custom_transcriptome.fa" and will be used in the General usages section.

Forked kallisto
A forked version of kallisto v0.43.1, which was modified to allow multi-threading. This is the last version which outputs a SAM formatted pseudo-alignment to stdout, allowing the processing of the pseudo-alignment as it is being generated simultaneously, and is the reason why this version is used. The forked kallisto executable is available for Linux or Mac, or can be compiled from the source code. "kallisto" in the General usages section refers to this version of kallisto.

General usages

Index the custom_transcriptome_fa (need to be performed once only):

perl psirc_v1.0.pl -i gencode.v29.annotation.custom_transcriptome.fa kallisto

Produce both BSJ detection and FLI inference outputs in a single run (recommended):

perl psirc_v1.0.pl -f -o output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq

Produce BSJ detection output only:

perl psirc_v1.0.pl -o output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq

Produce FLI inference output from the result of BSJ detection output only:

perl psirc_v1.0.pl -s output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq

where gencode.v29.annotation.custom_transcriptome.fa is the custom_transcriptome_fa, kallisto is the forked kallisto, output_directory is the user-specified directory to place the outputs, and R1.fastq R2.fastq are the input paired-end RNA-seq reads.

Index the inferred FLI

We require the header lines of the circular transcripts in fasta format should end with "\tC" to let the program know that they are circular transcripts. And header lines of linear transcripts should not end with "\tC". The outputs produced from psirc_v1.0.pl already meet this requirement, but the outputs produced by other FLI inference tools may not.

psirc-quant index [arguments] <FLI sequences>

Required argument:
  -i, --index=STRING          Filename for the index to be constructed

Optional argument:
  -k, --kmer-size=INT         k-mer (odd) length (default: 31, max value: 31)
      --make-unique           Replace repeated target names with unique names

Quantify FLI:

psirc-quant quant [arguments] R1.fastq R2.fastq

Required arguments:
  -i, --index=STRING            Filename for the index to be used for
                                quantification
  -o, --output-dir=STRING       Directory to write output to

Optional arguments:
      --bias                    Perform sequence based bias correction
  -b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
      --seed=INT                Seed for the bootstrap sampling (default: 42)
      --plaintext               Output plaintext instead of HDF5
      --fusion                  Search for fusions for Pizzly
      --single                  Quantify single-end reads
      --single-overhang         Include reads where unobserved rest of fragment is
                                predicted to lie outside a transcript
      --fr-stranded             Strand specific reads, first read forward
      --rf-stranded             Strand specific reads, first read reverse
  -l, --fragment-length=DOUBLE  Estimated average fragment length
  -s, --sd=DOUBLE               Estimated standard deviation of fragment length
  -x, --min-fragment-length     Minimum length of a valid fragment
  -X, --max-fragment-length     Maximum length of a valid fragment
                                (default: -l, -s values are estimated from paired
                                 end data, but are required when using --single)
  -t, --threads=INT             Number of threads to use (default: 1)
      --pseudobam               Save pseudoalignments to transcriptome to BAM file
      --genomebam               Project pseudoalignments to genome sorted BAM file
  -g, --gtf                     GTF file for transcriptome information
                                (required for --genomebam)
  -c, --chromosomes             Tab separated file with chrosome names and lengths
                                (optional for --genomebam, but recommended)

Please note that the <min-fragment-length> and <max-fragment-length> options are crucial when identifying fragments supporting back-splicing junctions. We suggest you first try <fragment-length> +|- 3*<sd> as the min. and max. fragment length.

Synopsis of outputs

BSJ detection

BSJ transcript list (candidate_circ_junctions.bed)
A list of all the detected BSJ loci and their supporting transcripts (including which exons are back-spliced) in BED format. The first three columns (chr, start, end) are the BSJ loci. The fourth column begins with the BSJ supporting read count, then a ":" separator, followed by the BSJ transcripts of the loci. The read count is the sum of the reads crossing the BSJ and the last-first exons reads of the BSJ transcript. To indicate how many reads are in each category, the fifth column is a value between 0 and 1, crossing BSJ reads / (crossing BSJ reads + last-first exons reads), with 1 indicating all reads are crossing BSJ reads, and 0 indicating all reads are last-first exons reads.

BSJ transcript sequences (candidate_circ_junctions.fa)
A FASTA file containing the sequences of each detected BSJ transcript.

BSJ transcript supporting reads SAM file (candidate_circ_supporting_reads.sam)
All detected BSJ supporting reads mapped to their BSJ transcripts in SAM format. These are pseudo-alignments directly outputted from kallisto. This information is useful for certain analyses, such as visualizing the supporting reads mapped to the BSJ transcript on a genome browser.

FLI inference

FLI list (full_length_isoforms.tsv)
A list of all inferred full-length linear and circular isoforms. For circular isoforms, the information in the BSJ transcript list file is included again. Isoforms with alternative splicing have their alternatively spliced exon numbers and supporting read counts indicated. Isoforms with exactly the same sequence have their names merged with a "," separator.

FLI sequences (full_length_isoforms.fa)
A FASTA file containing the sequences of each inferred FLI. Circular isoforms have an additional tab delimiter preceding a "C" character at the end of the FASTA header line. This is the input file to psirc full-length circular and linear isoform quantification, psirc-quant.

FLI alternative forward-splicing junctions supporting reads SAM file (full_length_isoforms_alt_fsj_supporting_reads.sam) All detected alternatively spliced reads mapped to their full-length isoforms in SAM format. These are again pseudo-alignments directly outputted from kallisto.

FLI quantification

Abundances of FLIs (abundance.tsv and abundance.h5) Estimated abundances of FLIs in both tab delimited format("abundance.tsv") and HDF5 format("abundance.h5").

Generation of custom_transcriptome_fa

custom_transcriptome_fas are generated through a provided Perl script, create_custom_transcriptome_fa.pl. The only two information required by create_custom_transcriptome_fa.pl are the reference genome FASTA file and its annotation GTF file. The reference genome FASTA file is one fasta file containing all the interested reference sequences of the genome (e.g. for human, chrs 1-22, M, X, and Y, each under a separate header), and the annotation GTF file needs to include at least the ‘exon’ entries of the annotations. The necessary fields that need to be present in each ‘exon’ entry are seqname, start, end, strand, and "gene_name" and "exon_number" in attributes. For GTF files downloadable from official sources (Ensembl, GENCODE, etc.), the exon entries and the necessary fields are already all present.

Details of the FLI sequences output

The FASTA headers in the FLI sequences file contain the names (ID) of each isoform. The ID begins with the transcript name. The ID of each circular transcript isoform ends with the exon connection information in the form of "_EwBx{_yiAzi}*", where w and x indicate the 3' and 5' exons that define the BSJ, and each pair of yi and zi defines an alternative splicing junction that involve non-adjacent exons, and there can be zero, one or more of them.

There are hence four possible types of FLIs in this file: linear, circular, linear alternative, and circular alternative. These can be distinguished simply based on the exon connection information described above. Linear isoforms have no exon connection information at all, circular isoforms only have the "_EwBx" part, linear alternative isoforms only have the "{_yiAzi}*" part, and circular alternative isoforms have the entire exon connection information.

Details of the psirc-quant output

Estimated abundances of FLIs are output in both tab delimited format("abundance.tsv") and HDF5 format("abundance.h5"). Other useful information collected during running of the program can be found in file "run_info.json". psirc-quant quantifies linear and circular FLIs together by maximizing the likelihood of generating the observed sequencing reads from FLIs.

References

  1. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525-7.
  2. Szabo L, Salzman J. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet. 2016;17(11):679-692.

License

First part of the psirc pipeline for BSJ detection and full-length isoform inference is distributed under the MIT license. Second part of the psirc pipeline for full-length isoform quantification is developed based upon kallisto, and distributed under the BSD 2-Clause license.

About

Full-length linear and circular transcript isoform reconstruction and quantification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published