Skip to content

Shao-Group/TERRACE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

install with bioconda Anaconda-Server Badge

Introduction

TERRACE is a circRNA assembler for paired-end RNA-seq data.

Installation

TERRACE can be easily installed via conda. If you would install it from source code, please follow INSTALL.

Usage

The usage of TERRACE is:

terrace -i <input.bam> -o <output.gtf> -fa <reference-genome.fa> --read_length <length-of-paired-end-reads> -r [reference_annotation.gtf] -fe [feature_file] [options]

The input.bam is the read alignment file generated by some RNA-seq aligner, (for example, STAR or HISAT2). Make sure that it is sorted; otherwise run samtools to sort it:

samtools sort input.bam > input.sort.bam

The reconstructed circular transcripts shall be written as GTF format into output.gtf. Detailed documentation about GTF format is available from Ensembl.

reference-genome.fa is the reference genome file in fasta format. Recommended - Gencode GRCh37/GRCh38.

length-of-paired-end-reads is the length of the reads used to produce the alignment file.

reference_annotation.gtf is the annotation file in GTF format. This parameter is optional.

feature_file is a csv file generated by TERRACE that contains circRNA features used in a machine learning model for assigning confidence scores. This parameter is optional. For detailed usage of this file, see the section on Scoring below.

TERRACE support the following parameters. Please refer to additional explanations below the table.

Parameters Default Value Description
--help print usage of TERRACE and exit
--version print version of TERRACE and exit
--preview show the inferred library_type and exit
--library_type empty chosen from {empty, unstranded, first, second}

--library_type is highly recommended to provide. The unstranded, first, and second correspond to fr-unstranded, fr-firststrand, and fr-secondstrand used in standard Illumina sequencing libraries. If none of them is given, i.e., it is empty by default, then TERRACE will try to infer the library_type by itself (see --preview). Notice that such inference is based on the XS tag stored in the input bam file. If the input bam file do not contain XS tag, then it is essential to provide the library_type to TERRACE. You can try --preview to see the inferred library_type.

Running TERRACE on a small example

A small example of input data example-input.bam is available in the example directory.

Suppose we have installed TERRACE following the steps in the Installation section, we have the executable file terrace at src/terrace.

Commands to enter example directory and run TERRACE using example-input.bam as input:

cd ./example
../src/terrace -i example-input.bam -o example-output.gtf --read_length 150

An output file named example-output.gtf will appear in the example directory. The output file stores the reconstructed circular transcripts assembled by TERRACE in GTF format.

Scoring

The output.gtf generated by TERRACE consists of abundance values in the score field of the GTF file by default. We provide a Random Forest pre-trained model to generate more reliable scores (between 0 to 1) and integrate them in the score field of the GTF file. After integrating the scores, a user-defined threshold can be provided to generate a supplementary precise.gtf file that contains circRNAs with scores above the given threshold. Please refer to RF-scoring/README for details of score generation, integration, and precise.gtf file.

To make use of the scoring functionalities, TERRACE need to be run to generate a feature file as follows.

cd ./example
../src/terrace -i example-input.bam -o example-output.gtf --read_length 150 -fe feature_file

An output file named example-output.gtf and a feature file feature_file will appear in the example directory. The output file stores the reconstructed circular transcripts assembled by TERRACE in GTF format. the feature file stores the features of output circRNAs needed for score generation.