Last Updated: 06/02/2019
2019.06.02 updated to version 2.7. Fixed GMAP option bug + added distance to closest annotated start/end for the gene (not ref isoform) and filtering afterwards.
2019.05.02 updated to version 2.6. Added deSALT aligner support using
2019.03.18 minor typo fixed for version 2.5. updated doc for
2019.03.10 updated to version 2.5. Fixed
sqanti_filter2.py missing fusion category also using polyA_motif as part of filtering.
2019.02.27 updated to version 2.4. Added polyA motif finding.
2019.02.27 updated to version 2.3.
junction_category fixed to check for (ss5,ss3) pairs in provided GTF.
2019.02.26 updated to version 2.2. added support for CAGE peak (FANTOM5) and Intropolis junction BED.
2018.10.15 updated to version 1.1. modified use of SAM to GFF with added
Private repo for Liz's modified SQANTI. The original SQANTI is by Ana Conesa lab.
Until the SQANTI authors have finalized their agreement with Liz on how to integrate Liz's changes, the script names are modified to reflect this as a temporary working version.
sqanti_qc.py is named currently
- Python (2.7)
- R (>= 3.4.0)
- R packages (for
sqanti_qc2.py): ggplot2, scales, reshape, gridExtra, grid, dplyr
Installing Python dependencies
I recommend using Anaconda which makes installing all the Python packages much easier. If you already have Anaconda installed because you use Iso-Seq3 or cDNA_Cupcake, you can activate your current environment and directly go to step (4).
(1) Here's the generic Anaconda installation for Linux environment. Currently only Linux environment supported.
bash ~/Downloads/Anaconda2-5.2.0-Linux-x86_64.sh export PATH=$HOME/anaconda5.2/bin:$PATH conda -V conda update conda
(2) Create a virutal environment. I will call it
y to agree to the interactive questions.
conda create -n anaCogent5.2 python=2.7 anaconda source activate anaCogent5.2
(3) Once you have activated the virtualenv, you should see your prompt changing to something like this:
(4) Install additional required libraries:
conda install -n anaCogent5.2 -c bioconda pysam conda install -n anaCogent5.2 psutil conda install -n anaCogent5.2 biopython conda install -n anaCogent5.2 -c http://conda.anaconda.org/cgat bx-python conda install -n anaCogent5.2 -c bioconda bcbiogff
If you don't already have cDNA_Cupcake installed, you can do that now:
$ git clone https://github.com/Magdoll/cDNA_Cupcake.git $ cd cDNA_Cupcake $ python setup.py build $ python setup.py install
No installation for SQANTI2 itself is required. The scripts can be run directly.
Activate the Anaconda environment. Make sure minimap2 works. Add
$ source activate anaCogent5.2 (anaCogent5.2)-bash-4.1$ export PYTHONPATH=$PYTHONPATH:<path_to>/cDNA_Cupcake/sequence/ (anaCogent5.2)-bash-4.1$ minimap2 --version 2.15-r905
Input to SQANTI QC
- Iso-Seq output. Preferably already mapped to the genome and collapsed to unique transcripts. (FASTA/FASTQ)
- Reference annotation in GTF format. For example GENCODE or CHESS.
- Reference genome, in FASTA format. For example hg38. Make sure your annotation GTF is based on the correct ref genome version!
Intropolis Junction BED file. I've provided a version of Intropolis for hg38 genome, modified into STAR junction format.
polyA motif list. A ranked list of polyA motifs to find upstream of the 3' end site. See human polyA list for an example.
Running SQANTI QC
The script usage is:
python sqanti_qc2.py [-t cpus] [--skipORF] [-c shortread_STAR_junction_out] [--cage_peak CAGE_PEAK_BED] [--polyA_motif_list POLYA_LIST] [--aligner_choice=minimap2,deSALT] <input_fasta> <annotation_gtf> <genome_fasta>
If you don't feel like running the ORF prediction part, use
--skipORF. Just know that all your transcripts will be annotated as non-coding.
If you have short read data, you can run STAR to get the junction file (usually called
SJ.out.tab, see STAR manual) and supply it to SQANTI2.
--aligner_choice=minimap2, the minimap2 parameter used currently is:
minimap2 -ax splice --secondary=no -C5 -O6,24 -B4 -uf
--aligner_choice=deSALT, the deSALT parameter used currently is:
deSALT aln -x ccs.
You can look at the
sqanti_qc2.py for the full command format.
python sqanti_qc2.py -t 30 example/touse.rep.fasta gencode.v29.annotation.gtf hg38_noalt.fa \ --cage_peak hg38.cage_peak_phase1and2combined_coord.bed \ --coverage Public_Intronpolis/intropolis.v1.hg19_with_liftover_to_hg38.tsv.modified
If you have multiple bed files, you can use file patterns:
python sqanti_qc2.py -t 30 example/touse.rep.fasta gencode.v29.annotation.gtf hg38_noalt.fa \ --cage_peak hg38.cage_peak_phase1and2combined_coord.bed \ --coverage "JunctionBeds/samples.*.junctions.bed"
SQANTI QC output
You can look at the example subfolder for a sample output. The PDF file shows all the figures drawn using R script SQANTI_report2.R, taking the
_junctions.txt as the two input. If you know R well, you are free to modify the R script to add new figures! I will be constantly adding new figures as well.
Detailed explanation of
Filtering Isoforms using SQANTI
I've made a lightweight filtering script based on SQANTI2 output that filters for two things: (a) intra-priming and (b) short read junction support.
The script usage is:
python sqanti_filter2.py <classification_txt> <input_fasta> <input_sam> [-a INTRAPRIMING] [-c MIN_COV] [-m MAX_DIST_TO_KNOWN_END]
-a determines the fraction of genomic 'A's above which the isoform will be filtered. The default is
-m sets the maximum distance to an annotated 3' end (the
diff_to_gene_TTS field in classification output) to offset the intrapriming rule.
-c is the filter for the minimum short read junction support (looking at the
min_cov field in
.classification.txt), and can only be used if you have short read data.
python sqanti_filter2.py touse.rep_classification.txt \ touse.rep.renamed_corrected.fasta \ touse.rep.renamed_corrected.sam
The current filtering rules are as follow:
- If a transcript is FSM, ISM, or NIC, then it is kept unless the 3' end is unreliable.
- If a transcript is NNC, genic, genomic, then it is kept only if all of below are true: (1) 3' end is reliable (2) does not have a junction that is labeled as RTSwitching (3) all junctions are either canonical or has short read coverage above
A transcript 3' end is unreliable if both of the following are true: the genomic 'A' content is above the
-a threshold and the distance to the known 3' end exceeeds the
SQANTI2 Output Explanation
SQANTI/SQANTI2 categorizes each isoform by finding the best matching reference transcript in the following order:
FSM (Full Splice Match): meaning the reference and query isoform have the same number of exons and each internal junction agree. The exact 5' start and 3' end can differ by any amount.
ISM (Incomplete Splice Match): the query isoform has 5' exons than the reference, but each internal junction agree. The exact 5' start and 3' end can differ by any amount.
NIC (Novel In Catalog): the query isoform does not have a FSM or ISM match, but is using a combination of known donor/acceptor sites.
NNC (Novel Not in Catalog): the query isoform does not have a FSM or ISM match, and has at least one donor or acceptor site that is not annotated.
Antisense: the query isoform does not have overlap a same-strand reference gene but is anti-sense to an annotated gene.
Genic Intron: the query isoform is completely contained within an annotated intron.
Genic Genomic: the query isoform overlaps with introns and exons.
Intergenic: the query isoform is in the intergenic region.
Classification Output Explanation
.classification.txt has the following fields:
isoform: the isoform ID. Usually in
length: isoform length.
exons: number of exons.
structural_category: one of the categories ["full-splice_match", "incomplete-splice_match", "novel_in_catalog", "novel_not_in_catalog", "genic", "antisense", "fusion", "intergenic", "genic_intron"]
associated_gene: the reference gene name.
associated_transcript: the reference transcript name.
ref_length: reference transcript length.
ref_exons: reference transcript number of exons.
diff_to_TSS: distance of query isoform 5' start to reference transcript start end. Negative value means query starts downstream of reference.
diff_to_TTS: distance of query isoform 3' end to reference annotated end site. Negative value means query ends upstream of reference.
diff_to_gene_TSS: distance of query isoform 5' start to the closest start end of any transcripts of the matching gene. This field is different from
diff_to_TSSsince it's looking at all annotated starts of a gene. Negative value means query starts downstream of reference.
diff_to_gene_TTS: distance of query isoform 3' end to the closest end of any transcripts of the matching gene. Negative value means query ends upstream of reference.
subcategory: A/B/C. Ignore for now.
RTS_stage: TRUE if one of the junctions could be a RT switching artifact.
all_canonical: TRUE if all junctions have canonical splice sites.
min_sample_cov: sample with minimum coverage.
min_cov: minimum junction coverage based on short read STAR junction output file. NA if no short read given.
min_cov_pos: the junction that had the fewest coverage. NA if no short read data given.
sd_cov: standard deviation of junction coverage counts from short read data. NA if no short read data given.
FL: currently always NA. I will add back this information later.
n_indels: total number of indels based on alignment.
n_indels_junc: number of junctions in this isoform that have alignment indels near the junction site (indicating potentially unreliable junctions).
bite: TRUE if all junctions match reference junctions completely.
iso_exp: currently always NA. I will add back this information later.
gene_exp: currently always NA. I will add back this information later.
ratio_exp: currently always NA. I will add back this information later.
FSM_class: ignore this field for now.
ORF_length: predicted ORF length.
CDS_length: predicted CDS length.
CDS_start: CDS start.
CDS_end: CDS end.
perc_A_downstreamTTS: percent of genomic "A"s in the downstream 20 bp window. If this number if high (say > 0.8), the 3' end site of this isoform is probably not reliable.
dist_peak: distance to closest TSS based on CAGE Peak data. Negative means upstream of TSS and positive means downstream of TSS. Strand-specific. SQANTI2 only searches for nearby CAGE Peaks within 10000 bp of the PacBio transcript start site. Will be
NAif none are found within 10000 bp.
within_peak: TRUE if the PacBio transcript start site is within a CAGE Peak.
--polyA_motif_listis given, shows the top ranking polyA motif found within 50 bp upstream of end.
--polyA_motif_listis given, shows the location of the last base of the hexamer. Position 0 is the putative poly(A) site. This distance is hence always negative because it is upstream.
Junction Output Explanation
.junctions.txt file shows every junction for every PB isoform. NOTE because of this the same junction might appear multiple times if they are shared by multiple PB isoforms.
isoform: Isoform ID
junction_number: The i-th junction of the isoform
genomic_start_coord: Start of the junction (1-based), note that if on - strand, this would be the junction acceptor site instead.
genomic_end_coord: End of the junction (1-based), note that if on - strand, this would be the junction donor site instead.
transcript_coord: Currently not implemented. Ignore.
knownif the (donor-acceptor) combination is annotated in the GTF file,
novelotherwise. Note that it is possible to have a
noveljunction even though both the donor and acceptor site are known, since the combination might be novel.
knownif the junction start site is annotated. If on - strand, this is actually the donor site.
knownif the junction end site is annotated. If on - strand, this is actually the acceptor site.
diff_to_Ref_start_site: distance to closest annotated junction start site. If on - strand, this is actually the donor site.
diff_to_Ref_end_site: distance to closest annotated junction end site. If on - strand, this is actually the acceptor site.
bite_junction: TRUE if either or both the junction start/end site matches annotation.
splice_site: Splice motif.
RTS_junction: TRUE if junction is predicted to a template switching artifact.
indel_near_junct: TRUE if there is alignment indel error near the junction site, indicating potential junction incorrectness.
--coverage(short read junction coverage info) is provided, shows the number of samples (cov files) that have short read that support this junction.
total_coverage: Total number of short read support from all samples that cover this junction.