Skip to content
Brian Haas edited this page Jul 2, 2020 · 7 revisions

CTAT-Splicing Detection and Annotation of Aberrant Splicing Isoforms in Cancer

The CTAT-Splicing module is part of the Trinity Cancer Transcriptome Analysis Toolkit and operates to identify candidate cancer splicing aberrations.

Certain introns are more likely to be relevant to cancer biology, representing cancer-specific isoforms that may result from alternative splicing or stem from intra-gene genomic deletions. For example, EGFR-vIII, EGFR-IVa, and EGFR-IVb are known oncogenic isoforms of the EGFR gene that are often found in glioblastomas and result from intra-gene deletion of exons that are observed as skipped in expressed isoforms. Another well-known example is a deletion of exon 19 in the MET gene, frequently found in lung cancers.

Other splicing patterns that are relevant to cancer biology are evident from comparing large transcriptome data sets of tumor and normal tissues, such as from the TCGA and GTEx projects, respectively. In our initial analysis of these data, we've identified ~24k introns that are enriched for splicing in tumor tissues as compared to normal tissues. The current version of our CTAT splicing database is available here.

Our 'cancer introns' are currently defined as:

  • enriched in TCGA samples as compared to GTEx samples (Fisher's Exact test, p < 0.05)
  • found in fewer than 5 GTEx normal samples
  • supported by at least 5 RNA-seq read alignments

Given an RNA-Seq data set and STAR genome alignments, the CTAT-Splicing module identifies and annotates the occurrence of any of these introns and builds an interactive IGV-based report for navigating the evidence. An igv-report example report highlighting the EGFRvIII oncogenic splicing variant is below:

igv_report_ctat_splicing_png

Right-click and download this example ctat-splicing.igv.html report to explore in your browser.

The CTAT-Splicing module is compatible and easily integrated with the CTAT-Mutations and CTAT-Fusions pipelines as part of the Trinity CTAT software ecosystem.

Installing CTAT-Splicing

Software Required

The CTAT-Splicing software can be obtained from the CTAT-Splicing Releases area.

Download and compile the software via running 'make' in the software base directory.

Some additional python3 modules may be needed:

%  pip3 install igv-reports==1.0.1 requests

Alternatively, we have Docker and Singularity images available.

Data Resources Required

CTAT-Splicing is compatible with the CTAT genome libraries distributed for use with CTAT-Fusion and CTAT-Mutations modules. Options are available here: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/, so choose one, and below we refer to it as 'CTAT_resource_lib.tar.gz'. The 'plug-n-play' libs are that... just download, unpack it (tar -zxvf filename.tar.gz) If you need to use the source CTAT genome lib for some reason, instructions on building the source genome lib are available, but its easier to just download and use the larger plug-n-play library.

Once you have the CTAT genome lib installed, you can integrate the CTAT-splicing data resource supplement. Download either the GRCh37 or GRCh38 'cancer_introns.*.tsv.gz' file, whichever matches the corresponding CTAT genome library being used.

Install this 'cancer_introns.*.tsv.gz' file into your CTAT genome lib using the following script included with the CTAT-Splicing software:

% ${CTAT_SPLICING_BASEDIR}/prep_genome_lib/ctat-splicing-lib-integration.py \
            --cancer_introns_tsv cancer_introns.*.tsv.gz \
            --genome_lib_dir /path/to/your/CTAT_genome_lib_build_dir

Once the above completes, you will be ready to run CTAT-Splicing.

Running CTAT-Splicing

CTAT-Splicing reports are based on STAR alignment outputs, leveraging the following outputs from having run STAR RNA-seq alignments against the human reference genome as organized in the CTAT genome lib:

  • SJ.out.tab : introns identified and reported by STAR as supported by RNA-seq read alignments
  • Chimeric.out.junction : (optional) chimeric read alignments reported by STAR
  • Aligned.out.sorted.bam : (optional) the aligned reads in bam format, required for visualization, ideally coordinate-sorted or will be coordinate-sorted as part of the CTAT-Splicing run.

A typical full-featured invocation of CTAT-Splicing would be:

${CTAT_SPLICING_BASEDIR}/STAR_to_cancer_introns.py \
       --SJ_tab_file  SJ.outl.tab \
       --chimJ_file Chimeric.out.junction \
       --vis \
       --bam_file Aligned.out.sorted.bam \
       --output_prefix my.samplename \
       --sample_name  my.samplename

CTAT-Splicing outputs

Outputs consist of the following:

  • ${output_prefix}.introns : introns reported by STAR and annotated according to genes containing corresponding splice junctions. Formatted like so:
intron                 strand   genes                uniq_mapped  multi_mapped
chr7:55019366-55142285  +       EGFR^ENSG00000146648.14 180     0
chr7:55142438-55143304  +       EGFR^ENSG00000146648.14 171     0
chr7:55143489-55146605  +       EGFR^ENSG00000146648.14 228     0
...
  • ${output_prefix}.cancer.introns : the list of candidate 'cancer introns' - those introns found to be enriched in cancer transcriptome samples. This file is formatted like so
intron                    strand  genes                    uniq_mapped  multi_mapped  TCGA_sample_counts                               GTEx_sample_counts  variant_name
chr7:55200414-55202516    +       EGFR^ENSG00000146648.14  114          0             GBM:1:0.59,LGG:1:0.19                            NA                  EGFRvIVb
chr7:55200414-55205255    +       EGFR^ENSG00000146648.14  114          0             GBM:6:3.55,LGG:2:0.38                            NA                  EGFRvIVa
chr7:116771655-116774880  +       MET^ENSG00000105976.13   114          0             LUAD:12:2.14,LUSC:2:0.37,LGG:1:0.19              NA                  METx14del
chr7:55019366-55155829    +       EGFR^ENSG00000146648.14  74           0             GBM:28:16.57,LGG:9:1.73,STAD:1:0.25,HNSC:1:0.18  NA                  EGFRvIII
...

The cancer introns are annotated according to the types of tumor or normal samples that they were found in, and the percent of samples.

For example: The cancer intron EGFRvIII was found in 28 TCGA glioblastoma samples (GBM), which corresponds to 16.57% of those samples. Each occurrence requires that the intron was found meeting the requirements indicated at top (min 5 supporting read alignments, etc).

Entries reported here must meet the minimum read alignments required ( --min_total_reads, default 5).

  • ${output_prefix}.cancer.introns.prelim : contains all candidate cancer introns with at least 1 supporting read alignment. The above 'cancer.introns' file is filtered for --min_total_reads .

  • ${output_prefix}.ctat-splicing.igv.html : a self-contained interactive IGV-report in html format based on the cancer.introns report.

additional .bed and .bam outputs are provided and used to generate the above ctat-splicing.igv.html file, but can be loaded into a desktop IGV or other genome browser for further exploration.

Questions, Comments, Tech Support?

Contact us on our google group https://groups.google.com/forum/#!forum/trinity_ctat_users

Clone this wiki locally