Skip to content
Brian Haas edited this page Jul 23, 2023 · 12 revisions

ctat-genome-lib-builder

The CTAT Genome Lib is a resource collection used by the Trinity Cancer Transcriptome Analysis Toolkit (CTAT). This CTAT-genome-lib-builder system is leveraged for preparing a target genome and annotation set for use with Trinity CTAT tools, including fusion transcript detection and cancer mutation discovery. The genome resource building process creates a 'CTAT genome resource library'. Inputs required for building a human genome resource library, in addition to pre-compiled CTAT human genome resource libs, are made available at https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/. See below if you want to create your own CTAT genome resource lib for human or for other organism or genome targets.

Installing a CTAT Genome Lib Data Resource

Data resources required are readily available at https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/, which includes the human genome, Gencode annotations, and coding annotations in gtf format. Also included are precomputed BLAST+ results from an all-vs-all search of the transcript sequences, Pfam domains identified in human protein sequences, and human cancer fusion annotations, which we compile from multiple sources https://github.com/FusionAnnotator/CTAT_HumanFusionLib/releases.

Alignment utilities used by Trinity CTAT include STAR and GMAP. Be sure to have each installed and available for use via your PATH setting.

Then, unpack the data resources and index the resources like so:

 tar xvf CTAT_resource_lib.tar.gz

Download a pre-compiled CTAT genome lib, if possible. The download is larger and takes long, but it includes all processed data and saves you from having to run through the time-intensive and computationally-intensive build process below.

If you download a 'data source' build, then you need to execute the genome lib build process like so:

 %  cd CTAT_resource_lib/


 ## For Human:
 %  ${ctat-genome-lib-builder-basedir}/prep_genome_lib.pl \
                     --genome_fa genome.primary.fa \
                     --gtf gencode.*.annotation.gtf \
                     --fusion_annot_lib CTAT_HumanFusionLib.dat.gz \
                     --annot_filter_rule AnnotFilterRule.pm \
                     --pfam_db current \
                     --dfam_db human \
                     --human_gencode_filter 

 ## For Mouse:
 %  ${ctat-genome-lib-builder-basedir}/prep_genome_lib.pl \
                     --genome_fa genome.primary.fa \
                     --gtf gencode.*.annotation.gtf \
                     --pfam_db current \
                     --dfam_db mouse

note, replace 'genome.primary.fa' with whatever the primary genome assembly is named in this release. It differs depending on the source lib. For GRCh38, it would be 'GRCh38.primary_assembly.genome.fa'.

If you're building a human CTAT genome lib, include the '--human_gencode_filter' parameter, which will construct and incorporate immunoglobulin super-loci to facilitate identification of IGH and IGL fusions. Also, be sure to indicate --dfam_db human .

A list of modifications to the human reference annotation and genome sequence for use with STAR-Fusion are provided.

Once the build is complete, you then refer to the above resource directory (specifically, the ctat_genome_lib_buld_dir/ subdirectory) via the '--genome_lib_dir' parameter of the CTAT utility to be executed, or set the path to an environmental variable 'CTAT_GENOME_LIB' to be conveniently auto-recognized among CTAT tools (where indicated as available).

Note, the above builder has a number of additional software requirements including blast, hmmer, among others. Using our Docker or Singularity images for doing this step is easiest and preferred if you're planning to go this route. For example, if you have Singularity installed, you can leverage the singularity image we provide on our release downloads page and run like so:

# For Human
% singularity exec -e star-fusion.simg \
   /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl \
      --genome_fa genome.primary.fa \
      --gtf gencode.*.annotation.gtf \
      --fusion_annot_lib fusion_lib.*.dat.gz \
      --annot_filter_rule AnnotFilterRule.pm \
      --pfam_db current \
      --dfam_db human \
      --human_gencode_filter 

# For Mouse
% singularity exec -e star-fusion.simg \
   /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl \
      --genome_fa genome.primary.fa \
      --gtf gencode.*.annotation.gtf \
      --pfam_db current \
      --dfam_db mouse 

Building a Custom Genome Resource Library for Fusion Detection

DIY human or mouse ctat genome libs

If you'd like to build a new Gencode-based CTAT genome library, for example, because a new genome annotation set is available or you for whatever reason need to use an older annotation set, you can do so by downloading a different Gencode GTF file for human or mouse. You can use the primary genome file that we provide in our source CTAT genome libs (not plug-n-play, but source) with your annotation gtf file, and then run the prep script:

 # For Human
 %  ${ctat-genome-lib-builder-basedir}/prep_genome_lib.pl \
                         --genome_fa ref_genome.fa \
                         --gtf your_new_annotation.gtf \
                         --fusion_annot_lib CTAT_HumanFusionLib.dat.gz \
                         --annot_filter_rule AnnotFilterRule.pm \
                         --pfam_db current \
                         --dfam_db human \
                         --human_gencode_filter 

  # For Mouse
  %  ${ctat-genome-lib-builder-basedir}/prep_genome_lib.pl \
                         --genome_fa ref_genome.fa \
                         --gtf your_new_annotation.gtf \
                         --pfam_db current \
                         --dfam_db mouse  

For Human, grab the AnnotFilterRule.pm and the CTAT_HumanFusionLib.dat.gz files from the most current CTAT source lib. They're identical in the GRCh37 or 38, so take from either.

DIY other than human or mouse...

If you're building a database that is not for human nor mouse, then provide the full Dfam.hmm datatabase file as the parameter to --dfam_db. Other than this, it looks like the mouse build:

  ## For targets other than human or mouse:
  %  ${ctat-genome-lib-builder-basedir}/prep_genome_lib.pl \
                         --genome_fa ref_genome.fa \
                         --gtf your_annotation.gtf \
                         --pfam_db current \
                         --dfam_db /path/to/Dfam.hmm  

Some users have reported problems with using the full Dfam.hmm library. You might use an organism-specific one instead, choosing one that's a suitably similar enough match to your target organism.

User support

Contact us on our google group https://groups.google.com/forum/#!forum/trinity_ctat_users