Build reference files required for genomic analysis from a gzipped fasta file and a gff file
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Reference data builder

This is a Hive/LSF pipeline to build reference files required for genomic analysis from fasta and gtf files. It exists because we expect to deal with many types of genomic and epigenomic data across multiple species. We need to ensure that a standard set of reference files are available in a consistent manner across all species.

The data is organised in a directory structure like this

  • root output directory
    • species name
      • assembly name
        • annotation
          • annotation name
            • indexes
              • program name
        • genome_fasta
        • genome_index
        • indexes
          • program name

For assemblies, the pipeline requires a fasta file. The pipeline provides

  • a gzipped copy of the fasta file
  • a fai file (for use with samtools faidx)
  • a dict file for the fasta (from picard)
  • a list of chromosomes and their sizes (for use with various tools, e.g. bedToBigBed)
  • indexes for bwa, bowtie, bowtie2 and bismark
  • TODO mappability files and stats

For gene annotations, the pipeline requires a GTF file. Each annotation must be based on an assembly. The pipeline provides.

  • a gzipped copy of the gtf file
  • ref_flat and rRNA interval files (for use with picard)
  • indexes for STAR and RSEM.


You will need an installation of ensembl-hive (2.2), and a compute cluster using LSF. It should be possible to use an alternative system by updating the resource_classes listed in Bio::RefBuild::PipeConfig::RefBuilderConf. Hive needs a mysql database to track the pipeline.

The pipeline requires several external programs. The version used in development is listed:

In addition, the pipeline expects to run under linux and makes use of standard linux tools:

  • cp
  • cut
  • find
  • gunzip
  • rm
  • sed
  • tee

Download the reference_data_builder code, install the perl dependencies listed in the cpanfile and add the lib/ dir to your PERL5LIB environment variable.

Setup the pipeline

Use hive's init pipeline to create a pipeline database: Bio::RefBuild::PipeConfig::RefBuilderConf \
  -host <dbhost> -port <dbport> -user <dbuser> -password <dbpassword> \
  -bedtools /path/to/bedtools \
  -bowtie1_dir /path/to//bowtie-1 \
  -bowtie2_dir /path/to/bowtie2-2 \
  -bwa /path/to/bwa-0.7.5a/bwa \
  -java /path/to/java \
  -picard /path/to/picard.jar \
  -samtools /path/to/samtools \
  -bgzip /path/to/tabix/bgzip \
  -star /path/to/STAR/bin/Linux_x86_64/STAR \
  -gtfToGenePred /path/to/gtfToGenePred \
  -rsem_dir /path/to/rsem \
  -bismark_dir /path/to/bismark \
  -bedGraphToBigWig /path/to/bedGraphToBigWig \
  -wiggletools /path/to/wiggletools \
  -cram_seq_cache_populate_script /path/to/seq\_cache\ \
  -lsf_queue_name a_queue_name \
  -cram_cache_root /path/to/cram/cache/root
  -output_root /path/to/output_dir

You may also specfiy -lsf_std_param to add additional LSF parameters for all jobs. For example, we use -lsf_std_param '-R"select[lustre]"' to ensure that all nodes can see the required storage area. The CRAM reference cache is explained on the HTSlib site. Set cram_cache_root to the directory used in the REF_CACHE environment variable.

The output of this will include the database URL required for later steps, e.g. mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder


You can choose to run everything at once, or run each major step individually. The major steps are:

  1. Create assembly specific resources
  2. Create one or more annotation specific resources
  3. Create mappability tracks for a range of kmer/read lengths
  4. Create a manifest file describing all files produced

Assembly specific resources

Seed the pipeline with the species and assembly names, the path to the gzipped fasta file, and a source URL of the fasta file (this will be used in the sequence dictionary). The fasta file should have the chromosomes in the order you wish to use for your output. The description lines can contain as much information as you wish, but the sequence name (e.g. chr1) should be the first thing after the > character, and should be separated from any additional information by a space (i.e. '>1 dna:chromosome' is OK, '>1**|**dna:chromosome' is not). -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder \
  -logic_name start_assembly -input_id \
  "{assembly_name => 'galgal4', species_name => 'Gallus gallus', fasta_uri => '', fasta_file => 'Gallus_gallus.Galgal4.dna.toplevel.fa.gz', }"

Run the pipeline with hive's beekeeper. This will take some time, it is best to run it under GNU Screen or tmux: -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder -loop

Gene set specific resources

Once the assembly has been processed, you can add gene set annotations to it. These should be in the form of a gzipped gtf file. The annotation pipeline relies on the output of the assembly steps, and should be based on the same assembly. -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder \
  -logic_name start_assembly -input_id \
  "{assembly_name => 'galgal4', species_name => 'Gallus gallus', gtf_file => 'Gallus_gallus.Galgal4.80.gtf.gz', annotation_name => 'e80', }" -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder -loop

The pipeline has been most thoroughly tested with Ensembl GTF files. We are also testing it with RefSeq annotation for goat, converted to GTF using the script scripts/ For simplicities sake, we reduce the GTF to just exons with gene_id and transcript_id attributes for use with RSEM and STAR, since these are the only features they can use.

Write manifest

It can be useful to have a manifest file, listing the size and a checksum value for each file. -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder \
  -logic_name start_manifest -input_id \
  "{assembly_name => 'galgal4', species_name => 'Gallus gallus',}" -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder -loop

Run everything at once

For ease of use, you can seed all steps in one go: -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder \
  -logic_name start_all -input_id \
  "{assembly_name=>'Galgal4',fasta_uri=>'',fasta_file=>'Gallus_gallus.Galgal4.dna.toplevel.fa.gz',species_name=>'Gallus gallus',kmer_sizes=>'50,100,150,200',annotation_name=>'e80',gtf_file=>'Gallus_gallus.Galgal4.80.gtf.gz'}" -url mysql://dbuser:dbpassword@dbhost:dbport/myuser_ref_builder -loop