Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This repo provides tools for creating

  1. Mappability file generation
  2. Signal track production

Mappability file generation

Create mappability tracks for a reference genome, at specific kmer lengths. Mappability is based on unique, exact matches. It is inspired by the mappability pipeline created by Anshul Kundaje for align2rawsignal


  1. ensembl-hive tested with version 2.2
  2. A database for use with hive (developed with MySQL 5.5.27, but should work with anything hive supports)
  3. A cluster workload mangement system supported by hive. This pipeline was developed with Platform LSF 8.0.1.
  4. Perl (tested with 5.12.5)
  5. bowtie tested with version 1.1.1 The pipeline is not compatible with bowtie2
  6. samtools tested with version 1.2
  7. bedtools tested with version 2.22.0
  8. bedGraphToBigWig from the Kent src tree
  9. Additional perl module dependencies - please see cpanfile in the root of the repository. If you have cpanm, you can install them with cpanm --installdeps .
  10. A reference genome, split into fasta files. One chromosome/contig per file works well, as the each file is loaded into memory by default. The eHive pipeline is tested to work with chromsomes up to the size of human chr1.
  11. A bowtie index file for the reference genome
  12. A chrom sizes file for use with bedGraphToBigWig - 2 column, tab-separated text file, listing the reference sequence names and lengths.


  1. Get the code: git clone
  2. Add the library to your PERL5LIB: export PERL5LIB=$PERL5LIB:genome-signal-tracks/lib
  3. Get all software and data listed in Requirements


  1. First, initialise the hive pipeline and tell it where to find the software it depends upon:
ensembl-hive/scripts/ Bio::GenomeSignalTracks::PipeConfig::MappabilityConf \
	-hive_driver mysql \
	-host db_host \
	-port db_port \
	-user db_user \
	-password db_password \
	-bedtools /path/to/bedtools \
	-samtools /path/to/samtools \
	-bowtie /path/to/bowtie \
	-bedGraphToBigWig /path/to/bedGraphToBigWig \
	-fasta_suffix fa

The paths to bedtools, samtool and bowtie are required. The fasta suffix is used when searching for fasta files and defaults to fa (i.e. it will look for files matching *.fa).

Running successfully will provide you with a database URL to use in subsequent steps.

  1. Secondly, tell hive what to work on:
ensembl-hive/scripts/ \
	-url mysql://username:password@host:port/dbname \
	-logic_name start \
	-input_id "{fasta_dir => '/path/to/fasta/', output_dir => '/path/to/output', kmer_sizes => '35,42,90..100', index_dir => '/path/to/bowtie_index', index_name => 'name_of_index', chrom_list => '/path/to/chrom_list'}"
  • fasta_dir should be a directory containing fasta files (matching the fasta suffix supplied in step 1).
  • output_dir is the directory where output files will be created. Working directories for each kmer length will also be creted here
  • kmer_size a list of kmer lengths to use. These should match the read lengths you expect to work with. This can include lists of values or ranges, so '35,45..50', would cause the pipeline to run for kmer lengths 35,45,45,47,48,49,50.
  • index_dir should be the directory containing the bowtie index
  • index_name should be the base name of the bowtie index, located in index_dir
  • chrom_list_ should be a tab separated file listing the sequences in the reference genome, and their length
  1. Run hive. The controller script (beekeeper) may be running for many hours, and should be run under gnu screen or similar.
screen -url mysql://username:password@host:port/dbname -loop
  1. Once the pipeline has finished, you should have bam, bedGraph and bigBed files (one per kmer size) in the output_dir. Output will be named based on the index_name and kmer size, e.g.

You can repeat steps 2 and 3 for as many reference genomes and kmer lengths as you require. The pipeline cleans up intermediate files files as it goes along.

Signal track production



  1. An alignment in BAM format. It must be coordinate sorted.
  2. A mappability track for the reference genome and read length of interest
  3. WiggleTools, tested with version 1.


The FAANG Data Coordination Centre has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement Nos. 815668, 817923 and 817998, and also form the Biotechnology and Biological Sciences Research Council under Grant Agreement No. BB/N019563/1.

You can’t perform that action at this time.