-
Notifications
You must be signed in to change notification settings - Fork 48
bamliquidator
Overview
Download
Usage
Developer Getting Started Check List
### Overview * bamliquidator is a suite of tools for efficiently analyzing the density of short DNA sequence read alignments in the BAM file format * the read counts across many genomes are grouped, normalized, graphed in interactive html files, and summarized * for an interactive graph example, see this [summary](http://jdimatteo.github.io/Meta-Analysis/summary.html) and this [breakdown for a single chromosome](http://jdimatteo.github.io/Meta-Analysis/chr20.html) * a whole genome can be processed and analyzed in less than 20 seconds (on modern hardware) * a BAM file is a binary sequence alignment map -- see [SAMtools](http://samtools.sourceforge.net/) for more info * the read counts and summaries are stored in HDF5 format where they can be efficiently read via Python [PyTables](http://www.pytables.org) or the [HDF5 C apis](www.hdfgroup.org/HDF5/) * see [here](https://github.com/BradnerLab/pipeline/blob/hot-spots/bamliquidator_internal/hot_spot_csv.py) for a Python script example using the summary to show the hot spots and cold spots in a genome (TODO -- update to use master branch after re-integrated) * the HDF5 files can be viewed directly with the cross platform tool [HDFView](http://www.hdfgroup.org/products/java/hdf-java-html/hdfview/) * there is also a simple command line utility for counting the number of reads in specified portion of a chromosome, and the count is output to the console ### Download * the latest release can be downloaded here (TODO) * install prerequisites on Ubuntu 13.10: `sudo apt-get install libhdf5-7 libtcmalloc-minimal4 libc++1` * you can also [build from source yourself](#Developer) ### Usage #### bamliquidator_batch.py ``` $ bamliquidator_batch.py --help usage: bamliquidator_batch.py [-h] [--output_directory OUTPUT_DIRECTORY] [--bin_counts_file BIN_COUNTS_FILE] [--bin_size BIN_SIZE] ucsc_chrom_sizes bam_file_path
Count the number of base pair reads in each bin of each chromosome in the bam file(s) at the given directory, and then normalize, plot, and summarize the counts in the output directory. For additional help, please see https://github.com/BradnerLab/pipeline/wiki
positional arguments: ucsc_chrom_sizes Tab delimited text file with the first column naming the chromosome (e.g. chr1), the third column naming the genome type (e.g. mm8), and the fifth column naming the number of base pairs in the reference chromosome. bam_file_path The directory to recursively search for .bam files for counting. Every .bam file must have a corresponding .bai file at the same location. To count just a single file, provide the .bam file path instead of a directory. The parent directory of each .bam file is interpreted as the cell type (e.g. mm1s might be an appropriate directory name). The .bam file name is also required to contain the genome type so that the corresponding entries in the ucsc_chrom_sizes file can be used. If your .bam files are not in this directory format, please consider creating a directory of sym links to your actual .bam and .bai files. If the .bam file already has 1 or more reads in the HDF5 counts file, then the .bam file is skipped.
optional arguments: -h, --help show this help message and exit --output_directory OUTPUT_DIRECTORY Directory to create and output the h5 and/or html files to (aborts if already exists). Default is "./output". --bin_counts_file BIN_COUNTS_FILE HDF5 counts file from a prior run to be appended to. If unspecified, defaults to creating a new file "bin_counts.h5" in the output directory. --bin_size BIN_SIZE Number of base pairs in each bin -- the smaller the bin size the longer the runtime and the larger the data files (default is 100000).
#### bamliquidator
bamliquidator is run from the command line with required positional arguments:
$ bamliquidator [ bamliquidator ] output to stdout
- bam file (.bai file has to be at same location)
- chromosome
- start
- stop
- strand +/-, use dot (.) for both strands
- number of summary points
- extension length
Example counting the number of reads on both strands from base pair 100 to 200 on chromosome 1 (inclusive):
$ bamliquidator 04032013_D1L57ACXX_4.TTAGGC.hg18.bwt.sorted.bam chr1 100 200 . 1 0 120 $
(TODO: add examples with summary points > 1, and explain what extension length does)
<a name="Developer"/>
### Developer Getting Started Check List
#### Dependencies: SAMtools, HDF5, boost, C++11 (clang/libc++), tcmalloc, PyTables (version 3 or later), Bokeh, NumPy
* Ubuntu 13.10 or later
* `sudo apt-get install git libbam-dev libhdf5-serial-dev libboost-dev clang-3.4 libc++-dev libgoogle-perftools-dev`
- Ubuntu 12.04 LTS
- for clang see steps at https://github.com/BradnerLab/pipeline/issues/4#issuecomment-31207506
- for libc++ see steps at https://github.com/BradnerLab/pipeline/issues/4#issuecomment-33296709
sudo apt-get install samtools libboost-all-dev libgoogle-perftools-dev
- Mac OS X (10.8 or later)
- install XCode (5 or later) and the command line utilities (TODO: link) for clang and libc++
- install and use homebrew (TODO: link) for the rest of the dependencies
$ brew tap homebrew/science
$ brew install samtools boost hdf5 google-perftools
- TODO: test these steps
- TODO: document installing PyTables, Bokeh, NumPy, and anything else needed
$ git clone git@github.com:BradnerLab/pipeline.git
$ cd pipeline/bamliquidator_internal
$ make
$ ./bamliquidator_batch
usage: ./bamliquidator_batch cell_type bin_size ucsc_chrom_size_path bam_file_path hdf5_file
e.g. ./bamliquidator_batch mm1s 100000 /grail/annotations/ucsc_chromSize.txt
/ifs/labs/bradner/bam/hg18/mm1s/04032013_D1L57ACXX_4.TTAGGC.hg18.bwt.sorted.bam
note that this application is intended to be run from bamliquidator_batch.py -- see
https://github.com/BradnerLab/pipeline/wiki for more information
$ ../bamliquidator
[ bamliquidator ] output to stdout
1. bam file (.bai file has to be at same location)
2. chromosome
3. start
4. stop
5. strand +/-, use dot (.) for both strands
6. number of summary points
7. extension length
$