BUS format specification
Clone or download
Latest commit 8c43ed2 Dec 14, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Initial commit Nov 12, 2018
README.md Update README.md Dec 15, 2018

README.md

The BUS format specification

The BUS format is a binary format for storing intermediate results for single cell RNA-Seq datasets. This repository details the specification of the format.

The motivation and example usage of the BUS format and BUStools are described in

P Melsted, V Ntranos, L Pachter, The Barcode, UMI, Set format and BUStools, bioRxiv 2018 pp: 472571.

Tools

BUS generation

BUS file manipulation

BUS parsing and processing

Format specification

A BUS file is a binary file consisting of a header followed by zero or more BUS records. Each BUS header consists of the following elements in order

Field name Description Type Value
magic fixed magic string char[4] BUS\0
version BUS format version uint32_t
bc_len Barcode length [1-32] uint32_t
umi_len UMI length [1-32] uint32_t
tlen Length of plain text header uint32_t
text Plain text header char[tlen]

BUS records are stored directly after the header in the following format

Field name Description Type
barcode 2-bit encoded barcode uint64_t
umi 2-bit encoded UMI uint64_t
ec equivalence class int32_t
count fragment count uint32_t
flags flags uint32_t