This utility converts genome alignments in SAM/BAM format to transcript compatibility counts (TCCs) as output by transcriptome pseudoalignment methods such as kallisto. bam2tcc has been used/tested on Mac OS X and Linux.
Binaries may eventually be provided. For now, bam2tcc must be compiled from
the source code. To compile bam2tcc
you will need a c++14 compatible compiler such as gcc 5.0 or Clang 3.4 or later.
CMake for compiling and SeqAn for SAM/BAM I/O. To download SeqAn, follow instructions here.
For compiling bam2tcc
, it is enough to download the library package and unzip to a directory.
First clone the repository:
$ git clone https://github.com/laureneliu/bam2tcc ~/clone/path
Next create a directory called build
and change directory into it:
$ mkdir build && cd build
Run CMake (cmake ..
) from the build
directory followed by make
.
The resulting executable is build/src/bam2tcc
.
Depending on where/how you installed SeqAn, you may need to append some extra
options, -DCMAKE_PREFIX_PATH and -DCMAKE_INCLUDE_PATH
so that CMake can find
the relevant SeqAn folder. See here. If you downloaded the SeqAn library it should be run with
cmake -DCMAKE_PREFIX_PATH=download-path/seqan-library-2.4.0/share/cmake -DSEQAN_INCLUDE_PATH=download-path/seqan-library-2.4.0/include ..
Given a set of RNA-seq reads contained in some FASTQ files:
- Use a genome alignment program such as HISAT2 or STAR to align the reads. Your output should be a SAM or BAM file.
- Use samtools to order the output by genomic coordinate if necessary. The
command is
samtools sort -T [tempPrefix] -@ [nthreads] -o [outfile] infile
. Here for more information. - Use this program to read the SAM/BAM file and output the appropriate TCC matrix, contained in .ec, .tsv, and .cells files.
Usage: /path/bam2tcc/build/src/bam2tcc [options]* -g <GFF> -S <SAM/BA> [-o <output>]
<GFF>
is the file of annotated sequences to use. Note that it must be in the
Ensembl GFF format or
it cannot be properly parsed. If there are multiple GFFs you wish to use, you may
input them as a comma-separated list of values. Entries should be grouped by
chromosome, then by transcript. The entry of type "transcript" must come before
the entries of type "exon" for one given transcript. Exons should be in order of
increasing genomic start coordinate if the transcript is on the forward strand,
and in order of decreasing genomic start coordinate if the transcript is on the
reverse strand. The program will check the GFF for this format before using it.
GFFs downloaded from Ensembl should be in the correct format.
<SAM/BAM>
is the SAM/BAM file output by the genome aligner. It must be in
the specified format. Either
of the two example aligners listed above (HISAT2, STAR) should give a
correctly-formated file. As with the GTFs, this option accepts a comma-separated
list of values. Alignments should be sorted by chromosome, then by genomic start
coordinate. Samtools provides a way to sort alignments in this way.
<output>
is the name/directory of your output files. Appropriate file
extensions will be added to the name your provide. Default is matrix.ec,
matrix.tsv, and matrix.cells. See below for brief description of what these
files look like.
-
-U Indicate that the reads are unpaired. Only necessary for running on kallisto --genomebam output.
-
-k Indicate that all input SAM/BAM files are kallisto genomebam files. Required if input file is in the BAM format, or if it does not specify the @PG:ID as kallisto in the header.
-
-R Indicate that all input SAM/BAM files are RapMap
lightweight
files. Required if input file is in the BAM format, or if it does not speciy the @PG:ID as RapMap in the header. -
-t, --transcriptome The transcriptome file. Only useful with kallisto. kallisto's output zero-indexes the transcripts. Using this option will give the output the same indexing. If there are more transcripts in the GTF than in the transcriptome, the program will continue where kallisto leaves off.
-
-e, --ec A kallisto (or other) equivalence class (ec) file. Order the TCCs in the same order as this ec file. Useful for direct comparison of program output with kallisto pseudo output. However, it might be more useful to look at kallisto quant output.
-
-p, --threads The number of threads to use. Defaults to 1.
-
--full-matrix Output a full matrix instead of the default sparse matrix. See below for more information on these formats.
-
-u, --unmatched Also output a SAM file containing all reads that didn't align to any transcripts. This excludes those reads that didn't align anywhere on the genome. Currently outputs a bad header.
-
--check-gff Only check GFF format.
There exists a preprocessor macro, READ_DIST, that gives the option to output the names of those reads that map to some transcript (i.e., whose TCCs are nonempty). The macros is in Mapper.hpp in the src folder. kallisto --genomebam gives information on how specific reads map, so it is possible to determine on which reads kallisto and genomic aligners differ. In the kallisto-generated file, all reads where at least one alignment does not have the "unmapped flag" (0x04) are considered to have mapped.
This section will only describe the ec, tsv, and cell files and add some restrictions to SAM/BAM formatting. For more information on the GFF and SAM/BAM file formats, you should look at their documentation.
This file contains two tab-separated columns. The first column is the zero-indexed row number (the line number). The second contains the equivalence class. From the paper on kallisto:
An equivalence class for a read is a multi-set of transcripts associated with the read; ideally it represents the transcripts a read could have originated from and provides a sufficient statistic for quantification.
This is a list of the SAM/BAM files used. They will be listed in the order that they were input.
This file may be in either the sparse matrix format (default) or full matrix format. You may change the output format using the `--full-matrix`` option.
This file contains n + 1 tab-separated columns, where n is the number of SAM/BAM files you input. The first column contains the zero-indexed row number (the line number). The (i, j) th entry, where j >= 1, is the number of reads file j (one-index the SAM/BAM files) that had the equivalence class of row i described in the ec file. A short example:
0 0 0 1
1 2 0 1
2 0 1 1
3 3 0 0
This tsv files says that three reads in the first SAM/BAM file aligned to the equivalence class described by row three. Say the ec file says this:
0 0
1 1
2 2
3 1,3
Then, three reads in the first SAM/BAM file aligned to equivalence class 1,3.
Note that you can look at the .cells file to see which SAM/BAM file this was.
This file contains three tab-separated columns. Take a full matrix. The first column of this file gives the zero-indexed row number of an entry. The second gives the -1-indexed column number (i.e. it one-indexes, but does not count the first column containing row numbers). The third is the entry itself. It only shows those entries that are nonzero. Take again the sample tsv file from above:
0 0 0 1
1 2 0 1
2 0 1 1
3 3 0 0
The sparse matrix looks like this:
1 0 2
3 0 3
2 1 1
0 3 1
1 3 1
2 3 1
For paired-end reads, the program will only count as "correct" a pair of alignments with flags: 0x01, 0x02, 0x10 for one alignment and 0x20 for the other, and 0x40 for one alignment and 0x80 for the other.
Also, all paired reads must have at least two alignments, one for each end,
regardless of whether it maps. HISAT2 will give a SAM file readable by the
program, but STAR requires the additional option --outSAMunmapped Within KeepPairs
. Consult the manual
for more information about the available options.
There exists another executable, /path/bam2tcc/build/src/debug
. It provides various useful
functions, but does not go into detail on what they do. Also, any mistakes
in input format of either the files or of the commandline arguments will result
in a potentially generic error (e.g. a segfault). However, none of the functions
are particularly complicated so a brief look at the source code should be all
the explanation necessary. Run the executable with no arguments for brief and
utterly uninformative descriptions of all functions and their arguments.
Of particular use are:
-
"sparse" (
-x
), which turns a full matrix from ONE INPUT FILE into a sparse matrix -
"Unmapped cat" (
-o
), which takes in an input SAM (not a BAM), presumably the file containing unmapped reads output bybam2tcc --unmapped
, and determines which reads were not properly aligned to the genome, and therefore did not map to any transcripts, and which reads did properly align somewhere in the genome, but in an area which did not correspond to any transcripts in the transcriptome.sameQName
should be 1 if all segments of a read are named the same way (as opposed to being named, for example, read/1 and read/2 to indicate that the first is the first mate, and the second the second mate).genomebam
should be 1 if the input SAM alignments were originally generated by kallisto --genomebam. -
"EQ to TCC" (
-t
), which takes as input the eq_classes.txt file generated by Salmon --dumpEQ and outputs TCCs in kallisto's format.