Skip to content

Bam file to count table

Buys de Barbanson edited this page Apr 5, 2020 · 4 revisions

Bam file to count table

bamToCountTable.py is a flexible script to convert a BAM file into a table, the rows of the table contain samples and the columns features. The script can extract information from tags and/or any other attribute which can be read out by the PySAM API, such as is_paired , is_duplicate, reference_name, reference_start, reference_end, is_qcfail, is_read1, mapping_quality.

The -sampleTags argument defines which tags/attributes in the BAM file are used to determine the sample of a read. If multiple tags define the sample, use a comma without a space to separate the tags/attributes.

The -joinedFeatureTags argument defines a group of tags/attributes in the BAM file to be used as columns of the output table. By default the occurrences of tag combinations are counted, to sum over values the -byValue attribute should be used.

The -featureTags argument can be used instead of the -joinedFeatureTags argument and counts every supplied feature as a separate feature / column.

The output of the script is a csv file by default. When a filename ending with .pickle.gz is supplied the output will be written as a gzipped pickle file.

Examples:

Amount of reads per sample -sampleTags SM per chromosome -joinedFeatureTags reference_name

bamToCountTable.py test.bam -sampleTags SM -joinedFeatureTags reference_name -o reads_per_sample.csv

Amount of molecules --dedup per sample -sampleTags SM per chromosome -joinedFeatureTags reference_name

bamToCountTable.py test.bam -sampleTags SM -joinedFeatureTags reference_name --dedup -o molecules_per_sample.csv

Bin molecules in 250kb bins -bin 250_000, where the bin is determined by the DS tag -binTag DS and counted molecule have at least 20 mapping quality -minMQ 20

bamToCountTable.py test.bam -joinedFeatureTags reference_name -minMQ 20 --dedup -binTag DS -bin 250_000 -o molecules_binned_250k.csv

Bin molecules in 250kb bins -bin 250_000 with a sliding window of 50kb -sliding 50_000 where the location is determined by the starting position of the read -binTag reference_start, ignore reads with alternative hits --filterXA and mapping quality below 60 -minMQ 60

bamToCountTable.py test.bam -joinedFeatureTags reference_name -minMQ 60 --filterXA --dedup -binTag reference_start -bin 250_000 -o molecules_binned_250k_sliding_50k.csv

Count amount of mapping reads per contig per cell, distribute multi mapping reads

bamToCountTable.py test.bam -joinedFeatureTags reference_name --divideMultimapping -o counts.csv