Parallelizing operations on SAM/BAM files

SAM/BAM files are typically large, thus, operations on these files are time intensive. This project provides tools to parallelize operations on SAM/BAM files. The workflow follows:

Split BAM/SAM file in n chunks
Perform operation in each chunk in a dedicated process and save resulting SAM/BAM chunk
Merge results back into a single SAM/BAM file

Depends on:

Samtools, can be installed with conda

Installation

pip install parallelbam

or

Git clone project
cd to cloned project directory
sudo python3 setup.py install

Better to install within an environment, such as a conda environment, to avoid path conflicts with the included bash scripts.

Usage

There is one main function named parallelizedBAMoperation. This function takes as mandatory arguments:

path to original bam file (should be ordered)
a callable function to perform the operation on each bam file chunk

The callable function must accept the following two first arguments:

path to input bam file and
path to resulting output bam file

in this order.

TODO

The current way to include bash scripts in the package, while working, seems awkward. Perhaps including bash code directly in subprocess would be simpler
Having permission error in some installations upon calling splitBAM.sh, can one make it executable during installation?

from parallelbam.parallelbam import parallelizeBAMoperation, getNumberOfReads

As an example, let's create a function that simply copies a bam file to another directory (does nothing to the bam file). When calling this function in parallelizeBAMoperation it will imply split the BAM file in chunks and the merge them back into a single BAM, whih sould be identical to the first one. We will split the BAM file in 8 chunks, and are dummy function will be called in separate process for each chunk.

import shutil

def foo(input_bam, output_bam):
    shutil.copyfile(input_bam, output_bam)
    
    
parallelizeBAMoperation('sample.bam',
                        foo, output_dir=None,
                        n_processes=8)

To check that the processed bam file, after merging the 8 chunks, contains the same number of reads we can call getNumberOfReads.

getNumberOfReads('sample.bam')

11825588

getNumberOfReads('processed.bam')

11825588

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
parallelbam		parallelbam
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.ipynb		README.ipynb
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallelizing operations on SAM/BAM files

Installation

Usage

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Robaina/parallelBAM

Folders and files

Latest commit

History

Repository files navigation

Parallelizing operations on SAM/BAM files

Installation

Usage

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages