# Parallelizing operations on SAM/BAM files

SAM/BAM files are typically large, thus, operations on these files are time intensive. This project provides tools to parallelize operations on SAM/BAM files. The workflow follows:

1. Split BAM/SAM file in _n_ chunks
2. Perform operation in each chunk in a dedicated process and save resulting SAM/BAM chunk 
3. Merge results back into a single SAM/BAM file

Depends on:

1. Samtools

# Installation

1. Git clone project
2. cd to cloned project directory
3. ```sudo python3 setup.py install```

# Usage

There is one main function named ```parallelizedBAMoperation```. This function takes as mandatory arguments:

1. path to original bam file (should be ordered)
2. a callable function to perform the operation on each bam file chunk

The callable function must accept the following two first arguments: (i) path to input bam file and (ii) path to resulting output bam file, in this order.

# Note

Preparing a bam file to run an operation in parallel takes a while, thus is not worth it when the operatin itself takes a short time. For example, preparing a typical bam file for parallelization (in 8 processes) can take almost a minute.

In [2]:
import shutil
from parallelbam.parallelbam import parallelizeBAMoperation, getNumberOfReads

In [4]:
def foo(input_bam, output_bam):
    shutil.copyfile(input_bam, output_bam)
    
    
parallelizeBAMoperation('parallelbam/test/toy_sample.bam',
                        foo, output_path=None,
                        n_processes=4)

In [5]:
getNumberOfReads('parallelbam/test/toy_sample.bam')

1000

In [6]:
getNumberOfReads('parallelbam/test/processed.bam')

1000