# Filtering sam/bam files by percent identity or percent of matched sequence

Tools to filter alignments in SAM/BAM files by percent identity or percent of matched sequence. 

Percent identity is computed as:

$$PI = 100 \frac{N_m}{N_m + N_i}$$

where $N_m$ is the number of matches and $N_i$ is the number of mismatches.

Percent of matched sequences is computed as:

$$PM = 100 \frac{N_m}{L}$$

where $L$ corresponds to query sequence length.


# Installation

```pip3 install filtersam```

# TODO

# Usage

In [1]:
from filtersam.filtersam import filterSAMbyIdentity

In [3]:
filterSAMbyIdentity('ERS491274.bam', output_path='processed_np.bam', identity_cutoff=95)

The processed BAM file, that is, after filtering, should contain fewer alignments

In [11]:
getNumberOfReads('ERS491274.bam')

1113119

In [5]:
getNumberOfReads('processed_np.bam')

11384

# Parallelizing filtersam

Filtering large BAM files can take a while. However, ```filtersam``` can be parallelized with an additional python package: [parallelbam](https://pypi.org/project/parallelbam/). Effectively, ```parallelbam``` splits a large BAM file into chunks and calls ```filtersam``` in dedicated processes for each one of them.

Let's try this out.

In [10]:
from parallelbam.parallelbam import parallelizeBAMoperation, getNumberOfReads
parallelizeBAMoperation('ERS491274.bam', filterSAMbyIdentity, [95], n_processes=8, output_dir=None)

In [4]:
getNumberOfReads('processed.bam')

11384

To check that the processed bam file, after merging the 8 chunks, contains the same number of reads we can call ```getNumberOfReads```.