Skip to content

Hard/soft-mask or exclude/extract FASTA sub-sequences

License

Notifications You must be signed in to change notification settings

Koohoko/figleaf_fasta

 
 

Repository files navigation

Modifications:

  • change constraints on hardmask letters, we can use "?" now.
  • fixed bugs when using fasta file with more than one sequence, with --task='exclude' or with --inverse_mask=False

DOI

figleaf_fasta applies hard/soft masking to a FASTA file or excludes/extracts sub-sequences from a FASTA file.

  • hard_mask: replace sequence with Ns or Xs
  • soft_mask: convert sequence to lowercase
  • exclude: exclude sub-sequences and concatenate non-excluded remainder
  • extract: extract and concatenate sub-sequences

Other tools for handling FASTA files (e.g. bedtools maskfasta, bedtools getfasta, pybedtools) require sequence name(s), corresponding to FASTA header names, to be specified (in addition to range information); sequence name specification allows different masking operations to be applied to different records in a multi-FASTA file.

figleaf_fasta is a simple lightweight tool that takes as input a (multi-)FASTA and range start, end positions; masking/exclusion/extraction will be applied to sequence(s) within the (multi-)FASTA, regardless of FASTA header names. This is useful if a user wants to apply the same masking to all FASTA files or all records of a multi-FASTA. A common use case is when handling reference-aligned (same-length) consensus FASTAs.

Installation

From pypi

pip3 install figleaf_fasta

From GitHub repository

git clone https://github.com/AlexOrlek/figleaf_fasta.git
cd figleaf_fasta
pip3 install .

Options and usage

figleaf_fasta can be run from a Linux command-line as follows:
figleaf [arguments...]

figleaf_fasta can be used within a Python script as follows:
from figleaf_fasta.figleaf import figleaf
figleaf([arguments...])

Running figleaf -h on the command-line produces a summary of the command-line options:

usage: figleaf [-h] -fi FASTA_INPUT -r RANGES_PATH -fo FASTA_OUTPUT [--task TASK] [--hard_mask_letter HARD_MASK_LETTER] [--inverse_mask]

figleaf_fasta: apply hard/soft mask to FASTA file or exclude/extract sub-sequences

optional arguments:
  -h, --help            show this help message and exit

Input:
  -fi FASTA_INPUT, --fasta_input FASTA_INPUT
                        Filepath to input fasta file to be masked (required)
  -r RANGES_PATH, --ranges_path RANGES_PATH
                        Two-column tsv file with rows containing 0-indexed end-exclusive ranges to be masked/excluded/extracted (required)

Output:
  -fo FASTA_OUTPUT, --fasta_output FASTA_OUTPUT
                        Filepath for masked output fasta file (required)

Task:
  --task TASK           "hard_mask","soft_mask","exclude","extract" (default: hard_mask)

Mask:
  --hard_mask_letter HARD_MASK_LETTER
                        Letter to represent hard_mask regions (N or X) (default: N)
  --inverse_mask        If flag is provided, all except mask ranges will be masked

The same arguments are required when using the figleaf function within a Python script, except that start, end positions can be provided either as a filepath (ranges_path), OR as a Python list (ranges_list).

Example

To generate example output in the example/ directory, run:
python figleaf_fasta.py or bash figleaf_fasta.sh

License

MIT License

About

Hard/soft-mask or exclude/extract FASTA sub-sequences

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 87.5%
  • Makefile 6.4%
  • Shell 6.1%