Skip to content

Versatile FASTA/FASTQ sequence file analysis and modification tool

License

Notifications You must be signed in to change notification settings

BioInf-Wuerzburg/SeqFilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Usage:
      SeqFilter <fa/fq>
      SeqFilter [filter] <fa/fq> --out <fa/fq>
      cat <fa/fq> | SeqFilter [filter] --out - | ...

Options:
    -o|--out <FILE/-> [OFF]
    -c|--stdout
        Output to file. -c for STDOUT.

    --stats <FILE> [-]
        Print stats to file. Default STDOUT or STDERR if STDOUT is in use.

    --ids <FILE/-/IDLIST>
        File of sequence IDs or literal list of IDs to be reported. Reads
        comma-, whitespace and newline separated lists. Leading '>' or '@'
        are ignored.

          SeqFilter fasta.fa --ids "seq35,seq49"  # list
          SeqFilter fasta.fa --ids ids.list       # file

    --ids-pattern <FILE/-/PATTERNLIST>
        Match a perl PATTERN or a link to a file containing multiple
        PATTERNs, one per line, against sequence ids. Keep matching
        sequences.

          SeqFilter seqs.fq --ids-patt 'comp2_c13_seq.*|comp2_c88_seq.*'

        Extended usage: add capture groups to perl P(A)TT(ER)N using "()".
        Splits matched sequences to different output files based on capture.
        Use sprintf conversions (%s, %02d, ...) in --out to define a
        template for the output file names.

          # split seqs by library (LIB1, LIB2, LIB14)
          SeqFilter multilib.fq --ids-pattern '\w+(\d+)' --out multilib_%02d.fq
          # creates multilib_01.fq, multilib_02.fq, multilib_14.fq

        NOTE: Perl needs to open a filehandle to every split file, this can
        slow things down considerably if you want to split into more than
        1000 different files with occurances of patterns randomly mixed in
        source file.

    --ids-exclude
        Reverse behaviour of --ids and --ids-pattern to excluding matched
        sequences, and keeping unmatched ones.

    --ids-rename <PATTERN>
        Provide a perl substitution pattern as string. The pattern is
        applied to every id. Use global "$COUNT" to access the output
        sequence counter, $PICARD to access the picard number.

          SeqFilter library_1.fq --ids-rename='s/.*\//sprintf("MYLIB_%05d\/%d", $COUNT, $PICARD)/e'
          # creates ids:
          # @MYLIB_00001/1
          # @MYLIB_00002/1 ...

    --desc-replace
    --desc-append
        Remove (no arg)/replace/append description of sequences.

    -l|--min-length <INT>
        Minimum sequence length.

    -L|--max-length <INT>
        Maximum length for sequences to be retrieved. Default off.

    -A|--fasta
    -Q|--fastq <CHAR>
        Convert FASTQ/FASTA to FASTA/FASTQ, respectively.

    -w|--line-width [80 for FASTA]
        FASTA only. Set 0 for single line. Ignored for FASTQ.

    --rc|--rev-comp <FILE/-/LIST>
        File of sequence IDs or list of IDs to be transformed, no argument
        to transform all sequences. Formatting follows the same rules as
        --ids files.

    --careful [OFF]
        FASTQ only. Check every FASTQ record to have a valid format. '@/+'
        at the start of id lines, sequence and quality of identical length,
        phred within boundaries. Slows down the parsing.

    --lower-case
    --upper-case
        Convert output sequence to lower/upper case.

    --iupac-to-N
        Convert non [ATGCNatgcn] characters to N.

    --phred-offset [auto]
        FASTQ only. Specify Phred offset for quality scores. Default
        auto-detect.

    --phred-transform
        FASTQ only. Transform phreds from input offset to specified
        "--phred-transform" offset, usually 33 to 64 or wise versa.

    --phred-mask
        FASTQ only. At least two values separated by ",", e.g "0,10" to mask
        all Nucleotides with phred below 10 with an "N". Optional additional
        values control:

          minimum length of unmasked regions
          minimum length of masked regions
          bps to ignore at the ends of masked regions (shorten masked regions to their
            core)
          a ratio that determines whether to mask/unmasked terminal regions that are
            smaller than are minimum unmasked region

        NOTE: Requires "--phred-offset".

    --trim-window <INT1>,[<INT2>],[<INT3>]
        FASTQ only. Trim sequences to quality >= SOFT,HARD,SIZE in a sliding
        window, default 10. The sliding window allows to have positions
        below the SOFT cutoff provided the window mean is higher than SOFT.
        Qualities below HARD, default 0, will always terminate a stretch. It
        is made sure that a) positions with quality below cutoff only occur
        within the remaining sequence, not at its start/end and b) windows
        never overlap eachother.

    --trim-lcs <INT,INT,INT>
        FASTQ only. Three values separated by ",", e.g. "30,40,50" to grep
        all stretches of quality >= 30 and minimum length 50 from the
        sequences. Faster than "--trim-window" yet breaks sequences even on
        a single low quality position.

        NOTE: "--trim-lcs" and "--trim-window" can be combined, e.g.

          --trim-lcs 5,40,100 --trim-window 10

        will generate sequences with qualities of at least 5 at every
        position and a window mean of 10.

    --substr <FILE/-/LIST>
        Pathname to a FILE containing information for subseq
        extraction/modification. The format is a tsv, by default lines of
        the format ID FROM TO are expected. Lines prepened by '#' are
        treated as comments and therefore ignored. If --substr-perl-style is
        set, the lines must start with the ID of the read, followed by the
        substr values OFFSET,LENGTH,REPLACESEQ,REPLACEQUAL. The parameter
        usage is than the same as for perl builtin "substr" function,
        meaning an OFFSET alone is sufficient, a positive value is set from
        the start of the sequence, a negative offset from the end, without
        LENGTH, the sequence is returned from OFFSET to its end.
        REPLACEMENTS are introduced at the OFFSET position, if LENGTH is 0,
        it is a simple insertion, else a part is deleted first and the
        REPLACEMENT is then inserted. Substring extraction is of course
        performed prior to any other trimming. To trim all reads use '*'
        instead of the read id. This command will be performed prior to any
        indiviual substr command.

          FROM TO:
            # extract sequence from pos 10 to pos 50
            read1 10 50

          OFFSET [LENGTH [REPLACEMENT]]
            # trim read1 head and tail by 10
            read1   10
            # extract from read2 250 nts starting at pos 15
            read2   15   250
            # replace 3 nt by an "N"" with qual "!" (for FASTQ)
            read3   3   1   N   !
            # trim from all reads 5nts at the beginning and the end.
            *   5
            *   -5

    --substr-perl-style
        By default, substr information are read according to the format FROM
        TO. Set this flag to switch the behaviour to perl substr() like
        style of "OFFSET [LENGTH [REPLACEMENT]]"

    -N|--Nx <INT,INT...>
        Report Nx value (N50, N90...). Default "50,90".

    -C|--base-composition <BASE(S),BASE(S),BASE(S),...>
        Report relative amount of given bases. Takes a "," separated list,
        each element of the list can be one or more bases (cummulative).

          --base-composition=GC,N        # combined GC and N content

    -H|--histogram
        Plot distribution of bases by length as ASCII plot. Uses linear
        scale for data sets with difference in order of magnitude < 2, log
        scale otherwise.

    --[no]-smart-labels
        Toggle shortening filepaths to shortest unique labels.

    -p|--progress
        Display progress bars (eq. '--verbose 2')

    -q|--quiet
        Omit all verbose messages. The same as --verbose=0, Superceeds
        --verbose settings.

    --verbose <INT>
        Toggle verbose level, default 2, which outputs statistics and
        progress. Set 1 for statistics only or 0 for no verbose output.

    -h|--help
        Display this help

    -V|--version
        Display current version