Skip to content

Understanding Filtering: Learn to harness FAT's DSL to Power Filtering Schemes

Matthew Mosior edited this page Aug 23, 2021 · 20 revisions

Introduction

Filtering Analysis Tool (FAT) uses its own domain-specific language (DSL) to define, process, and apply the filtering scheme defined in the configuration YAML to the input tab-delimited (tsv) file.

Operations

The DSL FAT uses to apply filtering encompasses several built-in operators.

Arithmetic Operators

The following arithmetic operators are supported by the DSL:

  • + : Addition
  • - : Subtraction
  • / : Division
  • | : Null (do nothing)

A multiplication operator is slated to be added in the next release.

Relational Operators

The following relational operators are supported by the DSL:

  • >= : Greater than or equal to
  • <= : Less than or equal to
  • == : Equal to
  • /= : Not equal to
  • =~ : Regex

Strict less than and greater than operators are slated to be added in the next release.

Supported Types of Data

Data in the following structures are supported by the DSL:

  • x,y : This structure should be used for two-piece data (i.e. 32,54)
  • y,x : This structure should be used for two-piece data, but will treat the second element first. (i.e. 54,32)
  • x,_ : This structure should be used for two-piece data, but will ignore the second element. (i.e. 32,54 -> 32)
  • _,y : This structure should be used for two-piece data, but will ignore the first element. (i.e. 32,54 -> 54)
  • x : This structure should be used for one-piece data (2.241 or missense)

Other structures can be added and supported, please create an issue for this request.

Defining Filtering Schemes in the Configuration YAML

The configuration YAML allows for a modular and extensible way to power the filtering schemes and customization of the resultant XLSX file.

The filtering schemes are defined using the filters associative array.

For example,

filters:
   - filtering_type: 'BINARY'
     filtering_column: 'tumor_exome_day0_var_count'
     filtering_column_type: 'x'
     filtering_operator: '|'
     filtering_string:
       bfs_numeric_operator: '>='
       bfs_numeric_number: '5'
   - filtering_type: 'BINARY'
     filtering_column: 'normal_exome_day0_VAF'
     filtering_column_type: 'x'
     filtering_operator: '|'
     filtering_string:
       bfs_numeric_operator: '<='
       bfs_numeric_number: '4.99'
   - filtering_type: 'BINARY'
     filtering_column: 'Capture_Val_Status'
     filtering_column_type: 'x'
     filtering_operator: '|'
     filtering_string:
       bfs_string_operator: '=='
       bfs_string_literal:
         - 'PASS'
   - ...

Each filters entry is made up by the following fields:

  • filtering_type : BINARY or TRINARY
    • BINARY filters are used to calculate the overall filtering status of the each row, TRINARY filters are not.
    • BINARY filters are pass/fail, TRINARY filters are split into three categories.
    • BINARY filters are encoded (by default) such that passing values are output as green background colored cells, and failing values are output as red background cells (colors can be changed).
  • filtering_column : ColumnName
    • ColumnName must be the name of a column/field that exists within the input tab-delimited file.
  • filtering_column_type : ColumnType
    • ColumnType must be one of the following:
      • x
      • x,y
  • filtering_operator : FilteringOperator
    • FilteringOperator must be one of the following:
      • +
      • -
      • /
      • |
  • filtering_string : FilteringString
    • FilteringString will have one of the following sets of associative key/value pairs:
      • bfs_string : This type of filtering string should be used for string-based comparisons:
        • ==
        • /=
        • =~
      • bfs_numeric : This type of filtering string should be used for arithmetic comparisons:
        • ==
        • >=
        • <=
      • tfs_string : This type of filtering string should be used for string-based comparisons:
        • ==
        • /=
        • =~
      • tfs_numeric : This type of filtering string should be used for arithmetic comparisons:
        • ==
        • >=
        • <=

The only difference between a bfs and tfs filtering_string is that bfs is used to denote a binary filtering string and tfs is used to denote a trinary filtering string.

The filtering string needs to correspond to the filtering_type.

Example:

- filtering_type: 'BINARY'
  ...
  ...
  ...
  filtering_string:
       bfs_numeric_operator: '>='
       bfs_numeric_number: '5'

Keep in mind, a column/field in the tab-delimited file can only be subject to single filter.