Skip to content

A Haskell script that performs hierarchical filtering on a tab-delimited file across groups of lines based on user-defined hierarchical filtering string.

License

Notifications You must be signed in to change notification settings

Matthew-Mosior/Representative-Sample-Chooser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Representative-Sample-Chooser: A Hierarchical Filtering Tool

Introduction

Representative-Sample-Chooser (RSC) is a computational tool for hierarchically filtering tab-delimited files.
This haskell script takes in an identifier field string, a hierarchical filter string and a tsv file in order capture best record for each identifier.

RSC outputs a single, filtered tsv file on successful exit, or an error message/log depending on the issue (more on this later).

Prerequisites

rsc.hs assumes you have a the GHC compiler and packages installed that it imports. The easiest way to do this is to download the Haskell Platform.

Installing required packages

To install the peripheral packages rsc.hs requires, you can call the following command assuming you have cabal, a package manager and build system for Haskell, installed on your system (it comes with the Haskell Platform).

$ cabal install [packagename]

Required packages

  • Control.DeepSeq
  • Data.ByteString
  • Data.ByteString.Char8
  • Data.ByteString.Lazy
  • Data.Char
  • Data.Functor
  • Data.List
  • Data.List.Split
  • Data.Ord
  • System.Console.GetOpt
  • System.Process
  • System.Environment
  • System.Exit
  • System.IO
  • System.IO.Temp
  • Text.PrettyPrint.Boxes
  • Text.Regex.TDFA

Input

RSC requires three inputs:

  1. Identifier Field String - This string defines the field (column) on which the data compression occurs.

The Identifier Field String has the following structure:
;[Identifier Field String];

  1. Hierarchical Filter String - This string defines a hierarchy of fields upon which to filter on. Within each field, a hierarchy of values will define the way in which filtering will occur (see example below).

The Hierarchical Filter String has the following structure:
;[FIELDNAME1]:[FIELDVALUE1]>=[FIELDVALUE2]>=...>=[FIELDVALUEN];[FIELDNAME2]:[FIELDVALUE1]>=[FIELDVALUE2]>=...>=[FIELDVALUEN];

The order in which the user provides the field names and corresponding field values is important, the first field name specified will be the first field used for comparison, and so on. The order of these fields in the input TSV file is irrelevant.

The following keywords can be used to identify fields which are numeric in nature (float or int):

  • Float -> FLOAT
  • Int -> INT

The following keycharacters can be used with the above keywords to define the type of comparison:

  • Maximum -> >
  • Minimum -> <

If the user has not specified all possible values for a given field, the program will exit early, and print a file named rsc_ERROR.log, detailing all field(s) with missing values, and what those missing values are:
Name_of_Field_(Column) Values_Found_in_Input_TSV_Not_Found_in_Hierarchical_Filter_String Values_Found_in_Hierarchical_Filter_String_Not_Found_in_Input_TSV

  1. TSV file - This is tab-delimited file on which the hierarchical filtering will occur.

Usage

rsc.hs is easy to use.

You can call it using the runghc command provided by the GHC compiler as such:
$ runghc rsc.hs -o name_of_filtered_file.tsv ";UPN_clinical;" ";Clinical_T_sequenced:T1>=T2>=T3>=T4>=T5>=T6>=T7>=T9>=Tn>=NA;model_group_reagent:combined_exome_capture>=exome>=capture_v2>=capture;common_name:relapse flow sorted>=relapse>=tumor>=normal;mean_depth:FLOAT.>;" ../path/to/input/file.tsv

For maximum performance, please compile and run the source code as follows:
$ ghc -O2 -o RSC rsc.hs
$ ./RSC -o name_of_filtered_file.tsv ";UPN_clinical;" ";Clinical_T_sequenced:T1>=T2>=T3>=T4>=T5>=T6>=T7>=T9>=Tn>=NA;model_group_reagent:combined_exome_capture>=exome>=capture_v2>=capture;common_name:relapse flow sorted>=relapse>=tumor>=normal;mean_depth:FLOAT.>;" ../path/to/input/file.tsv

Arguments

RSC has few different command line arguments:

Representative Sample Chooser, Copyright (c) 2020 Matthew Mosior.
Usage: rsc [-vV?o] [Identifier Field String] [Hierarchical Filter String] [TSV file]

  -v          --verbose             Output on stderr.
  -V, -?      --version             Show version number.
  -o OUTFILE  --outputfile=OUTFILE  The output file to which the results will be printed.
              --nonexhaustive       First sample will be returned for identifiers with
                                    non-exhaustive hierarchical filtering values.
              --help                Print this help message.

The -v option, the verbose option, will provide a full error message.
The -V option, the version option, will show the version of rsc in use.
The -o option, the outputfile option, is used to specify the file in which the filtered lines will be printed to.
The --nonexhaustive option specifies to print the first record for all given identifiers in which non-exhaustive filtering occured.
Finally, the --help option outputs the help message seen above.

Some Examples

The following examples will help illustrate the way the hierarchical filtering algorithm chooses a best record for each given identifier.

The following two examples assume the following inputs:

Identifier Field String: ;Sample_Group_ID;
Hierarchical Filter String: ;Time_point:T1>=T2>=T3;Type_of_data:complex>=simple>=NA;Data_depth:FLOAT.>;

Each of the following examples are illustrating the hierarchical filtering on a single identifier for simplicity's sake.

Example 1:

This example will illustrate a scenario where a single record is returned (user-defined hierarchical filter determined it was the best record for said identifier).

The hierarchical filtering starts on the most important field as described by the Hierarchical Filter String, Time_point.

Sample_Group_ID Time_point Type_of_data Data_depth
1 200ABC T1
tied
simple 100.19
2 200ABC T1
tied
complex 65.32
3 200ABC T1
tied
complex 106.78

There is a three way tie between all three lines due to the values in the Time_point field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Type_of_data, and all three lines are still being compared.

Sample_Group_ID Time_point Type_of_data Data_depth
1 200ABC T1
tied
simple 100.19
2 200ABC T1
tied
complex
tied
65.32
3 200ABC T1
tied
complex
tied
106.78

There is a two-way tie between lines 2 and 3 due to the values in the Type_of_data field, so the filtering then moves on to the next most important field as described by the Hierarchical Filter String, Data_depth, and is restricted to just lines 2 and 3.

Sample_Group_ID Time_point Type_of_data Data_depth
1 200ABC T1
tied
simple 100.19
2 200ABC T1
tied
complex
tied
65.32
3 200ABC T1
tied
complex
tied
106.78
wins

Because the Hierarchical Filter String was defined as FLOAT.>, the largest float value in the field would win the comparison between lines 2 and 3.

So line 3 was the choosen record for this given identifier.

Example 2:

This example will illustrate a scenario where no record is returned (user-defined hierarchical filter could not determine a best record for said identifier).

The hierarchical filtering starts on the most important field as described by the Hierarchical Filter String, Time_point.

Sample_Group_ID Time_point Type_of_data Data_depth
1 200ABC T1
tied
complex 101.10
2 200ABC T1
tied
complex 101.10
3 200ABC T1
tied
complex 101.10

There is a three way tie between all three lines due to the values in the Time_point field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Type_of_data, and all three lines are still being compared.

Sample_Group_ID Time_point Type_of_data Data_depth
1 200ABC T1
tied
complex
tied
101.10
2 200ABC T1
tied
complex
tied
101.10
3 200ABC T1
tied
complex
tied
101.10

There is a three way tie between all three lines due to the values in the Type_of_data field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Data_depth, and all three lines are still being compared.

Sample_Group_ID Time_point Type_of_data Data_depth
1 200ABC T1
tied
complex
tied
101.10
tied
2 200ABC T1
tied
complex
tied
101.10
tied
3 200ABC T1
tied
complex
tied
101.10
tied

There is a three way tie between all three lines due to the values in the Data_depth field, so there is no best record for this identifier.

By default, in this scenario, no record will be returned for this identifier.

In this scenario, the --nonexhaustive option (optional) will grab the first record for this identifier and return it.

The user could also provide additional field(s) and corresponding values in the hierarchical filter string, in hopes to break the current ties.

Docker

A docker container exists that contains all the necessary software to run RSC: matthewmosior/representativesamplechooser:final

Credits

Documentation was added April 2020.
Author : Matthew Mosior

About

A Haskell script that performs hierarchical filtering on a tab-delimited file across groups of lines based on user-defined hierarchical filtering string.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published