Skip to content

hepcat72/vcfSampleCompare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vcfSampleCompare.pl

WHAT IS THIS:

This script answers the question "What's different?" between my samples. It does this by sorting and filtering the variant records of a VCF file (containing data for 2 or more samples) based on the differences in the variant data between samples. The end result is a file containing the variants that show the biggest difference at the top and variants with no or little difference at the bottom or filtered out. Think of it as the genetic variant analog of a differential gene expression analysis. It solves the problem of finding "what's different" (for example) between wildtype and mutant samples.

Note, this tool does not perform a statistical analysis. It is only meant to highlight whether differences exist between sample groups of interest. It is not intended to be used as "proof" that differences between samples are "real" or biologically relevant.

Degree of "difference" between sample groups is determined by either genotype calls or the difference in the observation ratios (i.e. allelic frequencies "AO/DP") of a particular variant state (e.g. the number of times an 'A' is observed for a variant position over the number of reads that mapped over that position) between each sample group. Run with --help --extended for more details on "degree of difference".

The variant state used to determine the difference between sample groups is the one leading to the greatest difference. If there is a tie between reference (REF) and an alternate (ALT) variant states when computing the difference, the alternate state is the default selected variant state.

If multiple pairs of sample groups are provided, the pair used to represent the difference for a variant row is the one leading to the greatest degree of difference in genotype calls or observation ratios between sample groups.

If sample groups are not specified, the pair of sample groups leading to the greatest difference is greedily generated.

The sort is intended to bring variants whose degree of difference in genotype calls or allelic frequencies between sample groups to the top.

See --help --extended for more details.

DETAILS

This script works with VCF files generated by freeBayes (for SNP and small nucleotide variants) and svTyper (for structural variants). However, it will also work with any VCF data produced by other tools as long as the output includes GT and/or (AO and RO), and DP tags in the FORMAT data column.

Each row in a VCF file will be assumed to represent a variant (or variant position). In the context of this script, there are two ways to look at differences among the samples: genotype calls and the ratio of observations of a particular variant out of the total observations. We'll refer to this as either "observation ratios" or "allelic frequencies" throughout this documentation.

SORTING

Sorting is based on genotype call scores (BEST_GT_SCORE), observation ratio scores (BEST_OR_SCORE), and read depth scores (BEST_DP_SCORE). These score columns will always be between 0 and 1*.

When in --nogenotype mode, sorting ignores the BEST_GT_SCORE values.

  • When --gap-measure edge is supplied, BEST_OR_SCORE can be negative (between -1 and 0) if the observation ratios overlap/mix, i.e. there is no gap separating all observation ratios of 1 group from the other. A score of -1 is applied when all values in 1 group are bounded by the range of scores in the other group.

SORTING METRICS

All 3 scores on which sorting is based (BEST_DP_SCORE, BEST_GT_SCORE, and BEST_OR_SCORE) by default, are weighted in a 100:10:1 ratio, respectively. The weighted sum of the scores is used for a descending sort. To alter the weightings, see the --*-sort-weight options in the advanced usage (by running vcfSampleCompare.pl --extended 2).

The degree of difference between sample groups is based on the degree of difference of the genotype calls (BEST_GT_SCORE) and/or observation ratios (BESTOR_SCORE) between the sample groups reported. In both cases, BEST_GT_SCORE and BEST_OR_SCORE are intended to reflect the maximum "degree of difference" between sample groups.

However, with regard to BEST_GT_SCORE, by default, scores are treated in a binary fashion. A pair of sample groups either fully discriminates the 2 sample groups (a score of 1) or not (a score of 0). See the advanced usage (by running vcfSampleCompare.pl --extended --extended) to allow partially discriminating variants using the --minimum-gt-score option. A score of 0 can either mean that there were differing genotype calls between the sample groups or that none of the samples had a genotype call. Despite the common score, cases of no data in all samples will be sorted to the bottom of the output.

The scores under the BEST_OR_SCORE column can be calculated in 1 of 2 ways (see --gap-measure), which either treats the collection of observation ratios for a single variant state (e.g. a SNP value of "C") as 2 ranges of separate or overlapping ratios (all between 0 and 1) or a comparison of average observation ratios. Each variant state is considered independently and the one resulting in the highest BEST_OR_SCORE is the one that is used to represent the row. Consider SNPs for example. A SNP can have 1 of 4 values (A, T, G, or C). One of those is the reference state (RO), which is referred to as state 0. The other three states are 1, 2, and 3. For each state, the observation ratios are computed as (RO or AO)/DP for all of the samples. The state resulting in the highest score (i.e. the largest gap between the 2 ranges of observation ratios) is what is reported under the BEST_GT_SCORE column (see -u). If there is a tie between RO and one of the AO values, the AO state (1, 2, or 3) is preferred. For example, if state 1 is the state used to calculate the BEST_OR_SCORE, the ratio for each sample of AO(1)/DP is computed and those values for each group are used to compute the difference between the groups.

The BEST_DP_SCORE values are calculated as a degree of adequate depth (see --adequate-depth), where each sample/variant whose DP is at or above the adequate-depth gets a score of 1 and anything less is a value between 0 and 1.

If run in --nogenotype mode, the values in the BEST_OR_SCORE column of the output (see -u) reflect the only measure of degree of difference in the observation ratios (a.k.a. "allelic frequencies").

SAMPLE GROUP CONSTRUCTION AND ITS EFFECT ON SORTING

Sample group membership is what determines the BEST_GT_SCORE and BEST_OR_SCORE, which are used for sorting variant rows. Thus, how groups are defined affects the sort of the variants. The intention is to end up with variants at the top of the file that differ the most between sample groups of interest. If you are primarily interested in the difference between pre-defined groups, setting those groups will cause the greatest differences between them to occur at the top of the results.

The composition of the sample groups can be user-defined or dynamically generated and depends on the --sample-group/-s, --min-group-size/-d, and --grow/--nogrow parameters. Sample groups are constructed dynamically by default. Since the BEST_[GT/OR]_SCORE column values are used for sorting and it is the GT, AO, RO, and DP values of the members of the sample groups that determines those values, parameters which affect the sample groupings will affect the sort. The default behavior is to create 2 sample groups dynamically (unless --sample-group/-s is provided).

USER-DEFINED SAMPLE GROUPS

If you are interested in a comparison between 2 specific sets of samples, you can define them on the command line using 2 instances of the --sample-group parameter. For example, if you have 3 wildtype replicates (wt1, wt2, and wt3) and 3 mutant replicates (mut1, mut2, and mut3), you can define them on the command-line like this:

vcfSampleCompare.pl --sample-group 'wt1 wt2 wt3' --sample-group 'mut1 mut2 mut3' input.vcf

The values used must match the vcf file headers for those samples (appearing after the 'FORMAT' column header). Each list is space-delimited and each group must be wrapped in quotes.

You can have multiple sample group pairs. Each pair must occur in tandem (e.g. -s 'pair1/group1...' -s 'pair1/group2...' -s 'pair2/group1...' -s 'pair2/group2...'). When given multiple pairs of sample groups to compare, only the pair resulting in the best score (i.e. biggest difference in BEST_GT_SCORE or BEST_OR_SCORE) will be reported. In the event of a tie, the first pair defined on the command line will be the one reported.

By default, all samples in each group are used to calculate the score for the pair of groups.

Alternatively, if you are interested in at least N samples of group 1, that differ by genotype call or by at least the --separation-gap, with M samples of group 2 (and --grow is true), you can use the --min-group-size/-d parameter to supply the minimum number of members for each group (let's call them N & M for groups 1 & 2 respectively). Note, either N or M must represent at least half the members of their respective group sizes in order to avoid the introduction of noise. Unless --nogrow is supplied, vcfSampleCompare will try to add other samples in the defined group (over the supplied N and M values) until either all members of the group are included or there are no other samples with the same genotype call or the difference in their observation ratios would drop below the --separation-gap threshold. Each group is sorted by the overall genotype call abundance of all samples or by their observation ratios ((AO or RO)/DP). 2 attempts are made to create each group: bottom-up versus top-down from the sorted sample groups respectively, and then vice versa (top-down versus bottom-up). The first member of each group is the lowest/highest from group1/group2 respectively. The second attempt is the reverse: highest/lowest from group1/group2. If --grow is true (which is the default), 1 group member will be added to one of the groups, iteratively in a greedy fashion. The group selected to grow at each iteration, using the member from the end of its list it is being constructed from, will be the one resulting in the highest separation gap. Groups will stop growing when doing so would cause their genotype calls to include a call that is common between the groups or when doing so would cause the separation gap to exceed the --separation-gap threshold (or until all group members have been added). For example:

vcfSampleCompare.pl --sample-group 'wt1 wt2 wt3' --min-group-size 3 --sample-group 'mut1 mut2 mut3' --min-group-size 1 --separation-gap 0.6 --grow --nogenotype input.vcf

In this example, the separation score will be based on all wildtype samples and at least 1 mutant sample, though up to 3 mutant samples will be included in the mutant group as long as the resulting genotype calls between the groups differ or the gap between the observation ratios of the 2 groups is greater than or equal to 0.6.

If --nogrow is supplied, group sizes will reflect the --min-group-size. Groups will not grow beyond the member samples defined using --sample-group.

Note, when multiple pairs of groups are submitted on the command line, each variant/row may be composed of a different pair of groups. The selected pair is intended to represent the groupings that have greatest difference for that variant. If there are multiple pairs of sample groups that are of interest, it is recommended to run vcfSampleCompare.pl multiple times, one for each pairing. Supplying multiple pairs of groups is intended to be used when you do not know which samples have variants of interest.

The minimum possible score for user-defined sample groups is either -1 for scoring method 'edge' or 0 for scoring method 'mean'. With the 'edge' scoring method, negative numbers indicate the degree of overlap/mix of the observation ratios. A score of -1 means that one group's range of observation ratios contains the other.

DYNAMIC CREATION OF SAMPLE GROUPS

If no --sample-group(s) or --min-group-size(s) are supplied, 2 sample groups are constructed in the following manner: samples are sorted by their overall genotype call abundance or observation ratios and the sample at each end of the sorted list seeds 2 initial groups: one from the beginning of the list and one from the end (even if all genotype calls or observation ratios are the same). Unless --nogrow is supplied (or until each group's --min-group-size is reached), the next sample (from either end of the remaining sorted list) that results in the largest separation gap (e.g. the difference between the averages of the observation ratios in the 2 groups) or continues to not produce common genotype calls between the 2 groups is added to its end-group. Samples continue to be added to the groups, one by one, until the next sample added would cause a common genotype call between the groups or the difference in their observation ratios would go below the --separation-gap threshold (or in the case of --nogrow: until both groups' --min-group-size is reached).

Up to 2 --min-group-size's can be supplied, but must not sum to more than the total number of samples. The default --min-group-size is 1 when --sample-group is not provided.

Each grouping is independent for each variant (i.e. each row in the VCF file). Thus, when groups are dynamically created, each variant/row may be composed of different groups. The selected groups are intended to represent the binary sample division that represents the greatest difference for that variant between 2 sample groups. If there are more than 2 types of samples (e.g. multiple treatments and a control), dynamic creation of sample groups can miss significant differences and it is recommended to supply sample groupings in multiple runs of vcfSampleCompare.pl.

The minimum possible score when sample groups are dynamically created is 0, since the groups are created from a list sorted by either the genotype call or observation ratios (thus the difference in those values will always be positive).

FILTERING

There are 2 options that can be used to filter variants that do not contain differences between the sample groups (--separation gap and --min-group-size) and the usage of those parameters for filtering dependes on the genotype mode (--genotype/--nogenotype).

In --genotype mode, you can use --min-group-size as a reporting threshold. There must be this many members in a sample group with genotype calls that differs from all members of the other group. If the sample groups share a common genotypoe call, the row will not be printed.

In --nogenotype mode the threshold is the combination of --separation-gap and --min-group-size. If the difference in the observation ratios between the minimum-size sample groups is less than the separation gap threshold, the row will not be printed.

Note: Minimum group sizes are dynamically created regardless of whether there exists a sample grouping that would pass the filter. When --nofilter is supplied, sample groups that fail the filter will have a BEST_GT_SCORE or BEST_OR_SCORE of 0 and the sample grouping and size will be meaningless.

EXAMPLE

To sort based on the degree of difference between specific groups of samples, those groups can be defined on the command line using --sample-group/-s. You can specify a minimum number of samples in the groups to differ. So for example, say you have 3 wildtype (WT) replicates and you would like to see differences that all 3 WT samples have with any one of a set of 10 mutant samples. You would do that on the command line using the sample names:

vcfSampleCompare.pl -s "wt1 wt2 wt3" -d 3 -s "m1 m2 m3 m4 m5 m6 m7 m8 m9 m10" -d 1 input.vcf --gap-measure mean

The largest difference that the mean observation ratio of the WT samples has with 1 of the mutant samples will be at the top of the results.

INSTALL

cd into the vcfSampleCompare directory and run the following:

perl Makefile.PL
make
sudo make install

USAGE (SHORT)

vcfSampleCompare.pl -i <file*...>... [OPTIONS]

* <file*...>...       VCF input file.
  -o <sfx>            VCF outfile suffix (appended to -i).
  -u <sfx>            [STDOUT] Summary outfile suffix (appended to -i).
  -s <str ...>...     [any^] A group of sample names for difference comparisons.
                      ^ See --extended usage.
  -d <int,...>...     [all*] Minimum number of samples to use in a group to
                      determine difference with its partner.
  -a <flt>            [0.7] Minimum observation ratio difference [0-1].
  --no-g              Do not use genotype calls for sorting/filtering.
  --no-f              Do not filter variant rows.
  --no-w              Do not add samples to sample groups beyond the --min-
                      group-size.
  -l <int>            [4] Minimum read depth (DP).
  -x <int>            [20] Adequate read depth (DP).
  --help              Print general info and file formats.
  --extended [<cnt>]  Print detailed usage.

USAGE (LONG)

vcfSampleCompare.pl -i <file*...>... [OPTIONS]
vcfSampleCompare.pl [OPTIONS] < input_file

* <file*...>...             VCF input file generated either by freeBayes or
  -i <file*...>...          svTyper.
  < <file>
  -i <stub> < <file>        See --help for file format.

  -o <sfx>                  Outfile suffix appended to file names supplied to [-
                            i].  See --help for file format.

  -u <sfx>                  [STDOUT] Summary outfile suffix (appended to -i).
                            This file can contain a row for each variant row in
                            the input VCF file, and include the sorting value
                            and pairs of sample groups that were identified as
                            different.  This file may optionally be a filtered
                            version of the rows in the input VCF file (-i).

                            See --help for file format.

                            Mutually exclusive with [--outfile] (both options
                            specify an outfile name in different ways for the
                            same output).

  -s <str ...>...           [any^] This option must be supplied an even number
  --sample-group            of times (or once* or 0 times**).  Each pair of
    <str ...>...            samples groups, in order, is compared to determine
                            the maximum difference between the groups.  For
                            example, if you have 3 wildtype samples and 4 mutant
                            samples, you can define these 2 groups using -s 's1
                            s2 s3' -s 's4 s5 s6 s7' (where 's1', 's2', and 's3'
                            are the wildtype samples and 's4', 's5', 's6', and
                            's7' are mutant samples.  (All sample names must
                            match the sample names in the VCF column headers
                            row.)  The differences in variant states between
                            these groups of samples will be used to sort the
                            variants/rows of the VCF file.  See --extended
                            --help for a description of how degree of difference
                            is calculated.

                            ^ If no sample groups are supplied, a default pair
                            of samples that are the most different on any
                            particular row will be chosen.
                            * If only one group is defined, the second group is
                            assumed to be the remainder of the samples.
                            ** If no groups are defined, groups are dynamically
                            determined for each variant/row.  See --help
                            --extended for details.

  -d <int,...>...           [all*] Each sample group defined by -s is
  --min-group-size          accompanied by a (minimum) number of samples in that
    <int,...>...            group with which to compute the maximum difference
                            against its partner group.  Each instance of -s
                            should have a -d value supplied.  The order of the -
                            d values should correspond to the order of the -s
                            sample groups they apply to.  When -s is supplied,
                            the default for each group is the group size, but a
                            smaller number can bespecified.  The purpose is best
                            shown by example.  If you have 5 mutant samples and
                            3 replicate wildtype samples, you may want to find
                            variants where 1 or more mutants differ from all 3
                            wildtype samples, thus -d for the mutant group would
                            be '1' and -d for the wildtype group would be '3'.
                            In order to produce meaningful results, one group in
                            each pair of groups must get a value that is larger
                            than half the group size.  See --help --extended for
                            details on how this affects variant sorting, sample
                            group growing, and filtering.

                            When -s is supplied, setting -d to 0 will
                            automatically set the minimum group size to the
                            number of samples in the corresponding -s sample
                            group.

                            * If -s is not supplied, -d defaults to 1 for each
                            of the 2 dynamically created sample groups.

  -a <flt>                  [0.7] The difference between observation ratios
  --separation-gap <flt>    (i.e. allelic frequencies) of 2 sample groups
  --minimum-or-score <flt>  (defined by -s and -d) for a given variant state
                            (e.g. a SNP value of "A"), as calculated using the
                            --gap-measure method, must be at least this value in
                            order for a variant to be retained (see
                            --filter|--nofilter) or for a sample to be added to
                            a sample group (see --grow|--nogrow).  The
                            separation gap (or difference between observation
                            ratios) can be calculated for every variant state
                            (e.g. SNP values of "A", "T", "G", or "C" have
                            separate ratios for each sample).  The state which
                            produces the largest gap is the one that is used to
                            filter variants and add samples to sample groups.
                            Note that if only 1 or 0 samples have data, it will
                            be filtered regardless of this threshold.  Use
                            --nofilter to retain cases of too little data.  See
                            --help --extended for more details.

  -m <mean,edge>            [mean] Method to measure the gap between the
  --gap-measure <mean,      observation ratios (i.e. "allelic frequencies") of 2
    edge>                   sample groups.

                            Using each sample group's mean observation ratio to
                            measure the gap between sample groups is done by
                            taking the absolute difference of the mean
                            observation ratio of group 1 versus group 2,
                            resulting in a value between 0 and 1.

                            Using the (nearest or most overlapping) edge method
                            results in a number between -1 and 1, where values
                            between 0 and 1 represent the difference between the
                            closest observation ratios (when there is no
                            overlap) and values between -1 and 0 represent the
                            maximum degree of overlap of the range of
                            observation ratios.  Note, if the range of one group
                            contains the range of another, the score is -1.

                            Example 1 (observation ratios [AO/DP] for SNP state
                            "G" for 2 sample groups of size 3):

                            Group1 ratios: 0.1, 0.2, 0.3
                            Group2 ratios: 0.7, 0.8, 0.9
                            Edge Score: 0.4
                            Mean score: 0.6

                            Example 2 (observation ratios [AO/DP] for SNP state
                            "G" for 2 sample groups of size 3):

                            Group1 ratios: 0.1, 0.5, 0.7
                            Group2 ratios: 0.3, 0.7, 0.9
                            Edge Score: -0.4
                            Mean score: 0.2

  --no-g                    Do not use the genotype call (i.e. the 'GT' value in
  --nogenotype              the FORMAT string) for sorting & filtering rows, or
                            creating/growing sample groups.  Instead, use
                            observation ratios.  See --nogrow and --nofilter.
                            See --help --extended for details.

  --no-f                    Variants/rows are filtered based on the
  --nofilter                characteristics of the best sample group pair, such
                            as the size of the sample groups (see --min-group-
                            size and --grow), the depth of read coverage (see
                            --minimum-depth), and either their genotype score or
                            observation ratio score (see --separation-gap).
                            Supply this option to not filter variant rows whose
                            best sample group pair does not meet these
                            thresholds.  See --help --extended for more details.

  --no-w                    If the --min-group-size is less than the number of
  --nogrow                  samples in a group defined by --sample-group, by
                            default the script will keep adding samples to the
                            groups (from their remaining members) as long as (if
                            --nogenotype is supplied) the comparison of the
                            observation ratios (see --gap-measure) between the
                            groups is greater than -a or (if --genotype is
                            supplied) the genotype call is the same as that of
                            the current members or is different from all partner
                            group genotypes.  Note, this may lower the sort
                            order of a variant/row when --nogenotype is
                            supplied.

  -l <int>                  [4] Minimum read depth (DP).  Samples whose DP value
  --minimum-depth <int>     is below this threshold will not be added to sample
                            groups (see -d) when --grow is true.  Note however
                            that user-defined sample groups (see -s) still
                            include samples with depths below the minimum depth,
                            but variants/rows for which all samples in either
                            sample group have DP scores below a minimum score
                            based on this threshold, will be omitted from the
                            results (when --filter is true).  The minimum DP
                            score is calculated in the following manner:

                            S_m = D_m / D_a

                            where:

                            S_m = Minimum depth score
                            D_m = Minimum depth
                            D_a = Adequate depth

                            If you wish to only down-weight low-depth samples
                            and include them in dynamically generated group
                            pairs, use -x.

  -x <int>                  [20] Adequate read depth (DP).  This is the depth at
  --adequate-depth <int>    which all samples with greater depth are treated as
                            equally capable at yielding confident results.
                            Individual samples whose DP value is below this
                            threshold will have down-weighted depth scores
                            (between 0 and 1).  Samples at or above this depth
                            will have a depth score of 1.  Depth scores for
                            pairs of sample groups are reported under
                            BEST_DP_SCORE and PAIR_DP_SCORE (see -u).  The depth
                            score calculated for a pair of sample groups is
                            calculated in the following manner:

                            S = MIN(SUM[i=1..n](D_i > D_a ? D_a : D_i) / (D_a *
                            n),
                            SUM[i=1..m](D_i > D_a ? D_a : D_i) / (D_a * m))

                            where:

                            S   = Depth score
                            n   = Number of samples in group 1
                            n   = Number of samples in group 2
                            i   = Sample number
                            D_i = Depth of sample i
                            D_a = Adequate depth

                            If you wish to filter low-depth samples, use -l.
    ...

INPUT FORMAT (-i)

A VCF file is a plain text, tab-delimited file. The format is generally described here: http://bit.ly/2sulKcZ and described in detail here: http://bit.ly/2gKP5bN

However, the important parts that this script relies on are:

  1. The column header line (in particular - looking for the FORMAT and sample name columns).
  2. The colon-delimited codes in the FORMAT column values, specifically (for SNP data produced by freeBayes and Structural Variant data produced by SVTyper) AO (the number of reads supporting the variant), RO (the number of reads supporting the reference), and DP (the number of reads that map at or over the variant position).
  3. The colon-delimited values in the sample columns that correspond to the positions defined in the FORMAT column.

The file may otherwise be a standard VCF file containing header lines preceded by '##'. Empty lines are OK and will be printed regardless of parameters supplied to this script. Note, the --header and --no-header flags of this script do not refer to the VCF file's header, but rather the run info header of this script.

OUTPUT FORMAT: (-o)

The output file is the same format as the input VCF files, except sorted differently and possibly filtered.

OUTPUT FORMAT: (--outfile, -u)

Tab delimited file of variants that are sorted by and optionally filtered on degree of difference between 2 pairs of sample groups. Sorting and filtering is based on the values in the BEST_GT_SCORE and BEST_OR_SCORE columns. The columns of the file are:

  • CHROM - Chromosome - The chromosome on which the variant is located.
  • POS - Position - The position starting from 1 where the variant is located.
  • REF - Reference Value - The value the reference has in the variant position.
  • ALT - Alternate Value(s) - The value(s) observed in the samples in the variant position.
  • BEST_PAIR - Best Sample Group Pair Number - The sample group pair's number (numbered from left to right, as they were supplied on the command line) for the pair that resulted in the biggest difference in variant states between the sample groups. If -s was not used to pre-define sample groups, this value will always be 1, though each row's pair of sample groups selected will be independent.
  • BEST_GT_SCORE - Best Genotype Score - The value the primary sorting criteria is based on, which is the maximum PAIR_GT_SCORE.
  • BEST_OR_SCORE - Best Observation Ratio Score - The value the secondary sorting criteria is based on, which is the maximum PAIR_OR_SCORE.
  • BEST_DP_SCORE - Best Depth Score - The read depth score of the best sample group pair. The read depth score of the best pair is based on the lower average read depth of the 2 sample groups in the pair. See the usage for -x and -l to see how the score is calculated.
  • PAIR_NUM - Pair Number - A colon-delimited list of numbers indicating the pair of sample groups the sort and filtering is based on.
  • PAIR_GT_SCORE - Pair Genotype Score - A colon-delimited list of each sample group pair's maximum GT score.
  • PAIR_OR_SCORE - Pair Observation Ratio Score - A colon-delimited list of each sample group pair's maximum observation ratio score.
  • PAIR_DP_SCORE - Pair Read Depth Score - A colon-delimited list of each sample group pair's read depth score. The read depth score for each pair is based on the lower average read depth of the 2 sample groups in the pair. See the usage for -x and -l to see how the score is calculated.
  • STATES_USED_GT - Genotype Calls Used - The genotype calls used to calculate the BEST_GT_SCORE. There will be at least 2 genotype calls separated by a semicolon. If there are multiple genotype calls present in 1 sample group, they will be delimited with a "+".
  • STATE_USED_OR - Variant State Used to Calculate Observation Ratios - The state used can be 0 (the value of the variant at the variant position in the reference), or a number indicating which of the ALT observations produced the BEST_OR_SCORE. E.g. if the state '0' was used to compute the scores (indicating the REF state), then (by default) the average observation ratio of each group will be close to either N/N (i.e. the same as the reference) and the other group will be close to 0/N (i.e. different from the reference).
  • GROUP1_SAMPLES - Sample Group 1 Members - A comma-delimited list of sample names belonging to group 1. Lists of group 1 samples from multiple pairs of sample groups will be colon-delimited. E.g. For 2 pairs, the value might be: "s1,s2:s6,s7".
  • GROUP1_GTS - Sample Group 1 Genotype Calls - A list of comma-delimited genotype calls. Lists of group 1 genotype calls from multiple pairs of sample groups will be colon-delimited.
  • GROUP1_ORS - Sample Group 1 Observation Ratios - A list of comma-delimited observation ratios based on the variant state in STATE_USED_OR. Lists of group 1 observation ratios from multiple pairs of sample groups will be colon-delimited.
  • GROUP2_SAMPLES - Sample Group 2 Members - A comma-delimited list of sample names belonging to group 2. Lists of group 2 samples from multiple pairs of sample groups will be colon-delimited. E.g. For 2 pairs, the value might be: "s1,s2:s6,s7".
  • GROUP2_GTS - Sample Group 2 Genotype Calls - A list of comma-delimited genotype calls. Lists of group 2 genotype calls from multiple pairs of sample groups will be colon-delimited.
  • GROUP2_ORS - Sample Group 2 Observation Ratios - A list of comma-delimited observation ratios based on the variant state in STATE_USED_OR. Lists of group 2 observation ratios from multiple pairs of sample groups will be colon-delimited.

Example:

#CHROM	POS	ID	REF	ALT	BEST_PAIR	BEST_GT_SCORE	BEST_OR_SCORE	BEST_DP_SCORE	PAIR_NUM	PAIR_GT_SCORE	PAIR_OR_SCORE	PAIR_DP_SCORE	STATES_USED_GT	STATE_USED_OR	GROUP1_SAMPLES	GROUP1_GTS	GROUP1_ORS	GROUP2_SAMPLES	GROUP2_GTS	GROUP2_ORS
Chromosome	6610	.	C	G	1	1	1	408	1	1	1	408	1;0	1	sample1	1	254/254	sample2,sample3	0,0	0/407,0/564
Chromosome	10723	.	C	G	1	1	1	111	1	1	1	111	1;0	1	sample1	1	39/39	sample2,sample3	0,0	0/78,0/216
Chromosome	10843	.	T	C	1	1	1	33	1	1	1	33	1;0	1	sample1	1	8/8	sample2,sample3	0,0	0/25,0/67
Chromosome	10855	.	T	G	1	1	1	24	1	1	1	24	1;0	1	sample1	1	9/9	sample2,sample3	0,0	0/17,0/47
Chromosome	10866	.	TCCTG	CCCTA	1	1	1	21	1	1	1	21	1;0	1	sample1	1	6/6	sample2,sample3	0,0	0/15,0/42
Chromosome	10876	.	C	G	1	1	1	20	1	1	1	20	1;0	1	sample1	1	8/8	sample2,sample3	0,0	0/13,0/38
Chromosome	10888	.	G	C	1	1	1	19	1	1	1	19	1;0	1	sample1	1	8/8	sample2,sample3	0,0	0/11,0/37

About

Filter and rank variant call files (VCF) based on comparative evidence ratios between groups of samples.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages