This page describes basic and advanced usage of GAT.
A list of all command-line options is available via:
gat-run.py --help
The gat tool is controlled via the gat-run.py
script. This script requires the following input:
- A set of
intervals
S
withsegments of interest
to test.- A set of
intervals
A
withannotations
to test against.- A set of
intervals
W
describing aworkspace
GAT requires bed
formatted files. In its simplest form, GAT is then run as:
gat-run.py
--segment-file=segments.bed.gz
--workspace-file=workspace.bed.gz
--annotation-file=annotations.bed.gz
The script recognizes gzip compressed files by the suffix .gz
.
The principal output is a tab-separated table of pairwise comparisons between each segments of interest
and annotations
. The table will be written to stdout, unless the option --stdout
is given with a filename to which output should be redirected.
The main columns in the table are:
- track
the
segments of interest
track
- annotation
the
annotations
track
- observed
the observed count
- expected
the expected count based on the
sampled segments
- CI95low
the value at the 5% percentile of
samples
- CI95high
the value at the 95% percentile of
samples
- stddev
the standard deviation of
samples
- fold
the fold enrichment, given by the ratio observed / expected
- l2fold
log2 of the fold enrichment value
- pvalue
the
p-value
of enrichment/depletion- qvalue
a multiple-testing corrected
p-value
. See multiple testing correction.
- Additionally, there are the following columns:
- track_nsegments
number of segments in
track
insegments of interest
- track_size
number of residues in covered by
track
insegments of interest
within theworkspace
- track_density
fraction of residues in
track
insegments of interest
within theworkspace
- annotation_nsegments
number of segments in
track
inannotations
.- annotation_size
number of residues in covered by
track
inannotations
within theworkspace
- annotation_density
number of residues in covered by
track
inannotations
within theworkspace
- overlap_nsegments
number of segments in overlapping between
segments of interest
andannotations
- overlap_size
number of nucleotides overlapping between
segments of interest
andannotations
- overlap_density
fraction of residues overlapping between
segments of interest
andannotations
withinworkspace
- percent_overlap_nsegments_track
percentage of segments in
segments of interest
overlappingannotations
- percent_overlap_size_track
percentage of nucleotides in
segments of interest
overlappingannotations
- percent_overlap_nsegments_annotation
percentage of segments in
annotations
overlappingsegments of interest
- percent_overlap_size_annotation
percentage of nucleotides in
annotations
overlappingsegments of interest
- description
additional description of track (requires
--descriptions
to be set).
Further output files such as auxiliary summary statistics go to files named according to --filename-output-pattern
. The argument to filename-output-pattern
should contain one %s
placeholder, which is then substituted with section names.
Count here denotes the measure of association and defaults to number of overlapping nucleotides.
All of the options --segment-file, --workspace-file, --annotation-file can be used several times on the command line. What happens with multiple files depends on the file type:
- Multiple --segment-file entries are added to the list of
segments of interest
to test with.- Multiple --annotation-file entries are added to the list of
annotations
to test against.- Multiple --workspace entries are intersected to create a single workspace.
Generally, gat will test m segments of interest
lists against n annotations
lists in all m n* combinations.
Within a bed
formatted file, different tracks
can be separated using a UCSC formatted track
line, such as this:
track name="segmentset1"
chr1 23 100
chr3 50 2000
track name="segmentset2"
chr1 1000 2000
chr3 4000 5000
or alternatively, using the fourth column in a bed
formatted file:
chr1 23 100 segmentset1
chr3 50 2000 segmentset1
chr1 1000 2000 segmentset2
chr3 4000 5000 segmentset2
The latter takes precedence. The option --ignore-segment-tracks` forces gat to ignore the fourth column and consider all intervals to be from a single interval set.
Note
Be careful with bed-files where each interval gets a unique identifier. Gat will interprete each interval as a separate segment set to read. This is usually not intended and causes gat to require a very large amount of memory. (see the option --ignore-segment-tracks
By default, tracks can not be split over multiple files. The option --enable-split-tracks
permits this.
Isochores are genomic segments with common properties that are potentially correlated with the segments of interest and the annotations, but the correlation is not of interest here. For example, consider a CHiP-Seq experiment and the testing if CHiP-Seq intervals are close to genes. G+C rich regions in the genome are gene rich, while at the same time there is possibly a nucleotide composition bias in the CHiP-Seq protocol depleting A+T rich sequence. An association between genes and CHiP-Seq intervals might simply be due to the G+C effect. Using isochores can control for this effect to some extent.
Isochores split the workspace
into smaller workspaces of similar properties, so called isochore workspaces
. Simulations are performed for each isochore workspaces
separately. At the end, results for each all isochore workspaces are aggregated.
In order to add isochores, use the --isochore-file command line option.
Counters describe the measure of association that is tested. Counters are selected with the command line option --counter
. Available counters are:
nucleotide-overlap
: number of bases overlapping [default]segment-overlap
: number of intervals intervals in thesegments of interest
overlappingannotations
. A single base-pair overlap is sufficient.segment-mid-overlap
: number of intervals in thesegments of interest
overlapping at their midpointannotations
.annotations-overlap
: number of intervals in theannotations
overlappingsegments of interest
. A single base-pair overlap is sufficient.segment-mid-overlap
: number of intervals in theannotations
overlapping at their midpointsegments of interest
Multiple counters can be given. If only one counter is provided, the output will be to stdout. Otherwise, separate output files will be created each counter. The filename can be controlled with the --output-table-pattern
option.
By default, gat returns the empirical p-value
based on the sampling procedure. The minimum p-value
is 1 / number of samples
.
Sometimes the lower bound on p-values causes methods that estimate the FDR to fail as the distribution of p-values is atypical. In order to estimate lower pvalues, the number of samples needs to be increased. Unfortunately, the run-time of gat is directly proportional to the number of samples.
A solution is to set the option --pvalue-method
to --pvalue-method=norm
. In that case, pvalues are estimated by fitting a normal distribution to the samples. Small p-values are obtained by extrapolating from this fit.
gat provides several methods for controlling the false discovery rate. The default is to use the Benjamini-Hochberg procedure. Different methods can be chosen with the --qvalue-method
option.
--qvalue-method=storey
uses the procedure by Storey et al. (2002) to compute a q-value
for each pairwise comparison. The implementation is in its functionality equivalent to the qvalue package implemented in R.
Other options are equivalent to the methods as implemented in the R function p.adjust
.
gat can save and retrieve samples from a cache --cache=cache_filename
. If cache_filename
does not exist, samples will be saved to the cache after computation. If cache_filename
does already exist, samples will be retrieved from the cache instead of being re-computed. Using cached samples is useful when trying different counters (see counters
).
If the option --counts-file
is given, gat will skip the sampling and counting step completely and read observed counts from --count-file=counts_filename
.
GAT can make use of several available CPU/cores if available. Use the --num-threads=#
option in order to specify how many CPU/cores GAT will make use of. The default --num-threads=0
means that GAT will not use any multiprocessing.
A variety of options govern the output of intermediate results by gat. These options usually accept patterns that represent filenames with a %s
as a wild card character. The wild card is replaced with various keys. Note that the amount of data output can be substantial.
--output-counts-pattern
output counts. One file is created for each counter. Counts output files are required for
gat-compare
.--output-plots-pattern
create plots (requires matplotlib). One plot for each annotation is created showing the distribution of expected counts and the observed count. Also, outputs the distribution of p-values and q-values.
--output-samples-pattern
output
bed
formatted files with individual samples.
The gat-compare tool can be used to test if the fold changes found in two or more different gat experiments are significantly different from each other.
This tool requires the output files with counts created using the --output-counts-pattern
option.
For example, to compare if fold changes are signficantly different between two cell lines, execute:
gat-run.py --segments=CD4.bed.gz <...>
--output-counts-pattern=CD4.%s.overlap.counts.tsv.gz
gat-run.py --segments=CD14.bed.gz <...>
--output-counts-pattern=CD14.%s.overlap.counts.tsv.gz
gat-compare.py CD4.nucleotide-overlap.counts.tsv.gz CD14.nucleotide-overlap.counts.tsv.gz
Plot gat results.
Perform a GREAT analysis:
gat-great.py
--segment-file=segments.bed.gz
--workspace-file=workspace.bed.gz
--annotation-file=annotations.bed.gz