Skip to content

Latest commit

 

History

History
248 lines (181 loc) · 16.9 KB

asmc.md

File metadata and controls

248 lines (181 loc) · 16.9 KB

ASMC

This page describes compiling and running the ASMC program described in this paper:

  • P. Palamara, J. Terhorst, Y. Song, A. Price. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nature Genetics, 2018.

A page with data and annotations from the paper can be found here.

Summary (TL;DR)

The Ascertained Sequentially Markovian Coalescent is a method to efficiently estimate pairwise coalescence times along the genome. It can be run using SNP array or whole-genome sequencing (WGS) data.

To run ASMC, download it here, and follow the quickstart guide. Compile the ASMC_exe target to build the ASMC executable. To compute pairwise coalescence times for the following files containing SNP array data for 150 phased diploid samples:

  • ASMC_data/examples/asmc/exampleFile.n300.array.hap.gz
  • ASMC_data/examples/asmc/exampleFile.n300.array.samples
  • ASMC_data/examples/asmc/exampleFile.n300.array.map.gz

(see here for file formats), you can run the following ASMC command:

./build/ASMC_exe \
        --decodingQuantFile ASMC_data/decoding_quantities/30-100-2000_CEU.decodingQuantities.gz \
        --inFileRoot ASMC_data/examples/asmc/exampleFile.n300.array \
        --posteriorSums

This will generate ASMC_data/examples/asmc/exampleFile.n300.array.1-1.sumOverPairs.gz, which contains a matrix of size SxD, where S is the number of sites in the data, and D is the number of discrete coalescence time intervals defined in ASMC_data/discretizations/30-100-2000_CEU.disc. The [i,j]-th entry of this matrix contains the sum of posterior coalescence probabilities for all samples at SNP i and time j.

Running ASMC

Once you have computed or downloaded the decoding quantities corresponding to your demographic model, time discretization, and SNP allele frequencies, you can analyze SNP or WGS data using the ASMC program:

./build/ASMC_exe \
        --decodingQuantFile ASMC_data/decoding_quantities/30-100-2000_CEU.decodingQuantities.gz \
        --inFileRoot ASMC_data/examples/asmc/exampleFile.n300.array \
        --majorMinorPosteriorSums

For WGS data, add the --mode sequence option:

./build/ASMC_exe \
        --decodingQuantFile ASMC_data/decoding_quantities/30-100-2000_CEU.decodingQuantities.gz \
        --inFileRoot ASMC_data/examples/asmc/exampleFile.n300 \
        --majorMinorPosteriorSums \
        --mode sequence

Using the --majorMinorPosteriorSums flag, ASMC will output the sum of posterior coalescence probabilities for all analyzed pairs of individuals. These will be written in ASMC_data/examples/asmc/exampleFile.n300.array.{00,01,11}.sumOverPairs.gz for the SNP array example, and ASMC_data/examples/asmc/exampleFile.n300.{00,01,11}.sumOverPairs.gz for the WGS example.

If you are decoding a large number of samples, you can break down the computation in several jobs using the --jobs int and --jobInd int flags, which take the total number of jobs to be performed and the current job index as arguments. If you don't specify a name for the output files, a job index will be automatically added to the default output path.

The full set of command line options for ASMC is as follows:

Mandatory:
  --inFileRoot arg                Prefix of hap|haps|hap.gz|haps.gz, sample|samples, map|map.gz file
  --decodingQuantFile arg         Decoding quantities file

Choose one of:
  --posteriorSums                 Output file for sum of posterior distribution
                                  over pairs.
  --majorMinorPosteriorSums       Output file for sum of posterior distribution 
                                  over pairs, partitioned by major/minor alleles.                                

Optional:
  --outFileRoot arg               Output file for sum of posterior distribution
                                  over pairs (default: --hapsFileRoot argument)
  --mapFile arg                   Optional override for map|map.gz file, if not in inFileRoot
  --jobs int (=0)                 Number of jobs being done in parallel
  --jobInd int (=0)               Job index (1..jobs)
  --mode string (=array)          Decoding mode. Choose from {sequence, array}.
  --compress (=false)             Compress emission to binary (no CSFS)
  --useAncestral (=false)         Assume ancestral alleles are coded as 1 in
                                  input (will assume 1 = minor otherwise)
  --skipCSFSdistance float (=0.0) Genetic distance (in cM) between two CSFS emissions

In addition to the arguments described above, ASMC options include:

  • --compress is a shorthand for --skipCSFSdistance Infinity (see below).
  • --useAncestral can be used to specify that a 1 in the data specifies an ancestral allele. This will cause the CSFS to be used without folding. This is mostly not needed.
  • --skipCSFSdistance float, which takes a floating point argument, specifies the minimum distance for a CSFS emission to be used. The default is 0.0 (always use CSFS). Setting --skipCSFSdistance Infinity (which is the same as --compress) leads to never using the CSFS (i.e. the classic PSMC emission if decoding WGS data, or a binary emission which controls for ascertainment if decoding SNP array data).

Input/output file formats

You may want to look at files in ASMC_data/ for examples of the file formats described below.

Phased haplotypes in Oxford haps/sample format (*.hap/hap.gz, samples)

File format specification.

These files are provided in input to ASMC and optionally ASMCprepareDecoding. The file format explained here. These files are output by phasing programs like Eagle and Shapeit.

Genetic map (*.map/map.gz)

The genetic map provided in input to ASMC is in Plink map format, in which each line has four columns with format "Chromosome SNPName GeneticPosition PhysicalPosition". Genetic positions are in centimorgans, physical positions are in bp. The map can be optionally compressed using gzip.

Demographic history (*.demo)

The demographic history provided in input to ASMCprepareDecoding represents a piece-wise constant history of past effective population sizes, with format

TimeStart   PopulationSize

Where TimeStart is the first generation where the population has size PopulationSize. Note that population size is haploid, and that the demographic model is usually built assuming a specific mutation rate, which is passed as an argument to the ASMCprepareDecoding program. The first line should contain generation 0. You can obtain this model using e.g. PSMC/MSMC/SMC++. If your model is not piecewise constant, you will need to approximate it as piecewise constant. The last provided interval is assumed to last until time=Infinity (and is usually remote enough to have negligible effects on the results).

The demographic models used with ASMC can be found here and were inferred using smc++ in the following paper:

Spence, J.P. and Song, Y.S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances, Vol. 5, No. 10, eaaw9206 (2019), [doi].

They correspond to these population sizes, but rescaled to assume mutation rate of 1.65e-8.

Time discretization (*.disc)

The list of discrete time intervals provided in input to ASMC contains a single number per line, representing time measured in (continuous) generations, and starting at generation 0.0. For instance, the list ASMC_data/discretizations/30-100-2000_CEU.disc contains time intervals:

0.0
30.0
60.0
90.0
120.0
150.0
180.0
210.0
240.0
270.0
300.0
330.0
360.0
... <lines omitted>
79855.6
96263.0
124311.7

The intervals defined in this file are: {0.0-30.0, 30.0-60.0, ..., 96263.0-124311.7, 124311.7-Infinity}.

Decoding quantities (*.decodingQuantities.gz)

The *.decodingQuantities.gz file is generated by ASMCprepareDecoding and input into ASMC. It is used to perform efficient inference of pairwise coalescence times. There is no need to understand the content of this file.

Note: the CEU.demo demographic model and the decoding quantities for CEU+UKBB previously provided in this repository and this repository were mistakenly encoded as diploid rather than haploid. The file CEU.demo and CEU+UKBB decoding quantities here have now been fixed. They were generated using v2.2.1 of the PrepareDecoding tool, which also provides a simpler interface for computing decoding quantities as well as support for additional demographic models. Using these new decoding quantities with v1.2 of ASMC will tend to produce more recent estimates for TMRCAs compared to the decoding quantities distributed with v1.0 and v1.1. This should not have a substantial impact on most downstream analyses.

Time discretization intervals (*.intervalsInfo)

The *.intervalsInfo file is generated by the ASMCprepareDecoding and input into ASMC. It contains some useful information about the time discretization and the demographic model. It contains a number of lines corresponding to the number of discrete time intervals used in the analysis. Each line has format:

IntervalStart   ExpectedCoalescenceTime IntervalEnd

IntervalStart and IntervalEnd represent the start/end of each discrete time interval, ExpectedCoalescenceTime is the expected coalescence time for a pair of individuals who have been inferred to coalesce within this time interval, and depends on the demographic model.

Sum of pairwise posterior coalescence probabilities *.sumOverPairs.gz

The output of the ASMC analysis is written in *.{00,01,11}.sumOverPairs.gz files. Each file contains a matrix of size SxD, where S is the number of sites in the data, and D is the number of discrete time intervals used in the analysis. The [i,j]-th entry of each matrix contains the sum of posterior coalescence probabilities for all samples at SNP i and discrete coalescence time j. The output breaks down coalescence events of samples carrying different alleles at each site, using the {00,01,11} suffixes. Specifically:

  • The i-th row of the matrix in *.00.sumOverPairs.gz contains the sum of posterior probabilities for all pairs of samples that are homozygous 0 at site i.
  • The i-th row of the matrix in *.01.sumOverPairs.gz contains the sum of posterior probabilities for all pairs of samples that heterozygous at site i.
  • The i-th row of the matrix in *.11.sumOverPairs.gz contains the sum of posterior probabilities for all pairs of samples that are homozygous 1 at site i.

Tools, scripts, and analyses

These are some useful tools and scripts to be used with ASMC. They have not yet been relocated to this repository, but are standalone tools that are available at the indicated locations.

Tool to merge output of parallel ASMC jobs

The folder MERGE_POSTERIORS contains the ASMCmergePosteriorSums.jar program, which may be used to merge the output of different ASMC jobs. You can type

java -jar TOOLS/MERGE_POSTERIORS/ASMCmergePosteriorSums.jar  -h

for a list of command line options. Also see the merge.sh file. This tool assumes the decoding has been done using --majorMinorPosteriorSums. You may normalize the output so that the posterior sums to 1 for each site using the --norm flag. If you used the --posteriorSums flag and you want to simply normalize the output, you can run:

zcat path/to/output/exampleFile.n100.array.merged.sumOverPairs.gz | \
    awk '{ c=0; for (i=1; i<=NF; i++) c+=$i; for (i=1; i<=NF; i++) $i/=c; print; }' | \
    gzip -c -v - > path/to/output/exampleFile.n100.array.merged.norm.sumOverPairs.gz

Script to visualize average coalescence density in a region

The folder PLOT_POSTERIORS contains the plotPosteriorHeatMap.py program, which can be used to visualize the coalescence posterior density in specific regions (e.g. figures 3.b, 3.c, and S7 in the ASMC paper). For an example, see the data page, where the UKBB_posteriors files can be downloaded to generate the figures from the paper.

Density of recent coalescence (DRC) statistic

The DRC statistic can be obtained by summing entries of the sumOverPairs matrices that correspond to the desired time interval, and averaging across SNPs within a window. Please refer to the paper for details.

A few notes on analyzing the DRC statistic: it is important to remember that using a Gamma distribution as a null model for the DRC statistic is an approximation (see paper), and further work is needed to derive an exact null model. In particular the DRC cannot be larger than 1, so approximate p-values obtained under a Gamma model for very large DRC values will be artificially small. We found the Gamma approximation to be reasonable under some conditions, such as a large recent effective population size, which makes the DRC small in neutral regions, but this approximation will not work well in all populations (e.g. isolated populations) or for large time intervals. If you decide to use this approach to detect candidate regions for selection, we recommend carefully checking that the Gamma approximation is reasonable. Because particularly large DRC values may lead to artificially small p-values, you may also want to use the Gamma approximation only to determine an approximate significance threshold, then report raw DRC values larger than this threshold, rather than reporting approximate p-values.

Finally, we also suggest that you check that candidate regions detected this way do not overlap problematic regions of the genome, such as regions of very high/low recombination rate or LD, as well as large structural variants, such as inversions, which may affect recombination.

Precomputed decoding quantities

This repository contains various data files related to ASMC that users might find helpful. This directory contains two sets of decoding quantities that have been precomputed using the PrepareDecoding tool.

They are built using the European demographic model CEU.demo, SNP allele frequencies from the UK Biobank in UKBB.frq, and the time discretizations that can be found in this directory. The discretizations contain several short time intervals in the recent generations, followed by intervals calculated from the mutation age distribution from generation 2,000 on. This enables getting more fine-grained information for recent generation, though note that smaller time intervals will contain fewer coalescent events on average.