Skip to content

Latest commit

 

History

History
120 lines (86 loc) · 5.12 KB

file_formats.md

File metadata and controls

120 lines (86 loc) · 5.12 KB

Prepare Decoding file formats

Input file formats

Demographic history (*.demo)

The demographic history provided in input to Prepare Decoding represents a piece-wise constant history of past effective population sizes, with format

TimeStart   PopulationSize

Where TimeStart is the first generation where the population has size PopulationSize. Note that population size is haploid, and that the demographic model is usually built assuming a specific mutation rate, which is passed as an argument to the ASMCprepareDecoding program. The first line should contain generation 0. You can obtain this model using e.g. PSMC/MSMC/SMC++. If your model is not piecewise constant, you will need to approximate it as piecewise constant. The last provided interval is assumed to last until time=Infinity (and is usually remote enough to have negligible effects on the results).

The demographic models used with ASMC can be found here and were inferred using smc++ in the following paper:

Spence, J.P. and Song, Y.S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances, Vol. 5, No. 10, eaaw9206 (2019), [doi].

They correspond to these population sizes, but rescaled to assume mutation rate of 1.65e-8.

Time discretization (*.disc)

The list of discrete time intervals provided in input to Prepare Decoding contains a single number per line, representing time measured in (continuous) generations, and starting at generation 0.0. For instance, the list 30-100-2000_CEU.disc contains time intervals:

0.0
30.0
60.0
90.0
... <lines omitted>
79855.6
96263.0
124311.7

The intervals defined in this file are: {0.0-30.0, 30.0-60.0, ..., 96263.0-124311.7, 124311.7-Infinity}.

frequencies (*.frq)

A file containing SNP frequency data, in Plink format. These frequencies should reflect the allele frequency spectrum of the data you plan to analyze with ASMC. The file contains a header row, and data rows, one per variant, in the following form:

 CHR           SNP   A1   A2          MAF    NCHROBS
   1     rs3131972    A    G       0.1684     302964
   1    rs12184325    T    C      0.03716     304088

where:

  • CHR Chromosome code
  • SNP Variant identifier
  • A1 Allele 1 (usually minor)
  • A2 Allele 2 (usually major)
  • MAF Allele 1 frequency
  • NCHROBS Number of allele observations

Prepare decoding expects a file with a header row and minor allele frequencies in column 5.

Csfs (*.csfs)

A file containing CSFS information. If pre-calculated, a CSFS file can be passed to Prepare Decoding, but if not present CSFS will be calculated at runtime. Once calculated, you can save the CSFS file for re-use. There is no need to understand the content of this file.

Output file formats

The following files can be output by Prepare Decoding. See the relevant sections of the API documentation for how to save these files:

Decoding quantities (*.decodingQuantities.gz)

The *.decodingQuantities.gz file is generated by Prepare Decoding and is an input into ASMC. It is used to perform efficient inference of pairwise coalescence times. There is no need to understand the content of this file.

Time discretization intervals (*.intervalsInfo)

The *.intervalsInfo file is generated by the Prepare Decoding and is an input into ASMC. It contains some useful information about the time discretization and the demographic model. It contains a number of lines corresponding to the number of discrete time intervals used in the analysis. Each line has format:

IntervalStart   ExpectedCoalescenceTime IntervalEnd

The values IntervalStart and IntervalEnd represent the start/end of each discrete time interval, and ExpectedCoalescenceTime is the expected coalescence time for a pair of individuals who have been inferred to coalesce within this time interval, and depends on the demographic model.

Csfs (output) (*.csfs)

See above.

Time discretization (output) (*.disc)

See above.

Demographic history (output) (*.demo)

See above.