Skip to content

Latest commit

 

History

History
48 lines (43 loc) · 4.29 KB

DATAFORMATS.md

File metadata and controls

48 lines (43 loc) · 4.29 KB
title
Data classes

The following table describes specialized objects to store data represented in population genetics packages. Conversion between all types is possible.

Anyone developing a package for population genetic analysis is encouraged to use or build upon these data structures. If a new data structure is needed, please provide a conversion method to one or more of the classes listed below.

Class {type} (package) Strengths Weaknesses
DNAbin {S3} (ape) stores all sets of sequences (aligned or not) less compact than 2-bit coding (but by a factor 4 at most)
uses matrices (aligned) or lists so usual R's commands (names, rownames, [, [[, $) can be used
many as.DNAbin methods in ape (inc. from BioConductor)
efficient functions in ape (dist.dna, seg.sites, base.freq, read.FASTA) and in pegas (haplotype)
loci {S3} (pegas) low memory usage not really appropriate for some analyses (e.g., multivariate analyses)
all levels of ploidy and any number of alleles needs to improve the treatment of NA's (especially when data are read with read.vcf()
genotypes can be phased
any kind of individual data can be associated in the data frame
efficient to compute genotype and allele frequencies
genind {S4} (adegenet) stores allelic counts; ideal for multivariate analyses requires more memory
  additional slots for individual data less efficient to compute frequencies
additional slot for population strata
all levels of ploidy
genpop {S4} (adegenet) equivalent to genind at group level; ideal for multivariate analysis requires more memory
genlight {S4} (adegenet) stores binary SNPs using bit-level coding; very memory efficient more computationally intensive to handle; less functionalities
additional slots for individual data and population strata
all levels of ploidy assumes bi-allelic loci
genclone {S4} (poppr) inherits genind object; gains all advantages all the same weaknesses plus slightly more memory
stores multilocus genotype/lineage definitions (@mlg slot) for clonal populations
snpclone {S4} (poppr) inherits genlight object; gains all advantages all the same weaknesses plus slightly more memory
stores multilocus genotype/lineage definitions (@mlg slot) for clonal populations
genambig {S4} (polysat) stores microsatellite data with ambiguous ploidy does not handle any other data type
exports to genpop objects cannot easily be transferred to any other object
phyDat {S3} (phangorn) very general inspired by R data.frame, factor and contrasts, can contain any discrete data type; nucleotides, amino acids and codons have some more support designed having phylogenetic analysis in mind; requires alignments, where all sequences have same length
can be converted to and from DNAbin objects (as.DNAbin / as.phyDat)
a few generic functions work on it: c, unique, subset and utility functions baseFreq, allSitePattern, etc. data are not necessarily very memory efficient (as integer + contrast matrix), but stores only unique site patterns and their weights (as double)
"efficient" maximum likelihood, maximum parsimony and distance functions in phangorn
gtype {S3} (strataG) a simple R list containing a matrix where the first column is a stratification scheme and columns afterward are either haplotypes or diploid loci. If haploid data, the gtype object can also contain a list of DNA sequences. Can likely be made more efficient in terms of storage and preprocessing for other analytical routines in package
can be converted to data.frame or matrix with appropriate as. functions.
has manipulation functions like subset which will select certain strata and/or loci, merge to combine mulitple gtypes, and summary.
can create input files for Genepop, STRUCTURE, fastsimcoal, Arlequin, MEGA, and PHASE
multiDNA {S4} (apex) stores multiple DNAbin objects from ape
multiPhyDat {S4} (apex) stores multiple phyDat objects from phangorn