This sub-directory holds work on imputation. As usual, this research is done with population genetic data in mind.
The data consists a single genotype data set. Variables are variant count features ranging between 0 and 2; Samples are designed to derive from a semi-consistent population network. Semi-consistent is used here to indicate that certain observations have variable pdfs, and the characteristics of the structure vary (cluster distance may change).
VCF files are generated using the Genome Simulator tool of the first Tools repository link.
- replicated here for the specific data sets used notebook.
Window based analysis constructs data sets of distance data with which to predict position of missing observation in incomplete data set.
An aside on the accuracy of PCA inverse transformation.
Dimensionality reduction and maximum likelihood cluster classification. Use for stats, imputation.
i. Haplotype imputation
Based on the method described in section I.. Additions include: composite likelihood; control for distance; exclusion of observations carrying missing or heterozygous calls in local distance calculations.
data requirement: haplotype, phased, or nearly homozygous data.
validation: benchmark included.
ii. cluster distance and imputation
Application of the cluster search and inference pipeline on 3000 Rice Genomes data. Focus on Japonica and cBasmati variation. Distance inference now performed within 1MB of focal target.
iii. targeted Ne estimation at local windows