GitHub - SantosJGND/Imputation: VCF, dimensionality reduction, distances, KDE

Imputation repository.

This sub-directory holds work on imputation. As usual, this research is done with population genetic data in mind.

The data consists a single genotype data set. Variables are variant count features ranging between 0 and 2; Samples are designed to derive from a semi-consistent population network. Semi-consistent is used here to indicate that certain observations have variable pdfs, and the characteristics of the structure vary (cluster distance may change).

Data generation

VCF files are generated using the Genome Simulator tool of the first Tools repository link.

replicated here for the specific data sets used notebook.

I. Distances / Dimensionality reduction.

Window based analysis constructs data sets of distance data with which to predict position of missing observation in incomplete data set.

notebook

II. PCA inverse transformation.

An aside on the accuracy of PCA inverse transformation.

notebook

III. Cluster search.

Dimensionality reduction and maximum likelihood cluster classification. Use for stats, imputation.

notebook

Application to rice data.

i. Haplotype imputation

Based on the method described in section I.. Additions include: composite likelihood; control for distance; exclusion of observations carrying missing or heterozygous calls in local distance calculations.

data requirement: haplotype, phased, or nearly homozygous data.

validation: benchmark included.

notebook

ii. cluster distance and imputation

Application of the cluster search and inference pipeline on 3000 Rice Genomes data. Focus on Japonica and cBasmati variation. Distance inference now performed within 1MB of focal target.

notebook

iii. targeted Ne estimation at local windows

notebook

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
figures		figures
impute_tools		impute_tools
structure_tools		structure_tools
synth_tools		synth_tools
.gitignore		.gitignore
CoalSim_ClusterSearch.ipynb		CoalSim_ClusterSearch.ipynb
INV_transform.ipynb		INV_transform.ipynb
Impute_I_distances.ipynb		Impute_I_distances.ipynb
LICENSE.md		LICENSE.md
README.md		README.md
Reconstruct_trees.ipynb		Reconstruct_trees.ipynb
Theta_ClusterSearch.ipynb		Theta_ClusterSearch.ipynb
likelihood_play_deprecated.ipynb		likelihood_play_deprecated.ipynb
phasing.ipynb		phasing.ipynb
prepare_vcfs.ipynb		prepare_vcfs.ipynb
rice_ClusterSearch.ipynb		rice_ClusterSearch.ipynb
rice_impute.ipynb		rice_impute.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Imputation repository.

Data generation

I. Distances / Dimensionality reduction.

II. PCA inverse transformation.

III. Cluster search.

Application to rice data.

About

Releases

Packages

Languages

License

SantosJGND/Imputation

Folders and files

Latest commit

History

Repository files navigation

Imputation repository.

Data generation

I. Distances / Dimensionality reduction.

II. PCA inverse transformation.

III. Cluster search.

Application to rice data.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages