Phase imputation for missing data on a linkage map

preparing data for use in random forest

The goal of this program is to leverage the new linkage mapping technique we are employing, in order to create an input dataset of high quality for qtl analysis, and various statistical modelling techniques I will employ. These methods are not tolerant of missing genotypes, so we need to impute missing data for certain instances. Since data exists in a linkage map, we have more information to work with than programs such as fastPHASE which make imputations informed by LD and haplotype clustering.

Markers placed in clusters lend themselves to making a consensus haplotype for the given cluster. Consider the following hypothetical zero recombination cluster, ZRC_1. Any one markers has missing genotypes and lacks a full phase vector necessary for modelling. These markers all sit in the same cluster (cM location) on the linkage map, so their genotypes can be merged to create a full phase vector for the given ZRC.

ZRC_1

Member markers:

M1 H-HAHAAAAA-HAHH-
M2 HAHAH-AAAAHHAHHH
M3 HAHAHAAAAAHH-HHH

Consensus phase:

ZRC_1  HAHAHAAAAAHHAHHH

General idea of this algorithm

make a consensus phase for each of the clusters in the map. If the cluster is ambigious (i.e. all have missing data for a given column, or the two phases are equal in number) an NaN is returned, and imputed in step two.
For remaining missing data, fill the phase of the above and below clusters if they are equal. If the above or below clusters have non-matching phases (or one is also an NaN) then impute the value by considering the number of matches with the above and below groups, and then impute the phase for the closer matching cluster.

I've tried to make a thorough consideration of fringe cases, such as when:

There is disagreement between markers for an individual's phase in a cluster
- mode will be taken, ties will be dealt with as well
All markers in a cluster lack a genotype for a given individual
- will look at the clusters up and downstream to see if phase matches
- if they do not, poll the other individuals for matches up and down
there is only one marker in the cluster
- need to look at the adjacent clusters in manner of 2.
the flanking phases are both NaN
- look one cluster further in either direction (2 away).
- At this point you should be very careful, as there may be more missing data then one can deal with properly. So I've not coded this up to impute data from further than 3+ locations away (as at this point you may just be making things up).

notes:

example_data folder has data in the format needed to run phase imputation.

support_functions folder contains scripts to help you move a linkage map into the format necessary to run phase imputation and shell scripts to execute phase imputation for numerous linkage groups at once.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
example_data		example_data
support_functions		support_functions
LICENSE.md		LICENSE.md
README.md		README.md
map_phase_impute.py		map_phase_impute.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phase imputation for missing data on a linkage map

preparing data for use in random forest

ZRC_1

Member markers:

Consensus phase:

General idea of this algorithm

notes:

About

Releases

Packages

Languages

License

CNuge/phase-imputation

Folders and files

Latest commit

History

Repository files navigation

Phase imputation for missing data on a linkage map

preparing data for use in random forest

ZRC_1

Member markers:

Consensus phase:

General idea of this algorithm

notes:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages