Skip to content


vobencha edited this page Dec 6, 2012 · 18 revisions
Clone this wiki locally


This page describes VCF_probabilityBasedSnpEncoding, a Bioconductor mentored project.



The aim of this project is to modify the existing MatrixToSnpMatrix() function in the VariantAnnotation package to compute probabilitiy-based snp encodings. Currently the function converts genotypes to a SnpMatrix object but does not account for genotype probabilities.

Project attributes and estimates:

  • Difficulty: Advanced
  • Length: 12 weeks
  • Skills needed: R programming, S4 generics and method creation, statistics
  • Deliverables: See summary below.
  • Mentors: Valerie Obenchain and Vince Carey

Summary of Deliverables

  • genotypeToSnpMatrix generic; methods for CollapsedVCF and matrix
  • probabilityToSnpMatrix function
  • Unit tests
  • Man page
  • Update vignette
  • Deprecate MatrixToSnpMatrix()

Implementation Details

1. genotypeToSnpMatrix generic and methods:

Create generic and methods for CollapsedVCF and matrix.

Generic signature

setGeneric("genotypeToSnpMatrix", function(x, ...) standardGeneric("genotypeToSnpMatrix") )

Convert genotypes to a SnpMatrix with or without probability information. CollapsedVCF-method will have an argument 'uncertain' with default of FALSE. When 'uncertain=TRUE' probability information will be taken from the GP or GL field if available. The return value is a SnpMatrix with dimensions [sample,snp].

2. probabilityToSnpMatrix function:

Create a method where the input is a numeric probability matrix [snp,3] and the output is a SnpMatrix with dimensions [1,snp] with probability encoding. This function may be used internally when genotypeToSnpMatrix(..., uncertain=TRUE) and by users who want to convert the output of globalProbability() to a SnpMatrix.

Maybe useful here is snpStats:::prob2g() which takes probabilities and returns a SnpMatrix object.

3. Unit tests

Create a new unit test file, 'test_probability-methods' in VariantAnnotation/inst/unitTetsts/.

Concepts to test :

  • The ALT column in a CollapsedVCF can be a DNAStringSetList or CompressedCharacterList (case of structural variants). The method should check the ALT column to see if we are dealing with structural variants. If they are structural, emit a warning and don't convert to SnpMatrix. You can use VariantAnnotation/inst/extdata/structural.vcf for testing.

  • Dimension, missing data, valid values (i.e., 0, 1, 2, 3) checking before vs after conversion to SnpMatrix.

  • The function should operate on a CollapsedVCF with no samples. Emit warning (not error) and return empty SnpMatrix.

  • Spot check a few results in the SnpMatrix. Use the original genotype and computed frequencies/probabilities etc.

4. Create man page, update vignette

The vignette section 'Other Operations' briefly covers MatrixToSnpMatrix(). I'd like to revamp this section. Feel free to do what you'd like here in demonstrating the new function. We can get a small sample of new data from 1000 Genomes or elsewhere if it is useful for the man page, unit tests and vignette.

5. Deprecate MatrixToSnpMatrix()

Once genotypeToSnpMatrix() is functional we will deprecate MatrixToSpMatrix(). Both the code file and man page need deprecation messages.


  • GGtools : specifically vcf2sm()
  • snpStats : specifically g2post(), post2g()
  • SNPRelate : specifically snpgdsVCF2GDS()
  • GWASTools

Save for a later date:

globalProbability generic and methods:

Create generic and methods for CollapsedVCF and matrix.

Generic signature

setGeneric("globalProbability", function(x, ...) standardGeneric("globalProbability") )

Compute global probabilities and return a numeric matrix with dimensions [snp,3]. The methods will have an argument 'exact' with a default of FALSE. When 'exact=TRUE' the exact numeric values rather than the SnpMatrix encoding will be returned.

Global probabilities may be computed using (M^2, 2Mm, m^2) where M and m are median major and minor allele frequencies over all snps. The goal of the related VCF project, 'VCF_alleleFrequency', is to create a function that computes the major/minor allele frequency of the genotypes in geno(vcf)$GT (i.e., compute M and m). If this function is done before you get too far we can use it here. If not, we can integrate it later.

Something went wrong with that request. Please try again.