Skip to content
This repository

Overview

This page describes VCF_alleleFrequency, a Bioconductor mentored project.

Overview

VCF_alleleFrequency

The aim of this project is to provide a function that computes the allele frequency of the data in the geno() slot of the VCF class object.

Project attributes and estimates:

  • Difficulty: Easy
  • Length: 2 weeks
  • Skills needed: R programming, familiarity with S4 classes
  • Deliverables: Implement, test and document snpSummary,CollapsedVCF-method
  • Mentor: Valerie Obenchain
  • Mentee: Chris Wallace

Summary of Deliverables

  • snpSummary generic
  • snpSummary,CollapsedVCF method
  • Man page
  • Unit tests

Implementation Details

Ultimately we want a function that computes allele frequency, minor allele frequency, genotype frequency and Hardy-Weinberg estimates from the genotype data in a VCF class object. These measure have many steps in common so once one is computed much of the others come along for free. This project, computing allele frequency, is a first step in that direction.

1. Generic and method

This function will eventually compute more than just allele frequency so I think a name like snpSummary() is more appropriate. The generic should have the following signature

setGeneric("snpSummary", function(x, ...) standardGeneric("snpSummary") )

and the method,

setMethod("snpSummary", "CollapsedVCF", function(x, ...) { ## preprocess, compute, etc. })

The generic should be added to AllGenerics.R and the method will go in methods-CollapsedVCF-class.R The VCF class structure has recently been reorganized. VCF is now VIRTUAL and the two concrete subclasses are CollapsedVCF and ExpandedVCF. The CollapsedVCF is equivalent to the old VCF class.

2. Compute allele frequency

General Steps:

  • Confirm the VCF file holds some snps and not all indels or structural variants.
  • Handle the cases listed on the man page for MatrixToSnpMatrix() (e.g., only diploid calls are handled, variants with >1 ALT allele are set to NA, etc.). See ?MatrixToSnpMatrix for details.
  • allele frequency = ((heterozygous/2) + homozygous) / nsamples. Genotype data are a character matrix in geno(vcf)$GT. heterozygous = '1|0' or '0|1' and homozygous = '1|1'.
  • For now we won't distinguish between phased and unphased so genotypes with either '|' or '/' should be included.
  • Return a numeric vector with variant names.

4. Add man page

Add a snpSummary.Rd man page to VariantAnnotation/man.

5. Add unit tests

For now put the unit tests in inst/unitTests/test_VCF-class.R. We will move them to test_probability-methods.R when Stephanie adds it in.

Concepts to test:

  • Check the dimensions of the output against the input.

  • Check if allele.freq.a0 == 1.0 then hwe.Z and hwe.p.value are NA.

  • The ALT column in a CollapsedVCF can be a DNAStringSetList or CompressedCharacterList (structural variants). The method should check the ALT column to see if we are dealing with structural variants. If they are structural, emit a warning and return an empty matrix. You can use VariantAnnotation/inst/extdata/structural.vcf for testing.

  • Calling snpSummary() on a CollapsedVCF with no samples should emit a warning (not error) and return empty matrix with the appropriate columns.

  • For the common output variables, check the consistency of snpSummary() with that of col.summary() on a SnpMatrix.

References

  • MatrixToSnpMatrix() in VariantAnnotation
  • snpStats : specifically col.summary()
  • GWASTools
Something went wrong with that request. Please try again.