Skip to content
Krishna Veeramah edited this page Jan 5, 2016 · 12 revisions

Input file: lcMLkin requires as input a single vcf file containing biallelic genotypes and genotype likelihoods for all individuals that are being investigated.

Genotypes are indicated by the GT field and should consist of only two possible alleles, the reference allele (‘0’) and an alternate allele (‘1’). We will try and expand the software to allow two alternate alleles (which would be indicated by ‘1’ and ‘2’) in the near future.

Likelihoods can be provided in the vcf in three formats, with the particular format indicated by the –l flag at the command line. The three formats are raw likelihoods (-l raw), log10-scaled likelihoods (-l log)—both indicated by the GL field in the vcf—and Phred-scaled likelihoods (-l phred), indicated by the PL field. Only the 3 likelihoods compatible with the 2 possible SNP variants should be used, rather than likelihoods for all 10 possible genotypes (i.e. ref/ref, ref/alt, alt/alt).

Generating Genotype likelihoods: There are numerous methods for generating genotype likelihoods. Probably the most popular is that implemented in GATK using either [Unified Genotyper] (https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php) or [Haplotype Caller] (https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php). Ultimately what algorithm to use to generate likelihoods in a vcf file is the choice of the user.

However, regardless of which algorithm is used, it is important that sites are not excluded from the resulting vcf (i.e. set to missing) due to low genotype quality. The only reason that we would suggest sites are set to missing is because there are 0 reads at a site that meet some mapping quality threshold (<15). lcMLkin explicitly uses the information at low quality sites (of which there will be many for low coverage data), and excluding them may in particular bias against evidence for true heterozygote calls.

We have provided software, SNPbam2vcf.py that will produce the appropriate genotype likelihood data when given a list of bams and target SNPs (in this case raw likelihoods). The bams do not have to be pre-filtered for the target SNPs, they simply need to be indexed.

If using another program such as GATK, it is important when performing the SNP calling to change the appropriate default parameters such -stand_call_conf and -stand_emit_conf, which have default values of a phred quality of 30.

Choosing SNPs: The model of lcMLkin assumes that each site examined is independent. Therefore given a possible list of SNPs (for example in a vcf), these should be filtered such that all SNPs considered are a particular distance apart. What this distance should be will depend on the recombination rate of your species of interest. Minor violations of SNP independence should not greatly impact the results, and we recommend that ~10,000-100,000 SNPs are used for a human-sized genome (3Gb).

In addition, the SNPs that are chosen should have a relatively high minor allele frequency. SNPs were there are only one or two alternative alleles are not going to be particular informative about identity-by-descent for a large sample. At least greater than 5% is recommended.

A larger VCF containing genome-scale data can be filtered to a smaller number of SNP using existing tools. If the sites that will be examined are already known, we recommend using the intersect function from BEDtools. With a list of SNP locations given in a bed file (in zero-based co-ordinates), a command line may be something like this:

bedtools intersect –a input.vcf –b SNPs.bed –header > output.vcf

If the user wants to identify appropriate SNPs directly from the larger VCF, then [vcftools] (http://vcftools.sourceforge.net/) provides this functionality. To thin SNPs so they are at least 100K apart and have a minor allele frequency of 5%, the command line would be something like this:

vcftools --vcf input.vcf --thin 100000 --remove-indels --maf 0.05 --recode --recode-INFO-all --out output

Dealing with high numbers of related individuals: lcMLkin requires the estimation of allele frequencies as part of its model, which by default it does by assuming all individuals considered are unrelated. When only a small proportion of the pairs individuals being considered are related to some degree, estimation of allele frequencies under this assumption should not be a major issue. However, if there are a large number of related individuals there may be a greater degree of bias introduced. When this is the case, users can provide a list of founders in a text file (one per line) using the –u flag, and only those individuals will be used to estimate allele frequencies.

Of course, if founders are already known then this negates the purpose of the software. Our recommendation therefore is to first run lcMLkin assuming all individuals are unrelated. From this, pairs of related individuals up to some threshold can be identified (for example pi_hat > 0.05), even if the exact kinship coefficient cannot yet be determined with high reliability. Given the discovery of these pairs, lcMLkin can then be rerun using only one of each pair (or whatever combination of individuals provides a set of relatively unrelated individuals) to obtain more reliable estimates of kinship coefficients.

Clone this wiki locally