Skip to content

LiverpoolHarry/TagCNV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TagCNV

A Java package, which takes a list of CNV and a VCF file of phased SNP for a single chromosome and extracts r2 values from vcf file for SNP in CNV regions. The programme first creates haplotypes across each CNV based on either r2 (default) linkage or D' linkage between alleles. Allele pairs with ld (or D') above a threshold linkage value are joined into haplotypes. D' gives much larger haplotypes and consequently there are more haplotype alleles at each locus.

Having identified the haplotypes TagCNV calculates r2 for the correlation between haplotype allele and fractional copy number. This is the strategy used by other publications (check) Each sample is coded 0, 1 2 depending on the number of alleles of a given haplotype it has and this is correlated with the copy number for the sample using linear modelling (lm command) in R. The Java programme creates an R script and then calls Rcommand to run it.

Output

Two files are generated. A list of loci with evidence for association between haplotypes and CNV (out.txt) and a list of the same loci with the positions of the SNP in the haplotypes (out.HapPositions.txt) Additional output is available if cleanUp=false

To run the programme:

The TagCNV.jar is in the dist directory java -jar TagCNV.jar minld=0.8 vcf=Chr22.snp.vcf.gz cnv=Chr22.cnv.vcf out=test flanks=10000 samples=sampleList.txt LMplots=true cleanUp=false

Example files Chr22.snp.vcf.gz cnv=Chr22.cnv.vcf are included (Chr22.cnv.vcf must be gunzipped before use)

java -jar path/TagHaps.jar

Options should be entered as key value pairs seperated by = and no spaces

Eg: java -jar TagCNV.jar minld=0.8 vcf=Chr1.example.vcf.gz cnv=Chr1.example_cnv.vcf out=test flanks=10000 samples=sampleList.txt LMplots=true cleanUp=false

WARNING: The programme is designed to process a single chromosome at a time

cnv: Required; VCF file with fractional values for copy number; eg as output by GenomeStrip. It should NOT be bgzipped

minld: Required; Threshold ld value for building haplotypes

vcf: Required; String; vcf file of PHASED SNP to be used. It MUST be bgzipped

out: Required; Root name for output files

ld: Optional; Tab delimited file of r2 linkage values generated with vcftools command eg: vcftools --gzvcf Chr20.vcf.gz --bed Chr20.list.bed --hap-r2 --ld-window-bp 50000 --out Chr20-ld50000; where the bed file is a list of CNV loci. This will produce a file with these columns: with columns CHR POS1 POS2 N_CHR R^2 D Dprime; Only the R^2 column will be used D and Dprime will be ignored unless the parameter dprime=true is used; If ld is not set then vcftools will be used to generate the file

dprime: Optional; Boolean; if dprime=true then dprime will be used for ld calculations instead of the default r squared

flanks: Optional; Integer; length of flanks (in bp) either side of the CNV locus to test for haplotypes associated with CNV; default 0

samples: Optional; String; file with list of subset of samples from vcf file to be included in the analysis

LMplots: Optional; Boolean; default false; generate plots of the Linear Regression of Copy Number on haplotype genotype count. There will be a lot of these, so perhaps best used on short runs on interesting CNV and for diagnostics

cleanUp: Optional; Boolean; default true; if cleanUp=false then intermediate files will not be deleted at the end of the run. This is useful for debugging and visualising the algorithm

How Haplotypes are constructed

Many loci have high r2 with multiple other loci. The programme groups these into haplotype as follows. Foreach locus with a high r2 with at least one other locus it extracts a list of all loci with which it has high r2. It then tests each set of loci within a CNV against all others and if one is a subset of another then it removes the smaller set. However if two loci have some but not all loci in common then it leaves them as two separate haplotypes. In initial tests most loci fall into discrete sets of haplotypes.

EG For chromosome 20 of TrypanoGEN data there were a total 18341 loci within CNV; of which 6873 had r squared > 0.8. These formed 2597 haplotypes with a mean size of 3.4 loci. The haplotypes had 8826 loci so 1953 loci appeared in more than one haplotype.

Note: The input VCF file is assumed to have the following structure as generated by GenomeStrip (with 1 indexed columns):

  1. Start pos of cnv in column 2

  2. End pos of cnv as first field of INFO column in format END=1234567

  3. Data fields. Separator “:” Field 2 Integer copy number; Field 3 Fractional copy number

Source files

Java class files and a netbeans project are also included with the distribution

About

Tag Copy Number Variations (CNV) with SNP haplotypes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published