# Match markers in two VCF files

Applications such as haplotyping and imputation often need to match the markers in a target VCF file to those in a reference panel. Matching can be by marker ID, if the two data sets are on different builds, or by chromosome position, if the two data sets are on the same build.

## Example files

We need example VCF files for demonstation. You can manually download the target VCF file from the [link](http://faculty.washington.edu/browning/beagle/test.08Jun17.d8b.vcf.gz) (877KB) and put it in current working directory. Or, within Julia,

In [1]:
isfile("test.08Jun17.d8b.vcf.gz") || download("http://faculty.washington.edu/browning/beagle/test.08Jun17.d8b.vcf.gz", 
    joinpath(pwd(), "test.08Jun17.d8b.vcf.gz"))
stat("test.08Jun17.d8b.vcf.gz")

StatStruct(mode=0o100644, size=876514)

We can manually download the reference panel for chromosome 22 (1000 Genome Project) from the [link](http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/b37.vcf/chr22.1kg.phase3.v5a.vcf.gz) (135.4MB) and put it in the current working directory. Or, within Julia,

In [2]:
isfile("chr22.1kg.phase3.v5a.vcf.gz") || download("http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/b37.vcf/chr22.1kg.phase3.v5a.vcf.gz", 
    joinpath(pwd(), "chr22.1kg.phase3.v5a.vcf.gz"))
stat("chr22.1kg.phase3.v5a.vcf.gz")

StatStruct(mode=0o100644, size=135429468)

There are 424,147 markers and 2,504 samples (or 5,008 haplotypes) in the reference panel.

In [3]:
using VCFTools

nrecords("chr22.1kg.phase3.v5a.vcf.gz"), nsamples("chr22.1kg.phase3.v5a.vcf.gz")

(424147, 2504)

First 10 lines of the reference panel VCF:

In [4]:
fh = openvcf("chr22.1kg.phase3.v5a.vcf.gz", "r")
for l in 1:10
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##filedate=20151210
##source="simplfy-vcf (r1211)"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	HG00101	HG00102	HG00103	HG00105	HG00106	HG00107	HG00108	HG00109	HG00110	HG00111	HG00112	HG00113	HG00114	HG00115	HG00116	HG00117	HG00118	HG00119	HG00120	HG00121	HG00122	HG00123	HG00125	HG00126	HG00127	HG00128	HG00129	HG00130	HG00131	HG00132	HG00133	HG00136	HG00137	HG00138	HG00139	HG00140	HG00141	HG00142	HG00143	HG00145	HG00146	HG00148	HG00149	HG00150	HG00151	HG00154	HG00155	HG00157	HG00158	HG00159	HG00160	HG00171	HG00173	HG00174	HG00176	HG00177	HG00178	HG00179	HG00180	HG00181	HG00182	HG00183	HG00185	HG00186	HG00187	HG00188	HG00189	HG00190	HG00231	HG00232	HG00233	HG00234	HG00235	HG00236	HG00237	HG00238	HG00239	HG00240	HG00242	HG00243	HG00244	HG00245	HG00246	HG00250	HG00251	HG00252	HG00253	HG00254	HG00255	HG00256	HG00257	HG00258	HG00259	HG00260	HG00261	HG00262	HG00263	HG00264	HG002

## Match markers according to ID

`conformgt_by_id` function matches the markers with GT field in two VCF files according to marker ID. 

Below command instructs `conformgt_by_id` to  
0. Match the markers in `tgtvcf` file to those in `refvcf` file by IDs, from position 20000086 to 20099941 on chromosome 22
0. Adjust target VCF markers so that chromosome strand and allele order match the VCF reference file  
0. Test the allele frequencies being equal between the target and reference markers. The last argument specifies the significance level   Setting it to `0==false` effectively disable this test. Setting it to `1` effectively rejects all tests and no matched markers are output  
0. The matched VCF records are written into files `outfile.tgt.vcf.gz` and `outfile.ref.vcf.gz`, both with only GT field

In [5]:
tgtvcf = "test.08Jun17.d8b.vcf.gz"
refvcf = "chr22.1kg.phase3.v5a.vcf.gz"
outvcf = "conformgt.matched"
@time conformgt_by_id(refvcf, tgtvcf, outvcf, "22", 20000086:20099941, false)

┌ Info: Scan IDs in reference panel
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:42
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:25[39m
┌ Info: Match target IDs to reference IDs
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:75
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:25[39m
┌ Info: 823 records are matched
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:124


 71.045000 seconds (697.25 M allocations: 65.092 GiB, 6.85% gc time)


823

In [6]:
stat("conformgt.matched.tgt.vcf.gz")

StatStruct(mode=0o100644, size=47996)

In [7]:
stat("conformgt.matched.ref.vcf.gz")

StatStruct(mode=0o100644, size=292444)

The first 5 markers in the matched target file are

In [8]:
fh = openvcf("conformgt.matched.tgt.vcf.gz", "r")
for l in 1:35
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Seque

The first 5 markers in the matched reference file are

In [9]:
fh = openvcf("conformgt.matched.ref.vcf.gz", "r")
for l in 1:10
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##filedate=20151210
##source="simplfy-vcf (r1211)"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	HG00101	HG00102	HG00103	HG00105	HG00106	HG00107	HG00108	HG00109	HG00110	HG00111	HG00112	HG00113	HG00114	HG00115	HG00116	HG00117	HG00118	HG00119	HG00120	HG00121	HG00122	HG00123	HG00125	HG00126	HG00127	HG00128	HG00129	HG00130	HG00131	HG00132	HG00133	HG00136	HG00137	HG00138	HG00139	HG00140	HG00141	HG00142	HG00143	HG00145	HG00146	HG00148	HG00149	HG00150	HG00151	HG00154	HG00155	HG00157	HG00158	HG00159	HG00160	HG00171	HG00173	HG00174	HG00176	HG00177	HG00178	HG00179	HG00180	HG00181	HG00182	HG00183	HG00185	HG00186	HG00187	HG00188	HG00189	HG00190	HG00231	HG00232	HG00233	HG00234	HG00235	HG00236	HG00237	HG00238	HG00239	HG00240	HG00242	HG00243	HG00244	HG00245	HG00246	HG00250	HG00251	HG00252	HG00253	HG00254	HG00255	HG00256	HG00257	HG00258	HG00259	HG00260	HG00261	HG00262	HG00263	HG00264	HG002

Not only the IDs of the two sets of markers match, the REF alleles match as well.

If we turn on testing allele frequencies at significance level 0.05, less matched markers are reported

In [10]:
@time conformgt_by_id(refvcf, tgtvcf, outvcf, "22", 20000086:20099941, 0.05)

┌ Info: Scan IDs in reference panel
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:42
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:24[39m
┌ Info: Match target IDs to reference IDs
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:75
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:25[39m


 64.095969 seconds (691.00 M allocations: 64.993 GiB, 7.29% gc time)


┌ Info: 488 records are matched
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:124


488

## Match markers according to position

`conformgt_by_pos` function matches the markers with GT field in two VCF files according to marker position. This is applicable when target and reference data are on the same build.

Below command instructs `conformgt_by_pos` to  
0. Match the markers in `tgtvcf` file to those in `refvcf` file by positions, from position 20000086 to 20099941 on chromosome 22
0. Adjust target VCF markers so that chromosome strand and allele order match the VCF reference file  
0. Test the allele frequencies being equal between the target and reference markers. The last argument specifies the significance level   Setting it to `0==false` effectively disable this test. Setting it to `1` effectively rejects all tests and no matched markers are output  
0. The matched VCF records are written into files `outfile.tgt.vcf.gz` and `outfile.ref.vcf.gz`, both with only GT field

In [11]:
tgtvcf = "test.08Jun17.d8b.vcf.gz"
refvcf = "chr22.1kg.phase3.v5a.vcf.gz"
outvcf = "conformgt.matched"
@time conformgt_by_pos(refvcf, tgtvcf, outvcf, "22", 20000086:20099941, false)

┌ Info: Match target POS to reference POS
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:172
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:25[39m


 39.203119 seconds (355.95 M allocations: 33.467 GiB, 6.01% gc time)


┌ Info: 833 records are matched
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:239


833

The first 5 markers in the matched target file are

In [12]:
fh = openvcf("conformgt.matched.tgt.vcf.gz", "r")
for l in 1:35
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Seque

The first 5 markers in the matched reference file are

In [13]:
fh = openvcf("conformgt.matched.ref.vcf.gz", "r")
for l in 1:10
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##filedate=20151210
##source="simplfy-vcf (r1211)"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	HG00101	HG00102	HG00103	HG00105	HG00106	HG00107	HG00108	HG00109	HG00110	HG00111	HG00112	HG00113	HG00114	HG00115	HG00116	HG00117	HG00118	HG00119	HG00120	HG00121	HG00122	HG00123	HG00125	HG00126	HG00127	HG00128	HG00129	HG00130	HG00131	HG00132	HG00133	HG00136	HG00137	HG00138	HG00139	HG00140	HG00141	HG00142	HG00143	HG00145	HG00146	HG00148	HG00149	HG00150	HG00151	HG00154	HG00155	HG00157	HG00158	HG00159	HG00160	HG00171	HG00173	HG00174	HG00176	HG00177	HG00178	HG00179	HG00180	HG00181	HG00182	HG00183	HG00185	HG00186	HG00187	HG00188	HG00189	HG00190	HG00231	HG00232	HG00233	HG00234	HG00235	HG00236	HG00237	HG00238	HG00239	HG00240	HG00242	HG00243	HG00244	HG00245	HG00246	HG00250	HG00251	HG00252	HG00253	HG00254	HG00255	HG00256	HG00257	HG00258	HG00259	HG00260	HG00261	HG00262	HG00263	HG00264	HG002

Not only the chromosome and positions of the two sets of markers match, the REF alleles match as well.

If we turn on testing allele frequencies at significance level 0.05, less matched markers are reported

In [14]:
@time conformgt_by_pos(refvcf, tgtvcf, outvcf, "22", 20000086:20099941, 0.05)

┌ Info: Match target POS to reference POS
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:172
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:25[39m


 39.299402 seconds (360.50 M allocations: 33.846 GiB, 6.04% gc time)


┌ Info: 493 records are matched
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:239


493