Not working with VCFs that have multiple chromosomes. #155

bambrozio · 2020-03-18T19:37:28Z

I have the 1000 genome phase 3 VCF's concatenated in a single VCF.
I can successfully read it with pyvcf, thus the file is consistent and healthy.
When I try to downsample it by performing:

$ gunzip -c ../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | java -jar downsamplevcf.jar -N 1 -n 10000 > All.Phase3.10kSNPs.vcf
 ...
[INFO][DownSampleVcf]Count: 81,268,620 Elapsed: 57 minutes(90.29%) Remains: 6 minutes(9.71%) Last: 22:51,138,531
[INFO][DownSampleVcf]. Completed. N=81,271,745. That took:57 minutes

My result VCF contains only the chromosome 22.

I expected to get all the chromosomes, but with up to 10k SNPs in each of them. Am I using it wrongly, or is this a bug?

The text was updated successfully, but these errors were encountered:

lindenb · 2020-03-18T19:45:24Z

Hi,
yeah that's not the best tool for this as there is a buffer of -n 10000 variants and each time a new variant is read, its is replaced randomly in the buffer. So if there is more than 10000 variants for chr22, then it will be filled with chr22.

I agree the current doc is not clear about it.

You could try to use: http://lindenb.github.io/jvarkit/VCFShuffle.html and then pipe it into downsamplevcf.

bambrozio · 2020-03-18T21:29:21Z

Sorry, could you please provide a command line example?
Actually I wanted to downsample to 100k SNPs, do you think this is possible?
I'm giving a try on each SNP, generating a new VCF file for each of them. After that, I can concat the results. Do you think this will work? It's in progress here...

lindenb · 2020-03-18T22:24:27Z

with GNU tools:


gunzip -c  input.vcf.gz | grep '^#' > out.vcf

gunzip -c  input.vcf.gz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> out.vcf

bambrozio · 2020-03-19T19:53:31Z

I've given a try in a smaller VCF (405M), and I got two significant different outputs.
Using the AWK, I got an output VCF of 9.7M, while using the Jar, I got an output of 97M... Any idea why of the difference?

Here's the commands I utilised:

Using the AWK:

$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | grep '^#' > chr10.vcf
$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.s
ubset.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t 
$'\t' -k1,1 -k2,2n -k4,4 >> chr10.vcf

Using the JAR:

gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | java -jar downsamplevcf.jar -N 1 -n 10000 > chr10UsingJar.vcf

input:

$ ls -lah ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz 
-rw-r--r--  1 bambrozi  staff   405M 18 Mar 22:47 /Users/bambrozi/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz

Outputs:

$ ls -lath chr10*
-rw-r--r--  1 bambrozi  staff    97M 19 Mar 19:26 chr10UsingJar.vcf
-rw-r--r--  1 bambrozi  staff   9.7M 19 Mar 19:15 chr10.vcf

bambrozio · 2020-03-19T19:56:20Z

Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output?
If so, I can concat the output VCFs later (after the jar sampling)

lindenb · 2020-03-19T20:00:25Z

here you're taking 1000 variants. n | head -n 1000 | cut -f 2- | sort -T . -t

and here 10000 : -jar downsamplevcf.jar -N 1 -n 10000

Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output?

no, again, you shoud use the awk script or my tool vcfshuffle + vcfhead. It could be something like.

gunzip -c in.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 10000 > out.vcf

bambrozio · 2020-03-20T08:57:01Z

Hi, thanks for the help!
Still not working though:

With the AWK, I got an error in the sort command sort: No such file or directory:

$ echo START: `date` \
> && gunzip -c  ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | grep '^#' > ALL.phase3.biallelic-only-10kSNP.vcf \
> && gunzip -c  ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 10000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> ALL.phase3.biallelic-only-10kSNP.vcf \
> echo START: `date`
START: Fri 20 Mar 2020 09:21:09 GMT
sort: No such file or directory

With the vcfshuffler.jar, I got:

$ gunzip -c ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 100000 > ALL.phase3.biallelic-only-100kSNP.vcf 
Error: Unable to access jarfile vcfshuflle.jar
[SEVERE][VcfHead]Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input V
CF file
        at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
        at htsjdk.variant.vcf.VCFIteratorBuilder$VCFReaderIterator.<init>(VCFIteratorBuilder.java:177)
        at htsjdk.variant.vcf.VCFIteratorBuilder.open(VCFIteratorBuilder.java:97)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorFromInputStream(VCFUtils.java:288)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorStdin(VCFUtils.java:335)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIterator(VCFUtils.java:312)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.openVCFIterator(Launcher.java:515)
        at com.github.lindenb.jvarkit.tools.misc.VcfHead.doWork(VcfHead.java:110)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:777)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:940)
        at com.github.lindenb.jvarkit.tools.misc.VcfHead.main(VcfHead.java:156)
[INFO][Launcher]vcfhead Exited with failure (-1)

I've tried with: bgz, gz and plain vcf.

I can read the VCF's normally using, for example, hail, pyvcf, plink... Thus, I'm assuming the VCF is healthy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not working with VCFs that have multiple chromosomes. #155

Not working with VCFs that have multiple chromosomes. #155

bambrozio commented Mar 18, 2020

lindenb commented Mar 18, 2020

bambrozio commented Mar 18, 2020

lindenb commented Mar 18, 2020

bambrozio commented Mar 19, 2020 •

edited

bambrozio commented Mar 19, 2020

lindenb commented Mar 19, 2020

bambrozio commented Mar 20, 2020 •

edited

Not working with VCFs that have multiple chromosomes. #155

Not working with VCFs that have multiple chromosomes. #155

Comments

bambrozio commented Mar 18, 2020

lindenb commented Mar 18, 2020

bambrozio commented Mar 18, 2020

lindenb commented Mar 18, 2020

bambrozio commented Mar 19, 2020 • edited

bambrozio commented Mar 19, 2020

lindenb commented Mar 19, 2020

bambrozio commented Mar 20, 2020 • edited

bambrozio commented Mar 19, 2020 •

edited

bambrozio commented Mar 20, 2020 •

edited