Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not working with VCFs that have multiple chromosomes. #155

Open
bambrozio opened this issue Mar 18, 2020 · 7 comments
Open

Not working with VCFs that have multiple chromosomes. #155

bambrozio opened this issue Mar 18, 2020 · 7 comments

Comments

@bambrozio
Copy link

I have the 1000 genome phase 3 VCF's concatenated in a single VCF.
I can successfully read it with pyvcf, thus the file is consistent and healthy.
When I try to downsample it by performing:

$ gunzip -c ../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | java -jar downsamplevcf.jar -N 1 -n 10000 > All.Phase3.10kSNPs.vcf
 ...
[INFO][DownSampleVcf]Count: 81,268,620 Elapsed: 57 minutes(90.29%) Remains: 6 minutes(9.71%) Last: 22:51,138,531
[INFO][DownSampleVcf]. Completed. N=81,271,745. That took:57 minutes

My result VCF contains only the chromosome 22.

I expected to get all the chromosomes, but with up to 10k SNPs in each of them. Am I using it wrongly, or is this a bug?

@lindenb
Copy link
Owner

lindenb commented Mar 18, 2020

Hi,
yeah that's not the best tool for this as there is a buffer of -n 10000 variants and each time a new variant is read, its is replaced randomly in the buffer. So if there is more than 10000 variants for chr22, then it will be filled with chr22.

I agree the current doc is not clear about it.

You could try to use: http://lindenb.github.io/jvarkit/VCFShuffle.html and then pipe it into downsamplevcf.

@bambrozio
Copy link
Author

Sorry, could you please provide a command line example?
Actually I wanted to downsample to 100k SNPs, do you think this is possible?
I'm giving a try on each SNP, generating a new VCF file for each of them. After that, I can concat the results. Do you think this will work? It's in progress here...

@lindenb
Copy link
Owner

lindenb commented Mar 18, 2020

with GNU tools:


gunzip -c  input.vcf.gz | grep '^#' > out.vcf

gunzip -c  input.vcf.gz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> out.vcf

@bambrozio
Copy link
Author

bambrozio commented Mar 19, 2020

I've given a try in a smaller VCF (405M), and I got two significant different outputs.
Using the AWK, I got an output VCF of 9.7M, while using the Jar, I got an output of 97M... Any idea why of the difference?

Here's the commands I utilised:

Using the AWK:

$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | grep '^#' > chr10.vcf
$ gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.s
ubset.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 1000 | cut -f 2- | sort -T . -t 
$'\t' -k1,1 -k2,2n -k4,4 >> chr10.vcf

Using the JAR:

gunzip -c ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz | java -jar downsamplevcf.jar -N 1 -n 10000 > chr10UsingJar.vcf

input:

$ ls -lah ~/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz 
-rw-r--r--  1 bambrozi  staff   405M 18 Mar 22:47 /Users/bambrozi/Documents/1kGp3/1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.subset.vcf.bgz

Outputs:

$ ls -lath chr10*
-rw-r--r--  1 bambrozi  staff    97M 19 Mar 19:26 chr10UsingJar.vcf
-rw-r--r--  1 bambrozi  staff   9.7M 19 Mar 19:15 chr10.vcf

@bambrozio
Copy link
Author

Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output?
If so, I can concat the output VCFs later (after the jar sampling)

@lindenb
Copy link
Owner

lindenb commented Mar 19, 2020

here you're taking 1000 variants. n | head -n 1000 | cut -f 2- | sort -T . -t

and here 10000 : -jar downsamplevcf.jar -N 1 -n 10000

Do you think if I use the JAR over each chromosome (instead of the VCF that consolidates all them) I will get a reliable output?

no, again, you shoud use the awk script or my tool vcfshuffle + vcfhead. It could be something like.

gunzip -c in.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 10000 > out.vcf

@bambrozio
Copy link
Author

bambrozio commented Mar 20, 2020

Hi, thanks for the help!
Still not working though:

With the AWK, I got an error in the sort command sort: No such file or directory:

$ echo START: `date` \
> && gunzip -c  ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | grep '^#' > ALL.phase3.biallelic-only-10kSNP.vcf \
> && gunzip -c  ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | awk -F '\t' 'BEGIN{srand();} /^[^#]/ {printf("%d\t%s\n",int(rand()*100000),$0);}' | sort -T . -t $'\t' -k1,1n | head -n 10000 | cut -f 2- | sort -T . -t $'\t' -k1,1 -k2,2n -k4,4 >> ALL.phase3.biallelic-only-10kSNP.vcf \
> echo START: `date`
START: Fri 20 Mar 2020 09:21:09 GMT
sort: No such file or directory

With the vcfshuffler.jar, I got:

$ gunzip -c ../1kgP3.bgz/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz | java -jar vcfshuflle.jar | java -jar vcfhead.jar -n 100000 > ALL.phase3.biallelic-only-100kSNP.vcf 
Error: Unable to access jarfile vcfshuflle.jar
[SEVERE][VcfHead]Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input V
CF file
        at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
        at htsjdk.variant.vcf.VCFIteratorBuilder$VCFReaderIterator.<init>(VCFIteratorBuilder.java:177)
        at htsjdk.variant.vcf.VCFIteratorBuilder.open(VCFIteratorBuilder.java:97)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorFromInputStream(VCFUtils.java:288)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIteratorStdin(VCFUtils.java:335)
        at com.github.lindenb.jvarkit.util.vcf.VCFUtils.createVCFIterator(VCFUtils.java:312)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.openVCFIterator(Launcher.java:515)
        at com.github.lindenb.jvarkit.tools.misc.VcfHead.doWork(VcfHead.java:110)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:777)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:940)
        at com.github.lindenb.jvarkit.tools.misc.VcfHead.main(VcfHead.java:156)
[INFO][Launcher]vcfhead Exited with failure (-1)

I've tried with: bgz, gz and plain vcf.

I can read the VCF's normally using, for example, hail, pyvcf, plink... Thus, I'm assuming the VCF is healthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants