Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confirm numbers for one of the Illumina BaseSpace Hiseq V4 reference datasets #5

Closed
avilella opened this issue Dec 3, 2015 · 5 comments

Comments

@avilella
Copy link

avilella commented Dec 3, 2015

I downloaded one of the vcf files for NA12877 from a Public Dataset in Basespace (see screenshot).
https://basespace.illumina.com/analyses/18540672?preview=False&projectId=16095081
Ran it against hap.py and got the values below:

i=NA12877-rep1_S1.genome.vcf.gz && sudo docker run -it -v $PWD:/data/input -v $GROUP/PlatinumGenomes/:/data/PlatinumGenomes pkrusche/hap.py /opt/hap.py/bin/hap.py /data/PlatinumGenomes/hg19/8.0.1/NA12877/NA12877.vcf.gz /data/input/$i -o /data/test
HAPPY
Benchmarking Summary:
                 TRUTH.TOTAL  QUERY.TOTAL  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA
Locations.INDEL       591048       622986       0.833987          0.781263               0
Locations.SNP        3592710      3226918       0.889341          0.989879               0

Are these numbers correct? I was expecting SNP Recall to be in the order of 0.98 and SNP Precision to be around 0.998 .

@avilella
Copy link
Author

avilella commented Dec 3, 2015

screen shot 2015-12-03 at 10 25 46

@pkrusche
Copy link
Contributor

pkrusche commented Dec 3, 2015

Normally, PG should come with confident call regions in a bed file, these should be passed to hap.py using the -f switch. This will most likely not increase the recall though, not sure why this would be low (I don't have access to this project and don't know what workflow was run).

An easy way to see if the hap.py numbers are ok would probably be to run the VCF through VCAT on BaseSpace:

https://basespace.illumina.com/apps/1800799/Variant-Calling-Assessment-Tool?preferredversion

The most recent version of this uses hap.py 0.2.x -- the numbers might differ slightly, but shouldn't be too different.

@avilella
Copy link
Author

avilella commented Dec 3, 2015

I see, thanks for the info. There seems to be a discrepancy between versions, presumably due to a combination of differences in the hap.py code and the version of the PG vcfs.

I took a vcf from a Public dataset in BaseSpace that has been run against VCAT, and got the numbers shown below.

https://basespace.illumina.com/analyses/23453440?preview=False&projectId=20407389
screen shot 2015-12-03 at 14 27 04

The VCAT report on Basespace shows higher numbers than the ones below, and was run against PG v7.0. I believe I am running it agains PG v8.0.1.

The SNV Recall is fractionally lower for the numbers below wrt VCAT v7.0 but drops from 0.8173 to 0.7805 for Indels.

i=NA12878-NeoPrep-R3-75ng-Merged_S1.genome.vcf.gz && sudo docker run -it -v $PWD:/data/input -v $GROUP/PlatinumGenomes/:/data/PlatinumGenomes pkrusche/hap.py /opt/hap.py/bin/hap.py /data/PlatinumGenomes/hg19/8.0.1/NA12878/NA12878.vcf.gz /data/input/$i -o /data/test                                                                                                                                                                              
Benchmarking Summary:
                 TRUTH.TOTAL  QUERY.TOTAL  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA
Locations.INDEL       587126       579470       0.787844          0.780513               0
Locations.SNP        3565408      3134373       0.871888          0.991324               0

As long as these are versioning differences, I will carry on as it is as a happy hap.py user...

@pkrusche
Copy link
Contributor

pkrusche commented Dec 3, 2015

The numbers will be different between PG7 and 8.0.1. Also, VCAT uses -f ConfidentRegions.bed.gz - this affects precision and Frac_NA.

@pkrusche
Copy link
Contributor

pkrusche commented Dec 4, 2015

The numbers seem close enough to me considering the differences in PG version and command line. Closing.

@pkrusche pkrusche closed this as completed Dec 4, 2015
pkrusche pushed a commit that referenced this issue Feb 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants