confirm numbers for one of the Illumina BaseSpace Hiseq V4 reference datasets #5

avilella · 2015-12-03T10:28:35Z

I downloaded one of the vcf files for NA12877 from a Public Dataset in Basespace (see screenshot).
https://basespace.illumina.com/analyses/18540672?preview=False&projectId=16095081
Ran it against hap.py and got the values below:

i=NA12877-rep1_S1.genome.vcf.gz && sudo docker run -it -v $PWD:/data/input -v $GROUP/PlatinumGenomes/:/data/PlatinumGenomes pkrusche/hap.py /opt/hap.py/bin/hap.py /data/PlatinumGenomes/hg19/8.0.1/NA12877/NA12877.vcf.gz /data/input/$i -o /data/test
HAPPY
Benchmarking Summary:
                 TRUTH.TOTAL  QUERY.TOTAL  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA
Locations.INDEL       591048       622986       0.833987          0.781263               0
Locations.SNP        3592710      3226918       0.889341          0.989879               0

Are these numbers correct? I was expecting SNP Recall to be in the order of 0.98 and SNP Precision to be around 0.998 .

The text was updated successfully, but these errors were encountered:

avilella · 2015-12-03T10:28:45Z

pkrusche · 2015-12-03T10:51:58Z

Normally, PG should come with confident call regions in a bed file, these should be passed to hap.py using the -f switch. This will most likely not increase the recall though, not sure why this would be low (I don't have access to this project and don't know what workflow was run).

An easy way to see if the hap.py numbers are ok would probably be to run the VCF through VCAT on BaseSpace:

https://basespace.illumina.com/apps/1800799/Variant-Calling-Assessment-Tool?preferredversion

The most recent version of this uses hap.py 0.2.x -- the numbers might differ slightly, but shouldn't be too different.

avilella · 2015-12-03T14:26:34Z

I see, thanks for the info. There seems to be a discrepancy between versions, presumably due to a combination of differences in the hap.py code and the version of the PG vcfs.

I took a vcf from a Public dataset in BaseSpace that has been run against VCAT, and got the numbers shown below.

https://basespace.illumina.com/analyses/23453440?preview=False&projectId=20407389

The VCAT report on Basespace shows higher numbers than the ones below, and was run against PG v7.0. I believe I am running it agains PG v8.0.1.

The SNV Recall is fractionally lower for the numbers below wrt VCAT v7.0 but drops from 0.8173 to 0.7805 for Indels.

i=NA12878-NeoPrep-R3-75ng-Merged_S1.genome.vcf.gz && sudo docker run -it -v $PWD:/data/input -v $GROUP/PlatinumGenomes/:/data/PlatinumGenomes pkrusche/hap.py /opt/hap.py/bin/hap.py /data/PlatinumGenomes/hg19/8.0.1/NA12878/NA12878.vcf.gz /data/input/$i -o /data/test                                                                                                                                                                              
Benchmarking Summary:
                 TRUTH.TOTAL  QUERY.TOTAL  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA
Locations.INDEL       587126       579470       0.787844          0.780513               0
Locations.SNP        3565408      3134373       0.871888          0.991324               0

As long as these are versioning differences, I will carry on as it is as a happy hap.py user...

pkrusche · 2015-12-03T14:43:04Z

The numbers will be different between PG7 and 8.0.1. Also, VCAT uses -f ConfidentRegions.bed.gz - this affects precision and Frac_NA.

pkrusche · 2015-12-04T09:10:33Z

The numbers seem close enough to me considering the differences in PG version and command line. Closing.

HAP-157 and HAP-162

pkrusche closed this as completed Dec 4, 2015

pkrusche pushed a commit that referenced this issue Feb 16, 2016

Merge pull request #5 from Bioinformatics/HAP-157

8c09e83

HAP-157 and HAP-162

m891115891117 mentioned this issue Jan 31, 2018

pre.py header contig INFO issue #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confirm numbers for one of the Illumina BaseSpace Hiseq V4 reference datasets #5

confirm numbers for one of the Illumina BaseSpace Hiseq V4 reference datasets #5

avilella commented Dec 3, 2015

avilella commented Dec 3, 2015

pkrusche commented Dec 3, 2015

avilella commented Dec 3, 2015

pkrusche commented Dec 3, 2015

pkrusche commented Dec 4, 2015

confirm numbers for one of the Illumina BaseSpace Hiseq V4 reference datasets #5

confirm numbers for one of the Illumina BaseSpace Hiseq V4 reference datasets #5

Comments

avilella commented Dec 3, 2015

avilella commented Dec 3, 2015

pkrusche commented Dec 3, 2015

avilella commented Dec 3, 2015

pkrusche commented Dec 3, 2015

pkrusche commented Dec 4, 2015