benchmarks and speedups #56

darked89 · 2021-07-02T08:15:11Z

Hello,

I have completed TSV to VCF transformation of one Finngen GWAS summary file as a test case.
The gwas2vcf was run using Singularity container using:

finngen_R5_AB1_AMOEBIASIS.tsv (16380388 lines)
dbSNP 155
GRCh38 fasta
Singularity 3.7.0
Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz

Both genomic fasta and dbSNP VCF had chromosome ids in the same 1-22,X,Y,MT format, were indexed etc.

the output VCF format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  amoeb01
1       108391  rs1274919517    A       G       .       PASS    .       ES:SE:LP:ID     0.3653:1.9411:0.0702236:rs1274919517

This was executed in 582.99 mins
usr time 577.51 mins sys time 5.38 mins

My questions

is this in line with running times you observed or 10hrs /file is a fluke (== too slow)?
would it be possible to speed up processing?

Thank you

Darek Kedra

The text was updated successfully, but these errors were encountered:

mcgml · 2021-07-02T08:48:51Z

Hi @darked89

Thanks for the feedback. 10 hours does seem slow. It typically take 2-4 hours for densely imputed data (10M variants). How many variants are you mapping? The process is very I/O intense so fast storage will vastly improve performance.

Thanks
Matt

darked89 · 2021-07-02T20:54:07Z

Dear Matt,

the TSV input has +16M rows/positions. Compressing it with bgzip and indexing with tabix unfortunatelly did not improve the time needed to process it:

time ./run_amoeb_finn_whole_genome_dbsnp155_MT_gz-input.sh > error.amoeb.02.log 2>&1
real    588m40.557s

using (use real PID of the python3 running the gwas2vcf)

py-spy top --pid 123456

I got that >95% of the time the program spends executing update_dbsnp (gwas.py:94). No idea if i.e newer pysam (not that it is an easy change to make) will improve the speed.

Best,

DK

mcgml · 2021-07-05T07:47:19Z

Thanks @darked89. I will look at implementing cyvcf2 for reading/writing VCF files which is around 7x faster in the published example.

In the meantime, you could speed up the conversion by splitting your GWAS into chunks and running each chunk concurrently.

darked89 · 2021-07-05T08:55:31Z

Deari Matt,

for the large set of summary stats i.e. from Finngen the quick hack to try is to reduce the size of the dbSNPs VCF file by creating a customized, mini-dbSNP VCF by selecting dbSNP entries specific for the given biobank. I have to recheck that the results obtained in this way are identical to the ones obtained using the whole dbSNP.

Can you think about any reason they may differ?

Best,

DK

mcgml · 2021-07-05T08:59:20Z

Hi @darked89

Should give the same results. I doubt the performance would improve though since the SNP lookup is using tabix rather than reading the whole dbsnp VCF.

Do you observe a performance improvement?

Thanks
Matt

darked89 · 2021-07-05T11:28:33Z

Hi Matt,

I have used just a small subset of the original Finngen (chromosome 22).

# VCF sizes
 1.4G	dbsnp155_finngen_only_all_positions.vcf.gz
25G		dbSNP_155.GRCh38.names_fixed.vcf.gz

# time
'dbsnp': '/genome/dbsnp155_finngen_only_all_positions.vcf.gz': 109.69 secs
'dbsnp': '/genome/dbSNP_155.GRCh38.names_fixed.vcf.gz':  517.38 secs

Maybe there are some delays in accessing the drive in our setup, but it looks like the size of the VCF does affect the speed how fast one can query the indexed VCF file in some environments .

Hope it helps,

DK

mcgml · 2021-07-05T11:49:34Z

Thanks @darked89! That's a huge difference in performance. I will investigate.

If you would like to try, I created a new branch cyvcf2 which uses cyvcf2 for dbsnp look ups. The package in written in cython so should offer improved query speeds.

Thanks
Matt

darked89 · 2021-07-05T12:02:36Z

Hello,

I was unsure if there may be some silly snafu's somehow giving me a corrupted /not all the records result VCF really, really fast, so I did compute md5sums on the non-header portions of the outputs (whole_dbSNP vs Finngen_subset_ofdbSNP):

pigz --stdout --decompress 20210705_chr22_dbsnp-subset_sarco.vcf.gz | rg -v '^#' > 20210705_chr22_dbsnp-subset_sarco.vcf.no_header
pigz --stdout --decompress 20210705_chr22_dbsnp-whole_sarco.vcf.gz | rg -v '^#' > 20210705_chr22_dbsnp-whole_sarco.vcf.no_header

md5sum 20210705_chr22_dbsnp-*no_header
94dc99a7ec9265ec8df907ad30f9c39b  20210705_chr22_dbsnp-subset_sarco.vcf.no_header
94dc99a7ec9265ec8df907ad30f9c39b  20210705_chr22_dbsnp-whole_sarco.vcf.no_header

and the headers have a different commands:

##Gwas2VCF_command=--data /data/22_chrom_finngen_R5_D3_SARCOIDOSIS.tsv --json /data/finngen.json --id sarco01 --ref /genome/hs38p13.ens_pa.fa --dbsnp /genome/dbsnp155_finngen_only_all
_positions.vcf.gz --out /data/20210705_chr22_dbsnp-subset_sarco.vcf --cohort_controls 215712 --cohort_cases 2046; 1.3.1
##file_date=2021-07-05T11:46:55.360262

##Gwas2VCF_command=--data /data/22_chrom_finngen_R5_D3_SARCOIDOSIS.tsv --json /data/finngen.json --id sarco01 --ref /genome/hs38p13.ens_pa.fa --dbsnp /genome/dbSNP_155.GRCh38.names_fi
xed.vcf.gz --out /data/20210705_chr22_dbsnp-whole_sarco.vcf --cohort_controls 215712 --cohort_cases 2046; 1.3.1
##file_date=2021-07-05T11:58:52.434064

so It does not look like some late night error produced results too good to be true. Or so I hope.. ;)

DK

mcgml self-assigned this Jul 5, 2021

mcgml added the enhancement New feature or request label Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks and speedups #56

benchmarks and speedups #56

darked89 commented Jul 2, 2021

mcgml commented Jul 2, 2021

darked89 commented Jul 2, 2021

mcgml commented Jul 5, 2021

darked89 commented Jul 5, 2021

mcgml commented Jul 5, 2021

darked89 commented Jul 5, 2021 •

edited

Loading

mcgml commented Jul 5, 2021

darked89 commented Jul 5, 2021 •

edited

Loading

benchmarks and speedups #56

benchmarks and speedups #56

Comments

darked89 commented Jul 2, 2021

mcgml commented Jul 2, 2021

darked89 commented Jul 2, 2021

mcgml commented Jul 5, 2021

darked89 commented Jul 5, 2021

mcgml commented Jul 5, 2021

darked89 commented Jul 5, 2021 • edited Loading

mcgml commented Jul 5, 2021

darked89 commented Jul 5, 2021 • edited Loading

darked89 commented Jul 5, 2021 •

edited

Loading

darked89 commented Jul 5, 2021 •

edited

Loading