Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarks and speedups #56

Open
darked89 opened this issue Jul 2, 2021 · 8 comments
Open

benchmarks and speedups #56

darked89 opened this issue Jul 2, 2021 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@darked89
Copy link
Contributor

darked89 commented Jul 2, 2021

Hello,

I have completed TSV to VCF transformation of one Finngen GWAS summary file as a test case.
The gwas2vcf was run using Singularity container using:

  • finngen_R5_AB1_AMOEBIASIS.tsv (16380388 lines)
  • dbSNP 155
  • GRCh38 fasta
  • Singularity 3.7.0
  • Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz

Both genomic fasta and dbSNP VCF had chromosome ids in the same 1-22,X,Y,MT format, were indexed etc.

the output VCF format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  amoeb01
1       108391  rs1274919517    A       G       .       PASS    .       ES:SE:LP:ID     0.3653:1.9411:0.0702236:rs1274919517

This was executed in 582.99 mins
usr time 577.51 mins sys time 5.38 mins

My questions

  1. is this in line with running times you observed or 10hrs /file is a fluke (== too slow)?
  2. would it be possible to speed up processing?

Thank you

Darek Kedra

@mcgml
Copy link
Collaborator

mcgml commented Jul 2, 2021

Hi @darked89

Thanks for the feedback. 10 hours does seem slow. It typically take 2-4 hours for densely imputed data (10M variants). How many variants are you mapping? The process is very I/O intense so fast storage will vastly improve performance.

Thanks
Matt

@darked89
Copy link
Contributor Author

darked89 commented Jul 2, 2021

Dear Matt,

the TSV input has +16M rows/positions. Compressing it with bgzip and indexing with tabix unfortunatelly did not improve the time needed to process it:

time ./run_amoeb_finn_whole_genome_dbsnp155_MT_gz-input.sh > error.amoeb.02.log 2>&1
real    588m40.557s

using (use real PID of the python3 running the gwas2vcf)

py-spy top --pid 123456 

I got that >95% of the time the program spends executing update_dbsnp (gwas.py:94). No idea if i.e newer pysam (not that it is an easy change to make) will improve the speed.

Best,

DK

@mcgml
Copy link
Collaborator

mcgml commented Jul 5, 2021

Thanks @darked89. I will look at implementing cyvcf2 for reading/writing VCF files which is around 7x faster in the published example.

In the meantime, you could speed up the conversion by splitting your GWAS into chunks and running each chunk concurrently.

@mcgml mcgml self-assigned this Jul 5, 2021
@mcgml mcgml added the enhancement New feature or request label Jul 5, 2021
@darked89
Copy link
Contributor Author

darked89 commented Jul 5, 2021

Deari Matt,

for the large set of summary stats i.e. from Finngen the quick hack to try is to reduce the size of the dbSNPs VCF file by creating a customized, mini-dbSNP VCF by selecting dbSNP entries specific for the given biobank. I have to recheck that the results obtained in this way are identical to the ones obtained using the whole dbSNP.

Can you think about any reason they may differ?

Best,

DK

@mcgml
Copy link
Collaborator

mcgml commented Jul 5, 2021

Hi @darked89

Should give the same results. I doubt the performance would improve though since the SNP lookup is using tabix rather than reading the whole dbsnp VCF.

Do you observe a performance improvement?

Thanks
Matt

@darked89
Copy link
Contributor Author

darked89 commented Jul 5, 2021

Hi Matt,

I have used just a small subset of the original Finngen (chromosome 22).

# VCF sizes
 1.4G	dbsnp155_finngen_only_all_positions.vcf.gz
25G		dbSNP_155.GRCh38.names_fixed.vcf.gz

# time
'dbsnp': '/genome/dbsnp155_finngen_only_all_positions.vcf.gz': 109.69 secs
'dbsnp': '/genome/dbSNP_155.GRCh38.names_fixed.vcf.gz':  517.38 secs

Maybe there are some delays in accessing the drive in our setup, but it looks like the size of the VCF does affect the speed how fast one can query the indexed VCF file in some environments .

Hope it helps,

DK

@mcgml
Copy link
Collaborator

mcgml commented Jul 5, 2021

Thanks @darked89! That's a huge difference in performance. I will investigate.

If you would like to try, I created a new branch cyvcf2 which uses cyvcf2 for dbsnp look ups. The package in written in cython so should offer improved query speeds.

Thanks
Matt

@darked89
Copy link
Contributor Author

darked89 commented Jul 5, 2021

Hello,

I was unsure if there may be some silly snafu's somehow giving me a corrupted /not all the records result VCF really, really fast, so I did compute md5sums on the non-header portions of the outputs (whole_dbSNP vs Finngen_subset_ofdbSNP):

pigz --stdout --decompress 20210705_chr22_dbsnp-subset_sarco.vcf.gz | rg -v '^#' > 20210705_chr22_dbsnp-subset_sarco.vcf.no_header
pigz --stdout --decompress 20210705_chr22_dbsnp-whole_sarco.vcf.gz | rg -v '^#' > 20210705_chr22_dbsnp-whole_sarco.vcf.no_header

md5sum 20210705_chr22_dbsnp-*no_header
94dc99a7ec9265ec8df907ad30f9c39b  20210705_chr22_dbsnp-subset_sarco.vcf.no_header
94dc99a7ec9265ec8df907ad30f9c39b  20210705_chr22_dbsnp-whole_sarco.vcf.no_header

and the headers have a different commands:

##Gwas2VCF_command=--data /data/22_chrom_finngen_R5_D3_SARCOIDOSIS.tsv --json /data/finngen.json --id sarco01 --ref /genome/hs38p13.ens_pa.fa --dbsnp /genome/dbsnp155_finngen_only_all
_positions.vcf.gz --out /data/20210705_chr22_dbsnp-subset_sarco.vcf --cohort_controls 215712 --cohort_cases 2046; 1.3.1
##file_date=2021-07-05T11:46:55.360262

##Gwas2VCF_command=--data /data/22_chrom_finngen_R5_D3_SARCOIDOSIS.tsv --json /data/finngen.json --id sarco01 --ref /genome/hs38p13.ens_pa.fa --dbsnp /genome/dbSNP_155.GRCh38.names_fi
xed.vcf.gz --out /data/20210705_chr22_dbsnp-whole_sarco.vcf --cohort_controls 215712 --cohort_cases 2046; 1.3.1
##file_date=2021-07-05T11:58:52.434064

so It does not look like some late night error produced results too good to be true. Or so I hope.. ;)

DK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants