Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing #17

Closed
4 tasks done
HadrienG opened this issue Jul 19, 2017 · 5 comments
Closed
4 tasks done

Multiprocessing #17

HadrienG opened this issue Jul 19, 2017 · 5 comments
Assignees
Milestone

Comments

@HadrienG
Copy link
Owner

HadrienG commented Jul 19, 2017

Issue to track the progress on the Roadmap item "Add multiprocessing support"

  • mutliprocessing vs joblib
  • parallel read generation
  • parallel fast model generation
  • 0.6.0 release 🚀 🚀 🚀
@HadrienG HadrienG added this to the 1.0.0 milestone Jul 19, 2017
@HadrienG HadrienG self-assigned this Jul 19, 2017
@HadrienG HadrienG changed the title Mutliprocessing Multiprocessing Sep 21, 2017
@HadrienG
Copy link
Owner Author

HadrienG commented Sep 22, 2017

Some benchmarks will be added to this comment:

1 million reads, lognormal, 8 genomes

cores version 1 2 3 comments
1 0.4.1 27m25.261s 29m40.034s 24m51.049s generator
1 b019c56 31m4.006s 29m6.666s 32m37.646s list
2 6abe929 42m30.107s 41m30.599s 39m25.503s joblib, naive
4 6abe929 47m31.557s 52m14.546s 75m53.245s joblib, naive
4 39f744c 28m19.137s 29m20.592s 28m26.219s joblib, big jobs
2 0ee9afc 14m2.147s 13m17.612s 13m15.538s joblib, sep. output

1 million reads, uniform, 8 genomes

cores version 1 2 3 comments
1 0.4.1 26m11.288s 25m21.018s 25m53.223s generator
1 b019c56 32m28.583s 28m2.355s 30m22.856s list
2 6abe929 41m33.731s 37m8.619s 35m26.381s joblib, naive
4 6abe929 66m12.818s 52m39.893s 110m5.414s joblib, naive
4 39f744c 26m46.326s 26m37.626s 27m8.810s joblib, big jobs
2 0ee9afc 13m20.792s 13m39.435s 13m50.261s joblib, sep. output

@HadrienG
Copy link
Owner Author

HadrienG commented Sep 28, 2017

As we see from 6abe929 naively wrapping the generate_reads function in a Parallel block is not the solution

@HadrienG
Copy link
Owner Author

HadrienG commented Oct 9, 2017

39f744c brought back 0.4.1 speeds by distributing larger numbers to reads to generate instead of letting joblib dispatch one at a time.

Collecting the resulting reads from a record still took a while, 0ee9afc solves this by making each cpu write to a different temporary file

This was referenced Oct 10, 2017
@HadrienG
Copy link
Owner Author

HadrienG commented Oct 16, 2017

More benchmarks! This time for iss model

cores version 1 2 3 comments
1 0.5.1 97m47.692s 85m39.346s 92m33.240s
1 375d1e9 19m25.276s 19m54.093s 18m6.574s subsample bam to 1M reads
1 95b4f16 19m43.851s 18m39.208s 19m46.282s more arrays, less lists

Version 0.5.1 is very slow. An approximate breakdown of the time is:

  • reading bam file: 20 minutes
  • model insert size: 1 minute
  • model base quality: 70 minutes
  • writing to file: 1 minute

Parallelization of the model module is difficult, subsampling to 1M reads greatly speeds up the process and seems to be an acceptable tradeoff between speed and accuracy

@HadrienG
Copy link
Owner Author

closed with 96ecefd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant