Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization in TheoreticalSpectrumGenerator and MSSpectrum + Fixing multithreaded SimpleSearchEngine #4709

Merged
merged 29 commits into from Jun 8, 2020

Conversation

akriese
Copy link
Contributor

@akriese akriese commented May 11, 2020

While profiling the SimpleSearchEngine, we have found several points to optimize:

  1. Too much time consumed by String construction and concatenation (metadata).
  2. Sorting of the generated peak spectrum not working with partially sorted chunks of peaks
  3. Lots of copying happening in MSSpectrum::select()
  4. If losses shall be added, EmpiricalFormula::toString() takes up a huge amount of runtime.

This is how we optimized the above points:

  1. String construction was pulled outside of loops, if possible. emplace_back() was used for temporary Strings.
  2. As many peaks are added in sorted order to the spectrum, we can remember those chunks of peaks to sort them faster. MSSpectrum::sortByPositionPresorted() implements this with an idea similar to mergesort. For now, the new function is only used in TheoreticalSpectrumGenerator::getSpectrum(), as the chunk information has be added manually to a vector. Changes to the code were made to make sure that some chunks are actually pushed into the spectrum in sorted order (e.g. in addAbundantImmoniumIons_()).
    Also, a test for the new sort function was added to the MSSpectrum_test.
  3. While working on the new sort function it became apparent, that select() takes up a big amount of time to move around the meta value arrays. By using moves and reserves we lowered that amount.
  4. EmpiricalFormula::toString() spends a lot of time accessing maps. We cut that time 1. by caching already computed formula-strings in addLosses_faster_() in a map and 2. by changing the EmpiricalFormula::operator<() to make a search in sets and maps of EmpiricalFormula faster.

Results:
Depending on the parameters for the SimpleSearchEngine, the speed-up lies between 7% and 20%.

More information on the progress of this pull request can be seen here

@timosachsenberg
Copy link
Contributor

@enetz what do you think about these changes? I think you also made some attempts to optimize TSG

@akriese
Copy link
Contributor Author

akriese commented May 25, 2020

I've taken another look into the threading problem. There was an omp atomic missing, but fixing this didn't solve the whole problem.

@akriese
Copy link
Contributor Author

akriese commented May 26, 2020

I've found out, that the sorting of the best n hits for each scan are sorted using the AnnotatedHit_::hasBetterScore(). Problem here is, that the sorting only considers the score of the compared hits. If two sequences have the same score for one scan (which happens fairly often), there is no rule on how to order them. That made the SimpleSearchEngine have different results when using multiple threads.
I changed the hasBetterScore()-function to compare more members of the AnnotatedHit_-struct, so that two equally scoring hits are sorted in a specified way.

Also, I had to build in a sorting of the peptide_ids-vector in postProcessHits_(), to make sure, the idXML-File always writes the hits in the same order (as using threads "shuffles" peptide_ids).

The results with these new changes differ slightly compared to what the SimpleSearchEngine has put out (using one thread) before the changes were made. But at least the results are always the same now indepent on how many threads are being used.
I'd appreciate, if someone could give me a feedback on this. Only if approved, I would change the tests (if needed), which may take some more time.

@akriese akriese changed the title Optimization in TheoreticalSpectrumGenerator and MSSpectrum Optimization in TheoreticalSpectrumGenerator and MSSpectrum + Fixing multithreaded SimpleSearchEngine May 26, 2020
@@ -143,5 +147,8 @@ namespace OpenMS
double pre_int_;
double pre_int_H2O_;
double pre_int_NH3_;

// formula.toString() is extremely expensive, so we use a member map to remember what String belongs to which formula
//mutable std::map<EmpiricalFormula, String> formula_str_cache_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this comment line?

@timosachsenberg timosachsenberg merged commit 71a9ddb into OpenMS:develop Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants