Performance improvements in compartment report mapping handling.#156
Conversation
| report->getGIDs().end()); | ||
| // With 8 threads there's some speed up, but very small. We stick with 6 | ||
| // to avoid using more than 1 socket. | ||
| #pragma omp parallel for num_threads(6) |
There was a problem hiding this comment.
what happens if for some reason you have not 6 threads? is omp using less automatically or overcommits here? I guess this observation or benchmark is valid for one specific machine of ours. At least I would mention that.
There was a problem hiding this comment.
In the end it's a memory-bound operation (I don't see any computation), hence the speed-up is quite small indeed. I guess the change to indices.resize is probably enough, unless VTune really says threading makes sense here :)
There was a problem hiding this comment.
According to msdn, num_threads is the maximum number of threads that will execute the parallel region, unless dynamic adjustment is disabled, in which case the result of overcommiting is implementation defined (it's even allowed to abort :(). This clause also overrides the effect of OMP_NUM_THREADS
Something I've observed for sure, is that due to the NUMA architecture, going to another socket becomes slower (even more than x2) and unpredictable.
I didn't use VTune because I did it remotely and I haven't spend time to learn how to use the command line tools, but now you're an "expert", we can look into it tomorrow together if you want. The only profiling I was using was measuring the time of the whole operation and "sampling" with the debugger.
Without threading it takes x3 times more.
| @@ -734,8 +744,6 @@ bool CompartmentReportBinary::_parseMapping() | |||
| sectionOffsets[k.first] = k.second.first; | |||
There was a problem hiding this comment.
const auto& and better name for 'k'?
| std::sort(sectionsMapping.begin(), sectionsMapping.end()); | ||
|
|
||
| // now convert the maps into the desired mapping format | ||
| uint64_ts& sectionOffsets = offsetMapping[idx]; |
There was a problem hiding this comment.
why do you prefer explicit type over & here?
There was a problem hiding this comment.
I didn't touch this code.
|
|
||
| if (previous != LB_UNDEFINED_UINT16) | ||
| sectionsMapping[previous].second = count; | ||
| sectionsMapping[sectionsMapping.size() - 2].second.second = |
There was a problem hiding this comment.
what happens for 1-section reports? ;)
There was a problem hiding this comment.
Then this code is never executed. This is already the case of soma reports and the tests pass :)
Note that this line would only be executed if previous = undefined and current != previous. In that case the push back in line 722 has already been executed twice. You could argue that 730 is actually wrong if the cell has no compartments at all, but if that was the case, that would mean there's something reaaaly broken somewhere else.
| @@ -706,23 +716,23 @@ bool CompartmentReportBinary::_parseMapping() | |||
| const uint64_t frameIndex = | |||
| j + ((info.dataOffset - _header.dataBlockOffset) / | |||
| sizeof(float)); | |||
There was a problem hiding this comment.
is sizeof(float) machine-dependent?
There was a problem hiding this comment.
It is, but since all the machines we use are IEEE 745 I guess it's impossible to find something different from 4. In any case, I prefer keeping sizeof(float) for consistency with the rest of code style.
There was a problem hiding this comment.
You do realize, of course, that using sizeof(float) for dealing with locally allocated memory and for something created externally is a different story.
If the spec says it's 4 bytes for measurement in the report I would rather see the constant '4'.
But I'm not an expert on machine-dependent float type size to insist on that.
| perCellOffsets[idx] = info.accumCompartments; | ||
|
|
||
| for (int32_t j = 0; j < info.numCompartments; ++j) | ||
| { |
There was a problem hiding this comment.
static_cast for current = value?
Why is it downcasted to uint16 from float?
There was a problem hiding this comment.
I think uin16_t came before, inherited from the first code used to parse h5 reports. Why the binary report uses floats as the datatype for the mapping it a mystery to me. This can be changed if needed, but not in this PR.
There was a problem hiding this comment.
I think the float is used so that "index" is of the same size as "frame" and thus is read with same code (although I do not justify this).
The thing is, by downcasting to uint16 you are risking an overflow, as you mentioned before.
But if the "original" type is float, why restricting to uint16?
There was a problem hiding this comment.
Maybe using the same datatype makes sense in reporting lib... It you're really concerned about it, please create a ticket so we don't forget. I'll add an assert for the moment, because changing uint16_t to uint32_t impacts the public API and the wrapping, not just this file.
| @@ -706,23 +716,23 @@ bool CompartmentReportBinary::_parseMapping() | |||
| const uint64_t frameIndex = | |||
| j + ((info.dataOffset - _header.dataBlockOffset) / | |||
There was a problem hiding this comment.
I would probably add an assert that info.dataOffset >= _header.dataBlockOffset, unless it is guaranteed somewhere else.
There was a problem hiding this comment.
It isn't. An assert will still crash in release builds, don't you prefer an exception?
There was a problem hiding this comment.
logically, it's an assert since it should not happen, right?
There was a problem hiding this comment.
Yes, my point is that to avoid crashing the Python interpreter it's better an exception and maybe you prefer that.
There was a problem hiding this comment.
As long as there is a paranoid check that does not affect the performance, I have no preference here.
If you think logic_error is better, it's fine with me.
There was a problem hiding this comment.
I added a check some lines up to make this function return an error. That will throw a runtime_error. It's just an if in not the heaviest loop at loop and I haven't noticed any significant difference in performance.
In my opinion this is not a logic_error because it's not the users' fault (as opposed to writing e.g. sqrt(-1)), it could be due to a corrupted file.
| #pragma omp parallel for num_threads(6) | ||
| for (size_t idx = 0; idx < cells.size(); ++idx) | ||
| { | ||
| CellInfo& info = cells[idx]; |
|
|
||
| // < sectionID, < frameIndex, numCompartments > > | ||
| typedef std::map<uint16_t, std::pair<uint64_t, uint16_t>> | ||
| typedef std::vector<std::pair<uint16_t, std::pair<uint64_t, uint16_t>>> |
There was a problem hiding this comment.
is it really worth a typedef here?
| @@ -653,6 +663,10 @@ bool CompartmentReportBinary::_parseMapping() | |||
|
|
|||
| if (_header.byteswap) | |||
There was a problem hiding this comment.
just out of curiosity: is there a way to combine get<> with byteswap if needed?
Could help to avoid funny errors if new get<>s are added for some reason.
There was a problem hiding this comment.
Maybe, but get is a free function and whether bytes need swapping is a member variable inside the header, so the change you request requires either making get a member of add an argument, and none of the them is a nice change.
| indices.resize(indicesCount); | ||
|
|
||
| for (auto gid : report->getGIDs()) | ||
| std::vector<uint32_t> gidList(report->getGIDs().begin(), |
There was a problem hiding this comment.
why this copy is needed and why is it not const? does it have anything to do with threading?
There was a problem hiding this comment.
It could be const, but the copy is needed because the openmp loop needs something that is random accessible and report->getGIDs() returns a std::set.
| report->getGIDs().end()); | ||
| // With 8 threads there's some speed up, but very small. We stick with 6 | ||
| // to avoid using more than 1 socket. | ||
| #pragma omp parallel for num_threads(6) |
There was a problem hiding this comment.
this shows no effect for me (and vtune)
There was a problem hiding this comment.
...so I suggest to remove it
| for (const CellInfo& info : cells) | ||
| // With 8 threads there's some speed up, but very small. We stick with 6 | ||
| // to avoid using more than 1 socket. | ||
| #pragma omp parallel for num_threads(6) |
There was a problem hiding this comment.
add schedule(dynamic). does 800ms to 600ms for me
|
Retest this please |
No description provided.