Skip to content

Performance improvements in compartment report mapping handling.#156

Merged
hernando merged 2 commits into
masterfrom
mapping_perf
Jun 6, 2017
Merged

Performance improvements in compartment report mapping handling.#156
hernando merged 2 commits into
masterfrom
mapping_perf

Conversation

@hernando
Copy link
Copy Markdown
Contributor

No description provided.

Comment thread brain/detail/compartmentReport.h Outdated
report->getGIDs().end());
// With 8 threads there's some speed up, but very small. We stick with 6
// to avoid using more than 1 socket.
#pragma omp parallel for num_threads(6)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if for some reason you have not 6 threads? is omp using less automatically or overcommits here? I guess this observation or benchmark is valid for one specific machine of ours. At least I would mention that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end it's a memory-bound operation (I don't see any computation), hence the speed-up is quite small indeed. I guess the change to indices.resize is probably enough, unless VTune really says threading makes sense here :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to msdn, num_threads is the maximum number of threads that will execute the parallel region, unless dynamic adjustment is disabled, in which case the result of overcommiting is implementation defined (it's even allowed to abort :(). This clause also overrides the effect of OMP_NUM_THREADS
Something I've observed for sure, is that due to the NUMA architecture, going to another socket becomes slower (even more than x2) and unpredictable.

I didn't use VTune because I did it remotely and I haven't spend time to learn how to use the command line tools, but now you're an "expert", we can look into it tomorrow together if you want. The only profiling I was using was measuring the time of the whole operation and "sampling" with the debugger.
Without threading it takes x3 times more.

@@ -734,8 +744,6 @@ bool CompartmentReportBinary::_parseMapping()
sectionOffsets[k.first] = k.second.first;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto& and better name for 'k'?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::sort(sectionsMapping.begin(), sectionsMapping.end());

// now convert the maps into the desired mapping format
uint64_ts& sectionOffsets = offsetMapping[idx];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you prefer explicit type over & here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't touch this code.


if (previous != LB_UNDEFINED_UINT16)
sectionsMapping[previous].second = count;
sectionsMapping[sectionsMapping.size() - 2].second.second =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens for 1-section reports? ;)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this code is never executed. This is already the case of soma reports and the tests pass :)
Note that this line would only be executed if previous = undefined and current != previous. In that case the push back in line 722 has already been executed twice. You could argue that 730 is actually wrong if the cell has no compartments at all, but if that was the case, that would mean there's something reaaaly broken somewhere else.

@@ -706,23 +716,23 @@ bool CompartmentReportBinary::_parseMapping()
const uint64_t frameIndex =
j + ((info.dataOffset - _header.dataBlockOffset) /
sizeof(float));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is sizeof(float) machine-dependent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but since all the machines we use are IEEE 745 I guess it's impossible to find something different from 4. In any case, I prefer keeping sizeof(float) for consistency with the rest of code style.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do realize, of course, that using sizeof(float) for dealing with locally allocated memory and for something created externally is a different story.
If the spec says it's 4 bytes for measurement in the report I would rather see the constant '4'.
But I'm not an expert on machine-dependent float type size to insist on that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true

perCellOffsets[idx] = info.accumCompartments;

for (int32_t j = 0; j < info.numCompartments; ++j)
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast for current = value?
Why is it downcasted to uint16 from float?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think uin16_t came before, inherited from the first code used to parse h5 reports. Why the binary report uses floats as the datatype for the mapping it a mystery to me. This can be changed if needed, but not in this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the float is used so that "index" is of the same size as "frame" and thus is read with same code (although I do not justify this).
The thing is, by downcasting to uint16 you are risking an overflow, as you mentioned before.
But if the "original" type is float, why restricting to uint16?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using the same datatype makes sense in reporting lib... It you're really concerned about it, please create a ticket so we don't forget. I'll add an assert for the moment, because changing uint16_t to uint32_t impacts the public API and the wrapping, not just this file.

@@ -706,23 +716,23 @@ bool CompartmentReportBinary::_parseMapping()
const uint64_t frameIndex =
j + ((info.dataOffset - _header.dataBlockOffset) /
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably add an assert that info.dataOffset >= _header.dataBlockOffset, unless it is guaranteed somewhere else.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't. An assert will still crash in release builds, don't you prefer an exception?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically, it's an assert since it should not happen, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my point is that to avoid crashing the Python interpreter it's better an exception and maybe you prefer that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there is a paranoid check that does not affect the performance, I have no preference here.
If you think logic_error is better, it's fine with me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check some lines up to make this function return an error. That will throw a runtime_error. It's just an if in not the heaviest loop at loop and I haven't noticed any significant difference in performance.
In my opinion this is not a logic_error because it's not the users' fault (as opposed to writing e.g. sqrt(-1)), it could be due to a corrupted file.

#pragma omp parallel for num_threads(6)
for (size_t idx = 0; idx < cells.size(); ++idx)
{
CellInfo& info = cells[idx];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// < sectionID, < frameIndex, numCompartments > >
typedef std::map<uint16_t, std::pair<uint64_t, uint16_t>>
typedef std::vector<std::pair<uint16_t, std::pair<uint64_t, uint16_t>>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really worth a typedef here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really

@@ -653,6 +663,10 @@ bool CompartmentReportBinary::_parseMapping()

if (_header.byteswap)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity: is there a way to combine get<> with byteswap if needed?
Could help to avoid funny errors if new get<>s are added for some reason.

Copy link
Copy Markdown
Contributor Author

@hernando hernando Jun 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but get is a free function and whether bytes need swapping is a member variable inside the header, so the change you request requires either making get a member of add an argument, and none of the them is a nice change.

Comment thread brain/detail/compartmentReport.h Outdated
indices.resize(indicesCount);

for (auto gid : report->getGIDs())
std::vector<uint32_t> gidList(report->getGIDs().begin(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this copy is needed and why is it not const? does it have anything to do with threading?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be const, but the copy is needed because the openmp loop needs something that is random accessible and report->getGIDs() returns a std::set.

hernando pushed a commit that referenced this pull request Jun 1, 2017
hernando pushed a commit that referenced this pull request Jun 1, 2017
Comment thread brain/detail/compartmentReport.h Outdated
report->getGIDs().end());
// With 8 threads there's some speed up, but very small. We stick with 6
// to avoid using more than 1 socket.
#pragma omp parallel for num_threads(6)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows no effect for me (and vtune)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...so I suggest to remove it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

for (const CellInfo& info : cells)
// With 8 threads there's some speed up, but very small. We stick with 6
// to avoid using more than 1 socket.
#pragma omp parallel for num_threads(6)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add schedule(dynamic). does 800ms to 600ms for me

@hernando
Copy link
Copy Markdown
Contributor Author

hernando commented Jun 6, 2017

Retest this please

@hernando hernando merged commit b5d0fd7 into master Jun 6, 2017
@hernando hernando deleted the mapping_perf branch June 6, 2017 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants