Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SO list offsets are wrong/counterintuitive #108

Closed
bwvdnbro opened this issue Jul 8, 2021 · 6 comments
Closed

SO list offsets are wrong/counterintuitive #108

bwvdnbro opened this issue Jul 8, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@bwvdnbro
Copy link

bwvdnbro commented Jul 8, 2021

Describe the bug
The offsets stored in the SO list output seem not consistent with the SO sizes in the same file. My (possibly wrong) expectation is that the particle IDs belonging to SO i are stored in pIDs[offset[i] : offset[i]+size[i]], but that is not the case.

To Reproduce
The problem can be best illustrated through the following Python snippet that can be applied to any HDF5 SO list output:

import h5py
SOfile = h5py.File("<RANDOM SO_LIST FILE.hdf5>", "r")
ofs = SOfile["Offset"][:]
siz = SOfile["SO_size"][:]
pIDs = SOfile["Particle_IDs"][:]
if not ofs[-1]+siz[-1] == pIDs.shape[0]:
  print("Wrong final size ({0}=/={1})!".format(ofs[-1]+siz[-1], pIDs.shape[0]))
for i in range(1,ofs.shape[0]):
  if not ofs[i-1]+siz[i-1] == ofs[i]:
    print("Wrong offset ({0} {1}, {2} {3})!".format(ofs[i-1], siz[i-1], ofs[i], siz[i]))
    exit(1)

This will produce two error messages. The first one because the final offset is not separated from the end of the particle ID list by the size of the final SO, and the second because the next offset does not match the previous offset plus the previous size.

Expected behavior
ofs[i] == ofs[i-1]+siz[i-1]
Put differently, I would expect ofs to be equivalent to

ofs = numpy.cumsum(siz)
ofs[1:] = ofs[:-1]
ofs[0] = 0

Log files
Not applicable.

Environment (please complete the following information):
Irrelevant for this problem.

Additional context
I think the problem is situated in io.cxx:1450, where the value of SOpids[0] seems to be (incorrectly) omitted from the loop.

@rtobar
Copy link

rtobar commented Jul 9, 2021

Thanks @bwvdnbro for recording the issue here.

tl;dr: you are almost right: yes, there is an element being skipped when building offset from SOpids, but the missing index is 1, not 0. We can quickly patch this now, or you can wait a bit more for a more complete solution.

The longer story now: like I mentioned in Slack, I had actually been looking at this code these days in the context of trying to reduce the maximum memory usage (see #71). The code has some extra complexity because the SOpids and SOtypes arrays-of-vectors are originally allocated in GetSOMasses with N + 1 elements:

vector<Int_t> *SOpartlist = new vector<Int_t>[ngroup+1];
vector<int> *SOparttypelist = NULL;
#if defined(GASON) || defined(STARON) || defined(BHON) || defined(HIGHRES)
SOparttypelist=new vector<int>[ngroup+1];
#endif

Then they are 1-indexed:

for (i=1;i<=ngroup;i++)

SOpartlist[i].resize(llindex);
#if defined(GASON) || defined(STARON) || defined(BHON) || defined(HIGHRES)
SOparttypelist[i].resize(llindex);
#endif
for (j=0;j<llindex;j++) SOpartlist[i][j]=SOpids[indices[j]];
#if defined(GASON) || defined(STARON) || defined(BHON) || defined(HIGHRES)
for (j=0;j<llindex;j++) SOparttypelist[i][j]=typeparts[indices[j]];
#endif

The same repeats in GetInclusiveMasses.

Skipping element 0 is then fine, because it does't contain anything. What is being skipped though is the size of element 1 in the loop populating offset:

for (Int_t i=2;i<=ngroups;i++) offset[i]=offset[i-1]+SOpids[i].size();

In that line the indexing should happen with SOpids[i - 1].

As part of the changes that I've been working on for #71 I've also changed the indexing strategy for SOpids and SOtypes to be 0-based, which in turn makes their manipulation much clearer in other places. My changes have also fixed this problem actually, although I hadn't noticed them before you raised this issue. If you are in a hurry we can quickly try a patch now, but if you're happy to wait a bit more for the more complete solution then a fix for this problem will be included when I merge my changes.

@bwvdnbro
Copy link
Author

bwvdnbro commented Jul 9, 2021

Sorry, I meant element 1 and not 0 (my mind is just not used to thinking about 1 as the first element).

There is absolutely no hurry, I am currently using the numpy.cumsum workaround in my script.

@rtobar rtobar added the bug Something isn't working label Jul 9, 2021
rtobar added a commit that referenced this issue Jul 12, 2021
Spherical overdensity information (Particle IDs and types) are collected
by two different routines: GetInclusiveMasses and GetSOMasses. Both
routines use the following technique: first, for "ngroup" groups, a C
array-of-vectors (AOV) of "ngroup + 1" elements was allocated via "new"
(which was eventually delete[]'ed). Vectors for each group are filled
independently as groups are processed; the loop over groups is
1-indexed, and indexing into these AOVs happens using the
same iteration variable, meaning that their 0 element is skipped.
Finally, these AOVs are passed to WriteSOCatalog for writing.
WritingSOCatalog is aware of the 1-based indexing, and additionally it
flattens the AOVs into a single vector (thus duplicating their memory
requirement) to finally write the data into the output HDF5 file.

This commit originally aimed to reduce the memory overhead of the final
writing of data into the HDF5 file. The main change required to achieve
this is to perform the flattening of data at separate times, such that
particles IDs and types are not flattened at the same time, but one
after the other, with memory from the first flattening operation being
released before the second flattening happens, thus reducing the maximum
memory requirement of the application. This goal was achieved. However,
while performing these changes two things became clear: firstly, that
using a vector-of-vectors (VOV) was a better interface than an AOV (due
to automatic memory management), and secondly that the 1-based indexing
of the original AOVs introduced much complexity in the code. The scope
of these changes was then broadened to cover these two extra changes,
and therefore this commit considerably grew in size. In particular the
0-indexing of the VOVs allowed us to more easily use more std algorithms
that clarify the intent in certain places of the code.

There are other minor changes that have also been included in this
commit, mostly to reduce variable scopes, reduce code duplication, and
such. Assertions have also been sprinkled here and there to add further
assurance that the code is working as expected.

As an unintended side-effect, this commit also fixed the
wrongly-calculated Offset dataset, which was off by one index in the
original values. This problem was reported in #108, and seems to have
always been there.
rtobar added a commit that referenced this issue Jul 12, 2021
Spherical overdensity information (Particle IDs and types) are collected
by two different routines: GetInclusiveMasses and GetSOMasses. Both
routines use the following technique: first, for "ngroup" groups, a C
array-of-vectors (AOV) of "ngroup + 1" elements was allocated via "new"
(which was eventually delete[]'ed). Vectors for each group are filled
independently as groups are processed; the loop over groups is
1-indexed, and indexing into these AOVs happens using the
same iteration variable, meaning that their 0 element is skipped.
Finally, these AOVs are passed to WriteSOCatalog for writing.
WritingSOCatalog is aware of the 1-based indexing, and additionally it
flattens the AOVs into a single vector (thus duplicating their memory
requirement) to finally write the data into the output HDF5 file.

This commit originally aimed to reduce the memory overhead of the final
writing of data into the HDF5 file (see #71). The main change required
to achieve this is to perform the flattening of data at separate times,
such that particles IDs and types are not flattened at the same time,
but one after the other, with memory from the first flattening operation
being released before the second flattening happens, thus reducing the
maximum memory requirement of the application. This goal was achieved.
However, while performing these changes two things became clear:
firstly, that using a vector-of-vectors (VOV) was a better interface
than an AOV (due to automatic memory management), and secondly that the
1-based indexing of the original AOVs introduced much complexity in the
code. The scope of these changes was then broadened to cover these two
extra changes, and therefore this commit considerably grew in size. In
particular the 0-indexing of the VOVs allowed us to more easily use more
std algorithms that clarify the intent in certain places of the code.

There are other minor changes that have also been included in this
commit, mostly to reduce variable scopes, reduce code duplication, and
such. Assertions have also been sprinkled here and there to add further
assurance that the code is working as expected.

As an unintended side-effect, this commit also fixed the
wrongly-calculated Offset dataset, which was off by one index in the
original values. This problem was reported in #108, and seems to have
always been there.
@rtobar
Copy link

rtobar commented Jul 13, 2021

@bwvdnbro please have a look at the changes in the issue-71 branch, that should now yield a Offset dataset. Please let me know if you see now the correct values (and that nothing else breaks), and after confirmation I'll merge this to the master.

@bwvdnbro
Copy link
Author

Hi @rtobar, I will take a look at this tomorrow. I have never actually run VELOCIraptor before, so it can take a while. So far I was just looking at the output provided by someone else.

@bwvdnbro
Copy link
Author

@rtobar, I can now confirm that the changes in the issue-71 fix the offsets and do not affect any of the other quantities we look at.

@rtobar
Copy link

rtobar commented Jul 28, 2021

Branch merged now, thanks for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants