Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::bad_alloc in WriteSOCatalog #71

Open
rtobar opened this issue Mar 4, 2021 · 4 comments
Open

std::bad_alloc in WriteSOCatalog #71

rtobar opened this issue Mar 4, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@rtobar
Copy link

rtobar commented Mar 4, 2021

VR is crashing in WriteSOCatalog with the following message:

[0000] [1445.514] [ info] io.cxx:1292 Saving SO particle lists to halos-2.catalog_SOlist.0
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/var/slurm/slurmd/job2998665/slurm_script: Zeile 4: 16915 Abgebrochen             builds/53/stf -i /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/Run_0/eagle_0000 -C /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/vrconfig_3dfof_subhalos_SO_hydro.cfg -I 2 -o halos-2

This feels like a bug introduced with #57, but it will require a bit of investigation.

@rtobar rtobar added the bug Something isn't working label Mar 4, 2021
@rtobar
Copy link
Author

rtobar commented Mar 26, 2021

I went back to this issue yesterday and today. This is caused by a copy operation of many PIDs from a array of vectors into a single vector, which is then used for writing the resulting data into disk. The same applies for an array of types:

VELOCIraptor-STF/src/io.cxx

Lines 1506 to 1525 in 64de17b

if (nSOids>0) {
idval.resize(nSOids,0);
nSOids=0;
for (Int_t i=1;i<=ngroups;i++) {
for (Int_t j=0;j<SOpids[i].size();j++) {
idval[nSOids++]=SOpids[i][j];
}
SOpids[i].clear();
}
#if defined(GASON) || defined(STARON) || defined(BHON)
typeval.resize(nSOids,0);
nSOids=0;
for (Int_t i=1;i<=ngroups;i++) {
for (Int_t j=0;j<SOtypes[i].size();j++) {
typeval[nSOids++]=SOtypes[i][j];
}
SOtypes[i].clear();
}
#endif
}

It just so happens that the original array of vectors holds quite a big number of IDs (in our case nSOids is 2718973906), and therefore these copies contribute greatly to bringing the memory requirements of the application beyond what's available on the machine:

#### At the beginning of GetSOMasses there's 65 GB of resident memory used:
[0000] [1148.534] [ info] substructureproperties.cxx:3234 Memory report at substructureproperties.cxx:3234 ... Resident: 64.978 [GiB] [...]

### Just before WriteSOCatalog is called, 100 GB are being used:
[0000] [1402.079] [ info] substructureproperties.cxx:3770 Memory report at substructureproperties.cxx:3770 [...] Resident: 99.044 [GiB] [...]

With nSOids == 2718973906 the two copies require ~20 GB of memory in a machine that usually offers ~120 of free memory to processes; hence we end up with a crash.

This has been a known problem for a while though:

VELOCIraptor-STF/src/io.cxx

Lines 1213 to 1214 in 64de17b

///\todo optimisation memory wise can be implemented by not creating an array
///to store all ids and then copying info from the array of vectors into it.

I'll relabel this as an improvement rather than a bug.

@rtobar rtobar added enhancement New feature or request and removed bug Something isn't working labels Mar 26, 2021
@rtobar
Copy link
Author

rtobar commented Mar 26, 2021

In the original code, the original each of the vectors in the array-of-vectors was emptied after their contents were copied (but it wasn't shrunk via shrink_to_fit), and the full block of memory needed for the new 1D array was requested in full.

A non-invasive, possible improvement for this situation is then to resize the 1D array multiple times as we iterate over the original array-of-vectors, and as we effectively release the memory held by each of those individual vectors by calling shrink_to_fit. This will result in more memory allocations, but should keep the memory requirements in check, without having to touch more of the code.

@rtobar
Copy link
Author

rtobar commented Jul 5, 2021

Two other ideas about this that can be quickly implemented:

  • The flattening of ID and type values into their respective vectors doesn't need to happen at the same time. At the moment both are first flattened, and then sequentially written. Instead we could flatten one, write it, clear that memory, then do the same with the other
  • The new vector of types could be declared with a one-byte integer type (i.e., char instead of int). This way the flat copy of the types array would be 1/4th of what's now

rtobar added a commit that referenced this issue Jul 12, 2021
Spherical overdensity information (Particle IDs and types) are collected
by two different routines: GetInclusiveMasses and GetSOMasses. Both
routines use the following technique: first, for "ngroup" groups, a C
array-of-vectors (AOV) of "ngroup + 1" elements was allocated via "new"
(which was eventually delete[]'ed). Vectors for each group are filled
independently as groups are processed; the loop over groups is
1-indexed, and indexing into these AOVs happens using the
same iteration variable, meaning that their 0 element is skipped.
Finally, these AOVs are passed to WriteSOCatalog for writing.
WritingSOCatalog is aware of the 1-based indexing, and additionally it
flattens the AOVs into a single vector (thus duplicating their memory
requirement) to finally write the data into the output HDF5 file.

This commit originally aimed to reduce the memory overhead of the final
writing of data into the HDF5 file (see #71). The main change required
to achieve this is to perform the flattening of data at separate times,
such that particles IDs and types are not flattened at the same time,
but one after the other, with memory from the first flattening operation
being released before the second flattening happens, thus reducing the
maximum memory requirement of the application. This goal was achieved.
However, while performing these changes two things became clear:
firstly, that using a vector-of-vectors (VOV) was a better interface
than an AOV (due to automatic memory management), and secondly that the
1-based indexing of the original AOVs introduced much complexity in the
code. The scope of these changes was then broadened to cover these two
extra changes, and therefore this commit considerably grew in size. In
particular the 0-indexing of the VOVs allowed us to more easily use more
std algorithms that clarify the intent in certain places of the code.

There are other minor changes that have also been included in this
commit, mostly to reduce variable scopes, reduce code duplication, and
such. Assertions have also been sprinkled here and there to add further
assurance that the code is working as expected.

As an unintended side-effect, this commit also fixed the
wrongly-calculated Offset dataset, which was off by one index in the
original values. This problem was reported in #108, and seems to have
always been there.
@rtobar
Copy link
Author

rtobar commented Jul 28, 2021

The first bullet point mentioned in the last comment has now been merged into the master branch, as its implementation also fixed the problem described in #108. With this change The maximum RSS has now decreased, but extra copies are still required.

Although this is progress, a better solution is of course to avoid having to copy data in the first place, which should be possible if we collected PIDs and types into a single vector in the first place. This could be investigated in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant