std::bad_alloc in WriteSOCatalog #71

rtobar · 2021-03-04T07:34:42Z

VR is crashing in WriteSOCatalog with the following message:

[0000] [1445.514] [ info] io.cxx:1292 Saving SO particle lists to halos-2.catalog_SOlist.0
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/var/slurm/slurmd/job2998665/slurm_script: Zeile 4: 16915 Abgebrochen             builds/53/stf -i /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/Run_0/eagle_0000 -C /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/vrconfig_3dfof_subhalos_SO_hydro.cfg -I 2 -o halos-2

This feels like a bug introduced with #57, but it will require a bit of investigation.

The text was updated successfully, but these errors were encountered:

rtobar · 2021-03-26T05:21:41Z

I went back to this issue yesterday and today. This is caused by a copy operation of many PIDs from a array of vectors into a single vector, which is then used for writing the resulting data into disk. The same applies for an array of types:

VELOCIraptor-STF/src/io.cxx

Lines 1506 to 1525 in 64de17b

    
               if (nSOids>0) { 
        
                   idval.resize(nSOids,0); 
        
                   nSOids=0; 
        
                   for (Int_t i=1;i<=ngroups;i++) { 
        
                       for (Int_t j=0;j<SOpids[i].size();j++) { 
        
                           idval[nSOids++]=SOpids[i][j]; 
        
                       } 
        
                       SOpids[i].clear(); 
        
                   } 
        
           #if defined(GASON) || defined(STARON) || defined(BHON) 
        
                   typeval.resize(nSOids,0); 
        
                   nSOids=0; 
        
                   for (Int_t i=1;i<=ngroups;i++) { 
        
                       for (Int_t j=0;j<SOtypes[i].size();j++) { 
        
                           typeval[nSOids++]=SOtypes[i][j]; 
        
                       } 
        
                       SOtypes[i].clear(); 
        
                   } 
        
           #endif 
        
               }

It just so happens that the original array of vectors holds quite a big number of IDs (in our case nSOids is 2718973906), and therefore these copies contribute greatly to bringing the memory requirements of the application beyond what's available on the machine:

#### At the beginning of GetSOMasses there's 65 GB of resident memory used:
[0000] [1148.534] [ info] substructureproperties.cxx:3234 Memory report at substructureproperties.cxx:3234 ... Resident: 64.978 [GiB] [...]

### Just before WriteSOCatalog is called, 100 GB are being used:
[0000] [1402.079] [ info] substructureproperties.cxx:3770 Memory report at substructureproperties.cxx:3770 [...] Resident: 99.044 [GiB] [...]

With nSOids == 2718973906 the two copies require ~20 GB of memory in a machine that usually offers ~120 of free memory to processes; hence we end up with a crash.

This has been a known problem for a while though:

VELOCIraptor-STF/src/io.cxx

Lines 1213 to 1214 in 64de17b

    
           ///\todo optimisation memory wise can be implemented by not creating an array 
        
           ///to store all ids and then copying info from the array of vectors into it.

I'll relabel this as an improvement rather than a bug.

rtobar · 2021-03-26T05:35:19Z

In the original code, the original each of the vectors in the array-of-vectors was emptied after their contents were copied (but it wasn't shrunk via shrink_to_fit), and the full block of memory needed for the new 1D array was requested in full.

A non-invasive, possible improvement for this situation is then to resize the 1D array multiple times as we iterate over the original array-of-vectors, and as we effectively release the memory held by each of those individual vectors by calling shrink_to_fit. This will result in more memory allocations, but should keep the memory requirements in check, without having to touch more of the code.

rtobar · 2021-07-05T06:05:21Z

Two other ideas about this that can be quickly implemented:

The flattening of ID and type values into their respective vectors doesn't need to happen at the same time. At the moment both are first flattened, and then sequentially written. Instead we could flatten one, write it, clear that memory, then do the same with the other
The new vector of types could be declared with a one-byte integer type (i.e., char instead of int). This way the flat copy of the types array would be 1/4th of what's now

Spherical overdensity information (Particle IDs and types) are collected by two different routines: GetInclusiveMasses and GetSOMasses. Both routines use the following technique: first, for "ngroup" groups, a C array-of-vectors (AOV) of "ngroup + 1" elements was allocated via "new" (which was eventually delete[]'ed). Vectors for each group are filled independently as groups are processed; the loop over groups is 1-indexed, and indexing into these AOVs happens using the same iteration variable, meaning that their 0 element is skipped. Finally, these AOVs are passed to WriteSOCatalog for writing. WritingSOCatalog is aware of the 1-based indexing, and additionally it flattens the AOVs into a single vector (thus duplicating their memory requirement) to finally write the data into the output HDF5 file. This commit originally aimed to reduce the memory overhead of the final writing of data into the HDF5 file (see #71). The main change required to achieve this is to perform the flattening of data at separate times, such that particles IDs and types are not flattened at the same time, but one after the other, with memory from the first flattening operation being released before the second flattening happens, thus reducing the maximum memory requirement of the application. This goal was achieved. However, while performing these changes two things became clear: firstly, that using a vector-of-vectors (VOV) was a better interface than an AOV (due to automatic memory management), and secondly that the 1-based indexing of the original AOVs introduced much complexity in the code. The scope of these changes was then broadened to cover these two extra changes, and therefore this commit considerably grew in size. In particular the 0-indexing of the VOVs allowed us to more easily use more std algorithms that clarify the intent in certain places of the code. There are other minor changes that have also been included in this commit, mostly to reduce variable scopes, reduce code duplication, and such. Assertions have also been sprinkled here and there to add further assurance that the code is working as expected. As an unintended side-effect, this commit also fixed the wrongly-calculated Offset dataset, which was off by one index in the original values. This problem was reported in #108, and seems to have always been there.

rtobar · 2021-07-28T02:11:34Z

The first bullet point mentioned in the last comment has now been merged into the master branch, as its implementation also fixed the problem described in #108. With this change The maximum RSS has now decreased, but extra copies are still required.

Although this is progress, a better solution is of course to avoid having to copy data in the first place, which should be possible if we collected PIDs and types into a single vector in the first place. This could be investigated in the future.

rtobar added the bug Something isn't working label Mar 4, 2021

rtobar mentioned this issue Mar 5, 2021

Negative densities / other crashes with latest master #53

Open

rtobar added enhancement New feature or request and removed bug Something isn't working labels Mar 26, 2021

rtobar mentioned this issue Jul 9, 2021

SO list offsets are wrong/counterintuitive #108

Closed

rtobar mentioned this issue Jul 12, 2021

Can't write more than 2GB HDF5 datasets in parallel #88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std::bad_alloc in WriteSOCatalog #71

std::bad_alloc in WriteSOCatalog #71

rtobar commented Mar 4, 2021

rtobar commented Mar 26, 2021

rtobar commented Mar 26, 2021

rtobar commented Jul 5, 2021

rtobar commented Jul 28, 2021

std::bad_alloc in WriteSOCatalog #71

std::bad_alloc in WriteSOCatalog #71

Comments

rtobar commented Mar 4, 2021

rtobar commented Mar 26, 2021

rtobar commented Mar 26, 2021

rtobar commented Jul 5, 2021

rtobar commented Jul 28, 2021