Krowkee integration #2

bwpriest · 2022-03-21T19:26:21Z

No description provided.

…ark experiment.

steiltre · 2022-03-22T23:13:02Z

src/embed_ygm.cpp

+                << ", " << world.routing_protocol() << ", " << range_size
+                << ", " << vertex_count << ", "
+                << local_edge_count * world.size() << ", "
+                << compaction_threshold << ", " << promotion_threshold;


Can you print the seed after the promotion threshold to match the headers?

Oops! fixed.

bwpriest · 2022-03-23T19:30:11Z

Does the rmat_edge_generator have a large memory footprint? I am implementing an RMAT version of the embedding test and it is throwing OOM errors, even at significantly lower local_edge_count scales than the corresponding uniform benchmark.

For comparison, I tested embedding a graph with 2^26 vertices and 2^21 * 36 edges in the uniform case. However, the RMAT version appears to top out at 2^26 vertices and 2^14 * 36 edges. Adding more edges causes OOM errors. I am confident that the issue is not in krowkee due to the aforementioned uniform test. krowkee should be using the same amount of memory in each case, as it scales only with vertex_count.

In both cases I am storing the edges after generation in a std::vector<std::uint64_t, std::uint64_t> buffer, so it is not simply a matter of there being too many edges either.

Any ideas?

steiltre · 2022-03-23T20:48:34Z

The RMAT generator shouldn't be using much memory. It's essentially a handful of doubles, ints, and bools. I have been able to use the RMAT for experiments that amount to 2^26 vertices per compute node and 10M edges per MPI rank.

Are you using the rmat_edge_generator or the distributed_rmat_edge_generator? The distributed version handles making sure each rank's random number generator is given a unique seed and takes a global edge count. Giving the non-distributed version a global edge count would generate more edges than expected, which could be causing your OOM.

bwpriest · 2022-03-23T22:12:59Z

I am using the distributed_rmat_edge_generator, that was a typo on my part in the comment. I've been picking the experiment apart, and there is definitely something that I am not understanding.

It appears that the distributed_edge_generator returns the same edges to each rank. I instrumented histo_rmat_ygm to check the global number of nonzeros and well as the maximum and minimum vertex indices on each rank. The global number of nonzeros agrees with the same number of unique local vertices on each rank, which is extremely unlikely. Moreover, the minimum and maximum vertex IDs on each rank agree, which reinforces my theory.

I am still not sure why this would cause OOM issues when interfacing with krowkee, but in any case it looks like a bug.

bwpriest · 2022-03-23T22:29:39Z

I've illustrated the issue with changes in PR #3

steiltre · 2022-03-23T22:29:54Z

Shoot. You're right. I was using trial as my seed for the distributed_rmat_edge_generator, but the seed given to the local rmat_edge_generator is determined by multiplying this value by world.rank()+1.

In the case that trial==0, this would cause all ranks to generate the same edges.

…that it each iteration starts fresh.

bwpriest · 2022-03-23T22:39:57Z

So is the fix as simple as changing seed * (world.rank() + 1) to (seed + 1) * (world.rank() + 1) in `distributed_rmat_edge_generator?

steiltre · 2022-03-23T22:50:08Z

Yeah. I just played with the code from PR #3. The first trial has the local and global stats agree, but the second does not.

Adding a bigger number is 'better' so each trial ends up with more distinct edges.

Thanks for pointing out this bug.

bwpriest · 2022-03-23T23:00:38Z

There is suddenly a new compile issue relating to the recent commit to the feature/routing branch.

/g/g13/priest2/workspace/krowkee/repos/ygm-bench/src/histo_rmat_ygm.cpp: In function 'int main(int, char**)':
/g/g13/priest2/workspace/krowkee/repos/ygm-bench/src/histo_rmat_ygm.cpp:137:52: error: no match for 'operator-' (operand types are 'const ygm::detail::stats_tracker' and 'ygm::detail::stats_tracker')
  137 |     auto experiment_stats = world.stats_snapshot() - begin_stats;
      |                             ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~
      |                                                 |    |
      |                                                 |    ygm::detail::stats_tracker
      |                                                 const ygm::detail::stats_tracker

I commented out the stats collection lines to get histo_rmat_ygm to compile. This compile error can probably be resolved with a few more operator overloads.

However, even when I comment out the world.stats_snapshot() lines histo_rmat_ygm segfaults at runtime now. I am unsure as to why. I've updated PR #3 with the current version that compiles but segfaults.

I'm going to have to step away from this for now, but I don't think that I'll be able to get the rmat krowkee test working until we've resolved it. Let me know if I can be of help.

steiltre · 2022-03-23T23:04:36Z

Sorry about that. I'm changing some stuff to get more stats of interest for looking at YGM performance. I'll get everything in a coherent state.

…arameter from log to exact scale.

…ntenance.

…is templated upon a functor providing a vector of edge updates.

…rface.

…tor functor.

…_generator to be very uniform, as skewed distributions cause out-of-memory errors for unknown reasons.

bwpriest · 2022-03-31T23:52:51Z

I went ahead and moved all of the krowkee test chassis into a header, so that the only differences between the uniform and rmat tests is the functor that generates the edge list. However, the RMAT test is still breaking with OOM errors. Now, however, it only breaks when the RMAT distribution is skewed. The benchmark runs correctly when all parameters are set to 0.25.

As a consequence, we have to conclude that there is a problem somewhere in the dsk.async_update() call. I'm not sure if it is on the krowkee or ygm side, but I am inclined to believe it is on the krowkee side. More testing is clearly required.

…pear to have solved the OOM errors.

bwpriest · 2022-10-31T17:23:20Z

I finally revisited this, and concluded that there was an error in the way that I was handling distributed_rmat_edge_generator. I rolled my own edge generation class using rmat_edge_generator (and added a public interface to rmat_edge_generator::generate_edge()). I also collapsed src/embed_ygm.cpp and src/embed_ygm_rmat.cpp into a single file. I believe that everything should work now. @steiltre

bwpriest · 2022-10-31T17:42:45Z

No, I am still getting OOM errors. I am still unsure why.

…large graphs.

bwpriest · 2022-12-02T00:08:40Z

No, I am still getting OOM errors. I am still unsure why.

@steiltre I fixed this issue. It was a simple (and incredibly stupid) error on my part. The embed_ygm and embed_rmat_ygm workflows appear to work just fine now. I cannot run the whole benchmark chassis myself, because some of the directories are hard-coded to locations where I do not have permissions. However, if you feel like checking over my script additions (scripts/run_embed_ygm.sh and scripts/run_embed_rmat_ygm.sh) and checking that everything works when you run the whole benchmark script, I would appreciate it. Please let me know if there is anything else that I can do to help.

bwpriest added 2 commits March 21, 2022 12:22

Added krowkee as a dependency and creates simplistic embedding benchm…

f5ffb91

…ark experiment.

integrated embedding experiment into ygm-bench workflow.

0d5771e

bwpriest mentioned this pull request Mar 21, 2022

Should krowkee dependency and benchmark scripts be optional? #1

Closed

steiltre reviewed Mar 22, 2022

View reviewed changes

Added seed to printout

e333a0b

Moved sketch creationg inside of the experiment loop of embed_ygm so …

7eece4d

…that it each iteration starts fresh.

steiltre and others added 9 commits March 30, 2022 09:59

Updates embed_ygm.cpp to use new stats tracking for YGM (#1)

be5df9b

Merge branch 'master' into feature/krowkee

c431892

Incorporated upstream changes into embed_ygm and changed edge count p…

334e43f

…arameter from log to exact scale.

Moved some cli boilerplate into a separate header file for easier mai…

f910235

…ntenance.

broke most of krowkee test chassis into a header whose main function …

55054d3

…is templated upon a functor providing a vector of edge updates.

...

10e3a1c

Added world as a parameter to the krowkee edge_generator functor inte…

694fdec

…rface.

Moved krowkee test main chassis into header, templated on edge genera…

9ab0853

…tor functor.

Added rmat version of krowkee test. Currently fixing distributed_rmat…

7e0e4f2

…_generator to be very uniform, as skewed distributions cause out-of-memory errors for unknown reasons.

bwpriest added 3 commits April 4, 2022 12:31

cleaned up some comments

877f340

removed src/embed_rmat_ygm.cpp

4297a01

Folded rmat and uniform edge distributions into single executable. Ap…

e6a9c62

…pear to have solved the OOM errors.

Fixed the monumentally stupid bug that was causing the OOM issues on …

e1331cf

…large graphs.

steiltre merged commit 155e443 into LLNL:master Mar 31, 2023

bwpriest deleted the feature/krowkee branch May 10, 2023 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Krowkee integration #2

Krowkee integration #2

bwpriest commented Mar 21, 2022

steiltre Mar 22, 2022

bwpriest Mar 23, 2022

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022

bwpriest commented Mar 23, 2022

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022 •

edited

Loading

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022

bwpriest commented Mar 31, 2022

bwpriest commented Oct 31, 2022 •

edited

Loading

bwpriest commented Oct 31, 2022

bwpriest commented Dec 2, 2022

Krowkee integration #2

Krowkee integration #2

Conversation

bwpriest commented Mar 21, 2022

steiltre Mar 22, 2022

Choose a reason for hiding this comment

bwpriest Mar 23, 2022

Choose a reason for hiding this comment

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022

bwpriest commented Mar 23, 2022

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022 • edited Loading

bwpriest commented Mar 23, 2022

steiltre commented Mar 23, 2022

bwpriest commented Mar 31, 2022

bwpriest commented Oct 31, 2022 • edited Loading

bwpriest commented Oct 31, 2022

bwpriest commented Dec 2, 2022

steiltre commented Mar 23, 2022 •

edited

Loading

bwpriest commented Oct 31, 2022 •

edited

Loading