Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Krowkee integration #2

Merged
merged 17 commits into from
Mar 31, 2023
Merged

Krowkee integration #2

merged 17 commits into from
Mar 31, 2023

Conversation

bwpriest
Copy link
Member

No description provided.

<< ", " << world.routing_protocol() << ", " << range_size
<< ", " << vertex_count << ", "
<< local_edge_count * world.size() << ", "
<< compaction_threshold << ", " << promotion_threshold;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you print the seed after the promotion threshold to match the headers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! fixed.

@bwpriest
Copy link
Member Author

Does the rmat_edge_generator have a large memory footprint? I am implementing an RMAT version of the embedding test and it is throwing OOM errors, even at significantly lower local_edge_count scales than the corresponding uniform benchmark.

For comparison, I tested embedding a graph with 2^26 vertices and 2^21 * 36 edges in the uniform case. However, the RMAT version appears to top out at 2^26 vertices and 2^14 * 36 edges. Adding more edges causes OOM errors. I am confident that the issue is not in krowkee due to the aforementioned uniform test. krowkee should be using the same amount of memory in each case, as it scales only with vertex_count.

In both cases I am storing the edges after generation in a std::vector<std::uint64_t, std::uint64_t> buffer, so it is not simply a matter of there being too many edges either.

Any ideas?

@steiltre
Copy link
Collaborator

The RMAT generator shouldn't be using much memory. It's essentially a handful of doubles, ints, and bools. I have been able to use the RMAT for experiments that amount to 2^26 vertices per compute node and 10M edges per MPI rank.

Are you using the rmat_edge_generator or the distributed_rmat_edge_generator? The distributed version handles making sure each rank's random number generator is given a unique seed and takes a global edge count. Giving the non-distributed version a global edge count would generate more edges than expected, which could be causing your OOM.

@bwpriest
Copy link
Member Author

I am using the distributed_rmat_edge_generator, that was a typo on my part in the comment. I've been picking the experiment apart, and there is definitely something that I am not understanding.

It appears that the distributed_edge_generator returns the same edges to each rank. I instrumented histo_rmat_ygm to check the global number of nonzeros and well as the maximum and minimum vertex indices on each rank. The global number of nonzeros agrees with the same number of unique local vertices on each rank, which is extremely unlikely. Moreover, the minimum and maximum vertex IDs on each rank agree, which reinforces my theory.

I am still not sure why this would cause OOM issues when interfacing with krowkee, but in any case it looks like a bug.

@bwpriest
Copy link
Member Author

I've illustrated the issue with changes in PR #3

@steiltre
Copy link
Collaborator

Shoot. You're right. I was using trial as my seed for the distributed_rmat_edge_generator, but the seed given to the local rmat_edge_generator is determined by multiplying this value by world.rank()+1.

In the case that trial==0, this would cause all ranks to generate the same edges.

@bwpriest
Copy link
Member Author

So is the fix as simple as changing seed * (world.rank() + 1) to (seed + 1) * (world.rank() + 1) in `distributed_rmat_edge_generator?

@steiltre
Copy link
Collaborator

steiltre commented Mar 23, 2022

Yeah. I just played with the code from PR #3. The first trial has the local and global stats agree, but the second does not.

Adding a bigger number is 'better' so each trial ends up with more distinct edges.

Thanks for pointing out this bug.

@bwpriest
Copy link
Member Author

There is suddenly a new compile issue relating to the recent commit to the feature/routing branch.

/g/g13/priest2/workspace/krowkee/repos/ygm-bench/src/histo_rmat_ygm.cpp: In function 'int main(int, char**)':
/g/g13/priest2/workspace/krowkee/repos/ygm-bench/src/histo_rmat_ygm.cpp:137:52: error: no match for 'operator-' (operand types are 'const ygm::detail::stats_tracker' and 'ygm::detail::stats_tracker')
  137 |     auto experiment_stats = world.stats_snapshot() - begin_stats;
      |                             ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~
      |                                                 |    |
      |                                                 |    ygm::detail::stats_tracker
      |                                                 const ygm::detail::stats_tracker

I commented out the stats collection lines to get histo_rmat_ygm to compile. This compile error can probably be resolved with a few more operator overloads.

However, even when I comment out the world.stats_snapshot() lines histo_rmat_ygm segfaults at runtime now. I am unsure as to why. I've updated PR #3 with the current version that compiles but segfaults.

I'm going to have to step away from this for now, but I don't think that I'll be able to get the rmat krowkee test working until we've resolved it. Let me know if I can be of help.

@steiltre
Copy link
Collaborator

Sorry about that. I'm changing some stuff to get more stats of interest for looking at YGM performance. I'll get everything in a coherent state.

@bwpriest
Copy link
Member Author

I went ahead and moved all of the krowkee test chassis into a header, so that the only differences between the uniform and rmat tests is the functor that generates the edge list. However, the RMAT test is still breaking with OOM errors. Now, however, it only breaks when the RMAT distribution is skewed. The benchmark runs correctly when all parameters are set to 0.25.

As a consequence, we have to conclude that there is a problem somewhere in the dsk.async_update() call. I'm not sure if it is on the krowkee or ygm side, but I am inclined to believe it is on the krowkee side. More testing is clearly required.

@bwpriest
Copy link
Member Author

bwpriest commented Oct 31, 2022

I finally revisited this, and concluded that there was an error in the way that I was handling distributed_rmat_edge_generator. I rolled my own edge generation class using rmat_edge_generator (and added a public interface to rmat_edge_generator::generate_edge()). I also collapsed src/embed_ygm.cpp and src/embed_ygm_rmat.cpp into a single file. I believe that everything should work now. @steiltre

@bwpriest
Copy link
Member Author

No, I am still getting OOM errors. I am still unsure why.

@bwpriest
Copy link
Member Author

bwpriest commented Dec 2, 2022

No, I am still getting OOM errors. I am still unsure why.

@steiltre I fixed this issue. It was a simple (and incredibly stupid) error on my part. The embed_ygm and embed_rmat_ygm workflows appear to work just fine now. I cannot run the whole benchmark chassis myself, because some of the directories are hard-coded to locations where I do not have permissions. However, if you feel like checking over my script additions (scripts/run_embed_ygm.sh and scripts/run_embed_rmat_ygm.sh) and checking that everything works when you run the whole benchmark script, I would appreciate it. Please let me know if there is anything else that I can do to help.

@steiltre steiltre merged commit 155e443 into LLNL:master Mar 31, 2023
@bwpriest bwpriest deleted the feature/krowkee branch May 10, 2023 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants