Add test case illustrating slow planning #16
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have found an issue where certain conditions in a flow can cause the planning phase to take a very long time. It is related to flows that use a lot of CoGroups, but there may be other conditions that trigger it. The bulk of the time is spent in calls to methods on ElementGraphs and ElementGraph that find the shortest path between pairs of nodes. These calls are made by HadoopPlanner, LocalPlanner, and FlowPlanner. As we add more CoGroups, it looks like we make many more calls to KShortestPaths.getPaths, and each call takes much longer.
We have a flow that we are building dynamically for a client, based on user configuration. They have many tables (about 50, I think), and unfortunately the flow as it's currently configured results in a lot of joins between those tables. In some cases the same two tables are joined together several times using different key fields. We started the job almost 48 hours ago, and it is still in the planning phase. We can probably modify our code that translates user configuration into a Flow so that we reduce the number of joins, but it still seems like it is taking far longer than it should to build this flow.
This test case should illustrate the problem. I'm just creating n source taps, n sink taps, and n pipes. For each i > 1, pipe i is joined onto pipe i - 1. I timed the calls to connect() for various values of i. Those timings are listed below. Note that the flow we're actually trying to run is more complex than this one, as it isn't simply chaining together n tables. It seems to actually scale worse than this contrived example.
I have an idea for a change that might speed it up, but I'm not sure if it would be acceptable. If you can get away with just one of the shortest paths instead of all the shortest paths between a pair of nodes, you should be able to use JGraphT's FloydWarshallShortestPaths class. That uses a dynamic programming algorithm to find the shortest path between every pair of n nodes in O(n^3) time and stores them in a table. I have experimented with this locally and it seems that at least for the flow we're having trouble with it will speed things up dramatically. I was able to run the Floyd-Warshall algorihtm on the graph for this flow in less than a minute. I have been running an unpatched Cascading 2.2 job on the same flow, and it has been sitting in the planning phase for almost 48 hours now.