-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lineage-overview - 598 Network read timeout error (spline 0.5.3) #738
Comments
Case is: Job X produces data D. X when producing D reads D-1 which is created by X-1. X is executed hourly and has very big graph inside |
I just confirmed that single graph for single job X displays just fine. After hour pases we've got X and X+1 on same graph, it's slower but still works. After several hours (and jobs) single graph is too big. Either there's some endless loop on backend or it just timeouts because of huge data being returned in single request |
I bet it's the later case - huge data. Although the query is recursive it cannot loop infinitely - on one hand the recursion depth is restricted by a parameter, and on another hand the query only traverses backward in time, never forward, from any given point. That is under assumption that all nodes' clocks are sync'ed however. I cannot tell from the top of my head if the actual loop is possible if for some job the write event is reported to happen earlier than some of the read events on the same job. (Sound worth to try to simulate it). But anyway, practically it should not loop infinitely just because of restriction no.1 |
Can you elaborate on that one please? Do you mean that an execution plan of the same job grows from a run to a run? |
Each single execution plan is same but graph/visualisation grows: We've got datasource D. First run of Xi reads and writes D so graph would be Xi <-> D . Next we've got run Xi+1 it also reads and writes D, but D already has lineage from Xi so we end'up with Xi <-> D <-> Xi+1 , With Xi+2 graph/visualisation will have yet another addition in graph and so on. When Xi has big plan internally + it has many other dependencies amount of data returned in lineage-overview becomes huge |
WOuld changing app name of Xi, Xi+1, Xi+2 help anything if they all would have same name X but different app ids ? Would spline in that case just overwrite previous lineage/plan ? |
The app name, likewise the app ID is completely irrelevant for lineage. Every run of every job is recorded completely independently. The global (end-to-end) lineage picture is created by traversing individual execution plans with respect to two things - the time and the data source. It doesn't matter if I read from and write to the very same file or table, because both events (read and write) being logically happened at different instant of time represent two different snapshots of data. Such situation is completely fine for Spline. |
For performance troubleshooting it's important to distinguish between performance degradation because of a single execution plan being too big, and because of a number of writes to a single data source being too large. In the first case the UI would have difficulties to display detailed lineage, while the global lineage overview picture would still render OK. In the later case (large number of writes) the situation is opposite - the bigger lineage overview might timeout, but if you request a detailed lineage directly by URL you'd get a normally working page. |
Degradation happens after multiple executions (multiple subsequent writes by subsequent jobs of same type). First lineage overview is fast. Further (some more writes) become slower and slower up to the point when it always timeouts Job is run hourly so our daily retention does not help |
I see, so the overview query gets slower. Hm... I wonder if it's fixable by just tweaking some indices. |
|
I just ran this query but it returns empty result :( |
|
That query returns very fast. I think something different is a problem My friend found that reducing max depth in query makes it very fast. |
Try the full one (the one that returns the graph overview for UI)
|
basically |
Here's the source code for that custom function - https://github.com/AbsaOSS/spline/blob/release/0.5.3/persistence/src/main/resources/AQLFunctions/event_lineage_overview.js |
I could only get this profile (probably not very helpful). Deeper depths would not return in reasonable waiting time (human timeout occurred ;) )
|
44 seconds?? That's a lot :)
Is the difference between 6 and 7 much more noticeable than between 5 and 6? I'm trying to understand if every recursion cycle takes similar time, or is there a job/event on which the query takes significantly longer... Can you share the screenshot of the graph that still works? Or at least the returned JSON from the REST request? |
Depth times 7 - 44 seconds |
It's some kind of exponential explosion :) |
Don't tell me that :) |
So now I feel that I need to go home. It seems we've got plenty of work for the future days :)) |
But... well... it's a graph traversing after all, the time complexity is expected to be exponential. The problem is that it's too steep, right? |
I think it's specific scensrio:
|
I think you can safely narrow it to this scenario, other dependencies are rather normal 1st run (Ja, Jb, Jc are 3 exec plans of job J):
2nd run:
3rd run:
I won't try 4th run but that's the explosion caused by subsequent job runs :) So it's not single dependency graph pattern. It single job pattern which explodes when job is run multiple times. That job is run hourly so after 6-7 hours it's not possible to display such graph at all |
I also think that it's not a bug itself - this dependency pattern behaves this way and we cannot change it. Things which can be done here for sure:
|
++
|
Three questions:
|
Arango 3.6.4 Yes I've seen depth limit. Problem is not depth for typical graphs, in this case graph explodes very wide before it reaches 10 depth limit Note that detailed lineages are huge for my case, maybe it's connected? Huge graph with huge detailed lineages |
it shouldn't matter, the overview query doesn't touch the detailed lineage nodes. |
And this #720 ? Schema is quite big (maybe 100 fields in nested structs) |
As long as it makes into the DB it's OK. The overview query only depends on the data pipeline high-level structure in terms of reads, writes and data sources. |
So maybe additional data sources. As you can see on screenshots each of those "explosive" nodes has 10 other input sources |
It's rather the number of non-terminal sources read by every execution. This what makes the base of the exponent. I was able to flood my local setup with 410 kind of query |
@wajda is it possible to apply it directly on arango? It will be much easier on my env environment than preparing and deploying custom build |
You need to replace a UDF function. It is possible to do it directly, but hardly easier. |
sending request with postman is 1000x easier and faster than preparing custom artifacts and deploying it on o puppet environment :) i'll try to set it up and test tomorrow as I understand it changes query only so it's not required to re-send lineages? |
That's right. No Spline re-deploy or the data change is required. |
changed caching method a bit and saved even more CPU. My test page contains about 600 nodes and 6000 edges making the graph insane. I believe that the major bottleneck can be considered solved at this point. |
Great, I'm still gathering lineage data on our side not sure if I'll manage to test it today |
@wajda what should be function name in http post request? I tried |
honestly I never registered UDFs directly :( Need to try first. |
Ok, so first case spline could not display graph. Number of jobs is not big, but jobs are very big inside (their execution plan) and some jobs may have "past dependency" (
time + 1
depends on same graph fromtime
)lineage-overview XHR returned 598 Network read timeout
On server side i've found that arangodb was using all available CPU for that request
Stats from queries (
appends
query is extremely slow)The text was updated successfully, but these errors were encountered: