lineage-overview - 598 Network read timeout error (spline 0.5.3) #738

DaimonPl · 2020-06-22T08:03:35Z

Ok, so first case spline could not display graph. Number of jobs is not big, but jobs are very big inside (their execution plan) and some jobs may have "past dependency" (time + 1 depends on same graph from time)

lineage-overview XHR returned 598 Network read timeout

On server side i've found that arangodb was using all available CPU for that request

Stats from queries (appends query is extremely slow)

RETURN {
    "operations"    :   LENGTH(operation),
    "dataSource"    :   LENGTH(dataSource),
    "exec-plans"    :   LENGTH(executionPlan),
    "exec-events"   :   LENGTH(progress),
    "appends"       :   LENGTH(operation[* FILTER CURRENT._type == "Write" AND CURRENT.append]),
    "unique-apps"   :   LENGTH(UNIQUE(executionPlan[*].extra.appName)),
    "top-io-per-ds" :   (
                            LET pairs = (
                                FOR ds IN dataSource
                                    FOR op IN 1 INBOUND ds writesTo, readsFrom
                                        COLLECT t = op._type == "Read" ? "reads"
                                                        : op.append ? "appends"
                                                        : "overwites",
                                                dsId = ds._key WITH COUNT INTO c
                                        SORT c DESC
                                        RETURN [t, c]
                            )
                            FOR p IN pairs
                                COLLECT t = p[0] into g
                                RETURN [t, g[* LIMIT 20].p[1]]
                        ),
    "top-longest-observed-writes-seqs" : (
        FOR p IN progress
            LET cnt = LENGTH(SPLINE::OBSERVED_WRITES_BY_READ(p))
            FILTER cnt > 0
            SORT cnt DESC
            LIMIT 20
            RETURN cnt
    )
}

[
  {
    "operations": 50184,
    "dataSource": 207,
    "exec-plans": 539,
    "exec-events": 539,
    "unique-apps": 57,
    "top-io-per-ds": [
      [
        "overwites",
        [
          163,
          36,
          12,
          12,
          12,
          12,
          11,
          11,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7
        ]
      ],
      [
        "reads",
        [
          708,
          374,
          187,
          187,
          164,
          164,
          141,
          111,
          106,
          106,
          106,
          96,
          82,
          48,
          48,
          48,
          48,
          48,
          30,
          24
        ]
      ]
    ],
    "top-longest-observed-writes-seqs": [
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10
    ]
  }
]

The text was updated successfully, but these errors were encountered:

DaimonPl · 2020-06-22T08:10:41Z

Case is:

Job X produces data D. X when producing D reads D-1 which is created by X-1. X is executed hourly and has very big graph inside

DaimonPl · 2020-06-22T12:15:01Z

I just confirmed that single graph for single job X displays just fine. After hour pases we've got X and X+1 on same graph, it's slower but still works. After several hours (and jobs) single graph is too big. Either there's some endless loop on backend or it just timeouts because of huge data being returned in single request

wajda · 2020-06-22T12:45:14Z

I bet it's the later case - huge data. Although the query is recursive it cannot loop infinitely - on one hand the recursion depth is restricted by a parameter, and on another hand the query only traverses backward in time, never forward, from any given point. That is under assumption that all nodes' clocks are sync'ed however. I cannot tell from the top of my head if the actual loop is possible if for some job the write event is reported to happen earlier than some of the read events on the same job. (Sound worth to try to simulate it). But anyway, practically it should not loop infinitely just because of restriction no.1

wajda · 2020-06-22T12:51:06Z

After several hours (and jobs) single graph is too big

Can you elaborate on that one please? Do you mean that an execution plan of the same job grows from a run to a run?
You said that the job X_i reads from D_i-1 that is produced by X_i-1. This should not affect the size of X_i's execution plan.

DaimonPl · 2020-06-22T13:04:35Z

After several hours (and jobs) single graph is too big

Can you elaborate on that one please? Do you mean that an execution plan of the same job grows from a run to a run?
You said that the job Xi reads from Di-1 that is produced by Xi-1. This should not affect the size of Xi's execution plan.

Each single execution plan is same but graph/visualisation grows:

We've got datasource D. First run of Xi reads and writes D so graph would be Xi <-> D . Next we've got run Xi+1 it also reads and writes D, but D already has lineage from Xi so we end'up with Xi <-> D <-> Xi+1 , With Xi+2 graph/visualisation will have yet another addition in graph and so on.

When Xi has big plan internally + it has many other dependencies amount of data returned in lineage-overview becomes huge

DaimonPl · 2020-06-22T13:32:24Z

WOuld changing app name of Xi, Xi+1, Xi+2 help anything if they all would have same name X but different app ids ? Would spline in that case just overwrite previous lineage/plan ?

wajda · 2020-06-22T14:05:44Z

The app name, likewise the app ID is completely irrelevant for lineage. Every run of every job is recorded completely independently. The global (end-to-end) lineage picture is created by traversing individual execution plans with respect to two things - the time and the data source. It doesn't matter if I read from and write to the very same file or table, because both events (read and write) being logically happened at different instant of time represent two different snapshots of data. Such situation is completely fine for Spline.

wajda · 2020-06-22T14:15:16Z

For performance troubleshooting it's important to distinguish between performance degradation because of a single execution plan being too big, and because of a number of writes to a single data source being too large. In the first case the UI would have difficulties to display detailed lineage, while the global lineage overview picture would still render OK. In the later case (large number of writes) the situation is opposite - the bigger lineage overview might timeout, but if you request a detailed lineage directly by URL you'd get a normally working page.

DaimonPl · 2020-06-22T14:22:59Z

Degradation happens after multiple executions (multiple subsequent writes by subsequent jobs of same type).

First lineage overview is fast. Further (some more writes) become slower and slower up to the point when it always timeouts

Job is run hourly so our daily retention does not help

wajda · 2020-06-22T14:54:56Z

I see, so the overview query gets slower. Hm... I wonder if it's fixable by just tweaking some indices.
Can you help me to profile the query? I'll send you a few portions of it, so we can identify the culprit.

wajda · 2020-06-22T15:41:58Z


LET startEvent = FIRST(FOR p IN progress FILTER p._key == @eventId RETURN p)

LET readTime = startEvent.timestamp
LET observedWrites = (
    FOR rds IN 2 OUTBOUND startEvent progressOf, depends
        LET maybeObservedOverwrite = SLICE(
            (FOR wo IN 1 INBOUND rds writesTo
                FILTER !wo.append
                FOR e IN 2 INBOUND wo executes, progressOf
                    FILTER e.timestamp < readTime
                    SORT e.timestamp DESC
                    LIMIT 1
                    RETURN e
            ), 0, 1)
        LET observedAppends = (
            FOR wo IN 1 INBOUND rds writesTo
                FILTER wo.append
                FOR e IN 2 INBOUND wo executes, progressOf
                    FILTER e.timestamp > maybeObservedOverwrite[0].timestamp
                       AND e.timestamp < readTime
                    SORT e.timestamp ASC
                    RETURN e
            )
        LET allObservedEvents = APPEND(maybeObservedOverwrite, observedAppends)
        FOR e IN allObservedEvents RETURN e
)


LET exec = FIRST(FOR ex IN 1 OUTBOUND startEvent progressOf RETURN ex)
LET affectedDsEdge = FIRST(FOR v, e IN 1 OUTBOUND exec affects RETURN e)
LET rdsWithInEdges = (FOR ds, e IN 1 OUTBOUND exec depends RETURN [ds, e])
LET readSources = rdsWithInEdges[*][0]
LET readDsEdges = rdsWithInEdges[*][1]

LET vertices = (
    FOR vert IN APPEND(readSources, exec)
        LET vertType = SPLIT(vert._id, '/')[0]
        RETURN vertType == "dataSource"
            ? {
                "_id": vert._key,
                "_class": "za.co.absa.spline.consumer.service.model.DataSourceNode",
                "name": vert.uri
            }
            : {
                "_id": vert._key,
                "_class": "za.co.absa.spline.consumer.service.model.ExecutionNode",
                "name": vert.extra.appName
            }
)

LET edges = (
    FOR edge IN APPEND(readDsEdges, affectedDsEdge)
        LET edgeType = SPLIT(edge._id, '/')[0]
        LET exKey = SPLIT(edge._from, '/')[1]
        LET dsKey = SPLIT(edge._to, '/')[1]
        RETURN {
            "source": edgeType == "depends" ? dsKey : exKey,
            "target": edgeType == "affects" ? dsKey : exKey
        }
)

RETURN {vertices, edges, observedWrites}

DaimonPl · 2020-06-22T15:48:04Z

I just ran this query but it returns empty result :(

wajda · 2020-06-22T15:49:11Z

Take the value for @eventId from the Spline UI URL that fails to load.

Then run both "Profile" and "Explain" actions in the ArrangoDB UI

DaimonPl · 2020-06-22T15:53:05Z

Query String (2309 chars, cacheable: false):
 LET startEvent = FIRST(FOR p IN progress FILTER p._key == 
 "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2" RETURN p)
 LET readTime = startEvent.timestamp
 LET observedWrites = (
     FOR rds IN 2 OUTBOUND startEvent progressOf, depends
         LET maybeObservedOverwrite = SLICE(
             (FOR wo IN 1 INBOUND rds writesTo
                 FILTER !wo.append
                 FOR e IN 2 INBOUND wo executes, progressOf
                     FILTER e.timestamp < readTime
                     SORT e.timestamp DESC
                     LIMIT 1
                     RETURN e
             ), 0, 1)
         LET observedAppends = (
             FOR wo IN 1 INBOUND rds writesTo
                 FILTER wo.append
                 FOR e IN 2 INBOUND wo executes, progressOf
                     FILTER e.timestamp > maybeObservedOverwrite[0].timestamp
                        AND e.timestamp < readTime
                     SORT e.timestamp ASC
                     RETURN e
             )
         LET allObservedEvents = APP...

Execution plan:
 Id   NodeType            Calls   Items   Runtime [s]   Comment
  1   SingletonNode           1       1       0.00000   * ROOT
  7   SubqueryNode            1       1       0.00007     - LET #1 = ...   /* const subquery */
  2   SingletonNode           1       1       0.00000       * ROOT
 78   IndexNode               1       1       0.00005         - FOR p IN progress   /* primary index scan */    
 76   LimitNode               1       1       0.00000           - LIMIT 0, 1
  6   ReturnNode              1       1       0.00001           - RETURN p
  8   CalculationNode         1       1       0.00001     - LET startEvent = FIRST(#1)   /* simple expression */
  9   CalculationNode         1       1       0.00001     - LET readTime = startEvent.`timestamp`   /* attribute expression */
 85   SubqueryStartNode       1       2       0.00000     - LET observedWrites = ( /* subquery begin */
 11   TraversalNode           1      12       0.00021       - FOR rds  /* vertex */ IN 2..2  /* min..maxPathDepth */ OUTBOUND startEvent /* startnode */  progressOf, OUTBOUND depends
 23   SubqueryNode            1      12       0.00450         - LET #8 = ...   /* subquery */
 12   SingletonNode          11      11       0.00001           * ROOT
 13   TraversalNode          11      82       0.00175             - FOR wo  /* vertex */ IN 1..1  /* min..maxPathDepth */ INBOUND rds /* startnode */  writesTo
 14   CalculationNode        11      82       0.00007               - LET #59 = ! wo.`append`   /* simple expression */
 15   FilterNode             11      82       0.00022               - FILTER #59
 16   TraversalNode          11      82       0.00205               - FOR e  /* vertex */ IN 2..2  /* min..maxPathDepth */ INBOUND wo /* startnode */  executes, INBOUND progressOf
 17   CalculationNode        11      82       0.00007                 - LET #63 = (e.`timestamp` < readTime)   /* simple expression */
 18   FilterNode             11      61       0.00007                 - FILTER #63
 19   CalculationNode        11      61       0.00004                 - LET #65 = e.`timestamp`   /* attribute expression */
 20   SortNode               11       4       0.00004                 - SORT #65 DESC   /* sorting strategy: standard */
 21   LimitNode              11       4       0.00003                 - LIMIT 0, 1
 22   ReturnNode             11       4       0.00003                 - RETURN e
 24   CalculationNode         1      12       0.00002         - LET maybeObservedOverwrite = SLICE(#8, 0, 1)   /* simple expression */
 87   SubqueryStartNode       1      23       0.00014         - LET observedAppends = ( /* subquery begin */
 26   TraversalNode           1      94       0.00135           - FOR wo  /* vertex */ IN 1..1  /* min..maxPathDepth */ INBOUND rds /* startnode */  writesTo
 27   CalculationNode         1      94       0.00003             - LET #69 = wo.`append`   /* attribute expression */
 28   FilterNode              1      12       0.00020             - FILTER #69
 29   TraversalNode           1      12       0.00059             - FOR e  /* vertex */ IN 2..2  /* min..maxPathDepth */ INBOUND wo /* startnode */  executes, INBOUND progressOf
 30   CalculationNode         1      12       0.00001               - LET #73 = ((e.`timestamp` > maybeObservedOverwrite[0].`timestamp`) && (e.`timestamp` < readTime))   /* simple expression */
 31   FilterNode              1      12       0.00026               - FILTER #73
 32   CalculationNode         1      12       0.00001               - LET #75 = e.`timestamp`   /* attribute expression */
 33   SortNode                1      12       0.00022               - SORT #75 ASC   /* sorting strategy: standard */
 88   SubqueryEndNode         1      12       0.00020         - RETURN  e ) /* subquery end */
 36   CalculationNode         1      12       0.00002         - LET allObservedEvents = APPEND(maybeObservedOverwrite, observedAppends)   /* simple expression */
 37   EnumerateListNode       1       5       0.00023         - FOR e IN allObservedEvents   /* list iteration */
 86   SubqueryEndNode         1       1       0.00001     - RETURN  e ) /* subquery end */
 43   SubqueryNode            1       1       0.00051     - LET #22 = ...   /* subquery */
 40   SingletonNode           1       1       0.00000       * ROOT
 41   TraversalNode           1       1       0.00048         - FOR ex  /* vertex */ IN 1..1  /* min..maxPathDepth */ OUTBOUND startEvent /* startnode */  progressOf
 75   LimitNode               1       1       0.00000           - LIMIT 0, 1
 42   ReturnNode              1       1       0.00001           - RETURN ex
 44   CalculationNode         1       1       0.00018     - LET exec = FIRST(#22)   /* simple expression */
 48   SubqueryNode            1       1       0.00009     - LET #27 = ...   /* subquery */
 45   SingletonNode           1       1       0.00000       * ROOT
 46   TraversalNode           1       1       0.00005         - FOR v  /* vertex */, e  /* edge */ IN 1..1  /* min..maxPathDepth */ OUTBOUND exec /* startnode */  affects
 77   LimitNode               1       1       0.00000           - LIMIT 0, 1
 47   ReturnNode              1       1       0.00000           - RETURN e
 83   SubqueryStartNode       1       2       0.00020     - LET rdsWithInEdges = ( /* subquery begin */
 51   TraversalNode           1      12       0.01350       - FOR ds  /* vertex */, e  /* edge */ IN 1..1  /* min..maxPathDepth */ OUTBOUND exec /* startnode */  depends
 52   CalculationNode         1      12       0.00004         - LET #83 = [ ds, e ]   /* simple expression */
 84   SubqueryEndNode         1       1       0.00030     - RETURN  #83 ) /* subquery end */
 55   CalculationNode         1       1       0.00002     - LET readSources = rdsWithInEdges[*][0]   /* simple expression */
 81   SubqueryStartNode       1       2       0.00030     - LET vertices = ( /* subquery begin */
 58   CalculationNode         1       2       0.00054       - LET #85 = APPEND(readSources, exec)   /* simple expression */
 59   EnumerateListNode       1      13       0.00040       - FOR vert IN #85   /* list iteration */
 61   CalculationNode         1      13       0.00071         - LET #87 = ((SPLIT(vert.`_id`, "/")[0] == "dataSource") ? { "_id" : vert.`_key`, "_class" : "za.co.absa.spline.consumer.service.model.DataSourceNode", "name" : vert.`uri` } : { "_id" : vert.`_key`, "_class" : "za.co.absa.spline.consumer.service.model.ExecutionNode", "name" : vert.`extra`.`appName` })   /* simple expression */
 82   SubqueryEndNode         1       1       0.00002     - RETURN  #87 ) /* subquery end */
 49   CalculationNode         1       1       0.00001     - LET affectedDsEdge = FIRST(#27)   /* simple expression */
 56   CalculationNode         1       1       0.00002     - LET readDsEdges = rdsWithInEdges[*][1]   /* simple expression */
 79   SubqueryStartNode       1       2       0.00001     - LET edges = ( /* subquery begin */
 65   CalculationNode         1       2       0.00002       - LET #89 = APPEND(readDsEdges, affectedDsEdge)   /* simple expression */
 66   EnumerateListNode       1      13       0.00046       - FOR edge IN #89   /* list iteration */
 67   CalculationNode         1      13       0.00005         - LET edgeType = SPLIT(edge.`_id`, "/")[0]   /* simple expression */
 68   CalculationNode         1      13       0.00004         - LET exKey = SPLIT(edge.`_from`, "/")[1]   /* simple expression */
 69   CalculationNode         1      13       0.00004         - LET dsKey = SPLIT(edge.`_to`, "/")[1]   /* simple expression */
 70   CalculationNode         1      13       0.00003         - LET #91 = { "source" : ((edgeType == "depends") ? dsKey : exKey), "target" : ((edgeType == "affects") ? dsKey : exKey) }   /* simple expression */
 80   SubqueryEndNode         1       1       0.00001     - RETURN  #91 ) /* subquery end */
 73   CalculationNode         1       1       0.00001     - LET #93 = { "vertices" : vertices, "edges" : edges, "observedWrites" : observedWrites }   /* simple expression */
 74   ReturnNode              1       1       0.00000     - RETURN #93

Indexes used:
 By   Name      Type      Collection   Unique   Sparse   Selectivity   Fields        Ranges
 78   primary   primary   progress     true     false       100.00 %   [ `_key` ]    (p.`_key` == "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2")
 11   edge      edge      depends      false    false        18.88 %   [ `_from` ]   base OUTBOUND
 11   edge      edge      progressOf   false    false       100.00 %   [ `_from` ]   base OUTBOUND
 13   edge      edge      writesTo     false    false        28.70 %   [ `_to` ]     base INBOUND
 16   edge      edge      executes     false    false       100.00 %   [ `_to` ]     base INBOUND
 16   edge      edge      progressOf   false    false       100.00 %   [ `_to` ]     base INBOUND
 26   edge      edge      writesTo     false    false        28.70 %   [ `_to` ]     base INBOUND
 29   edge      edge      executes     false    false       100.00 %   [ `_to` ]     base INBOUND
 29   edge      edge      progressOf   false    false       100.00 %   [ `_to` ]     base INBOUND
 41   edge      edge      progressOf   false    false       100.00 %   [ `_from` ]   base OUTBOUND
 46   edge      edge      affects      false    false       100.00 %   [ `_from` ]   base OUTBOUND
 51   edge      edge      depends      false    false        18.88 %   [ `_from` ]   base OUTBOUND

Traversals on graphs:
 Id  Depth  Vertex collections  Edge collections      Options                                  Filter / Prune Conditions
 11  2..2                       progressOf, depends   uniqueVertices: none, uniqueEdges: path                           
 13  1..1                       writesTo              uniqueVertices: none, uniqueEdges: path                           
 16  2..2                       executes, progressOf  uniqueVertices: none, uniqueEdges: path                           
 26  1..1                       writesTo              uniqueVertices: none, uniqueEdges: path                           
 29  2..2                       executes, progressOf  uniqueVertices: none, uniqueEdges: path                           
 41  1..1                       progressOf            uniqueVertices: none, uniqueEdges: path                           
 46  1..1                       affects               uniqueVertices: none, uniqueEdges: path                           
 51  1..1                       depends               uniqueVertices: none, uniqueEdges: path                           

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   remove-unnecessary-calculations
  4   optimize-subqueries
  5   move-calculations-up-2
  6   move-filters-up-2
  7   use-indexes
  8   remove-filter-covered-by-index
  9   remove-unnecessary-calculations-2
 10   move-calculations-down
 11   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0          271        103         0.02873

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00001
 parsing                    0.00029
 optimizing ast             0.00006
 loading collections        0.00001
 instantiating plan         0.00023
 optimizing plan            0.00147
 executing                  0.02565
 finalizing                 0.00101

DaimonPl · 2020-06-22T16:01:43Z

That query returns very fast. I think something different is a problem

My friend found that reducing max depth in query makes it very fast. maxDepth>=7 returns error. maxDepth<=6 work fine (the smaller, the faster)

wajda · 2020-06-22T16:04:17Z

Try the full one (the one that returns the graph overview for UI)

LET startEvent = FIRST(FOR p IN progress FILTER p._key == @eventId RETURN p)
RETURN SPLINE::EVENT_LINEAGE_OVERVIEW(startEvent, 10)

wajda · 2020-06-22T16:07:06Z

basically SPLINE::EVENT_LINEAGE_OVERVIEW implements that recursion and executes the other queries (the ones that I sent you before) against different event IDs it finds in a path. So there are might be some event ID on that some of those queries takes particularly long to run. It would be cool to find that event ID and the exact subquery

wajda · 2020-06-22T16:08:31Z

Here's the source code for that custom function - https://github.com/AbsaOSS/spline/blob/release/0.5.3/persistence/src/main/resources/AQLFunctions/event_lineage_overview.js

DaimonPl · 2020-06-22T16:19:15Z

I could only get this profile (probably not very helpful). Deeper depths would not return in reasonable waiting time (human timeout occurred ;) )

Query String (168 chars, cacheable: false):
 LET startEvent = FIRST(FOR p IN progress FILTER p._key == 
 "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2" RETURN p)
 RETURN SPLINE::EVENT_LINEAGE_OVERVIEW(startEvent, 7)

Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  7   SubqueryNode          1       1       0.00009     - LET #1 = ...   /* const subquery */
  2   SingletonNode         1       1       0.00000       * ROOT
 12   IndexNode             1       1       0.00007         - FOR p IN progress   /* primary index scan */    
 11   LimitNode             1       1       0.00000           - LIMIT 0, 1
  6   ReturnNode            1       1       0.00000           - RETURN p
  9   CalculationNode       1       1      44.42263     - LET #6 = SPLINE::EVENT_LINEAGE_OVERVIEW(FIRST(#1), 7)   /* user-defined function */   /* simple expression */
 10   ReturnNode            1       1       0.00001     - RETURN #6

Indexes used:
 By   Name      Type      Collection   Unique   Sparse   Selectivity   Fields       Ranges
 12   primary   primary   progress     true     false       100.00 %   [ `_key` ]   (p.`_key` == "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2")

Optimization rules applied:
 Id   RuleName
  1   remove-unnecessary-calculations
  2   optimize-subqueries
  3   use-indexes
  4   remove-filter-covered-by-index
  5   remove-unnecessary-calculations-2

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0            1          0        44.42360

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00007
 optimizing ast             0.00001
 loading collections        0.00001
 instantiating plan         0.00003
 optimizing plan            0.00018
 executing                 44.42276
 finalizing                 0.00051

wajda · 2020-06-22T16:21:11Z

44 seconds?? That's a lot :)

maxDepth>=7 returns error.

Is the difference between 6 and 7 much more noticeable than between 5 and 6? I'm trying to understand if every recursion cycle takes similar time, or is there a job/event on which the query takes significantly longer...

Can you share the screenshot of the graph that still works? Or at least the returned JSON from the REST request?

DaimonPl · 2020-06-22T16:25:57Z

Depth times

7 - 44 seconds
6 - 12 seconds
5 - 2.3 seconds
4 - 0.5 seconds

DaimonPl · 2020-06-22T16:28:51Z

here's graph from first job run (3 writes)

here's job from second job run (time +1 hour, 3 new writes and 3 previous writes)

I think that third run is the first to fail

DaimonPl · 2020-06-22T16:29:46Z

It's some kind of exponential explosion :)

wajda · 2020-06-22T16:30:22Z

Don't tell me that :)

wajda · 2020-06-22T16:32:25Z

So now I feel that I need to go home. It seems we've got plenty of work for the future days :))

wajda · 2020-06-22T16:35:42Z

But... well... it's a graph traversing after all, the time complexity is expected to be exponential. The problem is that it's too steep, right?

DaimonPl · 2020-06-22T16:44:13Z

I think it's specific scensrio:

job writes 3 partitions to table (single hive datasource uri)
when calculating those writes it reads same table as one of the sources (same uri as writes)
it is calculated hourly
first job run displays '3' exec plans (3 writes)
as a result second job run displays '3*3' exec plans
third job run would probably display 3* 3*3 and so on

DaimonPl · 2020-06-29T13:49:20Z

I think you can safely narrow it to this scenario, other dependencies are rather normal

1st run (Ja, Jb, Jc are 3 exec plans of job J):

T -> Ja -> T
T -> Jb -> T
T -> Jc -> T

2nd run:

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T`
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T`
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1c` -> `T`
`T` -> `Jc` /

3rd run:

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T` \
`T` -> `Jc` /                         \
                                       \
`T` -> `Ja` \                           \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T` -----> `J+2a` -> `T`
`T` -> `Jc` /                           /
                                       /
`T` -> `Ja` \                         /
`T` -> `Jb`  -> `T` -> `J+1c` -> `T` /
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T` \
`T` -> `Jc` /                         \
                                       \
`T` -> `Ja` \                           \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T` -----> `J+2b` -> `T`
`T` -> `Jc` /                           /
                                       /
`T` -> `Ja` \                         /
`T` -> `Jb`  -> `T` -> `J+1c` -> `T` /
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T` \
`T` -> `Jc` /                         \
                                       \
`T` -> `Ja` \                           \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T` -----> `J+2c` -> `T`
`T` -> `Jc` /                           /
                                       /
`T` -> `Ja` \                         /
`T` -> `Jb`  -> `T` -> `J+1c` -> `T` /
`T` -> `Jc` /

I won't try 4th run but that's the explosion caused by subsequent job runs :)

So it's not single dependency graph pattern. It single job pattern which explodes when job is run multiple times. That job is run hourly so after 6-7 hours it's not possible to display such graph at all

DaimonPl · 2020-06-29T14:18:21Z

I also think that it's not a bug itself - this dependency pattern behaves this way and we cannot change it. Things which can be done here for sure:

reporting only first run instead of each run - that what we did and it's enough for lineage pourposes
configuring default depth (that's what I was asking already)
stop graph traversal after some node / edge limit - though I don't know if this would be possible with Arango

wajda · 2020-06-29T15:00:06Z

++

memoize the recursive function to avoid repetitive path traversing
offload some computation from the UDF to AQL to avoid overhead on converting data structure to/from V8, potentially hit some indices and/or execute it on sharded data nodes instead of a coordinator.
pre-link the nodes with additional edges to shift some computation from the request-time to the background and don't repeat it.

wajda · 2020-06-30T14:39:51Z

Three questions:

which exact ArangoDB version are you using?
which Spline Admin CLI version did you use to create/upgrade the Spline db?
which Spline UI version are you using?

wajda · 2020-06-30T14:44:00Z

I tried 30 level nested graph and the UI responded in less than a second:

Have you seen this popup on your UI?

DaimonPl · 2020-06-30T14:46:52Z

Arango 3.6.4
Spline 0.5.3

Yes I've seen depth limit. Problem is not depth for typical graphs, in this case graph explodes very wide before it reaches 10 depth limit

Note that detailed lineages are huge for my case, maybe it's connected? Huge graph with huge detailed lineages

wajda · 2020-06-30T14:53:26Z

it shouldn't matter, the overview query doesn't touch the detailed lineage nodes.

DaimonPl · 2020-06-30T14:56:29Z

And this #720 ? Schema is quite big (maybe 100 fields in nested structs)

wajda · 2020-06-30T15:05:59Z

As long as it makes into the DB it's OK. The overview query only depends on the data pipeline high-level structure in terms of reads, writes and data sources.

DaimonPl · 2020-06-30T15:12:22Z

So maybe additional data sources. As you can see on screenshots each of those "explosive" nodes has 10 other input sources

wajda · 2020-06-30T15:44:58Z

It's rather the number of non-terminal sources read by every execution. This what makes the base of the exponent. I was able to flood my local setup with 4¹⁰ kind of query

wajda · 2020-07-01T12:05:36Z

@DaimonPl can you please test this commit? fbb6ae5
It's enough to build it and run java -jar admin-cli.jar db-upgrade ...

In my setup it decreased the query time from 50 seconds to 95 milliseconds :)

DaimonPl · 2020-07-01T12:20:37Z

@wajda is it possible to apply it directly on arango? It will be much easier on my env environment than preparing and deploying custom build

wajda · 2020-07-01T12:31:33Z

You need to replace a UDF function. It is possible to do it directly, but hardly easier.
You need to call ArangoDB REST API or use arangosh.

DaimonPl · 2020-07-01T13:19:14Z

sending request with postman is 1000x easier and faster than preparing custom artifacts and deploying it on o puppet environment :)

i'll try to set it up and test tomorrow

as I understand it changes query only so it's not required to re-send lineages?

wajda · 2020-07-01T13:29:12Z

That's right. No Spline re-deploy or the data change is required.

…the recursive function memoization

wajda · 2020-07-02T12:06:30Z

changed caching method a bit and saved even more CPU.
40b360b
Now it copes quite OK with my somewhat extreme case - every job reads from 4 primary sources (each of which contains 10 appends from the previous executions of the same jobs) and overwrites one of them. It additionally reads 1 to 13 secondary sources at random. So in overall the average vertex degree in a graph is about 50-ish, and the depth is 20.

My test page contains about 600 nodes and 6000 edges making the graph insane.
The server responded in about 9 seconds, and then the majority of the time (about 1 minute) was spent by Chrome in rendering the beast.

I believe that the major bottleneck can be considered solved at this point.

DaimonPl · 2020-07-02T13:05:50Z

Great, I'm still gathering lineage data on our side not sure if I'll manage to test it today

DaimonPl · 2020-07-02T14:14:38Z

@wajda what should be function name in http post request? I tried SPLINE::EVENT_LINEAGE_OVERVIEW but it looks it created new function instead of repalcing previous one ?

wajda · 2020-07-02T14:20:28Z

honestly I never registered UDFs directly :( Need to try first.

DaimonPl · 2020-07-02T14:30:44Z

Ok I managed to do it, was missing /_db/spline/ in URI

With your changes graph which was previously failing now displays "fine ;)"

DaimonPl · 2020-07-02T14:34:09Z

Next iteration as well

* spline #738 lineage-overview - add memoization to Arango UDF * spline #738 lineage-overview - replace sub-function memoization with the recursive function memoization

DaimonPl added the bug label Jun 22, 2020

DaimonPl changed the title ~~lineage-overview - 598 Network read timeout error~~ lineage-overview - 598 Network read timeout error (spline 0.5.3) Jun 22, 2020

wajda added the scaling / performance label Jun 22, 2020

wajda added a commit that referenced this issue Jul 1, 2020

spline #738 lineage-overview - add memoization to Arango UDF

fbb6ae5

wajda added a commit that referenced this issue Jul 2, 2020

spline #738 lineage-overview - replace sub-function memoization with …

40b360b

…the recursive function memoization

wajda removed the help wanted label Jul 2, 2020

wajda mentioned this issue Jul 2, 2020

spline-738 - example AbsaOSS/spline-spark-agent#103

Merged

wajda added a commit that referenced this issue Jul 3, 2020

Bugfix/spline 738 lineage overview performance (#747)

4b21611

* spline #738 lineage-overview - add memoization to Arango UDF * spline #738 lineage-overview - replace sub-function memoization with the recursive function memoization

wajda closed this as completed Aug 20, 2020

wajda added the st-6-closed-549f90fb label Apr 1, 2022

wajda mentioned this issue Jul 31, 2023

Feature/spline spark agent 726 multi arch docker AbsaOSS/spline-spark-agent#728

Merged

lineage-overview - 598 Network read timeout error (spline 0.5.3) #738

lineage-overview - 598 Network read timeout error (spline 0.5.3) #738

Comments

DaimonPl commented Jun 22, 2020

DaimonPl commented Jun 22, 2020

DaimonPl commented Jun 22, 2020 • edited Loading

wajda commented Jun 22, 2020 • edited Loading

wajda commented Jun 22, 2020

DaimonPl commented Jun 22, 2020

DaimonPl commented Jun 22, 2020

wajda commented Jun 22, 2020

wajda commented Jun 22, 2020

DaimonPl commented Jun 22, 2020

wajda commented Jun 22, 2020

wajda commented Jun 22, 2020

DaimonPl commented Jun 22, 2020

wajda commented Jun 22, 2020 • edited Loading

DaimonPl commented Jun 22, 2020

DaimonPl commented Jun 22, 2020

wajda commented Jun 22, 2020

wajda commented Jun 22, 2020

wajda commented Jun 22, 2020

DaimonPl commented Jun 22, 2020

wajda commented Jun 22, 2020 • edited Loading

DaimonPl commented Jun 22, 2020

DaimonPl commented Jun 22, 2020 • edited Loading

DaimonPl commented Jun 22, 2020

wajda commented Jun 22, 2020

wajda commented Jun 22, 2020

wajda commented Jun 22, 2020

DaimonPl commented Jun 22, 2020 • edited Loading

DaimonPl commented Jun 29, 2020 • edited Loading

DaimonPl commented Jun 29, 2020

wajda commented Jun 29, 2020

wajda commented Jun 30, 2020 • edited Loading

wajda commented Jun 30, 2020

DaimonPl commented Jun 30, 2020 • edited Loading

wajda commented Jun 30, 2020

DaimonPl commented Jun 30, 2020

wajda commented Jun 30, 2020

DaimonPl commented Jun 30, 2020

wajda commented Jun 30, 2020

wajda commented Jul 1, 2020

DaimonPl commented Jul 1, 2020

wajda commented Jul 1, 2020

DaimonPl commented Jul 1, 2020

wajda commented Jul 1, 2020

wajda commented Jul 2, 2020 • edited Loading

DaimonPl commented Jul 2, 2020

DaimonPl commented Jul 2, 2020

wajda commented Jul 2, 2020

DaimonPl commented Jul 2, 2020

DaimonPl commented Jul 2, 2020

DaimonPl commented Jun 22, 2020 •

edited

Loading

wajda commented Jun 22, 2020 •

edited

Loading

wajda commented Jun 22, 2020 •

edited

Loading

wajda commented Jun 22, 2020 •

edited

Loading

DaimonPl commented Jun 22, 2020 •

edited

Loading

DaimonPl commented Jun 22, 2020 •

edited

Loading

DaimonPl commented Jun 29, 2020 •

edited

Loading

wajda commented Jun 30, 2020 •

edited

Loading

DaimonPl commented Jun 30, 2020 •

edited

Loading

wajda commented Jul 2, 2020 •

edited

Loading