Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lineage-overview - 598 Network read timeout error (spline 0.5.3) #738

Closed
DaimonPl opened this issue Jun 22, 2020 · 65 comments
Closed

lineage-overview - 598 Network read timeout error (spline 0.5.3) #738

DaimonPl opened this issue Jun 22, 2020 · 65 comments

Comments

@DaimonPl
Copy link
Contributor

Ok, so first case spline could not display graph. Number of jobs is not big, but jobs are very big inside (their execution plan) and some jobs may have "past dependency" (time + 1 depends on same graph from time)

lineage-overview XHR returned 598 Network read timeout

On server side i've found that arangodb was using all available CPU for that request

Stats from queries (appends query is extremely slow)

RETURN {
    "operations"    :   LENGTH(operation),
    "dataSource"    :   LENGTH(dataSource),
    "exec-plans"    :   LENGTH(executionPlan),
    "exec-events"   :   LENGTH(progress),
    "appends"       :   LENGTH(operation[* FILTER CURRENT._type == "Write" AND CURRENT.append]),
    "unique-apps"   :   LENGTH(UNIQUE(executionPlan[*].extra.appName)),
    "top-io-per-ds" :   (
                            LET pairs = (
                                FOR ds IN dataSource
                                    FOR op IN 1 INBOUND ds writesTo, readsFrom
                                        COLLECT t = op._type == "Read" ? "reads"
                                                        : op.append ? "appends"
                                                        : "overwites",
                                                dsId = ds._key WITH COUNT INTO c
                                        SORT c DESC
                                        RETURN [t, c]
                            )
                            FOR p IN pairs
                                COLLECT t = p[0] into g
                                RETURN [t, g[* LIMIT 20].p[1]]
                        ),
    "top-longest-observed-writes-seqs" : (
        FOR p IN progress
            LET cnt = LENGTH(SPLINE::OBSERVED_WRITES_BY_READ(p))
            FILTER cnt > 0
            SORT cnt DESC
            LIMIT 20
            RETURN cnt
    )
}
[
  {
    "operations": 50184,
    "dataSource": 207,
    "exec-plans": 539,
    "exec-events": 539,
    "unique-apps": 57,
    "top-io-per-ds": [
      [
        "overwites",
        [
          163,
          36,
          12,
          12,
          12,
          12,
          11,
          11,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7,
          7
        ]
      ],
      [
        "reads",
        [
          708,
          374,
          187,
          187,
          164,
          164,
          141,
          111,
          106,
          106,
          106,
          96,
          82,
          48,
          48,
          48,
          48,
          48,
          30,
          24
        ]
      ]
    ],
    "top-longest-observed-writes-seqs": [
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10,
      10
    ]
  }
]
@DaimonPl DaimonPl added the bug label Jun 22, 2020
@DaimonPl DaimonPl changed the title lineage-overview - 598 Network read timeout error lineage-overview - 598 Network read timeout error (spline 0.5.3) Jun 22, 2020
@DaimonPl
Copy link
Contributor Author

Case is:

Job X produces data D. X when producing D reads D-1 which is created by X-1. X is executed hourly and has very big graph inside

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jun 22, 2020

I just confirmed that single graph for single job X displays just fine. After hour pases we've got X and X+1 on same graph, it's slower but still works. After several hours (and jobs) single graph is too big. Either there's some endless loop on backend or it just timeouts because of huge data being returned in single request

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

I bet it's the later case - huge data. Although the query is recursive it cannot loop infinitely - on one hand the recursion depth is restricted by a parameter, and on another hand the query only traverses backward in time, never forward, from any given point. That is under assumption that all nodes' clocks are sync'ed however. I cannot tell from the top of my head if the actual loop is possible if for some job the write event is reported to happen earlier than some of the read events on the same job. (Sound worth to try to simulate it). But anyway, practically it should not loop infinitely just because of restriction no.1

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

After several hours (and jobs) single graph is too big

Can you elaborate on that one please? Do you mean that an execution plan of the same job grows from a run to a run?
You said that the job Xi reads from Di-1 that is produced by Xi-1. This should not affect the size of Xi's execution plan.

@DaimonPl
Copy link
Contributor Author

After several hours (and jobs) single graph is too big

Can you elaborate on that one please? Do you mean that an execution plan of the same job grows from a run to a run?
You said that the job Xi reads from Di-1 that is produced by Xi-1. This should not affect the size of Xi's execution plan.

Each single execution plan is same but graph/visualisation grows:

We've got datasource D. First run of Xi reads and writes D so graph would be Xi <-> D . Next we've got run Xi+1 it also reads and writes D, but D already has lineage from Xi so we end'up with Xi <-> D <-> Xi+1 , With Xi+2 graph/visualisation will have yet another addition in graph and so on.

When Xi has big plan internally + it has many other dependencies amount of data returned in lineage-overview becomes huge

@DaimonPl
Copy link
Contributor Author

WOuld changing app name of Xi, Xi+1, Xi+2 help anything if they all would have same name X but different app ids ? Would spline in that case just overwrite previous lineage/plan ?

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

The app name, likewise the app ID is completely irrelevant for lineage. Every run of every job is recorded completely independently. The global (end-to-end) lineage picture is created by traversing individual execution plans with respect to two things - the time and the data source. It doesn't matter if I read from and write to the very same file or table, because both events (read and write) being logically happened at different instant of time represent two different snapshots of data. Such situation is completely fine for Spline.

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

For performance troubleshooting it's important to distinguish between performance degradation because of a single execution plan being too big, and because of a number of writes to a single data source being too large. In the first case the UI would have difficulties to display detailed lineage, while the global lineage overview picture would still render OK. In the later case (large number of writes) the situation is opposite - the bigger lineage overview might timeout, but if you request a detailed lineage directly by URL you'd get a normally working page.

@DaimonPl
Copy link
Contributor Author

Degradation happens after multiple executions (multiple subsequent writes by subsequent jobs of same type).

First lineage overview is fast. Further (some more writes) become slower and slower up to the point when it always timeouts

Job is run hourly so our daily retention does not help

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

I see, so the overview query gets slower. Hm... I wonder if it's fixable by just tweaking some indices.
Can you help me to profile the query? I'll send you a few portions of it, so we can identify the culprit.

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020


LET startEvent = FIRST(FOR p IN progress FILTER p._key == @eventId RETURN p)

LET readTime = startEvent.timestamp
LET observedWrites = (
    FOR rds IN 2 OUTBOUND startEvent progressOf, depends
        LET maybeObservedOverwrite = SLICE(
            (FOR wo IN 1 INBOUND rds writesTo
                FILTER !wo.append
                FOR e IN 2 INBOUND wo executes, progressOf
                    FILTER e.timestamp < readTime
                    SORT e.timestamp DESC
                    LIMIT 1
                    RETURN e
            ), 0, 1)
        LET observedAppends = (
            FOR wo IN 1 INBOUND rds writesTo
                FILTER wo.append
                FOR e IN 2 INBOUND wo executes, progressOf
                    FILTER e.timestamp > maybeObservedOverwrite[0].timestamp
                       AND e.timestamp < readTime
                    SORT e.timestamp ASC
                    RETURN e
            )
        LET allObservedEvents = APPEND(maybeObservedOverwrite, observedAppends)
        FOR e IN allObservedEvents RETURN e
)


LET exec = FIRST(FOR ex IN 1 OUTBOUND startEvent progressOf RETURN ex)
LET affectedDsEdge = FIRST(FOR v, e IN 1 OUTBOUND exec affects RETURN e)
LET rdsWithInEdges = (FOR ds, e IN 1 OUTBOUND exec depends RETURN [ds, e])
LET readSources = rdsWithInEdges[*][0]
LET readDsEdges = rdsWithInEdges[*][1]

LET vertices = (
    FOR vert IN APPEND(readSources, exec)
        LET vertType = SPLIT(vert._id, '/')[0]
        RETURN vertType == "dataSource"
            ? {
                "_id": vert._key,
                "_class": "za.co.absa.spline.consumer.service.model.DataSourceNode",
                "name": vert.uri
            }
            : {
                "_id": vert._key,
                "_class": "za.co.absa.spline.consumer.service.model.ExecutionNode",
                "name": vert.extra.appName
            }
)

LET edges = (
    FOR edge IN APPEND(readDsEdges, affectedDsEdge)
        LET edgeType = SPLIT(edge._id, '/')[0]
        LET exKey = SPLIT(edge._from, '/')[1]
        LET dsKey = SPLIT(edge._to, '/')[1]
        RETURN {
            "source": edgeType == "depends" ? dsKey : exKey,
            "target": edgeType == "affects" ? dsKey : exKey
        }
)

RETURN {vertices, edges, observedWrites}

@DaimonPl
Copy link
Contributor Author

I just ran this query but it returns empty result :(

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

Take the value for @eventId from the Spline UI URL that fails to load.
image

Then run both "Profile" and "Explain" actions in the ArrangoDB UI

image

@DaimonPl
Copy link
Contributor Author

Query String (2309 chars, cacheable: false):
 LET startEvent = FIRST(FOR p IN progress FILTER p._key == 
 "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2" RETURN p)
 LET readTime = startEvent.timestamp
 LET observedWrites = (
     FOR rds IN 2 OUTBOUND startEvent progressOf, depends
         LET maybeObservedOverwrite = SLICE(
             (FOR wo IN 1 INBOUND rds writesTo
                 FILTER !wo.append
                 FOR e IN 2 INBOUND wo executes, progressOf
                     FILTER e.timestamp < readTime
                     SORT e.timestamp DESC
                     LIMIT 1
                     RETURN e
             ), 0, 1)
         LET observedAppends = (
             FOR wo IN 1 INBOUND rds writesTo
                 FILTER wo.append
                 FOR e IN 2 INBOUND wo executes, progressOf
                     FILTER e.timestamp > maybeObservedOverwrite[0].timestamp
                        AND e.timestamp < readTime
                     SORT e.timestamp ASC
                     RETURN e
             )
         LET allObservedEvents = APP...

Execution plan:
 Id   NodeType            Calls   Items   Runtime [s]   Comment
  1   SingletonNode           1       1       0.00000   * ROOT
  7   SubqueryNode            1       1       0.00007     - LET #1 = ...   /* const subquery */
  2   SingletonNode           1       1       0.00000       * ROOT
 78   IndexNode               1       1       0.00005         - FOR p IN progress   /* primary index scan */    
 76   LimitNode               1       1       0.00000           - LIMIT 0, 1
  6   ReturnNode              1       1       0.00001           - RETURN p
  8   CalculationNode         1       1       0.00001     - LET startEvent = FIRST(#1)   /* simple expression */
  9   CalculationNode         1       1       0.00001     - LET readTime = startEvent.`timestamp`   /* attribute expression */
 85   SubqueryStartNode       1       2       0.00000     - LET observedWrites = ( /* subquery begin */
 11   TraversalNode           1      12       0.00021       - FOR rds  /* vertex */ IN 2..2  /* min..maxPathDepth */ OUTBOUND startEvent /* startnode */  progressOf, OUTBOUND depends
 23   SubqueryNode            1      12       0.00450         - LET #8 = ...   /* subquery */
 12   SingletonNode          11      11       0.00001           * ROOT
 13   TraversalNode          11      82       0.00175             - FOR wo  /* vertex */ IN 1..1  /* min..maxPathDepth */ INBOUND rds /* startnode */  writesTo
 14   CalculationNode        11      82       0.00007               - LET #59 = ! wo.`append`   /* simple expression */
 15   FilterNode             11      82       0.00022               - FILTER #59
 16   TraversalNode          11      82       0.00205               - FOR e  /* vertex */ IN 2..2  /* min..maxPathDepth */ INBOUND wo /* startnode */  executes, INBOUND progressOf
 17   CalculationNode        11      82       0.00007                 - LET #63 = (e.`timestamp` < readTime)   /* simple expression */
 18   FilterNode             11      61       0.00007                 - FILTER #63
 19   CalculationNode        11      61       0.00004                 - LET #65 = e.`timestamp`   /* attribute expression */
 20   SortNode               11       4       0.00004                 - SORT #65 DESC   /* sorting strategy: standard */
 21   LimitNode              11       4       0.00003                 - LIMIT 0, 1
 22   ReturnNode             11       4       0.00003                 - RETURN e
 24   CalculationNode         1      12       0.00002         - LET maybeObservedOverwrite = SLICE(#8, 0, 1)   /* simple expression */
 87   SubqueryStartNode       1      23       0.00014         - LET observedAppends = ( /* subquery begin */
 26   TraversalNode           1      94       0.00135           - FOR wo  /* vertex */ IN 1..1  /* min..maxPathDepth */ INBOUND rds /* startnode */  writesTo
 27   CalculationNode         1      94       0.00003             - LET #69 = wo.`append`   /* attribute expression */
 28   FilterNode              1      12       0.00020             - FILTER #69
 29   TraversalNode           1      12       0.00059             - FOR e  /* vertex */ IN 2..2  /* min..maxPathDepth */ INBOUND wo /* startnode */  executes, INBOUND progressOf
 30   CalculationNode         1      12       0.00001               - LET #73 = ((e.`timestamp` > maybeObservedOverwrite[0].`timestamp`) && (e.`timestamp` < readTime))   /* simple expression */
 31   FilterNode              1      12       0.00026               - FILTER #73
 32   CalculationNode         1      12       0.00001               - LET #75 = e.`timestamp`   /* attribute expression */
 33   SortNode                1      12       0.00022               - SORT #75 ASC   /* sorting strategy: standard */
 88   SubqueryEndNode         1      12       0.00020         - RETURN  e ) /* subquery end */
 36   CalculationNode         1      12       0.00002         - LET allObservedEvents = APPEND(maybeObservedOverwrite, observedAppends)   /* simple expression */
 37   EnumerateListNode       1       5       0.00023         - FOR e IN allObservedEvents   /* list iteration */
 86   SubqueryEndNode         1       1       0.00001     - RETURN  e ) /* subquery end */
 43   SubqueryNode            1       1       0.00051     - LET #22 = ...   /* subquery */
 40   SingletonNode           1       1       0.00000       * ROOT
 41   TraversalNode           1       1       0.00048         - FOR ex  /* vertex */ IN 1..1  /* min..maxPathDepth */ OUTBOUND startEvent /* startnode */  progressOf
 75   LimitNode               1       1       0.00000           - LIMIT 0, 1
 42   ReturnNode              1       1       0.00001           - RETURN ex
 44   CalculationNode         1       1       0.00018     - LET exec = FIRST(#22)   /* simple expression */
 48   SubqueryNode            1       1       0.00009     - LET #27 = ...   /* subquery */
 45   SingletonNode           1       1       0.00000       * ROOT
 46   TraversalNode           1       1       0.00005         - FOR v  /* vertex */, e  /* edge */ IN 1..1  /* min..maxPathDepth */ OUTBOUND exec /* startnode */  affects
 77   LimitNode               1       1       0.00000           - LIMIT 0, 1
 47   ReturnNode              1       1       0.00000           - RETURN e
 83   SubqueryStartNode       1       2       0.00020     - LET rdsWithInEdges = ( /* subquery begin */
 51   TraversalNode           1      12       0.01350       - FOR ds  /* vertex */, e  /* edge */ IN 1..1  /* min..maxPathDepth */ OUTBOUND exec /* startnode */  depends
 52   CalculationNode         1      12       0.00004         - LET #83 = [ ds, e ]   /* simple expression */
 84   SubqueryEndNode         1       1       0.00030     - RETURN  #83 ) /* subquery end */
 55   CalculationNode         1       1       0.00002     - LET readSources = rdsWithInEdges[*][0]   /* simple expression */
 81   SubqueryStartNode       1       2       0.00030     - LET vertices = ( /* subquery begin */
 58   CalculationNode         1       2       0.00054       - LET #85 = APPEND(readSources, exec)   /* simple expression */
 59   EnumerateListNode       1      13       0.00040       - FOR vert IN #85   /* list iteration */
 61   CalculationNode         1      13       0.00071         - LET #87 = ((SPLIT(vert.`_id`, "/")[0] == "dataSource") ? { "_id" : vert.`_key`, "_class" : "za.co.absa.spline.consumer.service.model.DataSourceNode", "name" : vert.`uri` } : { "_id" : vert.`_key`, "_class" : "za.co.absa.spline.consumer.service.model.ExecutionNode", "name" : vert.`extra`.`appName` })   /* simple expression */
 82   SubqueryEndNode         1       1       0.00002     - RETURN  #87 ) /* subquery end */
 49   CalculationNode         1       1       0.00001     - LET affectedDsEdge = FIRST(#27)   /* simple expression */
 56   CalculationNode         1       1       0.00002     - LET readDsEdges = rdsWithInEdges[*][1]   /* simple expression */
 79   SubqueryStartNode       1       2       0.00001     - LET edges = ( /* subquery begin */
 65   CalculationNode         1       2       0.00002       - LET #89 = APPEND(readDsEdges, affectedDsEdge)   /* simple expression */
 66   EnumerateListNode       1      13       0.00046       - FOR edge IN #89   /* list iteration */
 67   CalculationNode         1      13       0.00005         - LET edgeType = SPLIT(edge.`_id`, "/")[0]   /* simple expression */
 68   CalculationNode         1      13       0.00004         - LET exKey = SPLIT(edge.`_from`, "/")[1]   /* simple expression */
 69   CalculationNode         1      13       0.00004         - LET dsKey = SPLIT(edge.`_to`, "/")[1]   /* simple expression */
 70   CalculationNode         1      13       0.00003         - LET #91 = { "source" : ((edgeType == "depends") ? dsKey : exKey), "target" : ((edgeType == "affects") ? dsKey : exKey) }   /* simple expression */
 80   SubqueryEndNode         1       1       0.00001     - RETURN  #91 ) /* subquery end */
 73   CalculationNode         1       1       0.00001     - LET #93 = { "vertices" : vertices, "edges" : edges, "observedWrites" : observedWrites }   /* simple expression */
 74   ReturnNode              1       1       0.00000     - RETURN #93

Indexes used:
 By   Name      Type      Collection   Unique   Sparse   Selectivity   Fields        Ranges
 78   primary   primary   progress     true     false       100.00 %   [ `_key` ]    (p.`_key` == "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2")
 11   edge      edge      depends      false    false        18.88 %   [ `_from` ]   base OUTBOUND
 11   edge      edge      progressOf   false    false       100.00 %   [ `_from` ]   base OUTBOUND
 13   edge      edge      writesTo     false    false        28.70 %   [ `_to` ]     base INBOUND
 16   edge      edge      executes     false    false       100.00 %   [ `_to` ]     base INBOUND
 16   edge      edge      progressOf   false    false       100.00 %   [ `_to` ]     base INBOUND
 26   edge      edge      writesTo     false    false        28.70 %   [ `_to` ]     base INBOUND
 29   edge      edge      executes     false    false       100.00 %   [ `_to` ]     base INBOUND
 29   edge      edge      progressOf   false    false       100.00 %   [ `_to` ]     base INBOUND
 41   edge      edge      progressOf   false    false       100.00 %   [ `_from` ]   base OUTBOUND
 46   edge      edge      affects      false    false       100.00 %   [ `_from` ]   base OUTBOUND
 51   edge      edge      depends      false    false        18.88 %   [ `_from` ]   base OUTBOUND

Traversals on graphs:
 Id  Depth  Vertex collections  Edge collections      Options                                  Filter / Prune Conditions
 11  2..2                       progressOf, depends   uniqueVertices: none, uniqueEdges: path                           
 13  1..1                       writesTo              uniqueVertices: none, uniqueEdges: path                           
 16  2..2                       executes, progressOf  uniqueVertices: none, uniqueEdges: path                           
 26  1..1                       writesTo              uniqueVertices: none, uniqueEdges: path                           
 29  2..2                       executes, progressOf  uniqueVertices: none, uniqueEdges: path                           
 41  1..1                       progressOf            uniqueVertices: none, uniqueEdges: path                           
 46  1..1                       affects               uniqueVertices: none, uniqueEdges: path                           
 51  1..1                       depends               uniqueVertices: none, uniqueEdges: path                           

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   remove-unnecessary-calculations
  4   optimize-subqueries
  5   move-calculations-up-2
  6   move-filters-up-2
  7   use-indexes
  8   remove-filter-covered-by-index
  9   remove-unnecessary-calculations-2
 10   move-calculations-down
 11   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0          271        103         0.02873

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00001
 parsing                    0.00029
 optimizing ast             0.00006
 loading collections        0.00001
 instantiating plan         0.00023
 optimizing plan            0.00147
 executing                  0.02565
 finalizing                 0.00101

@DaimonPl
Copy link
Contributor Author

That query returns very fast. I think something different is a problem

My friend found that reducing max depth in query makes it very fast. maxDepth>=7 returns error. maxDepth<=6 work fine (the smaller, the faster)

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

Try the full one (the one that returns the graph overview for UI)

LET startEvent = FIRST(FOR p IN progress FILTER p._key == @eventId RETURN p)
RETURN SPLINE::EVENT_LINEAGE_OVERVIEW(startEvent, 10)

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

basically SPLINE::EVENT_LINEAGE_OVERVIEW implements that recursion and executes the other queries (the ones that I sent you before) against different event IDs it finds in a path. So there are might be some event ID on that some of those queries takes particularly long to run. It would be cool to find that event ID and the exact subquery

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

@DaimonPl
Copy link
Contributor Author

I could only get this profile (probably not very helpful). Deeper depths would not return in reasonable waiting time (human timeout occurred ;) )

Query String (168 chars, cacheable: false):
 LET startEvent = FIRST(FOR p IN progress FILTER p._key == 
 "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2" RETURN p)
 RETURN SPLINE::EVENT_LINEAGE_OVERVIEW(startEvent, 7)

Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  7   SubqueryNode          1       1       0.00009     - LET #1 = ...   /* const subquery */
  2   SingletonNode         1       1       0.00000       * ROOT
 12   IndexNode             1       1       0.00007         - FOR p IN progress   /* primary index scan */    
 11   LimitNode             1       1       0.00000           - LIMIT 0, 1
  6   ReturnNode            1       1       0.00000           - RETURN p
  9   CalculationNode       1       1      44.42263     - LET #6 = SPLINE::EVENT_LINEAGE_OVERVIEW(FIRST(#1), 7)   /* user-defined function */   /* simple expression */
 10   ReturnNode            1       1       0.00001     - RETURN #6

Indexes used:
 By   Name      Type      Collection   Unique   Sparse   Selectivity   Fields       Ranges
 12   primary   primary   progress     true     false       100.00 %   [ `_key` ]   (p.`_key` == "fefcb74e-ea5c-40f7-bed2-ba34dff1b6f1:kbq4zyo2")

Optimization rules applied:
 Id   RuleName
  1   remove-unnecessary-calculations
  2   optimize-subqueries
  3   use-indexes
  4   remove-filter-covered-by-index
  5   remove-unnecessary-calculations-2

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0            1          0        44.42360

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00007
 optimizing ast             0.00001
 loading collections        0.00001
 instantiating plan         0.00003
 optimizing plan            0.00018
 executing                 44.42276
 finalizing                 0.00051

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

44 seconds?? That's a lot :)

maxDepth>=7 returns error.

Is the difference between 6 and 7 much more noticeable than between 5 and 6? I'm trying to understand if every recursion cycle takes similar time, or is there a job/event on which the query takes significantly longer...

Can you share the screenshot of the graph that still works? Or at least the returned JSON from the REST request?

@DaimonPl
Copy link
Contributor Author

Depth times

7 - 44 seconds
6 - 12 seconds
5 - 2.3 seconds
4 - 0.5 seconds

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jun 22, 2020

here's graph from first job run (3 writes)

image

here's job from second job run (time +1 hour, 3 new writes and 3 previous writes)

image

I think that third run is the first to fail

@DaimonPl
Copy link
Contributor Author

It's some kind of exponential explosion :)

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

Don't tell me that :)

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

So now I feel that I need to go home. It seems we've got plenty of work for the future days :))

@wajda
Copy link
Contributor

wajda commented Jun 22, 2020

But... well... it's a graph traversing after all, the time complexity is expected to be exponential. The problem is that it's too steep, right?

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jun 22, 2020

I think it's specific scensrio:

  • job writes 3 partitions to table (single hive datasource uri)
  • when calculating those writes it reads same table as one of the sources (same uri as writes)
  • it is calculated hourly
  • first job run displays '3' exec plans (3 writes)
  • as a result second job run displays '3*3' exec plans
  • third job run would probably display 3* 3*3 and so on

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jun 29, 2020

I think you can safely narrow it to this scenario, other dependencies are rather normal

1st run (Ja, Jb, Jc are 3 exec plans of job J):

T -> Ja -> T
T -> Jb -> T
T -> Jc -> T

2nd run:

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T`
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T`
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1c` -> `T`
`T` -> `Jc` /

3rd run:

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T` \
`T` -> `Jc` /                         \
                                       \
`T` -> `Ja` \                           \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T` -----> `J+2a` -> `T`
`T` -> `Jc` /                           /
                                       /
`T` -> `Ja` \                         /
`T` -> `Jb`  -> `T` -> `J+1c` -> `T` /
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T` \
`T` -> `Jc` /                         \
                                       \
`T` -> `Ja` \                           \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T` -----> `J+2b` -> `T`
`T` -> `Jc` /                           /
                                       /
`T` -> `Ja` \                         /
`T` -> `Jb`  -> `T` -> `J+1c` -> `T` /
`T` -> `Jc` /

`T` -> `Ja` \
`T` -> `Jb`  -> `T` -> `J+1a` -> `T` \
`T` -> `Jc` /                         \
                                       \
`T` -> `Ja` \                           \
`T` -> `Jb`  -> `T` -> `J+1b` -> `T` -----> `J+2c` -> `T`
`T` -> `Jc` /                           /
                                       /
`T` -> `Ja` \                         /
`T` -> `Jb`  -> `T` -> `J+1c` -> `T` /
`T` -> `Jc` /

I won't try 4th run but that's the explosion caused by subsequent job runs :)

So it's not single dependency graph pattern. It single job pattern which explodes when job is run multiple times. That job is run hourly so after 6-7 hours it's not possible to display such graph at all

@DaimonPl
Copy link
Contributor Author

I also think that it's not a bug itself - this dependency pattern behaves this way and we cannot change it. Things which can be done here for sure:

  • reporting only first run instead of each run - that what we did and it's enough for lineage pourposes
  • configuring default depth (that's what I was asking already)
  • stop graph traversal after some node / edge limit - though I don't know if this would be possible with Arango

@wajda
Copy link
Contributor

wajda commented Jun 29, 2020

++

  • memoize the recursive function to avoid repetitive path traversing
  • offload some computation from the UDF to AQL to avoid overhead on converting data structure to/from V8, potentially hit some indices and/or execute it on sharded data nodes instead of a coordinator.
  • pre-link the nodes with additional edges to shift some computation from the request-time to the background and don't repeat it.

@wajda
Copy link
Contributor

wajda commented Jun 30, 2020

Three questions:

  • which exact ArangoDB version are you using?
  • which Spline Admin CLI version did you use to create/upgrade the Spline db?
  • which Spline UI version are you using?

@wajda
Copy link
Contributor

wajda commented Jun 30, 2020

I tried 30 level nested graph and the UI responded in less than a second:
image

Have you seen this popup on your UI?

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jun 30, 2020

Arango 3.6.4
Spline 0.5.3

Yes I've seen depth limit. Problem is not depth for typical graphs, in this case graph explodes very wide before it reaches 10 depth limit

Note that detailed lineages are huge for my case, maybe it's connected? Huge graph with huge detailed lineages

@wajda
Copy link
Contributor

wajda commented Jun 30, 2020

it shouldn't matter, the overview query doesn't touch the detailed lineage nodes.

@DaimonPl
Copy link
Contributor Author

And this #720 ? Schema is quite big (maybe 100 fields in nested structs)

@wajda
Copy link
Contributor

wajda commented Jun 30, 2020

As long as it makes into the DB it's OK. The overview query only depends on the data pipeline high-level structure in terms of reads, writes and data sources.

@DaimonPl
Copy link
Contributor Author

So maybe additional data sources. As you can see on screenshots each of those "explosive" nodes has 10 other input sources

@wajda
Copy link
Contributor

wajda commented Jun 30, 2020

It's rather the number of non-terminal sources read by every execution. This what makes the base of the exponent. I was able to flood my local setup with 410 kind of query

@wajda
Copy link
Contributor

wajda commented Jul 1, 2020

@DaimonPl can you please test this commit? fbb6ae5
It's enough to build it and run java -jar admin-cli.jar db-upgrade ...

In my setup it decreased the query time from 50 seconds to 95 milliseconds :)

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jul 1, 2020

@wajda is it possible to apply it directly on arango? It will be much easier on my env environment than preparing and deploying custom build

@wajda
Copy link
Contributor

wajda commented Jul 1, 2020

You need to replace a UDF function. It is possible to do it directly, but hardly easier.
You need to call ArangoDB REST API or use arangosh.

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jul 1, 2020

sending request with postman is 1000x easier and faster than preparing custom artifacts and deploying it on o puppet environment :)

i'll try to set it up and test tomorrow

as I understand it changes query only so it's not required to re-send lineages?

@wajda
Copy link
Contributor

wajda commented Jul 1, 2020

That's right. No Spline re-deploy or the data change is required.

wajda added a commit that referenced this issue Jul 2, 2020
@wajda
Copy link
Contributor

wajda commented Jul 2, 2020

changed caching method a bit and saved even more CPU.
40b360b
Now it copes quite OK with my somewhat extreme case - every job reads from 4 primary sources (each of which contains 10 appends from the previous executions of the same jobs) and overwrites one of them. It additionally reads 1 to 13 secondary sources at random. So in overall the average vertex degree in a graph is about 50-ish, and the depth is 20.

My test page contains about 600 nodes and 6000 edges making the graph insane.
The server responded in about 9 seconds, and then the majority of the time (about 1 minute) was spent by Chrome in rendering the beast.

image

I believe that the major bottleneck can be considered solved at this point.

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jul 2, 2020

Great, I'm still gathering lineage data on our side not sure if I'll manage to test it today

@wajda wajda removed the help wanted label Jul 2, 2020
@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jul 2, 2020

@wajda what should be function name in http post request? I tried SPLINE::EVENT_LINEAGE_OVERVIEW but it looks it created new function instead of repalcing previous one ?

@wajda
Copy link
Contributor

wajda commented Jul 2, 2020

honestly I never registered UDFs directly :( Need to try first.

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jul 2, 2020

Ok I managed to do it, was missing /_db/spline/ in URI

With your changes graph which was previously failing now displays "fine ;)"

image

@DaimonPl
Copy link
Contributor Author

DaimonPl commented Jul 2, 2020

Next iteration as well

image

wajda added a commit that referenced this issue Jul 3, 2020
* spline #738 lineage-overview - add memoization to Arango UDF

* spline #738 lineage-overview - replace sub-function memoization with the recursive function memoization
@wajda wajda closed this as completed Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants