Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New process graph structure #160

Closed
m-mohr opened this issue Dec 11, 2018 · 17 comments
Closed

New process graph structure #160

m-mohr opened this issue Dec 11, 2018 · 17 comments

Comments

@m-mohr
Copy link
Member

m-mohr commented Dec 11, 2018

All assignees, it would be very helpful if you could review this issue carefully. Thanks!

Updated on 14/12/2018 with a new approach as the old approach was very hard to parse/execute.

Proposal for a new process graph structure

During the last days we had to find out that we face several issues (see below) with the current process graphs definition. All these issues discussed in the VITO sprint are solved by a new graph-based approach discussed below. This proposal is a bit different than the one discussed at VITO, but it stays with the promising solution to actually make the process graph really a graph-like structure. It radically changes the process graph structure, but gives us a lot more flexibility. In the following I'll provide an example:

Algorithm

Assume we would like to execute the following (not very meaningful) algorithm:

Workflow

Process chain in blue, callbacks in yellow. Process descriptions are available here: http://processes.openeo.org/sprint

JavaScript client code example

Could be generated by the following client code:

var b = new ProcessGraphBuilder();
var collection = b.process("get_collection", {name: "Sentinel-1"});
// filter_temporal
var dateFilter1 = b.process("filter_temporal", {data: collection, from: "2017-01-01", to: "2017-01-31"});
var dateFilter2 = b.process("filter_temporal", {data: collection, from: "2018-01-01", to: "2018-01-31"});
var merge = b.process("merge_collections", {data1: dateFilter1, data2: dateFilter2});
b.process("export", {data: merge, format: 'png'});
// minimum time
var minTime = b.process("reduce", {
	data: merge,
	dimension: "temporal",
	reducer: (builder, params) => builder.process("min", {data: params.dimension_data, dimension: params.dimension})
});
var bandFilter = b.process("filter_bands", {data: minTime, bands: ["nir", "red"]});
// NDVI (manually)
var ndvi = b.process("reduce", {
	data: bandFilter,
	dimension: "spectral",
	reducer: (builder, params) => {
		var result = builder.process("divide", {
			x: builder.process("substract", {data: params.dimension_data}),
			y: builder.process("sum", {data: params.dimension_data})
		});
		builder.process('output', {data: result});
		return result;
	}
})
// Export
var result = b.process("export", {data: ndvi, format: 'png'});
// Generate JSON
var createdProcessGraph = b.generate(result);

Process graph

This translates into the following JSON encoding for the process graph:

{
    "export1": {
        "arguments": {
            "data": {
                "from_node": "mergec1"
            },
            "format": "png"
        },
        "process_id": "export"
    },
    "export2": {
        "arguments": {
            "data": {
                "from_node": "reduce2"
            },
            "format": "png"
        },
        "process_id": "export",
        "result": true
    },
    "filter1": {
        "arguments": {
            "data": {
                "from_node": "getcol1"
            },
            "from": "2017-01-01",
            "to": "2017-01-31"
        },
        "process_id": "filter_temporal"
    },
    "filter2": {
        "arguments": {
            "data": {
                "from_node": "getcol1"
            },
            "from": "2018-01-01",
            "to": "2018-01-31"
        },
        "process_id": "filter_temporal"
    },
    "filter3": {
        "arguments": {
            "bands": [
                "nir",
                "red"
            ],
            "data": {
                "from_node": "reduce1"
            }
        },
        "process_id": "filter_bands"
    },
    "getcol1": {
        "arguments": {
            "name": "Sentinel-1"
        },
        "process_id": "get_collection"
    },
    "mergec1": {
        "arguments": {
            "data1": {
                "from_node": "filter1"
            },
            "data2": {
                "from_node": "filter2"
            }
        },
        "process_id": "merge_collections"
    },
    "reduce1": {
        "arguments": {
            "data": {
                "from_node": "mergec1"
            },
            "dimension": "temporal",
            "reducer": {
                "callback": {
                    "min1": {
                        "arguments": {
                            "data": {
                                "from_argument": "dimension_data"
                            },
                            "dimension": {
                                "from_argument": "dimension"
                            }
                        },
                        "process_id": "min",
                        "result": true
                    }
                }
            }
        },
        "process_id": "reduce"
    },
    "reduce2": {
        "arguments": {
            "data": {
                "from_node": "filter3"
            },
            "dimension": "spectral",
            "reducer": {
                "callback": {
                    "divide1": {
                        "arguments": {
                            "x": {
                                "from_node": "substr1"
                            },
                            "y": {
                                "from_node": "sum1"
                            }
                        },
                        "process_id": "divide",
                        "result": true
                    },
                    "output1": {
                        "arguments": {
                            "data": {
                                "from_node": "divide1"
                            }
                        },
                        "process_id": "output"
                    },
                    "substr1": {
                        "arguments": {
                            "data": {
                                "from_argument": "dimension_data"
                            }
                        },
                        "process_id": "substract"
                    },
                    "sum1": {
                        "arguments": {
                            "data": {
                                "from_argument": "dimension_data"
                            }
                        },
                        "process_id": "sum"
                    }
                }
            }
        },
        "process_id": "reduce"
    }
}

Processes

So each process is assigned a unique graph id (e.g. export2) and arguments can expect data from another process by its graph id in the from_node property. They can expect either the result implicitly (no receive property available) or expect "internal" data from a process such as a single element (element) in filter or dimension data (dim_data) and the dimension (dimension) in reduce. The parameters, which are passed to callbacks are specified in the JSON schema of the process parameters. For examples see the following two processes, especially the schema for reducer and expression:

{
    "id": "reduce",
    "description": "reduce dimensions - lower dimensionality, same resolution.",
    "parameters": {
        "data": {
            "schema": {
                "type": "object",
                "format": "image-collection"
            },
            "required": true
        },
        "reducer": {
            "schema": {
                "type": "object",
                "format": "callback",
                "properties": {
                    "dim_data": {
                        "description": "An image-collection with exactly one dimension to reduce.",
                        "type": "object",
                        "format": "image-collection"
                    },
                    "dimension": {
                        "description": "The dimension string",
                        "type": "string"
                    }
                }
            },
            "required": true
        },
        "dimension": {
            "schema": {
                "type": "string"
            },
            "required": true
        }
    },
    "returns": {
        "schema": {
            "type": "object",
            "format": "image-collection"
        }
    },
    "summary": "reduce dimensions - lower dimensionality, same resolution.",
    "categories": [
        "core",
        "reducer"
    ]
}
{
    "id": "filter",
    "description": "Example: Filter by `instrument_mode` from STAC properties.",
    "parameters": {
        "data": {
            "schema": {
                "oneOf": [
                    {
                        "type": "object",
                        "format": "image-collection"
                    },
                    {
                        "type": "object",
                        "format": "vector-collection"
                    }
                ]
            },
            "required": true
        },
        "expression": {
            "schema": {
                "type": "object",
                "format": "callback",
                "properties": {
                    "element": {
                        "description": "Each element in the collection is passed to the callback.",
                        "type": "object",
                        "format": "collection-element"
                    }
                }
            },
            "required": true
        },
        "dimension": {
            "schema": {
                "type": "string"
            },
            "required": true
        }
    },
    "returns": {
        "schema": {
            "oneOf": [
                {
                    "type": "object",
                    "format": "image-collection"
                },
                {
                    "type": "object",
                    "format": "vector-collection"
                }
            ]
        }
    },
    "summary": "Filtering \/ Selecting data based on a logical expression",
    "categories": [
        "core",
        "filter"
    ]
}

(Remark: Having the callback parameters in the JSON schema may be a problem for validation, so we may need to move them one level up and define them directly in the process parameters, but we'll discover that in the next months. It doesn't change anything major in the discussed approach here.)

Callbacks are a process graph on their own and are set using the callback property.

As multiple end nodes are possible (see example above), for web-services or stored process-graphs it can be important to have exactly one end node, which result can be referenced and used. Therefore one node need to have the result flag set to true. callback has a result node similar to our "main" process graph.

Processing the process graph

I made implementations in JS for both generating (client-side) and parsing/executingprocess graphs (server-side). Both solutions work and can be found here: https://github.com/Open-EO/openeo-js-client/tree/new-processgraph-builder

As a back-end you need to go through all nodes/processes in the list and set for each node to which node it passes data and from which it expects data. In another iteration the back-end can find all start nodes for processing by checking for zero dependencies (i.e. node.expectsFrom.length === 0 in JS).

You can now start and execute the start nodes (in parallel if possible). Results can be passed to the nodes that were identified beforehand. For each node that depends on multiple inputs you need to check whether all dependencies have already finished and only execute once the last dependency is ready.

Please be aware that the result node (result set to true) is not necessarily the last node that is executed. The author of the process graph may choose to set a non-end node to the result node!

Issues (as discussed in the VITO sprint)

During the last days we had to find out that we face several issues (see below) with the current process graphs definition. More complex process graphs won't work with our current JSON encoding as the JSON encoding was basically a tree so for instance branching from one imported collection into some "parallel" processing chains doesn't work well. In the end that led to limitations such as that we can't pass two bands into a multiply(x, y) method.

  1. How to integrate the "new" concept of callbacks into the process graph? Currently not possible as validation/execution would fail (parameters missing, wrong order of execution).
    • Special handling in validation?
    • Introduce a new type of process object, e.g. by replacing process_id by callback_id? Would not get executed directly and would be validated differently.
    • Passing the callback process name as string doesn't work as some callbacks need further parameters to be specified. Adding them to the calling process would make validation very complex.
  2. Passing parameters to callbacks:
    • Passing by order => doesn't work as JSON objects don't have an order.
    • Passing by name => All parameter names need to be in sync, could potentially lead to conflicts.
    • Introduce a variable? JSON pointer (=> too complex)? Special type of already specified process graph variables?
  3. How to branch? => JSON is a tree, we need a graph (as the name suggests).
  4. Sometimes it is helpful to declare variables, e.g. for easier band arithmetic (reference a band from a collection) or getting properties from a collection (see property). How to solve? We could add get and set for a first draft.
    • Allow to run a sequential list of process graphs? For instance, the first for declaring variables and the second with the algorithm?
    • Or add a "declarations" to the jobs etc that hold a list of process graphs that are executed and the results are stored in variables before executing the process_graph itself? Would make external validation harder as it is not one single piece that users can insert for instance in the openEO Hub.
@mkadunc
Copy link
Member

mkadunc commented Dec 12, 2018

A couple of comments to the above:

  • binding between the callback and the outer process node (reduce or filter) is specified twice in JSON (e.g. reduce1.arguments.reducer.call_process == min1 and min1.arguments.data.from_process == reduce1) this makes the dependency cyclic (not clear which way it should be traversed) and prevents reusing the same callback as part of many filters and/or reducers; I understand that these are two different connections and can be non-trivial, but IMO it makes the situation a bit hard to read
  • the fact that many nodes in the callback can (need to) reference the outer context makes things look more complex than they really are
  • the two-way connection makes it unclear which node is calling the shots and controlling the flow, and it's quite hard to reason about (e.g. using your intuition) in an asynchronous setting
  • the naming in JSON confuses process (type of operation) with process node (instance of operation inside the graph, with all its connections) - I suggest we use some other term instead of from_process when referencing specific nodes within the graph

@m-mohr
Copy link
Member Author

m-mohr commented Dec 12, 2018

Thanks, @mkadunc!

  • binding between the callback and the outer process node (reduce or filter) is specified twice in JSON (e.g. reduce1.arguments.reducer.call_process == min1 and min1.arguments.data.from_process == reduce1) this makes the dependency cyclic (not clear which way it should be traversed) and prevents reusing the same callback as part of many filters and/or reducers; I understand that these are two different connections and can be non-trivial, but IMO it makes the situation a bit hard to read
  • the fact that many nodes in the callback can (need to) reference the outer context makes things look more complex than they really are
  • the two-way connection makes it unclear which node is calling the shots and controlling the flow, and it's quite hard to reason about (e.g. using your intuition) in an asynchronous setting

If I understand you correctly, all three points are basically saying that we should remove the from_process in the "receiving "callback processes, right? This is actually a really good idea. I removed the from_process in the callback processes, which receive data (re-review above). It really looks simpler and reusing callbacks sounds very useful, too.

  • the naming in JSON confuses process (type of operation) with process node (instance of operation inside the graph, with all its connections) - I suggest we use some other term instead of from_process when referencing specific nodes within the graph

We have basically three "magic" keywords at the moment: from_process (expect_from in the VITO proposal), call_process and receive. I'm fine with renaming them. We could replace "process" with "node" for example, but open for other suggestions, too. receive is also a bit confusing as from_process is basically also a receive. What would you suggest? It just needs to minimize the possibility of conflicts with other potentially passed JSON objects, e.g. just id would probably collide with GeoJSON).

@m-mohr m-mohr changed the title Issues with the process graph New process graph structure Dec 12, 2018
@m-mohr
Copy link
Member Author

m-mohr commented Dec 14, 2018

Updated on 14/12/2018 with a new approach as the old approach was very hard to parse/execute. It's still a graph, but the callbacks are handled differently. "Reference" implementations in JS for building and parsing/executing are available in the JS client repository for now. This will be implemented as described here in openEO API v0.4 if no other comments are received. Awaiting your feedback, thanks.

@mkadunc As you brought up re-using callbacks: If you re-use a callback in a client the processes get duplicated in the graph as the callbacks are just generated as child process graph of the calling method. If this gets too messy the user has to store the process graph and reference it. The issue is that the builder doesn't know whether a callback function is called twice or not.

@mkadunc
Copy link
Member

mkadunc commented Dec 18, 2018

Possible typo: divide1 and output1 in the second callback both look like something that is collected by the outer reduce2 function.

@mkadunc
Copy link
Member

mkadunc commented Dec 18, 2018

In case it's useful, a compact pseudo-representation of the graph:

getcol1 = get_collection(name: "Sentinel-1")
return 
	merge_collections( //mergec1
		data1: getcol1.filter_temporal(from: "2017-01-01", to: "2017-01-31"), 
		data2: getcol1.filter_temporal(from: "2018-01-01", to: "2018-01-31")
	)
	.export(format: "png") //export1
	.reduce(dimension: "temporal", reducer: new callback({ //reduce1
		return args["dimension_data"].min(dimension: args["dimension"]) //min1
	}))
	.filter_bands(bands: ["nir", "red"]) 	//filter3
	.reduce(dimension: "spectral", reducer: new callback({ 	//reduce2
	    temp = args["dimension_data"]
		return divide1 = divide(x: temp.subtract(), y: temp.sum())
		output1 = divide1.output()
	}))
	.export(format: "png") //export2

(you get something more JSON-like if you substitute ... obj.process ... with ... process(data: obj, ...)

@m-mohr
Copy link
Member Author

m-mohr commented Dec 18, 2018

Thanks, @mkadunc .

Possible typo: divide1 and output1 in the second callback both look like something that is collected by the outer reduce2 function.

divide1 is "collected" by reduce2, which is communicated with the result flag, but output1 is not really collected, it is just a (stupid?) example, which basically sends every result of the divide function to the WebSocket for monitoring purposes. Did I miss anything here?

(you get something more JSON-like if you substitute ... obj.process ... with ... process(data: obj, ...)

Sure, my example given above is just based on a not very elegant JS process graph builder written as a proof-of-concept. In the long run you'd generate something that feels more "native", so a class with methods generated from the process definitions returned by the connected back-end. Still, you need to pass the data, which you omitted. You can do some "magic" there to pass it automatically, but that may fail in some cases. For example, your first call to export could be a problem as it returns a boolean instead of a collection, so the following reducer would probably fail.

@mkadunc
Copy link
Member

mkadunc commented Dec 18, 2018

Did I miss anything here?

Nope, I missed the WebSocket output thing...

You can do some "magic" there to pass it automatically, but that may fail in some cases.

Don't worry - I wasn't trying to define a new encoding, just rewrote the JSON into some pseudo-language to get a better overview of the process graph.

the user has to store the process graph and reference it

Right now we have magic strings coming into the callback process, which limits reuse of stored (sub)processes - i.e. the callback process needs all its arguments to have identical names in all contexts where it's executed.

I agree about the need to copy graphs - as it is now, the callback sub-process is parameterized with magic strings (free symbols), which might get resolved/bound to different things in different situations.

We could improve this by defining callbacks as pure functions (subprocesses), that:

  • are executed in their own scope (i.e. cannot access anything outside their own definition)
  • get all of their parameters bound whenever they are used
  • return the result of a single process node (we could have void processes that don't return, though)
    e.g.:
callback1 = new callback({
    temp = args["pixelPair"]
    return divide1 = divide(x: temp.subtract(), y: temp.sum())
})

.reduce(dimension: "spectral", reducer: {
    process: callback1,
    binding: {
        "pixelPair" : "dimension_data" 
    }
})

In the example above, callback1 becomes its own process, very much like pre-defined processes such as divide or UDFs.

The callback can be called from anywhere, as it doesn't depend on any data values being available outside of its definition.

In the example, the parameter of callback, "pixelPair", is defined implicitly. Maybe there's value in defining function parameters explicitly, in which case the callback constructor could get a list of all parameters that the callback takes (similar to UDF, I guess).

Another nice feature is that "dimension_data" - the magic internal variable that only exists within the implementation of 'reduce' - is only used once, and only inside the arguments of the 'reduce' call.

There are a couple of nice benefits of having callbacks be equivalent to pre-defined processes and/pr UDFs:

  • you can use a UDF directly in callback role, without special constructs
  • you can use callbacks in many places
  • callbacks are actually parameterized processes - saving their definitions would allow nesting user-defined processes and building more complex processing chains from basic building blocks
  • you can use system-defined processes in callbacks directly, e.g.: (data.reduce(reducer: { process: min, binding: "data": "dimension_data" }))
  • you can execute callback outside of any iterator, e.g. on literal values

@m-mohr
Copy link
Member Author

m-mohr commented Dec 18, 2018

Right now we have magic strings coming into the callback process, which limits reuse of stored (sub)processes - i.e. the callback process needs all its arguments to have identical names in all contexts where it's executed.

No, this is not the way stored process graphs work as of now. Process graphs by itself don't have "non-filled"/flexible parameters as this would lead to a non-validating process graph and would therefore be rejected. To fill those flexible parameters, we introduced process graph variables in API v0.3: https://open-eo.github.io/openeo-api/v/0.4.0/processgraphs/#variables . So you'd need to define these and set these variables in the get_processgraph process.

Nevertheless, it is true that the from_argument and the process graph variables do similar things, so we may want to consider joining these concepts. I'll need to think a bit more about it tomorrow.

We could improve this by defining callbacks as pure functions (subprocesses)

I think this is already the case. On client-side in the JS builder a callback is a native JS function, which can be re-used. In the process graph they are still duplicated. To me it's not completely clear whether you are speaking about the process graph or the client implementation here?

There are a couple of nice benefits of having callbacks be equivalent to pre-defined processes and/pr UDFs

We have the same idea here. That's also how I'd like it to be and I think this is already the case, but may need some fine-tuning. For example, I removed the binding from the last proposal, but it may still be useful and it seems it should be added again. That could be an optional binding property that sits on the same level as the "callback" property and defines a mapping. If not existent it just maps/passes through. But I am not sure yet whether that works as we don't know the parameter names of the callback, do we? @mkadunc

@mkadunc
Copy link
Member

mkadunc commented Dec 21, 2018

To me it's not completely clear whether you are speaking about the process graph or the client implementation here?

Both - I mostly want to get as much clarity as possible in terms of the conceptual domain model (of processes, nodes and variables) and the conceptual model of the expression syntax that we're building.

Things such as:

  • what is the scope of symbols;
  • where does the binding between values and symbols happen;
  • which variables can be read from where;
  • which variables can be updated form where (if at all; having everything immutable makes sense...);
  • how are values passed around the graph;
  • how does each part of the graph execution get triggered (and how many times).

After re-reading your examples in light of your comments I agree that your current proposal is a good way to go forward, and maybe find opportunities for consolidation of concepts later.

Regarding the binding property - I suggest that we start by talking about a single canonical full-form encoding — either callback's parameter names are defined by the calling functions; or there's always a binding property that defines the mapping. Then we can handle any optional short-hand for and/or syntax sugar in terms of what it would look like in that canonical form (e.g. what would be the equivalent value of binding property that would get the same result as not specifying it at all).

jdries added a commit to Open-EO/openeo-python-client that referenced this issue Jan 2, 2019
jdries added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 3, 2019
jdries added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 16, 2019
jdries added a commit to Open-EO/openeo-python-client that referenced this issue Jan 16, 2019
jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Jan 16, 2019
@mkadunc
Copy link
Member

mkadunc commented Feb 1, 2019

Another thought about binding: I realized that for me, some confusion around free parameters came from the fact that I expect from other languages that order of parameters defines the interface of required lambda expressions (and other types of pure functions) — in openEO we opted to have named parameters and insignificant order. I'm not arguing for changing this, just trying to work out how this decision influences callbacks.

TL/DR: I don't think we need binding - our choice of significant parameter names means that the higher-order functions (reduce, map, foreach etc.) will define the parameter names of the callback function, which can be used in anonymous (inline or in-place) callbacks or, in case of explicit callback functions, should be identical to the callback's own declared parameters.

P.S.: If we support anonymous/inline callback, binding can be quite easily achieved manually by wrapping the function call in an inline expression:

map("array": myArrayA, "callback": { myDifferentCb("itm": item, "idx": index) }).


longer version

Maybe best to demonstrate with a couple of examples - start with Java / JavaScript / TypeScript where order defines meaning of params. The callback is a function that takes some number of parameters of a certain type in order (names are there just for convenience):

  // parameter names "value" and "index" are irrelevant
  function map<T, U>(array: Array<T>, callbackfn: (item: T, index: number) => U): U[] {
    const ret = [];
    for(i = 1; i< array.size; i++) {
        const element = array[i]; 
        ret.push(callbackfn(element, i)); //name is irrelevant; 'element' is bound to first param of callbackfn
    }
    return ret;
  }

  const myArray : number = [5, 4, 2];
  const changedArray : string[] = map(myArray,
    // the names p1 and p2 are local to the callback function
    // we need explicit binding of some names to indices, then use the names in the callback expression
    (p1, p2) : string => toString(p1) + "_" + p2;
  );

An extreme version of parameter-name-is-not-important are LISPs lambda expressions — or Mathematica's "pure functions" — that don't even bother with parameter names and allow one to write the callback with just the parameter position in the input:

    (* callback is a Function that will be called with 2 parameters; their names are irrelevant *)
    MapIndexed[callback_Function, array_List] := Table[callback[array[idx], idx], {idx, Length[array]}];
   
    myArray = {5, 4, 2};
    changedArray = MapIndexed[
        (* the following line is complete definition of callback - '#1' and '#2' are parameter references *)
        (* '&' indicates that this is a pure function *)
        (* no explicit binding *)
        (ToString[#1] + "_" + #2)& 
        , myArray
    ]

In openEO or other language with significant parameter names (and insignificant order), the interface of the callback function will prescribe names — this eliminates the need for new local names in the definition of the callback, and is close to LISPs or Mathematica's pure functions. Example here uses some fictional notation for parameterized types, and an anonymous declaration of the callback function:

    // callback will be evaluated with parameters "item" and "index"; their order is not important
    function map<T,U>(array: Array<T>, callback: Function<index:number, item:T>) {
        let ret = [];
        for (let i = 0, i<array.length, i++) {
            ret = array_push(
                "array": ret, 
                "element": callback("item": get_element("array":array, "index":i),
                                             "index": i)
            );
        }
        return ret;
    }
    
    myArrayB = map("array": myArrayA,
        "callback":
        // start anonymous function block;
        // signature of the function is inferred from 
        // the definition of 'map' function
        {
            return toString(item) + "_" + index;
        }
    );

This is equivalent to the following version with explicit declaration of the callback function:

    // parameter names are important! order is not
    function myCb(index: number, item: number) {
        return toString(item) + "_" + index;
    }
    myArrayB = map("array": myArrayA, "callback": myCb);

@m-mohr
Copy link
Member Author

m-mohr commented Feb 1, 2019

Thanks Miha, I really appreciate all your thought. By the way, the decision for named parameters and insignificant order comes directly from the decision for JSON objects and its restrictions.

I'm note sure whether we are speaking about the same things regarding bindings. It seems your example is client or back-end code? As you can see in the JS client code example in the initial post, there are no bindings in it, so I came to the same conclusion. What I was speaking about later was the process graph. I still think we need "bindings" there. Or, let's not call that bindings, it just states where the data for each parameter comes from. And I think we still need that, otherwise it is not clear how the data flow is. The name used in the process graph are specified in the process definition. See dim_data and dimension in the reduce example above. So in the end these things are nothing a user really needs to care about (depending how smart the client implementation is, of course). So maybe we are already on the same boat?

Regarding your other questions:

what is the scope of symbols;
where does the binding between values and symbols happen;
which variables can be read from where;
which variables can be updated form where (if at all; having everything immutable makes sense...);

There are no real variables, the data is passed directly between processes. Wen can't really control the client-side though.

how are values passed around the graph;

Using from_node and from_argument (in callbacks). Parameters expect data from other processes using from_node. from_argument specifies which parameter from the potential callback parameters to use in a process executed in a callback.

how does each part of the graph execution get triggered (and how many times).

For callbacks this depends on the input data / context. For nodes in a set of processes, each get triggered once.

@mkadunc
Copy link
Member

mkadunc commented Feb 5, 2019

I'm note sure whether we are speaking about the same things regarding bindings. It seems your example is client or back-end code?

It's an idealized client code. Or, if you want, a programming-language-like representation of the process graph.

As you can see in the JS client code example in the initial post, there are no bindings in it, so I came to the same conclusion. What I was speaking about later was the process graph. I still think we need "bindings" there.

We might not - in the same way as the client code for callback directly references properties of the params object, which are known implicitly because the outer function declares what they will be, the process graph has direct references to the same thing via from_argument.

Or, let's not call that bindings, it just states where the data for each parameter comes from. And I think we still need that, otherwise it is not clear how the data flow is. The name used in the process graph are specified in the process definition. See dim_data and dimension in the reduce example above. So in the end these things are nothing a user really needs to care about (depending how smart the client implementation is, of course). So maybe we are already on the same boat?

I think we are on the same page. I'm not trying to push for any particular change, just asking for clarification.

There are no real variables, the data is passed directly between processes. Wen can't really control the client-side though.

  • from_node and from_argument make me think like we are dereferencing some values from some variables (you call them nodes and arguments, but that's just semantics);

  • the names that we use to reference nodes and arguments are what I'd call symbols;

  • scope of variables would be a description of places inside the process graph where a from_node with that variable (node/argument) name can resolve to that variable's value - e.g.:

  1. If I create a node called myCube on the root level of my process graph, can I use from_node('myCube') inside a callback?
    myCube = collection("Sentinel-2_L1C");
    myCube.apply({
        return param.value/(myCube.w*myCube.h); // Can I use node from outside in here?
    });
  1. Can I have myCube defined again inside the callback, if I already have it on the root level?
    myCube = collection("Sentinel-2_L1C");
    apply_dimensions(myCube, {
        myCube = sort(param.dim_items); //can I declare another node with the same name
        return cumSum(myCube);
    });
  1. If I have a callback nested within callback, can I access the outer callback's arguments from the inner callback's declaration?
    myCube = collection("Sentinel-2_L1C");
    apply_dimensions(myCube, {
        transformed_param = apply(dim_items, {
            return param.value + super.param.dim_items.length; // How can I use param from outside?
        });
        return transformed_param;
    });

@m-mohr
Copy link
Member Author

m-mohr commented Feb 6, 2019

It's an idealized client code. Or, if you want, a programming-language-like representation of the process graph.

Okay, I wasn't speaking about client code here as it is hard to control what the "external" languages support, whether they are strict about scoping (Java) or not so much (JavaScript with var).

An example:

var collection = b.process("get_collection", {name: "Sentinel-1"});
var dateFilter = b.process("filter_temporal", {data: collection, from: "2017-01-01", to: "2017-01-31"});
b.process("export", {data: merge, format: 'png'});
var minTime = b.process("reduce", {
	data: merge,
	dimension: "temporal",
	reducer: (builder, params) => builder.process("min", {data: params.dimension_data, dimension: params.dimension})
});

In the reducer callback, JavaScript would technically allow me to use the dateFilter variable because the scope of dateFilter is not restricted. Nevertheless, currently this would fail when generating a process graph as the current architecture does only allow strict scoping, so you are only allowed to use whatever is passed in params (and builder). This wouldn't be a problem in Java as the scoping is strict. Unfortunately we can't really control this (or process graphs would get pretty messy to build, parse and execute).

We might not - in the same way as the client code for callback directly references properties of the params object, which are known implicitly because the outer function declares what they will be, the process graph has direct references to the same thing via from_argument.

Probably a misunderstanding. I was speaking about the process graph, there is no params object. params is a implementation detail of the JS process graph builder. So, yes, "bindings" via from_node and from_argument in the process graph are probably required, but no bindings in the JS client are needed.

  • from_node and from_argument make me think like we are dereferencing some values from some variables (you call them nodes and arguments, but that's just semantics);

from_node is referencing a return value from another process, from_argument is referencing an argument that is passed to a callback, e.g. a single value in apply from the loop that is iterating over all the values in the cube.

  • scope of variables would be a description of places inside the process graph where a from_node with that variable (node/argument) name can resolve to that variable's value - e.g.:
  1. If I create a node called myCube on the root level of my process graph, can I use from_node('myCube') inside a callback?
    myCube = collection("Sentinel-2_L1C");
    myCube.apply({
        return param.value/(myCube.w*myCube.h); // Can I use node from outside in here?
    });

No, that's not allowed. Strict scoping, see example above.

  1. Can I have myCube defined again inside the callback, if I already have it on the root level?
    myCube = collection("Sentinel-2_L1C");
    apply_dimensions(myCube, {
        myCube = sort(param.dim_items); //can I declare another node with the same name
        return cumSum(myCube);
    });

Yes, due to the strict scoping. (And thanks for the good examples, this will help me writing better documentation.)

  1. If I have a callback nested within callback, can I access the outer callback's arguments from the inner callback's declaration?
    myCube = collection("Sentinel-2_L1C");
    apply_dimensions(myCube, {
        transformed_param = apply(dim_items, {
            return param.value + super.param.dim_items.length; // How can I use param from outside?
        });
        return transformed_param;
    });

No, again, strict scoping. You can only access what is passed by the process that calls the callback and what is passed there is defined in the process description. In my first comment it would be dim_data and dimension in reduce and element in filter.

Hope this clarifies things. I'll try to better explain it once porting this to the API documentation. Do you think the approach is reasonable or should we switch to a less strict scoping? I'm convinced at the moment strict scoping makes things easier and less prone to errors, but are there any concerns you have? Sure, we need to make ensure all relevant information required in the callbacks are passed to the callback. @mkadunc

@mkadunc
Copy link
Member

mkadunc commented Feb 6, 2019

Thanks. This clears things up, and I fully agree that strict scope (i.e. no access into parent or child contexts) is the best approach to take right now.

The only drawback I can think of ATM is the inability to pass custom values to code inside a callback — right now we pass only those arguments that the outer function (e.g. reduce) declares in the signature of its callback (e.g. dim_data and dimension). If I wanted to use some other variable in the callback (such as a result of a previous computation), I would somehow need to embed its values inside the array.

The following is possible in Javascript, but not in openEO:

   const foo = [1,2,3];
   const bar = 7;
   const fifu = map(a, params =>params.dim_data + bar);

If I wanted to do this in openEO, i'd need to do something analogous to this:

   //add a new single-element axis
   const newcube = append_dimension(foo, "variable", "foo");

   //add a new element to the axis, with all values initialized to bar
   const foobar = extend_dimension(foo, "variable", "bar", bar);
   
   const fifu = reduce(foobar, "variable", params => params.dim_data[0] + params.dim_data[1]);

If we were to solve this problem, I'd consider doing this explicitly, e.g. with a parameter on definition of custom-callback that one could populate manually in order to provide the "context" into the callback. E.g. like this:

   const fifu = apply_dimensions(cube: a, dimensions: ALL, callback : 
      {
         // explicitly specify which local data will be 'injected' into callback
         context: {boo: bar},
         // user only params and context in the callback expression (no access to data outside)
         expression: (params, context) =>params.dim_data + context.boo
      }
   );

@mkadunc
Copy link
Member

mkadunc commented Feb 6, 2019

Not sure if my example above is representative of what openEO use-cases require, so maybe supporting a way of passing context into callbacks is not yet necessary.

@m-mohr
Copy link
Member Author

m-mohr commented Feb 6, 2019

Yes, this could be useful, but I'm not sure whether anybody needs it yet. I don't see a disadvantage to introduce a context mechanism/parameter except it leads to a bit additional documentation. I opened a separate issue for this: Open-EO/openeo-processes#25
It's a process specification issue, not so much an issue of the process graph itself.

@m-mohr
Copy link
Member Author

m-mohr commented Feb 11, 2019

Mostly incorporated into the API documentation. Would be great if someone could check https://open-eo.github.io/openeo-api/v/0.4.0/processgraphs/ and the openAPI specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants