Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[PROPOSAL] [RFC] Streaming action output and other intermediate results #2175
Multiple people have asked for this in our community so I'm bring this proposal back to life (I'm actually moving it from a private repo to a public one so other community members can participate as well).
Keep in mind that this proposal is almost 11 months old and a lot has change then. The biggest change which can affect this proposal is that now we have a /stream endpoint and mechanism for streaming data.
I will explain differences between my original proposed approach and utilizing stream for that in a new comment. In short, each approach has it's own advantages and disadvantages, but I personally think the original proposal is still better since it provides intermediate result (if we go with stream, user will only be able to receive new output / data, but if we go with my proposal, API endpoint will return everything which has been buffered so far).
Currently if you run a long-running action or an action which consists of multiple steps (action-chain or a work-flow), you only know if an action is scheduled, running or finished, but you have no visibility into what is going on with the running action until it completes (either succeeds or fails).
Good examples of a long-running actions include building and compiling stuff,d running tests, etc.
For this purpose, define: short < 5 sec < Long.
The goal is to increase the visibility into running actions and stream output of the running actions and other intermediate results. Additionaly, we want to support better integration with Mistral.
With "output" I'm referring to the action's stdout and stderr. With "other intermediate results" I'm referring to
I separated those two because I think they are two different things. Output is opaque to us and we have no idea how it looks so we should treat it as such. This means streaming it as soon as it us available. For performance and conveniece reasons we should probably still buffer it. For line-delimited stuff we could simply send a line as soon as it's available.
And as far as "other intermediate results" go - here we know the structure and what is an atomic unit which means we can stream those things as a whole as new line delimited JSON (one JSON serialized object per line).
Inside the runners, we should intercept the action output and write it in a temporary document in a special collection. To avoid MongoDB lock contention issues, we should buffer the output a bit and not write it character by character since this is expensive (dunno, it's been a while, I need to research again, maybe newer versions of MongoDB finally have more granular write locking). To avoid lock contention, we might also want to use a separate database.
API user (client, web ui) should be able to consume this output with a special task specific streaming endpoint - e.g.
Internally, this streaming endpoint would consume the endpoint from the temporary collection for old data and from the message bus for new data.
As noted above, this document would be temporary so if a user visits this URL after the action has completed, we would simply read the whole output from the action execution collection.
@enykeev proposed using server-sent events and the same endpoint as we use for streaming action execution and other objects.
This would potentially work as well, but I like a separate endpoint more since it's more granular and we can avoid server-sent events metadata overhead. On top of that,
A lot of programming languages buffer output (stdout and stderr) by default. This could become a minor problem and annoyance for non-Python runner actions.
I personally don't think this is a big issue. We consider Python runner actions as first-class citizens, so if a user is worried about the output buffering and not receiving output as soon as it's available, they can rewrite an action using Python runner.
I would need to check whether we can actually consume a raw stream in browser. If not, we would have to add some metadata in form of eventsource format (potentially, with additional query parameter like
Overall +1 to the proposal. I think we need this now.
The details matter in terms of how this really fit into our result storage strategy in general. I hope we can speculate what our long term plan of result storage is and also what historical result storage looks like so that we can potentially make all this work well together. (This is likely a pipe dream)
I suspect we will quickly move into filters on streams approach. I translate the requirement to the client pretty much defines what data it is interested in when connecting to the streams endpoint and client only receive those results. This is great idea in theory until it actually becomes extremely expensive to compute filters on the server.
Also, Often it might be better to take the http://graphql.org/ approach rather than build ad-hoc APIs.
As far as technical / implementation and storage side goes - I was planning to create a new table / collection which would store output (stdout and stderr) which has been collected and buffered so far.
Every time a client would hit
Once the execution has completed, we would simply delete an entry in this table / collection and if user requests the same endpoint, we would simply read from action executions collection.
It's also worth pointing out that this collection will be quite write heavy, that's why we should at least start with a separate collection, or maybe even a separate MongoDB database (this way it's easier for use to just move this single table to a different MongoDB server if performance issues shows up).
Having said that, yes, it would be better if we could use a more appropriate database for this use case (write-heavy, something like Cassandra should work great). When we start to work on migrating to a different datastore, we could start with a simple migration like this - start with only migrating output data to a new datastore.
I actually have started to think that result should not be stored in the main
Yeah, I agree. Storing results somewhere else is a good idea since in a lot of cases, we just need the execution metadata and not the result itself and vice versa.
But yeah, even though it's slightly related, lets move this task / improvement to a different issue (maybe it's also a good candidate for initial DB migration task :)).
The problem here is that client connects to the stream once and this is one way channel. We can track what execution the specific client got last, but that's a huge mess as we would need to store the state and there's whole bunch of situations it might go wrong. Having a separate streaming endpoint for every execution really is a better way unless client wants to receive results data on dozens executions at once. I can't figure out a use case for that at the moment.