New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] [RFC] Streaming action output and other intermediate results #2175

Closed
Kami opened this Issue Nov 5, 2015 · 14 comments

Comments

Projects
None yet
@Kami
Copy link
Member

Kami commented Nov 5, 2015

Multiple people have asked for this in our community so I'm bring this proposal back to life (I'm actually moving it from a private repo to a public one so other community members can participate as well).

Keep in mind that this proposal is almost 11 months old and a lot has change then. The biggest change which can affect this proposal is that now we have a /stream endpoint and mechanism for streaming data.

I will explain differences between my original proposed approach and utilizing stream for that in a new comment. In short, each approach has it's own advantages and disadvantages, but I personally think the original proposal is still better since it provides intermediate result (if we go with stream, user will only be able to receive new output / data, but if we go with my proposal, API endpoint will return everything which has been buffered so far).

Problem

Currently if you run a long-running action or an action which consists of multiple steps (action-chain or a work-flow), you only know if an action is scheduled, running or finished, but you have no visibility into what is going on with the running action until it completes (either succeeds or fails).

Good examples of a long-running actions include building and compiling stuff,d running tests, etc.

Use cases

  1. Long running shell or python actions: replacing Jenkins - action is a build, produces long output, a minute, user wants to see chunks of the output, rather than wait for full completion.
  2. Long running workflow - ActionChain, especially Mistral. The "workflow" action execution should show workflow progress and the tasks as they change state (new tasks scheuduled, running, complete...).
  3. Short tasks - majority of the cases - don't bother to do intermediate output, wait till complete.

For this purpose, define: short < 5 sec < Long.

Goals

The goal is to increase the visibility into running actions and stream output of the running actions and other intermediate results. Additionaly, we want to support better integration with Mistral.

With "output" I'm referring to the action's stdout and stderr. With "other intermediate results" I'm referring to ActionExecution and potentially other intermediate objects (@enykeev is already working on the second part).

I separated those two because I think they are two different things. Output is opaque to us and we have no idea how it looks so we should treat it as such. This means streaming it as soon as it us available. For performance and conveniece reasons we should probably still buffer it. For line-delimited stuff we could simply send a line as soon as it's available.

And as far as "other intermediate results" go - here we know the structure and what is an atomic unit which means we can stream those things as a whole as new line delimited JSON (one JSON serialized object per line).

Implementation Proposal

Inside the runners, we should intercept the action output and write it in a temporary document in a special collection. To avoid MongoDB lock contention issues, we should buffer the output a bit and not write it character by character since this is expensive (dunno, it's been a while, I need to research again, maybe newer versions of MongoDB finally have more granular write locking). To avoid lock contention, we might also want to use a separate database.

API user (client, web ui) should be able to consume this output with a special task specific streaming endpoint - e.g. /v1/actionexecutions/<id>/output.

Internally, this streaming endpoint would consume the endpoint from the temporary collection for old data and from the message bus for new data.

As noted above, this document would be temporary so if a user visits this URL after the action has completed, we would simply read the whole output from the action execution collection.

@enykeev proposed using server-sent events and the same endpoint as we use for streaming action execution and other objects.

This would potentially work as well, but I like a separate endpoint more since it's more granular and we can avoid server-sent events metadata overhead. On top of that, /output endpoint would include no metadata and markers which makes 3rd party integration easier (simply read data from the connection when available and display).

Caveats

A lot of programming languages buffer output (stdout and stderr) by default. This could become a minor problem and annoyance for non-Python runner actions.

I personally don't think this is a big issue. We consider Python runner actions as first-class citizens, so if a user is worried about the output buffering and not receiving output as soon as it's available, they can rewrite an action using Python runner.

@enykeev

This comment has been minimized.

Copy link
Member

enykeev commented Nov 5, 2015

This would potentially work as well, but I like a separate endpoint more since it's more granular and we can avoid server-sent events metadata overhead.

I would need to check whether we can actually consume a raw stream in browser. If not, we would have to add some metadata in form of eventsource format (potentially, with additional query parameter like ?raw=false). But you're right, this should be separate endpoint since otherwise we're risking to flood our clients with the traffic they don't actually need at the moment and it would be very hard to filter it server side.

@manasdk

This comment has been minimized.

Copy link
Collaborator

manasdk commented Nov 5, 2015

Overall +1 to the proposal. I think we need this now.

The details matter in terms of how this really fit into our result storage strategy in general. I hope we can speculate what our long term plan of result storage is and also what historical result storage looks like so that we can potentially make all this work well together. (This is likely a pipe dream)

If not, we would have to add some metadata in form of eventsource format

I suspect we will quickly move into filters on streams approach. I translate the requirement to the client pretty much defines what data it is interested in when connecting to the streams endpoint and client only receive those results. This is great idea in theory until it actually becomes extremely expensive to compute filters on the server.

Also, Often it might be better to take the http://graphql.org/ approach rather than build ad-hoc APIs.

@Kami

This comment has been minimized.

Copy link
Member

Kami commented Nov 5, 2015

As far as technical / implementation and storage side goes - I was planning to create a new table / collection which would store output (stdout and stderr) which has been collected and buffered so far.

Every time a client would hit /v1/actionexecutions/<id>/output or similar we would return raw data from this table which has been process so far (there would be no processing so it's a simple read).

Once the execution has completed, we would simply delete an entry in this table / collection and if user requests the same endpoint, we would simply read from action executions collection.

It's also worth pointing out that this collection will be quite write heavy, that's why we should at least start with a separate collection, or maybe even a separate MongoDB database (this way it's easier for use to just move this single table to a different MongoDB server if performance issues shows up).

Having said that, yes, it would be better if we could use a more appropriate database for this use case (write-heavy, something like Cassandra should work great). When we start to work on migrating to a different datastore, we could start with a simple migration like this - start with only migrating output data to a new datastore.

@manasdk

This comment has been minimized.

Copy link
Collaborator

manasdk commented Nov 5, 2015

Once the execution has completed, we would simply delete an entry in this table / collection and if user requests the same endpoint, we would simply read from action executions collection.

I actually have started to think that result should not be stored in the main actionexecution table. We also, should not be passing around these massive result objects through StackStorm. Only when the result is required it can be read from the back-end - this applies to usage in rules, workflow engines etc. I suspect that is easier said than done. Anyway, since I do not want to derail this conversation any further I think we can move forward with the approach you describe above for the time being.

@Kami

This comment has been minimized.

Copy link
Member

Kami commented Nov 5, 2015

Yeah, I agree. Storing results somewhere else is a good idea since in a lot of cases, we just need the execution metadata and not the result itself and vice versa.

But yeah, even though it's slightly related, lets move this task / improvement to a different issue (maybe it's also a good candidate for initial DB migration task :)).

@enykeev

This comment has been minimized.

Copy link
Member

enykeev commented Nov 6, 2015

I suspect we will quickly move into filters on streams approach. I translate the requirement to the client pretty much defines what data it is interested in when connecting to the streams endpoint and client only receive those results.

The problem here is that client connects to the stream once and this is one way channel. We can track what execution the specific client got last, but that's a huge mess as we would need to store the state and there's whole bunch of situations it might go wrong. Having a separate streaming endpoint for every execution really is a better way unless client wants to receive results data on dozens executions at once. I can't figure out a use case for that at the moment.

@nmaludy

This comment has been minimized.

Copy link
Contributor

nmaludy commented Mar 16, 2017

+1

1 similar comment
@silverbp

This comment has been minimized.

Copy link

silverbp commented Mar 20, 2017

+1

@pietervogelaar

This comment has been minimized.

Copy link

pietervogelaar commented May 10, 2017

+1

1 similar comment
@dominikmueller

This comment has been minimized.

Copy link

dominikmueller commented Jul 18, 2017

+1

@koenbud

This comment has been minimized.

Copy link

koenbud commented Jul 24, 2017

+1 for python streaming/realtime stdout

@Kami Kami added this to the 2.4.0 milestone Aug 8, 2017

@Kami Kami referenced this issue Aug 8, 2017

Merged

Real-time streaming output for Python runner actions #3657

13 of 14 tasks complete
@armab

This comment has been minimized.

Copy link
Member

armab commented Aug 8, 2017

Related discussion: StackStorm/discussions#230 by @enykeev

@LindsayHill LindsayHill modified the milestones: 2.5.0, 2.4.0 Aug 24, 2017

@humblearner humblearner modified the milestones: 2.5.0, 2.6.0 Oct 24, 2017

@LindsayHill

This comment has been minimized.

Copy link
Contributor

LindsayHill commented Oct 26, 2017

First release of this in v2.5.0, published today!! Maybe we can close this issue at last?

@Kami

This comment has been minimized.

Copy link
Member

Kami commented Apr 5, 2018

Closing since this functionality is now available in StackStorm >= 2.5.0.

@Kami Kami closed this Apr 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment