Switch from Progrock to OpenTelemetry #6835

vito · 2024-03-06T01:20:15Z

Out with Progrock, in with OpenTelemetry

Now that OpenTelemetry has support for streaming logs, we are able to switch from Progrock and embrace an open standard and ecosystem instead.

This change should be mostly invisible to the user, unless they want to integrate with OpenTelemetry. In that case, they can set the standard OTEL_ environment variables client-side, and all telemetry will be sent to the configured OTLP exporter, without requiring any engine-side configuration.

How does it work?

We use a custom OpenTelemetry pipeline that is able to send "live" spans to a configured exporter; this is how the TUI (and soon Cloud) is able to show you what's happening rather than what happened.
To get telemetry data out of the engine and to the TUI, the engine implements an OpenTelemetry publish/subscribe pattern. The CLI subscribes to its own trace, and the engine publishes telemetry to the subscriber.
We originally needed this pub/sub just so the TUI can show engine activity, but since we already had to do the work anyway, we also re-process data received from the pub/sub and send finished spans to your configured exporter. This way you only need to configure OTEL_ on the client side!
DagQL instrumentation now creates an Otel span for each field resolution and adds base64-encoded callpbv1.Call as a span attribute. Not that this is a single Call and not a full DAG, which saves us ~10x (unpredictable) data sent over the wire, but implies the consumer needs to keep track of all Calls that it sees in order to grok inter-Call references.
Dagger Cloud will become an OTLP endpoint that is capable of receiving "live" spans.

That covers tracing. What about logging?

The Go OpenTelemetry SDK does not yet implement support for logging, so we have a package of our own, basically copy-pasta from the tracing stack with changes to support logging, since it's already supported everywhere else. This can be removed once Logs SDK prototype open-telemetry/opentelemetry-go#4955 lands.
Same as above, we have a pub/sub stack for logs, too. There's no "live" vs "complete" distinction for logs though.
Our shim detects the current trace and span, which Buildkit already propagates for us, and pipes the command's stdout/stderr to io.Writers that send the logs to OpenTelemetry. This should maybe be upstreamed to Buildkit once 4955 lands.
DagQL instrumentation automatically configures iotctx.Stdout and ioctx.Stderr to send to the resolver's span.
To replace Progrock's "global logging" feature, we can now send logs with a special annotation that we can detect client-side to show them globally, while still maintaining a relationship to an originating span. This part isn't fully baked yet, but seems viable.
Dagger Cloud will become an OTLP endpoint that is capable of receiving logs for spans.
Engine logs can also be sent to Dagger Cloud as before, but I need to resurrect this code and confirm it works.

Breaking changes

The old TUIs have been removed. So both the dagger run DAG TUI, and the ancient interactive tree view TUI. These were just not worth translating to the new world, so this was finally the time to GC them. Apologies to the handful of users that still used the tree view style UI.
The --focus flag is gone.
The pipeline() APIs are now no-ops.

New features / improvements

The CLI now supports curl style verbosity flags.
- -v keeps things from disappearing
- -vv reveals internal/encapsulated spans
- -vvv reveals spans that evaluated very quickly
dagger watch allows you to see all activity across the engine, which is pretty neat. I've already used it for troubleshooting integration tests. It works by just subscribing to "everything" via the pub/sub service mentioned above.
A span can be given a special attribute that tells the UI to collapse its children by default: telemetry.Encapsulate().
We now proactively shut down services on client close.
We should no longer have to worry about missing the 'last few bits' of telemetry. The pub/sub system now waits for active spans to complete before detaching. So hopefully that means no more "runaway builds" or services left running forever.
The TUI now supports a Ctrl+\ keybind and SIGQUITs itself, so if there's a hang for whatever reason, now we can more easily get a stack dump.
The Engine now uses the same OpenTelemetry stack as the rest of Dagger, so hopefully we don't have to worry too much about dependency hell. (Knock on wood.)
All the labels that used to be sent to Cloud as part of Pipelines are now associated to the CLI's OpenTelemetry resource instead.
User-land code can directly hook in to the broader trace; we automatically configure OTEL_ as appropriate in the shim.

Regressions

We no longer use Buildkit's progress stream, so we are at the mercy of their Otel integration, which is not as rich.
- We no longer show CACHED status for Buildkit cache hits. :( Instead you'll see separate spans like cache request: ... and load cache: ....
- We no longer show image pull/extract progress, so no more fancy 2d progress bars.
- We no longer send the LLB op protobuf in the telemetry.
- All of these can probably be upstream PRs.

Notable tweaks

DagQL no longer marks module functions as 'impure', which was causing parallel calls to never be deduped, leading to duplicate spans.
- Technically we already cache module functions within a session, and the DagQL server cache has the same lifetime as the session, so this should not be a behavior change, but we'll want to change this if/when we take a swing at module function caching and/or publishing recipes.
dagger terminal no longer does the insane vterm emulation thing, which was originally done so we can support showing progress alongside your active terminal session.
- Now you should just use dagger watch in a separate terminal. Implementing a terminal emulator is not our core competency. It was fun, but also endless toil. The new implementation is much simpler.
OpenTelemetry is very dependent on ctx propagation, and context.Background() breaks that, so now we use the new context.WithoutCancel() API which is nearly always what we wanted instead, since it keeps all the juicy values but just drops the cancel-propagation aspect.
We now initialize the TUI frontend super early on - literally before we even initialize telemetry. This is great for various reasons, but the trade-off is we need to do custom barebones flag parsing to handle -v/--debug and other global flags. We can't defer that to the usual flag parsing since that's tied to module loading and stuff (for e.g. dagger call).
We currently have our own hokey OpenTelemetry Go SDK for logging under telemetry/env and telemetry/sdklog which should be nuked as soon as the official SDK lands (Logs SDK prototype open-telemetry/opentelemetry-go#4955).
Deleted various stale commands under cmd/ that weren't worth maintaining.
- otel-collector (this is just native now)
- dagger-graph (this was coupled to Progrock/Buildkit progress stream)
- upload-journal (replaced by OTLP)

* All Progrock plumbing is now gone, though we may want to bring it back for compatibility. Removing it was a useful exercise to find the many places where we're relying on it. * TUI now supports -v, -vv, -vvv (configurable verbosity). Global flags like --debug, -v, --silent, etc. are processed anywhere in the command string and respected. * CLI forwards engine traces and logs to configured exporters, no need to configure engine-side (we already need this flow for the TUI) * "Live" spans are emitted to TUI and cloud, filtered out before sending to a traditional (non-Live) otel stack * Engine supports pub/sub for traces and logs, can be exposed in the future as a GraphQL subscription * Refactor context.Background usage to context.WithoutCancel. We usually don't want a total reset, since that drops the span context and any other telemetry related things (loggers etc). Go 1.21 added context.WithoutCancel which is more precise. * engine: don't include source in slogs. Added this prospectively and it doesn't seem worth the noise. * idtui: DB can record multiple traces, polish * multi traces is mostly for dagviz, so i can run it with a single DB * add 'passthrough' UI flag which tells the UI to ignore a span and descend into its children * add 'ignore' UI flag, to be used sparingly for things whose signal:noise ratio is irredeemibly low (e.g. 'id' calls) * make loadFooFromID calls passthrough * make Buildkit gRPC calls passthrough * Global Progrock rogs are theoretically replaced with tracing.GlobalLogger, but it has yet to be integrated into anything. * Module functions are pure after all. They're already cached per-session, so this makes DagQL reflect that, avoiding duplicate Buildkit work that would be deduped at the Buildkit layer. Cleans up the telemetry since previously you'd see duplicate queries. * TODO: ensure draining is airtight * TODO: global logging to TUI * TODO: batch forwarded engine spans instead of emitting them "live" * TODO: fix dagger terminal Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

previously we would cancel all subscribers for a trace whenever a client/derver went away. but for modules/nesting this meant the inner call would cancel the whole trace early. * TODO: looks like services still don't drain completely? Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

previously service spans would be left incomplete on exit. now we'll detach from them on shutdown, which will only stop the service if we're the last depender on it. end result _should_ be that services are always completed through telemetry, but I've seen maybe 2 in 50 runs still leave it running. still troubleshooting, but without this change there is no hope at all. fixes dagger#6493 Signed-off-by: Alex Suraci <alex@dagger.io>

Honestly not 100% confirmed, but seems right. I think the final solution might be to get traces/logs out without going through a session in the first place. Signed-off-by: Alex Suraci <alex@dagger.io>

seeing a panic in ExportSpans/UploadTraces, this should help avoid bringing whole server down - I think - or at least give us hope. Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

fetching the logs/traces over a session is really annoying with draining because the session itself gets closed before things can be fully flushed. Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

More than a 10x efficiency increase. Frontend still super easy to implement. Test: # in ~/src/bass $ with-dev dagger call -m ./ --src https://github.com/vito/bass unit --packages ./pkg/cli stdout --debug &> out $ rg measuring out | cut -d= -f2 | xargs | tr ' ' '+' | sed -e 's/0m//g' -e 's/[^0-9\+]//g' | cat -v | bc Before: 8524838 (~8.1 MiB) After: 727039 (~0.7 MiB) Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

kinda hacky, but it makes sense that we need to handle this, cause loadFooFromID or generally anything can take an ID that's never been seen by the server before, and the loadFooFromID span will come first. Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

vito · 2024-03-29T04:17:55Z

Update: ok, Go SDK support for custom spans is implemented: 31b303b

But honestly it's not a great reference for other SDK authors; it's a bit hairy for not great reasons. I'd check out the Python SDK (or the Elixir SDK 🎉 - thanks @wingyplus!) instead. You can totally reference the 'live span' code though, that part's just copy-pasta from telemetry/inflight/ and shows how to implement it.

@sipsma @jedevc WDYT of refactoring the Go SDK codegen <=> Go SDK runtime <=> Dagger engine code sharing?¹ Rough idea is to change the runtime to work via replace dagger.io/dagger => ./internal/dagger, the same way the engine works. That way we can share code more easily without worrying about import paths. (Motivation: #6835 (comment))

That would result in the easiest central place to share code being the Go SDK, which kind of tracks I guess, since we're already incentivized to keep its dependencies minimal, as we would with any code shared between all three.

Would save that for a future PR, this one's already way too big.

Also @sipsma I think all the feedback is addressed now, modulo recent changes. I didn't find time to double-check the Buildkit memory thing so I just made it send to io.Discard and tweaked the comment. But the flag parsing situation is way less gross now.

🙏

Having a hard time figuring out what to call these, lemme know if there are better terms. By "Go SDK codegen" I mean dagger.io/dagger, by "Go SDK runtime" I mean the module codegen'd package, and by the Dagger engine I mean... the Dagger engine. ↩

Signed-off-by: Alex Suraci <alex@dagger.io>

no more exec /runtime! Signed-off-by: Alex Suraci <alex@dagger.io>

worth refactoring, but not now™ Signed-off-by: Alex Suraci <alex@dagger.io>

wingyplus · 2024-03-29T14:49:29Z

I found an issue when I'm working on #6962. I create a file to sdk/elixir/test.exs with the snippet below

Application.ensure_all_started(:inets)

Mix.install([{:dagger, path: "."}])

defmodule Test do
  def run(dag \\ Dagger.connect!()) do
    dag
    |> Dagger.Client.container()
    |> Dagger.Container.from("alpine")
    |> Dagger.Container.with_exec(~w"echo hello")
  |> Dagger.Sync.sync()
  after
    Dagger.close(dag)
  end
end

Test.run

The script cannot compile because I'm accidentally make the sdk compilation error. When run with with-dev dagger run elixir test.exs I found this error

wingyplus@WINGYMOMO:~/src/github.com/dagger/dagger-elixir-otel/sdk/elixir$ with-dev dagger run elixir test.exs
==> dagger
Compiling 2 files (.ex)

== Compilation error in file lib/dagger/core/graphql_client/httpc.ex ==
** (SyntaxError) invalid syntax found on lib/dagger/core/graphql_client/httpc.ex:23:7:
    error: syntax error before: '['
    │
 23 │       []
    │       ^
    │
    └─ lib/dagger/core/graphql_client/httpc.ex:23:7
    (elixir 1.16.2) lib/kernel/parallel_compiler.ex:428: anonymous fn/5 in Kernel.ParallelCompiler.spawn_workers/8
could not compile dependency :dagger, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile dagger --force", update it with "mi
x deps.update dagger" or clean it with "mix deps.clean dagger"
21:42:58 WRN failed to get repo HEAD err="reference not found"
exit status 1
wingyplus@WINGYMOMO:~/src/github.com/dagger/dagger-elixir-otel/sdk/elixir$

Let's ignore Elixir error for now. The issue is when error occurred, my shell display is now broken for some reason. :(

Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com>

vito · 2024-04-02T21:15:20Z

Thanks for the approval. 🙏

Integration tests were failing for the Python SDK, related to the new opentelemetry dependency, so I've just reverted those for now so we can get this PR merged first. Each SDK can just be a separate PR.

Will merge once checks are green.

looks like there's more to figure out with module dependencies? either way, don't want this to block the current PR, they can be re-introduced in another PR like the other SDKs Revert "Pin python requirements" This reverts commit b40c411. Revert "Add Python support" This reverts commit 08aa92c. Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

vito · 2024-04-02T23:04:16Z

Update: some Python related tests flaked due to a concurrent map write here:

https://github.com/dagger/dagger/blob/main/sdk/python/runtime/discovery.go#L250

This is unrelated, but I've piled a quick fix onto my PR (just added a mutex), since it failed twice in a row.

* progrock -> otel * All Progrock plumbing is now gone, though we may want to bring it back for compatibility. Removing it was a useful exercise to find the many places where we're relying on it. * TUI now supports -v, -vv, -vvv (configurable verbosity). Global flags like --debug, -v, --silent, etc. are processed anywhere in the command string and respected. * CLI forwards engine traces and logs to configured exporters, no need to configure engine-side (we already need this flow for the TUI) * "Live" spans are emitted to TUI and cloud, filtered out before sending to a traditional (non-Live) otel stack * Engine supports pub/sub for traces and logs, can be exposed in the future as a GraphQL subscription * Refactor context.Background usage to context.WithoutCancel. We usually don't want a total reset, since that drops the span context and any other telemetry related things (loggers etc). Go 1.21 added context.WithoutCancel which is more precise. * engine: don't include source in slogs. Added this prospectively and it doesn't seem worth the noise. * idtui: DB can record multiple traces, polish * multi traces is mostly for dagviz, so i can run it with a single DB * add 'passthrough' UI flag which tells the UI to ignore a span and descend into its children * add 'ignore' UI flag, to be used sparingly for things whose signal:noise ratio is irredeemibly low (e.g. 'id' calls) * make loadFooFromID calls passthrough * make Buildkit gRPC calls passthrough * Global Progrock rogs are theoretically replaced with tracing.GlobalLogger, but it has yet to be integrated into anything. * Module functions are pure after all. They're already cached per-session, so this makes DagQL reflect that, avoiding duplicate Buildkit work that would be deduped at the Buildkit layer. Cleans up the telemetry since previously you'd see duplicate queries. * TODO: ensure draining is airtight * TODO: global logging to TUI * TODO: batch forwarded engine spans instead of emitting them "live" * TODO: fix dagger terminal Signed-off-by: Alex Suraci <alex@dagger.io> * fix log draining, again, ish previously we would cancel all subscribers for a trace whenever a client/derver went away. but for modules/nesting this meant the inner call would cancel the whole trace early. * TODO: looks like services still don't drain completely? Signed-off-by: Alex Suraci <alex@dagger.io> * don't set up logs if not configured Signed-off-by: Alex Suraci <alex@dagger.io> * respect configured level Signed-off-by: Alex Suraci <alex@dagger.io> * clean up shim early tracing remnants Signed-off-by: Alex Suraci <alex@dagger.io> * synchronously detach services on main client exit previously service spans would be left incomplete on exit. now we'll detach from them on shutdown, which will only stop the service if we're the last depender on it. end result _should_ be that services are always completed through telemetry, but I've seen maybe 2 in 50 runs still leave it running. still troubleshooting, but without this change there is no hope at all. fixes dagger#6493 Signed-off-by: Alex Suraci <alex@dagger.io> * flush telemetry before closing server clients Honestly not 100% confirmed, but seems right. I think the final solution might be to get traces/logs out without going through a session in the first place. Signed-off-by: Alex Suraci <alex@dagger.io> * switch from errgroup to conc for panic handling seeing a panic in ExportSpans/UploadTraces, this should help avoid bringing whole server down - I think - or at least give us hope. Signed-off-by: Alex Suraci <alex@dagger.io> * nest 'starting session' beneath 'connect' Signed-off-by: Alex Suraci <alex@dagger.io> * send logs out from engine to log exporter too Signed-off-by: Alex Suraci <alex@dagger.io> * bump midterm Signed-off-by: Alex Suraci <alex@dagger.io> * switch to server-side telemetry pub/sub fetching the logs/traces over a session is really annoying with draining because the session itself gets closed before things can be fully flushed. Signed-off-by: Alex Suraci <alex@dagger.io> * show newer traces first Signed-off-by: Alex Suraci <alex@dagger.io> * cleanup Signed-off-by: Alex Suraci <alex@dagger.io> * send individual Calls over telemetry instead of IDs More than a 10x efficiency increase. Frontend still super easy to implement. Test: # in ~/src/bass $ with-dev dagger call -m ./ --src https://github.com/vito/bass unit --packages ./pkg/cli stdout --debug &> out $ rg measuring out | cut -d= -f2 | xargs | tr ' ' '+' | sed -e 's/0m//g' -e 's/[^0-9\+]//g' | cat -v | bc Before: 8524838 (~8.1 MiB) After: 727039 (~0.7 MiB) Signed-off-by: Alex Suraci <alex@dagger.io> * idtui Base was correct in returning bool Signed-off-by: Alex Suraci <alex@dagger.io> * handle case where calls haven't been seen yet kinda hacky, but it makes sense that we need to handle this, cause loadFooFromID or generally anything can take an ID that's never been seen by the server before, and the loadFooFromID span will come first. Signed-off-by: Alex Suraci <alex@dagger.io> * idtui: add space between progress and primary output Signed-off-by: Alex Suraci <alex@dagger.io> * swap -vvv and -vv, -vv now breaks encapsulation Signed-off-by: Alex Suraci <alex@dagger.io> * cleanups Signed-off-by: Alex Suraci <alex@dagger.io> * tidy mage Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * loosen go.mod constraints Signed-off-by: Alex Suraci <alex@dagger.io> * revive labels tests Signed-off-by: Alex Suraci <alex@dagger.io> * fix cachemap tests Signed-off-by: Alex Suraci <alex@dagger.io> * nuclear option: wait for all spans to complete Rather than closing the telemetry connection and hoping the timing works out, we keep track of which traces have active spans and wait for that count to reach 0. A bit more complicated but not seeing a simpler solution really. Without this we can't ensure that the client sees the very outermost spans complete. Signed-off-by: Alex Suraci <alex@dagger.io> * pass-through all gRPC stuff hasn't really been useful, it's available in the full trace for devs, or we can add a verbosity level. Signed-off-by: Alex Suraci <alex@dagger.io> * dagviz: tweaks to support visualizing a live trace Signed-off-by: Alex Suraci <alex@dagger.io> * better 'docker tag' parsing Signed-off-by: Alex Suraci <alex@dagger.io> * fixup docker tag check Signed-off-by: Alex Suraci <alex@dagger.io> * pass auth headers to OTLP logs too Signed-off-by: Alex Suraci <alex@dagger.io> * fix stdio not making it out of gateway containers Signed-off-by: Alex Suraci <alex@dagger.io> * fix terminal support Signed-off-by: Alex Suraci <alex@dagger.io> * drain immediately when interrupted otherwise we can get stuck waiting for child spans of a nested process that got kill -9'd. not perfect but better than hanging on Ctrl+C which is already an emergent situation where you're not likely that interested in any remaining data if you already had a reason to interrupt. in Cloud we'll clean up any orphaned spans based on keepalives anyway. Signed-off-by: Alex Suraci <alex@dagger.io> * fix unintentionally HTTP-ifying gRPC otlp enpoint Signed-off-by: Alex Suraci <alex@dagger.io> * give up retrying connection if outer ctx canceled Signed-off-by: Alex Suraci <alex@dagger.io> * initiate draining only when main client goes away Signed-off-by: Alex Suraci <alex@dagger.io> * appease linter Signed-off-by: Alex Suraci <alex@dagger.io> * remove unnecessary wait we don't need to try synchronizing here now that we just generically wait for all spans to complete Signed-off-by: Alex Suraci <alex@dagger.io> * fix panic if no telemetry Signed-off-by: Alex Suraci <alex@dagger.io> * remove debug log Signed-off-by: Alex Suraci <alex@dagger.io> * print final progress tree in plain mode no substitute for live console streaming, but easier to implement for now, and probably easier to read in CI. probably needs more work, but might get some tests passing. Signed-off-by: Alex Suraci <alex@dagger.io> * fix Windows build Signed-off-by: Alex Suraci <alex@dagger.io> * propagate spans through dagger-in-dagger Signed-off-by: Alex Suraci <alex@dagger.io> * retry connecting to telemetry Signed-off-by: Alex Suraci <alex@dagger.io> * propagate span context through dagger run Signed-off-by: Alex Suraci <alex@dagger.io> * install default labels as otel resource attrs Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * remove pipeline tests these are expected to fail now Signed-off-by: Alex Suraci <alex@dagger.io> * fail root span when command fails Signed-off-by: Alex Suraci <alex@dagger.io> * Container.import: add span for streaming image Signed-off-by: Alex Suraci <alex@dagger.io> * idtui: break encapsulation in case of errors Signed-off-by: Alex Suraci <alex@dagger.io> * fix schema-level logging not exporting caught by TestDaggerUp/random Signed-off-by: Alex Suraci <alex@dagger.io> * update TestDaggerRun assertion Signed-off-by: Alex Suraci <alex@dagger.io> * fix test not syncing on progress completion Signed-off-by: Alex Suraci <alex@dagger.io> * add verbose debug log Signed-off-by: Alex Suraci <alex@dagger.io> * respect $DAGGER_CLOUD_URL and $DAGGER_CLOUD_TOKEN promoting these from _EXPERIMENTAL along the way, which has already been done for _TOKEN, don't really see a strong reason to keep the _EXPERIMENTAL prefix, but low conviction Signed-off-by: Alex Suraci <alex@dagger.io> * port 'processor: support span keepalive' originally aluzzardi/otel-in-flight@2fc011f Signed-off-by: Alex Suraci <alex@dagger.io> * add 'watch' command really helps with troubleshooting hanging tests! Signed-off-by: Alex Suraci <alex@dagger.io> * set a reasonable window size in plain mode otherwise the terminals resize a ton of times when a long string is printed, absolutely tanking performance. would be nice if that were fast, but no time for that now. Signed-off-by: Alex Suraci <alex@dagger.io> * manually revert container.import change i thought this wouldn't break it, but ... ? Signed-off-by: Alex Suraci <alex@dagger.io> * fix race Signed-off-by: Alex Suraci <alex@dagger.io> * mark watch command experimental Signed-off-by: Alex Suraci <alex@dagger.io> * fixup lock, more logging Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * fix data race in tests Signed-off-by: Alex Suraci <alex@dagger.io> * fix java SDK hang once again really not sure what's writing to stderr even with --silent but this is just too brittle. redirect stderr to /dev/null instead. Signed-off-by: Alex Suraci <alex@dagger.io> * retire dagger.io/ui.primary, use root span instead fixes Views test; frontend must have been getting confused because there were multiple "primary" spans Signed-off-by: Alex Suraci <alex@dagger.io> * take 2: just manually mark the 'primary' span Signed-off-by: Alex Suraci <alex@dagger.io> * merge tracing and telemetry packages Signed-off-by: Alex Suraci <alex@dagger.io> * cleanups Signed-off-by: Alex Suraci <alex@dagger.io> * roll back sync detach change this was no longer needed with the change to wait for spans to finish, not worth the review-time distraction Signed-off-by: Alex Suraci <alex@dagger.io> * cleanups Signed-off-by: Alex Suraci <alex@dagger.io> * update comment Signed-off-by: Alex Suraci <alex@dagger.io> * remove dead code Signed-off-by: Alex Suraci <alex@dagger.io> * default primary span to root span Signed-off-by: Alex Suraci <alex@dagger.io> * remove unused module arg Signed-off-by: Alex Suraci <alex@dagger.io> * send engine traces/logs to cloud Signed-off-by: Alex Suraci <alex@dagger.io> * implement sub metrics pub/sub Some clients presume this service is supported by the OTLP endpoint. So we can just have a stub implementation for now. Signed-off-by: Alex Suraci <alex@dagger.io> * sdk/go runtime: implement otel propagation TODO: set up otel for you Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * add scary comment Signed-off-by: Alex Suraci <alex@dagger.io> * batch events that are sent from the engine Previously we were just sending each individual update to the configured exporters, which was very expensive and would even slow down the TUI. When I originally tried to send it to span processors, nothing would be sent out; turns out that was because the transform.Spans call we were using didn't set the `Sampled` trace flag. Now we forward engine traces and logs to all configured processors, so their individual batching settings should be respected. Signed-off-by: Alex Suraci <alex@dagger.io> * fix spans being deduped within single batch * fix detection for in flight spans; we need to check EndTime < StartTime since sometimes we end up with a 1754 timestamp * when a span is already present in a batch, update it in-place rather than dropping it on the floor Signed-off-by: Alex Suraci <alex@dagger.io> * Add Python support Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com> * shim: proxy otel to 127.0.0.1:0 more universally compatible than unix:// Signed-off-by: Alex Suraci <alex@dagger.io> * remove unnecesssary fn Signed-off-by: Alex Suraci <alex@dagger.io> * attributes: add passthrough, bikeshed + document also start cleaning up "tasks" cruft nonsense, these can just be plain old attributes on a single span i think Signed-off-by: Alex Suraci <alex@dagger.io> * fix janky flag parsing parse global flags in two passes, ensuring the same flags are installed in both cases, and capturing the values before installing them into the real flag set, since that clobbers the values Signed-off-by: Alex Suraci <alex@dagger.io> * discard Buildkit progress ...just in case it gets buffered in memory forever otherwise Signed-off-by: Alex Suraci <alex@dagger.io> * sdk/go: somewhat gross support for opentelemetry had to copy-paste a lot of the telemetry code into sdk/go/. would love to just move everything there so it can be shared between the shim, the Go runtime, and the engine, however it is currently a huge PITA to share code between all three, because of the way codegen works. saving that for another day. maybe tomorrow. Signed-off-by: Alex Suraci <alex@dagger.io> * send logs to function call span, not exec /runtime Signed-off-by: Alex Suraci <alex@dagger.io> * tui: respect dagger.io/ui.mask no more exec /runtime! Signed-off-by: Alex Suraci <alex@dagger.io> * silence linter worth refactoring, but not now™ Signed-off-by: Alex Suraci <alex@dagger.io> * ignore --help when parsing global flags Signed-off-by: Alex Suraci <alex@dagger.io> * Pin python requirements Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com> * revert Python SDK changes for now looks like there's more to figure out with module dependencies? either way, don't want this to block the current PR, they can be re-introduced in another PR like the other SDKs Revert "Pin python requirements" This reverts commit b40c411. Revert "Add Python support" This reverts commit 08aa92c. Signed-off-by: Alex Suraci <alex@dagger.io> * fix race conditions in python SDK runtime Signed-off-by: Alex Suraci <alex@dagger.io> --------- Signed-off-by: Alex Suraci <alex@dagger.io> Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com> Co-authored-by: Helder Correia <174525+helderco@users.noreply.github.com>

vito force-pushed the otel-tui branch from 0b5e627 to a0c67dd Compare March 6, 2024 03:00

vito force-pushed the otel-tui branch from c416f44 to 9aefcf2 Compare March 6, 2024 21:47

vito added 13 commits March 6, 2024 18:29

Merge remote-tracking branch 'upstream/main' into otel-tui

abde5d4

Signed-off-by: Alex Suraci <alex@dagger.io>

don't set up logs if not configured

2899340

Signed-off-by: Alex Suraci <alex@dagger.io>

respect configured level

2b8c0fd

Signed-off-by: Alex Suraci <alex@dagger.io>

clean up shim early tracing remnants

b7c921b

Signed-off-by: Alex Suraci <alex@dagger.io>

flush telemetry before closing server clients

4036ce0

Honestly not 100% confirmed, but seems right. I think the final solution might be to get traces/logs out without going through a session in the first place. Signed-off-by: Alex Suraci <alex@dagger.io>

switch from errgroup to conc for panic handling

b3fe350

seeing a panic in ExportSpans/UploadTraces, this should help avoid bringing whole server down - I think - or at least give us hope. Signed-off-by: Alex Suraci <alex@dagger.io>

nest 'starting session' beneath 'connect'

29e82ee

Signed-off-by: Alex Suraci <alex@dagger.io>

send logs out from engine to log exporter too

0f856e4

Signed-off-by: Alex Suraci <alex@dagger.io>

bump midterm

c3ad64b

Signed-off-by: Alex Suraci <alex@dagger.io>

switch to server-side telemetry pub/sub

e7b370d

fetching the logs/traces over a session is really annoying with draining because the session itself gets closed before things can be fully flushed. Signed-off-by: Alex Suraci <alex@dagger.io>

Merge remote-tracking branch 'upstream/main' into otel-tui

008fae1

Signed-off-by: Alex Suraci <alex@dagger.io>

vito force-pushed the otel-tui branch from d02c439 to 383b38e Compare March 9, 2024 03:59

vito added 2 commits March 8, 2024 23:00

show newer traces first

79008a0

Signed-off-by: Alex Suraci <alex@dagger.io>

cleanup

a62f251

Signed-off-by: Alex Suraci <alex@dagger.io>

vito force-pushed the otel-tui branch from 383b38e to a62f251 Compare March 9, 2024 04:00

vito added 10 commits March 9, 2024 00:48

idtui Base was correct in returning bool

641e3a6

Signed-off-by: Alex Suraci <alex@dagger.io>

idtui: add space between progress and primary output

91c9192

Signed-off-by: Alex Suraci <alex@dagger.io>

swap -vvv and -vv, -vv now breaks encapsulation

7cd1bf6

Signed-off-by: Alex Suraci <alex@dagger.io>

cleanups

2399d94

Signed-off-by: Alex Suraci <alex@dagger.io>

tidy mage

5219fb2

Signed-off-by: Alex Suraci <alex@dagger.io>

tidy

0fbfd9c

Signed-off-by: Alex Suraci <alex@dagger.io>

loosen go.mod constraints

fd6b22e

Signed-off-by: Alex Suraci <alex@dagger.io>

revive labels tests

152f767

Signed-off-by: Alex Suraci <alex@dagger.io>

vito force-pushed the otel-tui branch from 3690594 to 31b303b Compare March 29, 2024 02:41

vito added 3 commits March 29, 2024 00:23

send logs to function call span, not exec /runtime

64a98a9

Signed-off-by: Alex Suraci <alex@dagger.io>

tui: respect dagger.io/ui.mask

b161258

no more exec /runtime! Signed-off-by: Alex Suraci <alex@dagger.io>

silence linter

a8fec0d

worth refactoring, but not now™ Signed-off-by: Alex Suraci <alex@dagger.io>

helderco mentioned this pull request Mar 30, 2024

Improve performance in the Python runtime module #6884

Merged

vito and others added 2 commits April 2, 2024 11:28

ignore --help when parsing global flags

dce8443

Signed-off-by: Alex Suraci <alex@dagger.io>

Pin python requirements

b40c411

Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com>

sipsma approved these changes Apr 2, 2024

View reviewed changes

vito force-pushed the otel-tui branch from d74aa9a to ff205b5 Compare April 2, 2024 21:16

vito added 2 commits April 2, 2024 18:39

Merge remote-tracking branch 'upstream/main' into otel-tui

1216ae8

fix race conditions in python SDK runtime

17d17f2

Signed-off-by: Alex Suraci <alex@dagger.io>

vito merged commit 6467198 into dagger:main Apr 3, 2024
43 checks passed

jedevc mentioned this pull request Apr 3, 2024

fix: traceExec in docker driver should capture stderr #7000

Merged

vito mentioned this pull request Apr 3, 2024

post-OpenTelemetry clean-ups/fixes #7005

Merged

helderco mentioned this pull request Apr 4, 2024

Python: Support OpenTelemetry #7015

Merged

fcanovai mentioned this pull request Apr 5, 2024

golangci-lint verbose option conflicting in dagger 0.11.0 sagikazarmark/daggerverse#64

Open

vito deleted the otel-tui branch April 5, 2024 16:11

jedevc mentioned this pull request Apr 11, 2024

Dagger cloud cannot be configured without caching #7079

Open

helderco mentioned this pull request Apr 12, 2024

feat(TSSdk): support OTEL #7074

Merged

gerhard mentioned this pull request Apr 12, 2024

🐞 Since v0.11.0 dagger version returns a warning when used in a git worktree #7089

Closed

This was referenced May 1, 2024

✨ Replace Jaeger with generic OTLP Export #1952

Closed

🐞 Zenith: WithFocus() only prints WithExec statements and not Pipeline() labels for dagger call #5994

Closed

feat: improve plain progress #7272

Draft

jedevc mentioned this pull request May 14, 2024

feat: mark [internal] buildkit prefixes as internal #7371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from Progrock to OpenTelemetry #6835

Switch from Progrock to OpenTelemetry #6835

vito commented Mar 6, 2024 •

edited

vito commented Mar 29, 2024

wingyplus commented Mar 29, 2024

vito commented Apr 2, 2024

vito commented Apr 2, 2024

Switch from Progrock to OpenTelemetry #6835

Switch from Progrock to OpenTelemetry #6835

Conversation

vito commented Mar 6, 2024 • edited

Out with Progrock, in with OpenTelemetry

Breaking changes

New features / improvements

Regressions

Notable tweaks

vito commented Mar 29, 2024

Footnotes

wingyplus commented Mar 29, 2024

vito commented Apr 2, 2024

vito commented Apr 2, 2024

vito commented Mar 6, 2024 •

edited