Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from Progrock to OpenTelemetry #6835

Merged
merged 104 commits into from Apr 3, 2024
Merged

Switch from Progrock to OpenTelemetry #6835

merged 104 commits into from Apr 3, 2024

Conversation

vito
Copy link
Contributor

@vito vito commented Mar 6, 2024

Out with Progrock, in with OpenTelemetry

Now that OpenTelemetry has support for streaming logs, we are able to switch from Progrock and embrace an open standard and ecosystem instead.

This change should be mostly invisible to the user, unless they want to integrate with OpenTelemetry. In that case, they can set the standard OTEL_ environment variables client-side, and all telemetry will be sent to the configured OTLP exporter, without requiring any engine-side configuration.

How does it work?

  1. We use a custom OpenTelemetry pipeline that is able to send "live" spans to a configured exporter; this is how the TUI (and soon Cloud) is able to show you what's happening rather than what happened.
  2. To get telemetry data out of the engine and to the TUI, the engine implements an OpenTelemetry publish/subscribe pattern. The CLI subscribes to its own trace, and the engine publishes telemetry to the subscriber.
  3. We originally needed this pub/sub just so the TUI can show engine activity, but since we already had to do the work anyway, we also re-process data received from the pub/sub and send finished spans to your configured exporter. This way you only need to configure OTEL_ on the client side!
  4. DagQL instrumentation now creates an Otel span for each field resolution and adds base64-encoded callpbv1.Call as a span attribute. Not that this is a single Call and not a full DAG, which saves us ~10x (unpredictable) data sent over the wire, but implies the consumer needs to keep track of all Calls that it sees in order to grok inter-Call references.
  5. Dagger Cloud will become an OTLP endpoint that is capable of receiving "live" spans.

That covers tracing. What about logging?

  1. The Go OpenTelemetry SDK does not yet implement support for logging, so we have a package of our own, basically copy-pasta from the tracing stack with changes to support logging, since it's already supported everywhere else. This can be removed once Logs SDK prototype open-telemetry/opentelemetry-go#4955 lands.
  2. Same as above, we have a pub/sub stack for logs, too. There's no "live" vs "complete" distinction for logs though.
  3. Our shim detects the current trace and span, which Buildkit already propagates for us, and pipes the command's stdout/stderr to io.Writers that send the logs to OpenTelemetry. This should maybe be upstreamed to Buildkit once 4955 lands.
  4. DagQL instrumentation automatically configures iotctx.Stdout and ioctx.Stderr to send to the resolver's span.
  5. To replace Progrock's "global logging" feature, we can now send logs with a special annotation that we can detect client-side to show them globally, while still maintaining a relationship to an originating span. This part isn't fully baked yet, but seems viable.
  6. Dagger Cloud will become an OTLP endpoint that is capable of receiving logs for spans.
  7. Engine logs can also be sent to Dagger Cloud as before, but I need to resurrect this code and confirm it works.

Breaking changes

  • The old TUIs have been removed. So both the dagger run DAG TUI, and the ancient interactive tree view TUI. These were just not worth translating to the new world, so this was finally the time to GC them. Apologies to the handful of users that still used the tree view style UI.
  • The --focus flag is gone.
  • The pipeline() APIs are now no-ops.

New features / improvements

  • The CLI now supports curl style verbosity flags.
    • -v keeps things from disappearing
    • -vv reveals internal/encapsulated spans
    • -vvv reveals spans that evaluated very quickly
  • dagger watch allows you to see all activity across the engine, which is pretty neat. I've already used it for troubleshooting integration tests. It works by just subscribing to "everything" via the pub/sub service mentioned above.
  • A span can be given a special attribute that tells the UI to collapse its children by default: telemetry.Encapsulate().
  • We now proactively shut down services on client close.
  • We should no longer have to worry about missing the 'last few bits' of telemetry. The pub/sub system now waits for active spans to complete before detaching. So hopefully that means no more "runaway builds" or services left running forever.
  • The TUI now supports a Ctrl+\ keybind and SIGQUITs itself, so if there's a hang for whatever reason, now we can more easily get a stack dump.
  • The Engine now uses the same OpenTelemetry stack as the rest of Dagger, so hopefully we don't have to worry too much about dependency hell. (Knock on wood.)
  • All the labels that used to be sent to Cloud as part of Pipelines are now associated to the CLI's OpenTelemetry resource instead.
  • User-land code can directly hook in to the broader trace; we automatically configure OTEL_ as appropriate in the shim.

Regressions

  • We no longer use Buildkit's progress stream, so we are at the mercy of their Otel integration, which is not as rich.
    • We no longer show CACHED status for Buildkit cache hits. :( Instead you'll see separate spans like cache request: ... and load cache: ....
    • We no longer show image pull/extract progress, so no more fancy 2d progress bars.
    • We no longer send the LLB op protobuf in the telemetry.
    • All of these can probably be upstream PRs.

Notable tweaks

  • DagQL no longer marks module functions as 'impure', which was causing parallel calls to never be deduped, leading to duplicate spans.
    • Technically we already cache module functions within a session, and the DagQL server cache has the same lifetime as the session, so this should not be a behavior change, but we'll want to change this if/when we take a swing at module function caching and/or publishing recipes.
  • dagger terminal no longer does the insane vterm emulation thing, which was originally done so we can support showing progress alongside your active terminal session.
    • Now you should just use dagger watch in a separate terminal. Implementing a terminal emulator is not our core competency. It was fun, but also endless toil. The new implementation is much simpler.
  • OpenTelemetry is very dependent on ctx propagation, and context.Background() breaks that, so now we use the new context.WithoutCancel() API which is nearly always what we wanted instead, since it keeps all the juicy values but just drops the cancel-propagation aspect.
  • We now initialize the TUI frontend super early on - literally before we even initialize telemetry. This is great for various reasons, but the trade-off is we need to do custom barebones flag parsing to handle -v/--debug and other global flags. We can't defer that to the usual flag parsing since that's tied to module loading and stuff (for e.g. dagger call).
  • We currently have our own hokey OpenTelemetry Go SDK for logging under telemetry/env and telemetry/sdklog which should be nuked as soon as the official SDK lands (Logs SDK prototype open-telemetry/opentelemetry-go#4955).
  • Deleted various stale commands under cmd/ that weren't worth maintaining.
    • otel-collector (this is just native now)
    • dagger-graph (this was coupled to Progrock/Buildkit progress stream)
    • upload-journal (replaced by OTLP)

* All Progrock plumbing is now gone, though we may want to bring it back
  for compatibility. Removing it was a useful exercise to find the many
  places where we're relying on it.
* TUI now supports -v, -vv, -vvv (configurable verbosity). Global flags
  like --debug, -v, --silent, etc. are processed anywhere in the command
  string and respected.
* CLI forwards engine traces and logs to configured exporters, no need
  to configure engine-side (we already need this flow for the TUI)
* "Live" spans are emitted to TUI and cloud, filtered out before sending
  to a traditional (non-Live) otel stack
* Engine supports pub/sub for traces and logs, can be exposed in the
  future as a GraphQL subscription
* Refactor context.Background usage to context.WithoutCancel. We usually
  don't want a total reset, since that drops the span context and any
  other telemetry related things (loggers etc). Go 1.21 added
  context.WithoutCancel which is more precise.
* engine: don't include source in slogs. Added this prospectively and it
  doesn't seem worth the noise.
* idtui: DB can record multiple traces, polish
  * multi traces is mostly for dagviz, so i can run it with a single DB
  * add 'passthrough' UI flag which tells the UI to ignore a span and
    descend into its children
  * add 'ignore' UI flag, to be used sparingly for things whose
    signal:noise ratio is irredeemibly low (e.g. 'id' calls)
  * make loadFooFromID calls passthrough
  * make Buildkit gRPC calls passthrough
* Global Progrock rogs are theoretically replaced with
  tracing.GlobalLogger, but it has yet to be integrated into anything.
* Module functions are pure after all. They're already cached
  per-session, so this makes DagQL reflect that, avoiding duplicate
  Buildkit work that would be deduped at the Buildkit layer. Cleans up
  the telemetry since previously you'd see duplicate queries.

* TODO: ensure draining is airtight
* TODO: global logging to TUI
* TODO: batch forwarded engine spans instead of emitting them "live"
* TODO: fix dagger terminal

Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 13 commits March 6, 2024 18:29
Signed-off-by: Alex Suraci <alex@dagger.io>
previously we would cancel all subscribers for a trace whenever a
client/derver went away. but for modules/nesting this meant the inner
call would cancel the whole trace early.

* TODO: looks like services still don't drain completely?

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
previously service spans would be left incomplete on exit. now we'll
detach from them on shutdown, which will only stop the service if we're
the last depender on it. end result _should_ be that services are always
completed through telemetry, but I've seen maybe 2 in 50 runs still
leave it running. still troubleshooting, but without this change there
is no hope at all.

fixes dagger#6493

Signed-off-by: Alex Suraci <alex@dagger.io>
Honestly not 100% confirmed, but seems right. I think the final solution
might be to get traces/logs out without going through a session in the
first place.

Signed-off-by: Alex Suraci <alex@dagger.io>
seeing a panic in ExportSpans/UploadTraces, this should help avoid
bringing whole server down - I think - or at least give us hope.

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
fetching the logs/traces over a session is really annoying with draining
because the session itself gets closed before things can be fully
flushed.

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 2 commits March 8, 2024 23:00
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
vito added 10 commits March 9, 2024 00:48
More than a 10x efficiency increase. Frontend still super easy to
implement.

Test:

  # in ~/src/bass
  $ with-dev dagger call -m ./ --src https://github.com/vito/bass unit --packages ./pkg/cli stdout --debug &> out
  $ rg measuring out | cut -d= -f2 | xargs | tr ' ' '+' | sed -e 's/0m//g' -e 's/[^0-9\+]//g' | cat -v | bc

Before:

  8524838 (~8.1 MiB)

After:

  727039 (~0.7 MiB)

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
kinda hacky, but it makes sense that we need to handle this, cause
loadFooFromID or generally anything can take an ID that's never been
seen by the server before, and the loadFooFromID span will come first.

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
@vito
Copy link
Contributor Author

vito commented Mar 29, 2024

Update: ok, Go SDK support for custom spans is implemented: 31b303b

But honestly it's not a great reference for other SDK authors; it's a bit hairy for not great reasons. I'd check out the Python SDK (or the Elixir SDK 🎉 - thanks @wingyplus!) instead. You can totally reference the 'live span' code though, that part's just copy-pasta from telemetry/inflight/ and shows how to implement it.

@sipsma @jedevc WDYT of refactoring the Go SDK codegen <=> Go SDK runtime <=> Dagger engine code sharing?1 Rough idea is to change the runtime to work via replace dagger.io/dagger => ./internal/dagger, the same way the engine works. That way we can share code more easily without worrying about import paths. (Motivation: #6835 (comment))

That would result in the easiest central place to share code being the Go SDK, which kind of tracks I guess, since we're already incentivized to keep its dependencies minimal, as we would with any code shared between all three.

Would save that for a future PR, this one's already way too big.

Also @sipsma I think all the feedback is addressed now, modulo recent changes. I didn't find time to double-check the Buildkit memory thing so I just made it send to io.Discard and tweaked the comment. But the flag parsing situation is way less gross now.

🙏

Footnotes

  1. Having a hard time figuring out what to call these, lemme know if there are better terms. By "Go SDK codegen" I mean dagger.io/dagger, by "Go SDK runtime" I mean the module codegen'd package, and by the Dagger engine I mean... the Dagger engine.

Signed-off-by: Alex Suraci <alex@dagger.io>
no more exec /runtime!

Signed-off-by: Alex Suraci <alex@dagger.io>
worth refactoring, but not now™

Signed-off-by: Alex Suraci <alex@dagger.io>
@wingyplus
Copy link
Contributor

I found an issue when I'm working on #6962. I create a file to sdk/elixir/test.exs with the snippet below

Application.ensure_all_started(:inets)

Mix.install([{:dagger, path: "."}])

defmodule Test do
  def run(dag \\ Dagger.connect!()) do
    dag
    |> Dagger.Client.container()
    |> Dagger.Container.from("alpine")
    |> Dagger.Container.with_exec(~w"echo hello")
  |> Dagger.Sync.sync()
  after
    Dagger.close(dag)
  end
end

Test.run

The script cannot compile because I'm accidentally make the sdk compilation error. When run with with-dev dagger run elixir test.exs I found this error

wingyplus@WINGYMOMO:~/src/github.com/dagger/dagger-elixir-otel/sdk/elixir$ with-dev dagger run elixir test.exs
==> dagger
Compiling 2 files (.ex)

== Compilation error in file lib/dagger/core/graphql_client/httpc.ex ==
** (SyntaxError) invalid syntax found on lib/dagger/core/graphql_client/httpc.ex:23:7:
    error: syntax error before: '['
    │
 23 │       []
    │       ^
    │
    └─ lib/dagger/core/graphql_client/httpc.ex:23:7
    (elixir 1.16.2) lib/kernel/parallel_compiler.ex:428: anonymous fn/5 in Kernel.ParallelCompiler.spawn_workers/8
could not compile dependency :dagger, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile dagger --force", update it with "mi
x deps.update dagger" or clean it with "mix deps.clean dagger"
21:42:58 WRN failed to get repo HEAD err="reference not found"
exit status 1
wingyplus@WINGYMOMO:~/src/github.com/dagger/dagger-elixir-otel/sdk/elixir$

Let's ignore Elixir error for now. The issue is when error occurred, my shell display is now broken for some reason. :(

vito and others added 2 commits April 2, 2024 11:28
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com>
@vito
Copy link
Contributor Author

vito commented Apr 2, 2024

Thanks for the approval. 🙏

Integration tests were failing for the Python SDK, related to the new opentelemetry dependency, so I've just reverted those for now so we can get this PR merged first. Each SDK can just be a separate PR.

Will merge once checks are green.

looks like there's more to figure out with module dependencies? either
way, don't want this to block the current PR, they can be re-introduced
in another PR like the other SDKs

Revert "Pin python requirements"

This reverts commit b40c411.

Revert "Add Python support"

This reverts commit 08aa92c.

Signed-off-by: Alex Suraci <alex@dagger.io>
@vito
Copy link
Contributor Author

vito commented Apr 2, 2024

Update: some Python related tests flaked due to a concurrent map write here:

https://github.com/dagger/dagger/blob/main/sdk/python/runtime/discovery.go#L250

This is unrelated, but I've piled a quick fix onto my PR (just added a mutex), since it failed twice in a row.

@vito vito merged commit 6467198 into dagger:main Apr 3, 2024
43 checks passed
@vito vito deleted the otel-tui branch April 5, 2024 16:11
vikram-dagger pushed a commit to vikram-dagger/dagger that referenced this pull request May 3, 2024
* progrock -> otel

* All Progrock plumbing is now gone, though we may want to bring it back
  for compatibility. Removing it was a useful exercise to find the many
  places where we're relying on it.
* TUI now supports -v, -vv, -vvv (configurable verbosity). Global flags
  like --debug, -v, --silent, etc. are processed anywhere in the command
  string and respected.
* CLI forwards engine traces and logs to configured exporters, no need
  to configure engine-side (we already need this flow for the TUI)
* "Live" spans are emitted to TUI and cloud, filtered out before sending
  to a traditional (non-Live) otel stack
* Engine supports pub/sub for traces and logs, can be exposed in the
  future as a GraphQL subscription
* Refactor context.Background usage to context.WithoutCancel. We usually
  don't want a total reset, since that drops the span context and any
  other telemetry related things (loggers etc). Go 1.21 added
  context.WithoutCancel which is more precise.
* engine: don't include source in slogs. Added this prospectively and it
  doesn't seem worth the noise.
* idtui: DB can record multiple traces, polish
  * multi traces is mostly for dagviz, so i can run it with a single DB
  * add 'passthrough' UI flag which tells the UI to ignore a span and
    descend into its children
  * add 'ignore' UI flag, to be used sparingly for things whose
    signal:noise ratio is irredeemibly low (e.g. 'id' calls)
  * make loadFooFromID calls passthrough
  * make Buildkit gRPC calls passthrough
* Global Progrock rogs are theoretically replaced with
  tracing.GlobalLogger, but it has yet to be integrated into anything.
* Module functions are pure after all. They're already cached
  per-session, so this makes DagQL reflect that, avoiding duplicate
  Buildkit work that would be deduped at the Buildkit layer. Cleans up
  the telemetry since previously you'd see duplicate queries.

* TODO: ensure draining is airtight
* TODO: global logging to TUI
* TODO: batch forwarded engine spans instead of emitting them "live"
* TODO: fix dagger terminal

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix log draining, again, ish

previously we would cancel all subscribers for a trace whenever a
client/derver went away. but for modules/nesting this meant the inner
call would cancel the whole trace early.

* TODO: looks like services still don't drain completely?

Signed-off-by: Alex Suraci <alex@dagger.io>

* don't set up logs if not configured

Signed-off-by: Alex Suraci <alex@dagger.io>

* respect configured level

Signed-off-by: Alex Suraci <alex@dagger.io>

* clean up shim early tracing remnants

Signed-off-by: Alex Suraci <alex@dagger.io>

* synchronously detach services on main client exit

previously service spans would be left incomplete on exit. now we'll
detach from them on shutdown, which will only stop the service if we're
the last depender on it. end result _should_ be that services are always
completed through telemetry, but I've seen maybe 2 in 50 runs still
leave it running. still troubleshooting, but without this change there
is no hope at all.

fixes dagger#6493

Signed-off-by: Alex Suraci <alex@dagger.io>

* flush telemetry before closing server clients

Honestly not 100% confirmed, but seems right. I think the final solution
might be to get traces/logs out without going through a session in the
first place.

Signed-off-by: Alex Suraci <alex@dagger.io>

* switch from errgroup to conc for panic handling

seeing a panic in ExportSpans/UploadTraces, this should help avoid
bringing whole server down - I think - or at least give us hope.

Signed-off-by: Alex Suraci <alex@dagger.io>

* nest 'starting session' beneath 'connect'

Signed-off-by: Alex Suraci <alex@dagger.io>

* send logs out from engine to log exporter too

Signed-off-by: Alex Suraci <alex@dagger.io>

* bump midterm

Signed-off-by: Alex Suraci <alex@dagger.io>

* switch to server-side telemetry pub/sub

fetching the logs/traces over a session is really annoying with draining
because the session itself gets closed before things can be fully
flushed.

Signed-off-by: Alex Suraci <alex@dagger.io>

* show newer traces first

Signed-off-by: Alex Suraci <alex@dagger.io>

* cleanup

Signed-off-by: Alex Suraci <alex@dagger.io>

* send individual Calls over telemetry instead of IDs

More than a 10x efficiency increase. Frontend still super easy to
implement.

Test:

  # in ~/src/bass
  $ with-dev dagger call -m ./ --src https://github.com/vito/bass unit --packages ./pkg/cli stdout --debug &> out
  $ rg measuring out | cut -d= -f2 | xargs | tr ' ' '+' | sed -e 's/0m//g' -e 's/[^0-9\+]//g' | cat -v | bc

Before:

  8524838 (~8.1 MiB)

After:

  727039 (~0.7 MiB)

Signed-off-by: Alex Suraci <alex@dagger.io>

* idtui Base was correct in returning bool

Signed-off-by: Alex Suraci <alex@dagger.io>

* handle case where calls haven't been seen yet

kinda hacky, but it makes sense that we need to handle this, cause
loadFooFromID or generally anything can take an ID that's never been
seen by the server before, and the loadFooFromID span will come first.

Signed-off-by: Alex Suraci <alex@dagger.io>

* idtui: add space between progress and primary output

Signed-off-by: Alex Suraci <alex@dagger.io>

* swap -vvv and -vv, -vv now breaks encapsulation

Signed-off-by: Alex Suraci <alex@dagger.io>

* cleanups

Signed-off-by: Alex Suraci <alex@dagger.io>

* tidy mage

Signed-off-by: Alex Suraci <alex@dagger.io>

* tidy

Signed-off-by: Alex Suraci <alex@dagger.io>

* loosen go.mod constraints

Signed-off-by: Alex Suraci <alex@dagger.io>

* revive labels tests

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix cachemap tests

Signed-off-by: Alex Suraci <alex@dagger.io>

* nuclear option: wait for all spans to complete

Rather than closing the telemetry connection and hoping the timing works
out, we keep track of which traces have active spans and wait for that
count to reach 0.

A bit more complicated but not seeing a simpler solution really. Without
this we can't ensure that the client sees the very outermost spans
complete.

Signed-off-by: Alex Suraci <alex@dagger.io>

* pass-through all gRPC stuff

hasn't really been useful, it's available in the full trace for devs, or
we can add a verbosity level.

Signed-off-by: Alex Suraci <alex@dagger.io>

* dagviz: tweaks to support visualizing a live trace

Signed-off-by: Alex Suraci <alex@dagger.io>

* better 'docker tag' parsing

Signed-off-by: Alex Suraci <alex@dagger.io>

* fixup docker tag check

Signed-off-by: Alex Suraci <alex@dagger.io>

* pass auth headers to OTLP logs too

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix stdio not making it out of gateway containers

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix terminal support

Signed-off-by: Alex Suraci <alex@dagger.io>

* drain immediately when interrupted

otherwise we can get stuck waiting for child spans of a nested process
that got kill -9'd. not perfect but better than hanging on Ctrl+C which
is already an emergent situation where you're not likely that interested
in any remaining data if you already had a reason to interrupt.

in Cloud we'll clean up any orphaned spans based on keepalives anyway.

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix unintentionally HTTP-ifying gRPC otlp enpoint

Signed-off-by: Alex Suraci <alex@dagger.io>

* give up retrying connection if outer ctx canceled

Signed-off-by: Alex Suraci <alex@dagger.io>

* initiate draining only when main client goes away

Signed-off-by: Alex Suraci <alex@dagger.io>

* appease linter

Signed-off-by: Alex Suraci <alex@dagger.io>

* remove unnecessary wait

we don't need to try synchronizing here now that we just generically
wait for all spans to complete

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix panic if no telemetry

Signed-off-by: Alex Suraci <alex@dagger.io>

* remove debug log

Signed-off-by: Alex Suraci <alex@dagger.io>

* print final progress tree in plain mode

no substitute for live console streaming, but easier to implement for
now, and probably easier to read in CI. probably needs more work, but
might get some tests passing.

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix Windows build

Signed-off-by: Alex Suraci <alex@dagger.io>

* propagate spans through dagger-in-dagger

Signed-off-by: Alex Suraci <alex@dagger.io>

* retry connecting to telemetry

Signed-off-by: Alex Suraci <alex@dagger.io>

* propagate span context through dagger run

Signed-off-by: Alex Suraci <alex@dagger.io>

* install default labels as otel resource attrs

Signed-off-by: Alex Suraci <alex@dagger.io>

* tidy

Signed-off-by: Alex Suraci <alex@dagger.io>

* remove pipeline tests

these are expected to fail now

Signed-off-by: Alex Suraci <alex@dagger.io>

* fail root span when command fails

Signed-off-by: Alex Suraci <alex@dagger.io>

* Container.import: add span for streaming image

Signed-off-by: Alex Suraci <alex@dagger.io>

* idtui: break encapsulation in case of errors

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix schema-level logging not exporting

caught by TestDaggerUp/random

Signed-off-by: Alex Suraci <alex@dagger.io>

* update TestDaggerRun assertion

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix test not syncing on progress completion

Signed-off-by: Alex Suraci <alex@dagger.io>

* add verbose debug log

Signed-off-by: Alex Suraci <alex@dagger.io>

* respect $DAGGER_CLOUD_URL and $DAGGER_CLOUD_TOKEN

promoting these from _EXPERIMENTAL along the way, which has already been
done for _TOKEN, don't really see a strong reason to keep the
_EXPERIMENTAL prefix, but low conviction

Signed-off-by: Alex Suraci <alex@dagger.io>

* port 'processor: support span keepalive'

originally aluzzardi/otel-in-flight@2fc011f

Signed-off-by: Alex Suraci <alex@dagger.io>

* add 'watch' command

really helps with troubleshooting hanging tests!

Signed-off-by: Alex Suraci <alex@dagger.io>

* set a reasonable window size in plain mode

otherwise the terminals resize a ton of times when a long string is
printed, absolutely tanking performance. would be nice if that were
fast, but no time for that now.

Signed-off-by: Alex Suraci <alex@dagger.io>

* manually revert container.import change

i thought this wouldn't break it, but ... ?

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix race

Signed-off-by: Alex Suraci <alex@dagger.io>

* mark watch command experimental

Signed-off-by: Alex Suraci <alex@dagger.io>

* fixup lock, more logging

Signed-off-by: Alex Suraci <alex@dagger.io>

* tidy

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix data race in tests

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix java SDK hang once again

really not sure what's writing to stderr even with --silent but this is
just too brittle. redirect stderr to /dev/null instead.

Signed-off-by: Alex Suraci <alex@dagger.io>

* retire dagger.io/ui.primary, use root span instead

fixes Views test; frontend must have been getting confused because there
were multiple "primary" spans

Signed-off-by: Alex Suraci <alex@dagger.io>

* take 2: just manually mark the 'primary' span

Signed-off-by: Alex Suraci <alex@dagger.io>

* merge tracing and telemetry packages

Signed-off-by: Alex Suraci <alex@dagger.io>

* cleanups

Signed-off-by: Alex Suraci <alex@dagger.io>

* roll back sync detach change

this was no longer needed with the change to wait for spans to finish,
not worth the review-time distraction

Signed-off-by: Alex Suraci <alex@dagger.io>

* cleanups

Signed-off-by: Alex Suraci <alex@dagger.io>

* update comment

Signed-off-by: Alex Suraci <alex@dagger.io>

* remove dead code

Signed-off-by: Alex Suraci <alex@dagger.io>

* default primary span to root span

Signed-off-by: Alex Suraci <alex@dagger.io>

* remove unused module arg

Signed-off-by: Alex Suraci <alex@dagger.io>

* send engine traces/logs to cloud

Signed-off-by: Alex Suraci <alex@dagger.io>

* implement sub metrics pub/sub

Some clients presume this service is supported by the OTLP endpoint. So
we can just have a stub implementation for now.

Signed-off-by: Alex Suraci <alex@dagger.io>

* sdk/go runtime: implement otel propagation

TODO: set up otel for you

Signed-off-by: Alex Suraci <alex@dagger.io>

* tidy

Signed-off-by: Alex Suraci <alex@dagger.io>

* add scary comment

Signed-off-by: Alex Suraci <alex@dagger.io>

* batch events that are sent from the engine

Previously we were just sending each individual update to the configured
exporters, which was very expensive and would even slow down the TUI.

When I originally tried to send it to span processors, nothing would be
sent out; turns out that was because the transform.Spans call we were
using didn't set the `Sampled` trace flag.

Now we forward engine traces and logs to all configured processors,
so their individual batching settings should be respected.

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix spans being deduped within single batch

* fix detection for in flight spans; we need to check EndTime <
  StartTime since sometimes we end up with a 1754 timestamp
* when a span is already present in a batch, update it in-place rather
  than dropping it on the floor

Signed-off-by: Alex Suraci <alex@dagger.io>

* Add Python support

Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com>

* shim: proxy otel to 127.0.0.1:0

more universally compatible than unix://

Signed-off-by: Alex Suraci <alex@dagger.io>

* remove unnecesssary fn

Signed-off-by: Alex Suraci <alex@dagger.io>

* attributes: add passthrough, bikeshed + document

also start cleaning up "tasks" cruft nonsense, these can just be plain
old attributes on a single span i think

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix janky flag parsing

parse global flags in two passes, ensuring the same flags are installed
in both cases, and capturing the values before installing them into the
real flag set, since that clobbers the values

Signed-off-by: Alex Suraci <alex@dagger.io>

* discard Buildkit progress

...just in case it gets buffered in memory forever otherwise

Signed-off-by: Alex Suraci <alex@dagger.io>

* sdk/go: somewhat gross support for opentelemetry

had to copy-paste a lot of the telemetry code into sdk/go/. would love
to just move everything there so it can be shared between the shim, the
Go runtime, and the engine, however it is currently a huge PITA to share
code between all three, because of the way codegen works. saving that
for another day. maybe tomorrow.

Signed-off-by: Alex Suraci <alex@dagger.io>

* send logs to function call span, not exec /runtime

Signed-off-by: Alex Suraci <alex@dagger.io>

* tui: respect dagger.io/ui.mask

no more exec /runtime!

Signed-off-by: Alex Suraci <alex@dagger.io>

* silence linter

worth refactoring, but not now™

Signed-off-by: Alex Suraci <alex@dagger.io>

* ignore --help when parsing global flags

Signed-off-by: Alex Suraci <alex@dagger.io>

* Pin python requirements

Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com>

* revert Python SDK changes for now

looks like there's more to figure out with module dependencies? either
way, don't want this to block the current PR, they can be re-introduced
in another PR like the other SDKs

Revert "Pin python requirements"

This reverts commit b40c411.

Revert "Add Python support"

This reverts commit 08aa92c.

Signed-off-by: Alex Suraci <alex@dagger.io>

* fix race conditions in python SDK runtime

Signed-off-by: Alex Suraci <alex@dagger.io>

---------

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com>
Co-authored-by: Helder Correia <174525+helderco@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants