HTTP Client Release 5.1.3: Fix silent stream crashes with proper OTP application architecture#67
Merged
dcrockwell merged 3 commits intomainfrom Mar 17, 2026
Merged
HTTP Client Release 5.1.3: Fix silent stream crashes with proper OTP application architecture#67dcrockwell merged 3 commits intomainfrom
dcrockwell merged 3 commits intomainfrom
Conversation
## Why This Change Was Made - In production, three independent streaming requests hung for exactly 900,001ms (the monitor timeout) with zero crash reports in CloudWatch. The root cause: the `dream_http_client_ref_mapping` ETS table was owned by whichever short-lived process first called `ensure_ref_mapping_table()`. When that process exited, the ETS table was destroyed. Any concurrent stream process that tried to look up ref mappings crashed with `badarg` inside `decode_stream_message_for_selector/1` — which runs inside the selector's `receive` block, so the crash killed the stream process silently. No `on_stream_error` callback fired. The stream just vanished. - A secondary race condition existed: two concurrent calls to `ensure_ref_mapping_table()` could both see `undefined` from `ets:info/1`, and the second `ets:new/2` would crash with `badarg`. ## What Was Changed - `ensure_ref_mapping_table()` now spawns a dedicated `table_holder_loop/0` process and transfers ETS table ownership via `ets:give_away/3`. The holder process runs an infinite receive loop, so the table outlives any individual stream process. - `ets:new` is wrapped in `try/catch error:badarg` to handle the race where another process creates the table between our `ets:info` check and our `ets:new` call. - All 7 bare ETS access functions (`lookup_ref_by_string`, `lookup_string_by_ref`, `store_ref_mapping`, `remove_ref_mapping`, `maybe_store_stream_zlib`, `maybe_decompress_stream_chunk`, `cleanup_stream_zlib`) now have `try/catch error:badarg` guards with safe fallbacks. - `get_or_create_string_id` recovers from table destruction by re-creating the table on `badarg`. - 9 regression tests added: 4 deterministic ETS-level tests (holder ownership, table survives creator exit, mappings persist, concurrent creation) and 5 integration tests through `start_stream()` (streams from expired callers, concurrent fast+slow streams, sequential streams). - Version bumped to 5.1.3, CHANGELOG and release notes updated. ## Note to Future Engineer - The `table_holder_loop/0` process intentionally leaks when tests delete and recreate the table — it just sits in `receive` doing nothing forever. This is harmless in both production (table is created once) and tests (processes are cleaned up when the VM exits). Don't add a `stop` message handler unless you enjoy debugging "who killed my ETS table" for the second time. - The try/catch on every ETS access is belt-and-suspenders: the holder process should make table destruction impossible, but the Erlang VM has a long history of teaching developers that "should" and "will" are different words.
## Why This Change Was Made - The previous fix (holder process + try/catch) was architecturally wrong. The unsupervised bare `spawn` for table ownership could die silently, and the try/catch on every ETS access masked real errors — returning garbled compressed data or silently losing mappings. Two layers of band-aids compensating for each other is not how the BEAM works. - The project's own BEAM rules explicitly state: "Never spawn unsupervised processes for anything that matters" and "Never try/catch your way out of a process crash." ## What Was Changed - `dream_http_client` is now a proper OTP application. Added `dream_http_client_app.erl` (application behaviour) and `dream_http_client_sup.erl` (supervisor). Both ETS tables are created in `start/2`, owned by the application master process for the entire application lifetime. - `gleam.toml` has `[erlang] application_start_module` set. - Removed `ensure_ref_mapping_table/0`, `table_holder_loop/0`, and all `try/catch error:badarg` guards from the 7 ETS access functions in `dream_httpc_shim.erl`. ETS operations are now direct calls. - Removed lazy table creation from `client.gleam`: `ensure_ets_tables`, `ensure_recorder_table`, `ensure_ref_mapping_table_wrapper`, `ets_table_exists`, and `ets_new` are all gone. - Rewrote regression tests for the new architecture. Removed tests that delete/recreate the table (architecture makes that unnecessary). Kept all integration tests for concurrent streams. - Updated CHANGELOG and release notes to reflect the architectural change. ## Note to Future Engineer - This follows the Ranch pattern (ranch_app.erl) — create ETS tables in the OTP application's start/2 callback. If you're wondering "why not just spawn a holder process?" — we tried that. It was the previous commit. Read it and weep. - The tables are `public` and `named_table`, accessible from any process. The application master owns them. If they don't exist, the application isn't started. That's a crash-loudly situation, not a try-catch-and-hope situation.
Fix silent stream crashes caused by ETS table ownership
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This release fixes a critical production bug where concurrent HTTP streams silently died when the process that created the shared ETS table exited.
dream_http_clientis now a proper OTP application — its ETS tables are owned by the application master process for the entire application lifetime.Why
In production, three independent streaming requests hung for exactly 900,001ms (the monitor timeout) with zero crash reports in CloudWatch. The streams just vanished — no
on_stream_errorcallback fired, no log entries.The root cause was ETS table ownership semantics: the
dream_http_client_ref_mappingtable was owned by whichever short-lived process first created it. When that process exited, Erlang destroyed the table. Every concurrent stream process then crashed withbadarginside the selector'sreceiveblock, killing the stream silently.What
dream_http_clientis now a proper OTP application following the Ranch pattern (the same approach used by Cowboy's transport layer for itsranch_serverETS table):dream_http_client_app.erl— OTP application behaviour that creates both ETS tables (dream_http_client_ref_mappinganddream_http_client_stream_recorders) instart/2dream_http_client_sup.erl— minimal supervisor required by the application callbackgleam.toml—[erlang] application_start_module = "dream_http_client_app"The tables are owned by the application master process, which lives for the entire application lifetime. No process can destroy them by exiting.
How
The initial fix attempt used an unsupervised
spawn+ets:give_awayholder process withtry/catch error:badargon every ETS access. This was architecturally wrong — an unsupervised process that could die silently, with try/catch masking real errors (returning garbled compressed data, silently losing mappings).The correct fix follows the BEAM's built-in application lifecycle: create the tables in the OTP application's
start/2callback, where they're owned by the application master. This eliminates the holder process, all try/catch blocks, race conditions on creation, and lazy table initialization. ETS operations are now direct calls — if the table doesn't exist, the process crashes loudly (correct behavior, means the application isn't started).All 192 tests pass, including 7 new regression tests that verify the OTP ownership model and reproduce the exact concurrent-streams-from-expired-callers bug scenario.
No API changes.