Skip to content

fix: ReplicationConnection init timeout #1430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 23, 2025

Conversation

edgurgel
Copy link
Member

What kind of change does this PR introduce?

Add a GenServer wrapper that we can control a timeout to initialise.

What is the current behavior?

We think that Postgrex.ReplicationConnection with sync_connect: true can block indefinitely.

Here is one of the partitioned supervisors blocking on Postgrex.ReplicationConnection.start_link:

case Postgrex.ReplicationConnection.start_link(__MODULE__, attrs, connection_opts) do

iex> Process.info(pid, :current_stacktrace)
{:current_stacktrace,
 [
   {:proc_lib, :sync_start, 2, [file: ~c"proc_lib.erl", line: 434]},
   {Realtime.Tenants.ReplicationConnection, :start_link, 1,
    [file: ~c"lib/realtime/tenants/replication_connection.ex", line: 124]},
   {DynamicSupervisor, :start_child, 3,
    [file: ~c"lib/dynamic_supervisor.ex", line: 795]},
   {DynamicSupervisor, :handle_start_child, 2,
    [file: ~c"lib/dynamic_supervisor.ex", line: 781]},
   {:gen_server, :try_handle_call, 4, [file: ~c"gen_server.erl", line: 2381]},
   {:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 2410]},
   {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 329]}
 ]}

This supervisor was stuck for a very long time.

Notice that Realtime.Tenants.ReplicationConnection.init/1 does not do any lengthy work that could block Postgrex.ReplicationConnection.init/1

def init(%__MODULE__{tenant_id: tenant_id, monitored_pid: monitored_pid} = state) do
Logger.metadata(external_id: tenant_id, project: tenant_id)
Process.monitor(monitored_pid)
state = %{state | table: "messages", schema: "realtime"}
state = %{
state
| publication_name: publication_name(state),
replication_slot_name: replication_slot_name(state)
}
Logger.info("Initializing connection with the status: #{inspect(state, pretty: true)}")
{:ok, state}
end

The ideal solution here would be to change Postgrex.ReplicationConnection.start_link to accept a new timeout option for the init step https://github.com/elixir-ecto/postgrex/blob/257daa773a7558d574df3aa3b558664275787ff8/lib/postgrex/replication_connection.ex#L356-L366

What is the new behavior?

It can't block because the GenServer wrapper will eventually timeout on init causing them to stop

Additional context

Add any other context or screenshots.

Copy link

vercel bot commented Jun 23, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
realtime-demo ⬜️ Ignored (Inspect) Jun 23, 2025 1:31am

@edgurgel edgurgel changed the title Fix/postgrex replication init timeout fix: ReplicationConnection init timeout Jun 23, 2025
@coveralls
Copy link

coveralls commented Jun 23, 2025

Coverage Status

coverage: 84.89% (+0.4%) from 84.505%
when pulling b186a32 on fix/postgrex-replication-init-timeout
into 44c0543 on main.

@edgurgel edgurgel merged commit 5b349c8 into main Jun 23, 2025
5 of 9 checks passed
@edgurgel edgurgel deleted the fix/postgrex-replication-init-timeout branch June 23, 2025 04:27
@kiwicopple
Copy link
Member

🎉 This PR is included in version 2.37.4 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants