Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgresql 10.5: ERROR: no data left in message #183

Closed
ovv opened this issue Aug 13, 2018 · 32 comments
Closed

Postgresql 10.5: ERROR: no data left in message #183

ovv opened this issue Aug 13, 2018 · 32 comments

Comments

@ovv
Copy link

ovv commented Aug 13, 2018

After upgrading to postgresql 10.5 PostgreSQL 10.5 (Debian 10.5-1.pgdg90+1) replications start failing with ERROR: no data left in message.

It doesn't happen straight after the update but once there is the need for a conflict resolution (set for apply remote in our case)

@tvarsis
Copy link

tvarsis commented Aug 14, 2018

I'm getting the same. Critical bug indeed since all replication is crashing.

@ball-hayden
Copy link

This appears to be an issue if the node is running PostgreSQL 10.5.

I'm running a 10.5 node and 10.3 subscriber and seeing this issue.
I'm not seeing this issue with a 10.3 node and a 10.5 subscriber.

Unfortunately I'm not seeing anything other interesting in the logs that feels like it might be helpful.

@DrakoRod
Copy link

Yep I have same problem in PostgreSQL 10.5 version, I had the same versions in two nodes and show me:
ERROR: no data left in message

The SO is CentOS 7:
PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit

@abs0
Copy link

abs0 commented Aug 16, 2018

Likewise - source PostgreSQL 9.5 on Ubuntu, targets PostgreSQL 10.5 (previously working on 10.4) on Ubuntu.

@tvarsis
Copy link

tvarsis commented Aug 17, 2018

We ended up moving to Postgres native replication and moving away from pglogical. I highly recommend it if you don't use the advanced features of pglogical and are in a hurry to get this working. Super simple to setup and works great on 10.5 as well. My guess is that it will keep working better with upgrades as well since it is a native module and part of the regular test/release cycle.

The only thing we had to adjust was getting rid of some truncate statements in nightly jobs and change them to delete statements instead, since truncate is not supported in native replication until Postgres 11.

If you are unsure if your current pglogical replication will work with native replication, you can read more about limitation and differences here:

https://blog.2ndquadrant.com/pglogical-logical-replication-postgresql-10/

@danielsabinasz
Copy link

Same problem here with PostgreSQL 9.6 and pglogical 2.2.0

@abs0
Copy link

abs0 commented Aug 20, 2018

@tvarsis - unfortunately switching to Postgres native replication is not an option for us as we have a primary server on 9.5. Maybe when we have that one migrated to 10 :)

Actually - does anyone know if it would it be feasible to setup pglogical without conflict resolution to avoid this issue?

@2ndquadrant-ci
Copy link

2ndquadrant-ci commented Aug 20, 2018 via email

@ringerc
Copy link
Contributor

ringerc commented Aug 21, 2018

You can work around this by recompiling pglogical against 10.5.

@ovv
Copy link
Author

ovv commented Aug 21, 2018

Awesome thanks Is there any plan to update the version available in 2ndquadrant debian repository ?

@ringerc
Copy link
Contributor

ringerc commented Aug 21, 2018

Yes, more to come. We're preparing a hackers post and some updates.

@mjevans
Copy link

mjevans commented Aug 21, 2018

I'm at least also seeing this same error message after restarting Debian PG 9.6 servers (the pglogical source is from the 2ndQuadrant repository for Debian stretch).

Package version 2.2.0-1.stretch+1, maybe recent security updates have broken something?

@ringerc
Copy link
Contributor

ringerc commented Aug 22, 2018

It's an issue with the latest point release. You must ensure your logical decoding output plugins (pglogical, bdr, etc) are built with the same PostgreSQL point release as the running PostgreSQL. If you're running a plugin built on 10.4 on 10.5, it'll crash. Similarly, if you run a plugin built on 10.5 on 10.4, that'll crash too. This affects all the point releases not just 10.x.

I'll link the hackers post with details soon.

Cc @mjevans

@greigwise
Copy link

Is there any additional word on this? I upgraded postgres to 9.6.10 from 9.6.9 and it has broken my logical replication using version 2.2.0. I tried to make pglogical from the source against the newer version of postgres which ended up reverting me to pglogical version 2.0.0 somehow, but got it installed and I'm still failing in the same way. This is a critical production system here, so any advice would be appreciated.

@martinmarques
Copy link
Contributor

martinmarques commented Aug 27, 2018 via email

@martinmarques
Copy link
Contributor

martinmarques commented Aug 27, 2018 via email

@df7cb
Copy link

df7cb commented Aug 27, 2018

@ringerc: I'm very interested in the details of this ABI change because this sort of issue is the reason Debian is usually very conservative about updating packages to new versions in the stable release - where PostgreSQL has a blanket exception and new upstream versions are simply waived through because the PG project has a history of being careful about not breaking things. Do you have any pointers?

@df7cb
Copy link

df7cb commented Aug 27, 2018

Oh btw, recompiling pglogical 2.2.0 does not fix the breakage (as per the regression tests) on Debian.

@greigwise
Copy link

Well, we just got the latest version (2.2.0-3) installed and we're seeing the same error. Is it possible postgres needs restarted?

@martinmarques
Copy link
Contributor

martinmarques commented Aug 27, 2018 via email

@martinmarques
Copy link
Contributor

martinmarques commented Aug 27, 2018 via email

@df7cb
Copy link

df7cb commented Aug 27, 2018 via email

@alvherre
Copy link
Contributor

hi @ChristophBerg did you look in the tmp_check/log directory? The useful output from prove should be there.

@ringerc
Copy link
Contributor

ringerc commented Aug 28, 2018

So we're looking at https://dl.2ndquadrant.com/default/release/browse/apt/pool/main/p/pglogical/ per https://dl.2ndquadrant.com ; there are -3 builds there like postgresql-10-pglogical_2.2.0-3.xenial+1_amd64.deb , which has

$ dpkg-deb -I ~/Downloads/postgresql-10-pglogical_2.2.0-3.xenial+1_amd64.deb
...
 Depends: libc6 (>= 2.4), libpq5 (>= 9.1~), postgresql-10 (>= 10.5)
...

and was built against Pg 10.5.

Or rather, the 9.5 equivalent pkg.

@ChristophBerg Are you using PGDG postgres, or debian postgres? 2ndQ-packaged pglogical or Debian packaged pglogical?

(apologies if this should be obvious from context, struggling for time)

@raiviskrumins
Copy link

I have installed latest pglogical:

root@hostname:~# dpkg -l postgresql-10
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                         Version             Architecture        Description
+++-============================-===================-===================-=============================================================
ii  postgresql-10                10.5-1.pgdg16.04+1  amd64               object-relational SQL database, version 10 server
root@hostname:~# dpkg -l postgresql-10-pglogical
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                         Version             Architecture        Description
+++-============================-===================-===================-=============================================================
ii  postgresql-10-pglogical      2.2.0-3.xenial+1    amd64               pglogical plugin for PostgreSQL 10

Still I'm getting these on standby:

2018-08-28 07:38:50.778 UTC [10443] [unknown]@database LOG:  starting apply for subscription subscription
2018-08-28 07:38:50.791 UTC [10443] [unknown]@database ERROR:  no data left in message
2018-08-28 07:38:50.791 UTC [10443] [unknown]@database LOG:  apply worker [10443] at slot 1 generation 16 exiting with error
2018-08-28 07:38:50.792 UTC [10179] LOG:  worker process: pglogical apply 16384:2875150205 (PID 10443) exited with exit code 1

and these on master:

2018-08-28 07:38:50.786 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,1,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"starting logical decoding for slot ""pgl_database_provider_subscription""","streaming transactions committing after 6C5/C36401A8, reading WAL from 6C5/C3630C78",,,,,,,,"subscription"
2018-08-28 07:38:50.786 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,2,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"logical decoding found initial starting point at 6C5/C3630C78","Waiting for transactions (approximately 1) older than 611497951 to end.",,,,,,,,"subscription"
2018-08-28 07:38:50.787 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,3,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"logical decoding found consistent point at 6C5/C3631808","There are no running transactions.",,,,,,,,"subscription"
2018-08-28 07:38:50.792 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,4,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,08006,"could not receive data from client: Connection reset by peer",,,,,,,,,"subscription"
2018-08-28 07:38:50.793 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,5,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,08P01,"unexpected EOF on standby connection",,,,,,,,,"subscription"
2018-08-28 07:38:50.793 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,6,"idle",2018-08-28 07:38:50 UTC,,0,LOG,00000,"disconnection: session time: 0:00:00.011 user=pglogical database=database host=192.168.0.100 port=44908",,,,,,,,,"subscription"

@martinmarques
Copy link
Contributor

martinmarques commented Aug 28, 2018 via email

@kkoppel
Copy link

kkoppel commented Aug 28, 2018

I had the same errors when trying to set up replication from PostgreSQL v9.6.10 to v10.5. After installing the new pglogical packages from https://dl.2ndquadrant.com/default/release/browse/apt/pool/main/p/pglogical/ for both PostgreSQL versions and restarting both clusters, replication started to work again.

I'm using PostgreSQL packages from the PGDG repository on Debian Stretch and installed the "stretch" versions of the pglogical packages.

@greigwise
Copy link

So, after updating the packages and restarting postgres on both sides, it worked for me also. shared_preload_libraries = have to restart postgres after an update. lol

@raiviskrumins
Copy link

@greigwise Yes, it worked for me as well. Thank you!

@ringerc
Copy link
Contributor

ringerc commented Aug 31, 2018

The backstory here is that commit f49a80c48 on PostgreSQL master accidentally broke the binary-compatibility of the layout of struct ReorderBufferTXN as part of fixing a couple of bugs. Since it was a bug fix, it was backported. The ABI change didn't get noticed, so the change landed in releases 10.5, 9.6.10, 9.5.9, 9.4.19 and 9.3.24, breaking the ABI for logical decoding output plugins.

There's discussion in PostgreSQL infrastructure team about whether ABI-checking is feasible to add to the build-farm, and there's soon going to be some discussion on pgsql-hackers about how to avoid this in future too. PostgreSQL tries extremely hard to keep patch releases backward compatible and very safe to update to, so changes will be made to stop it happening again.

This means the issue affects any other logical decoding output plugin like wal2json etc too. But not pgoutput or test_decoding since they're built as part of PostgreSQL itself.

We addressed this for pglogical by updating the packaging to add a new dependency on the post-ABI-break minor release. So we ensure we only build against that release or later and we only install against that release or later releases. It forces people to update, but they should anyway, and it's a lot safer than runtime attempts to compensate for struct layout changes.

@ringerc ringerc closed this as completed Aug 31, 2018
@df7cb
Copy link

df7cb commented Sep 2, 2018

@ringerc: PGDG packages, this is the apt.postgresql.org buildd.
@alvherre: tmp_check/log: has this:

psql:t/basic.sql:21: ERROR:  42883: function pg_current_xlog_location() does not exist
LINE 1: SELECT pg_xlog_wait_remote_apply(pg_current_xlog_location(),...

(I'll leave the debugging to @mnencia, he knows the packaging of this package much better than I do.)

@ringerc
Copy link
Contributor

ringerc commented Sep 2, 2018

@ChristophBerg That looks like it's not properly handling the renaming of pg_current_xlog_location to pg_current_wal_lsn in Pg10. Likely unrelated.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests