Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow initial data sync problems #325

Open
joshuabaird opened this issue Jul 6, 2021 · 7 comments
Open

Slow initial data sync problems #325

joshuabaird opened this issue Jul 6, 2021 · 7 comments

Comments

@joshuabaird
Copy link

I am attempting to use pglogical on AWS RDS to migrate a 9.6 instance (pglogical 2.2.2) to 13.3 (pglogical 2.3.3). The provider database size (tables, indexes, etc) is around 2TB. We're seeing what we believe to be a very slow initial data sync between the two databases. In ~19hours, only ~95GB has been synced between the two database. We don't see any obvious bottlenecks here.

I/O on the provider is pretty low:

image

Write I/O on the subscriber has been much higher at a steady 40-50MB/sec:. This is interesting, because with a sustained write I/O of 40-50MB/sec, I would expect that the subscriber would have much more than 95GB loaded in 19 hours. What am I missing?

image

CPU usage on the provider is very low, and somewhat high on the subscriber, but it's not pegged:

image

The instance types for both the subscriber and provider are m6g.large which appear to be capable of MUCH more I/O.

Relevant Postgres (parameter group) settings:

max_wal_senders: 20
max_wal_size: 2048
max_replication_slots: 10
max_worker_processes: 8

Is this performance expected? Is there anything I can do to speed it up?

@luss
Copy link

luss commented Jul 6, 2021 via email

@joshuabaird
Copy link
Author

Hi @luss! We met in NYC a few years ago at an AWS event when the RDS Postgres Customer Advisory Board was established.

Yes - we are rehearsing the upgrade now. This isn't being done in production. No, we haven't engaged RDS support yet, but we will.

Could you clarify the snapshot-workflow that you're referring to? I think you're describing something like this:

  • Restore a snapshot of the provider (PG9) database to a new PG9 instance
  • Upgrade the new PG9 instance to PG13 (using the "click" method via the AWS console/API/etc)
  • Configure pglogical replication between the provider (PG9) and subscriber (now PG13) databases

At this point, are you saying that pglogical will sync any changes that have occurred since the snapshot was originally taken?

@luss
Copy link

luss commented Jul 6, 2021 via email

@joshuabaird
Copy link
Author

In my testing, it seems like pglogical isn't aware of where it needs to be begin replicating when you pre-seed data (using the snapshot method referenced above). It looks like pglogical is trying to replicate data from the provider that already exists on the subscriber (as a result from the seed):

2021-07-06 20:42:35 UTC:10.180.21.178(49174):postgres@mydb:[2834]:LOG:  statement: COPY "public"."core_test" ("id","name","parent_id","low_price","mid_price","high_price","order","in_style_survey","display_name") FROM stdin
2021-07-06 20:42:36 UTC:10.180.21.178(49174):postgres@mydb:[2834]:ERROR:  duplicate key value violates unique constraint "core_test_pkey"
2021-07-06 20:42:36 UTC:10.180.21.178(49174):postgres@mydb:[2834]:DETAIL:  Key (id)=(90) already exists.

Am I missing something? Is it actually possible to pre-seed the subscriber with data and then ask pglogical to pickup replication in the right place?

@sparkacus
Copy link

@joshuabaird I have a similar issue with Google SQL - I am trying to migrate a 2TB Database and the initial sync is incredibly slow. I too am seeing slow network throughput of around 2MB\s between two medium spec Google SQL instances.

I know I haven't provided in-depth specs etc but just wondered if you resolved the issue or took the alternative approach i.e. pre-seed?

@joshuabaird
Copy link
Author

joshuabaird commented Jan 11, 2022

I overcame some of my issues with taking a slightly different approach:

  • After importing the schema into the subscriber - drop all of the foreign key constraints. This will allow you to manually sync each table which provides a few benefits - it's typically faster and it's easier/quicker to recover from unexpected errors because you don't have to resync every table if you run into problems
  • Tweak your PG config on provider+subscriber. Specifically, look into tuning max_wal_size, checkpoint_timeout and synchronous_commit. Excessive checkpoints are very expensive.
  • Create a new subscription with synchronize_data := 'false'
  • Manually resync each table: select * from pglogical.alter_subscription_resynchronize_table('subscription_name', 'table_name');

Once you have synced all tables individually, you can add the foreign keys back to the tables. You may also try dropping indexes on the subscriber and re-creating them once all tables are in sync.

Also - make sure your instances are not I/O bound. We have found that some tables (for some unknown reason) are very I/O instensive+slow, and others are fast.

Good luck!

@mmidzik
Copy link

mmidzik commented Sep 27, 2022

Appreciate for all the useful info here! I'm also hitting this issue where initial sync is incredibly slow and was hoping to pre-seed the database with a snapshot or direct import to start.

I found some useful posts on starting from a snapshot on RDS , but I'm unfortunately on Google SQL and haven't found an obvious way to get the LSN / am wary of data loss from this approach. I was hoping that starting with a snapshot I'd be able to simply restart replication and have it begin in the correct place, but alas I ended up on this thread.

@sparkacus - were you able to successfully pre-seed from a snapshot on CloudSQL?

@joshuabaird if you don't mind sharing, what max_wal_size etc tuning proved most successful for you? I've been attempting to tune these parameters, but am still seeing very slow throughput + high memory utilization at initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants