-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow initial data sync problems #325
Comments
A 2 TB migration should be rehearsed before it is executed in production so
you know how long it's gonna take. It's hard to guess at what the RDS
bottlenecks are. Have you tried to ask AWS for production support.
Have you considered taking a snapshot first and parallel bulk loading it?
Then you can catchup the pg13 snapshot with incremental transactions that
have occurred whilst the initialization has taken place. IN my experience,
planning and rehearsal are the way to avoid the dilemma your are in. Right
now it's looking look you aren't gonna know for sure for several weeks
whether your first "trial" is gonna work (it likely will not).
…--Luss
(Former Product Manager for Postgres at AWS)
On Tue, Jul 6, 2021 at 9:29 AM Josh Baird ***@***.***> wrote:
I am attempting to use pglogical on AWS RDS to migrate a 9.6 instance
(pglogical 2.2.2) to 13.3 (pglogical 2.3.3). The provider database size
(tables, indexes, etc) is around 2TB. We're seeing what we believe to be a
very slow initial data sync between the two databases. In ~19hours, only
~95GB has been synced between the two database. We don't see any obvious
bottlenecks here.
I/O on the provider is pretty low:
[image: image]
<https://user-images.githubusercontent.com/6271918/124607416-d3a25500-de3b-11eb-928d-8a30894baf14.png>
Write I/O on the subscriber has been much higher at a steady 40-50MB/sec:.
This is interesting, because with a sustained write I/O of 40-50MB/sec, I
would expect that the subscriber would have much more than 95GB loaded in
19 hours. What am I missing?
[image: image]
<https://user-images.githubusercontent.com/6271918/124607564-f92f5e80-de3b-11eb-8898-f38855143d87.png>
CPU usage on the provider is very low, and somewhat high on the
subscriber, but it's not pegged:
[image: image]
<https://user-images.githubusercontent.com/6271918/124607669-1106e280-de3c-11eb-9f5d-d2f46eae8cd9.png>
The instance types for both the subscriber and provider are m6g.large
which appear to be capable of MUCH more I/O.
Relevant Postgres (parameter group) settings:
max_wal_senders: 20
max_wal_size: 2048
max_replication_slots: 10
max_worker_processes: 8
Is this performance expected? Is there anything I can do to speed it up?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#325>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMWOHS443JXX4GXMFL2MC3TWMALZANCNFSM474SV2MQ>
.
|
Hi @luss! We met in NYC a few years ago at an AWS event when the RDS Postgres Customer Advisory Board was established. Yes - we are rehearsing the upgrade now. This isn't being done in production. No, we haven't engaged RDS support yet, but we will. Could you clarify the snapshot-workflow that you're referring to? I think you're describing something like this:
At this point, are you saying that pglogical will sync any changes that have occurred since the snapshot was originally taken? |
That's funny... I knew your name was familiar, but, I wasn't sure from
where.
…On Tue, Jul 6, 2021 at 11:29 AM Josh Baird ***@***.***> wrote:
Hi @luss <https://github.com/luss>! We met in NYC a few years ago at an
AWS event when the RDS Postgres Customer Advisory Board was established.
Yes - we are rehearsing the upgrade now. This isn't being done in
production. No, we haven't engaged RDS support yet, but we will.
Could you clarify the snapshot-workflow that you're referring to? I think
you're describing something like this:
- Restore a snapshot of the provider (PG9) database to a new PG9
instance
- Upgrade the new PG9 instance to PG13 (using the "click" method via
the AWS console/API/etc)
- Configure pglogical replication between the provider (PG9) and
subscriber (now PG13) databases
At this point, are you saying that pglogical will sync any changes that
have occurred since the snapshot was originally taken?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#325 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMWOHXLPY6DXUFGHXEY54LTWMOODANCNFSM474SV2MQ>
.
|
In my testing, it seems like pglogical isn't aware of where it needs to be begin replicating when you pre-seed data (using the snapshot method referenced above). It looks like pglogical is trying to replicate data from the provider that already exists on the subscriber (as a result from the seed):
Am I missing something? Is it actually possible to pre-seed the subscriber with data and then ask pglogical to pickup replication in the right place? |
@joshuabaird I have a similar issue with Google SQL - I am trying to migrate a 2TB Database and the initial sync is incredibly slow. I too am seeing slow network throughput of around 2MB\s between two medium spec Google SQL instances. I know I haven't provided in-depth specs etc but just wondered if you resolved the issue or took the alternative approach i.e. pre-seed? |
I overcame some of my issues with taking a slightly different approach:
Once you have synced all tables individually, you can add the foreign keys back to the tables. You may also try dropping indexes on the subscriber and re-creating them once all tables are in sync. Also - make sure your instances are not I/O bound. We have found that some tables (for some unknown reason) are very I/O instensive+slow, and others are fast. Good luck! |
Appreciate for all the useful info here! I'm also hitting this issue where initial sync is incredibly slow and was hoping to pre-seed the database with a snapshot or direct import to start. I found some useful posts on starting from a snapshot on RDS , but I'm unfortunately on Google SQL and haven't found an obvious way to get the LSN / am wary of data loss from this approach. I was hoping that starting with a snapshot I'd be able to simply restart replication and have it begin in the correct place, but alas I ended up on this thread. @sparkacus - were you able to successfully pre-seed from a snapshot on CloudSQL? @joshuabaird if you don't mind sharing, what |
I am attempting to use pglogical on AWS RDS to migrate a 9.6 instance (pglogical 2.2.2) to 13.3 (pglogical 2.3.3). The provider database size (tables, indexes, etc) is around 2TB. We're seeing what we believe to be a very slow initial data sync between the two databases. In ~19hours, only ~95GB has been synced between the two database. We don't see any obvious bottlenecks here.
I/O on the provider is pretty low:
Write I/O on the subscriber has been much higher at a steady 40-50MB/sec:. This is interesting, because with a sustained write I/O of 40-50MB/sec, I would expect that the subscriber would have much more than 95GB loaded in 19 hours. What am I missing?
CPU usage on the provider is very low, and somewhat high on the subscriber, but it's not pegged:
The instance types for both the subscriber and provider are m6g.large which appear to be capable of MUCH more I/O.
Relevant Postgres (parameter group) settings:
Is this performance expected? Is there anything I can do to speed it up?
The text was updated successfully, but these errors were encountered: