Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jewel: rgw: multisite: sync status reports master is on a different period #13175

Merged
merged 3 commits into from Feb 1, 2017

Conversation

smithfarm
Copy link
Contributor

This ensures that we get the current period in contrast to the admin log
which gets the master's earliest period.

Fixes: http://tracker.ceph.com/issues/18064
Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
(cherry picked from commit 4ca18df)
This is needed for rgw admin's sync status or else we end up always
publishing that we're behind since we are always checking against
master's first period to sync from

Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
(cherry picked from commit 063c949)
Also make the sync output look similar to the output of data sync
Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

(cherry picked from commit cc306c5)
@smithfarm smithfarm self-assigned this Jan 29, 2017
@smithfarm smithfarm added this to the jewel milestone Jan 29, 2017
@smithfarm smithfarm changed the title jewel: multisite: sync status reports master is on a different period jewel: rgw: multisite: sync status reports master is on a different period Jan 29, 2017
@smithfarm smithfarm merged commit f46c125 into ceph:jewel Feb 1, 2017
@smithfarm smithfarm deleted the wip-18684-jewel branch February 1, 2017 22:55
@smithfarm
Copy link
Contributor Author

@smithfarm
Copy link
Contributor Author

(11:46:45 AM) smithfarm: owasserm: thanks. For jewel integration rgw, then, what it comes down to is verifying that these 6 valgrind failures are all libtcmalloc-related: http://pulpito.front.sepia.ceph.com/smithfarm-2017-01-31_12:35:14-rgw-wip-jewel-backports-distro-basic-smithi/
(11:46:58 AM) smithfarm: owasserm: I will do that now
(11:47:05 AM) owasserm: smithfarm, thanks
(11:47:33 AM) smithfarm: owasserm: and assuming they are tcmalloc related, you said I can directly merge all the rgw PRs? Or do you want me to ask you for review in the PRs first?
(11:47:53 AM) owasserm: smithfarm, yes you can merge them
(11:48:19 AM) smithfarm: ok, will merge and do at least one or two more rgw runs before passing 10.2.6 to QE

@dbiazus
Copy link

dbiazus commented Mar 12, 2017

After running 10.2.6 We still reproducing the same issue, when changing a secondary zone to master zone.

Steps to reproduce:

On secondary cluster:
radosgw-admin zone modify --rgw-zone={zone-name} --master --default
radosgw-admin period update --commit
systemctl restart ceph-radosgw@*

At this point the secondary cluster is working well as master.

After the failed (old master) cluster is back:

radosgw-admin period pull --url={url-to-master-zone-gateway} --access-key={access-key} --secret={secret}
radosgw-admin zone modify --rgw-zone={zone-name} --master --default
radosgw-admin period update --commit
systemctl restart ceph-radosgw@*

And now, We remove the flag master from the Secondary cluster:

radosgw-admin zone modify --rgw-zone={zone-name} --master=false
radosgw-admin period update --commit
systemctl restart ceph-radosgw@*

And then, running the command "radosgw-admin sync status" on secondary cluster, we have:

master is on a different period: master_period=0c31136b-f2bd-402a-a65d-ed03a1956683 local_period=abd599c5-36d4-492e-a2f5-2a10eb2b6a93

Thanks!

@smithfarm
Copy link
Contributor Author

@dbiazus Are you saying that http://tracker.ceph.com/issues/18064 is still reproducible in Jewel 10.2.6?

@dbiazus
Copy link

dbiazus commented Mar 12, 2017

Yes, I'm able to reproduce the same behaviour in Jewel 10.2.6, even in a fresh install

@smithfarm
Copy link
Contributor Author

@theanalyst Ping - see the comments above.

@dbiazus
Copy link

dbiazus commented Apr 10, 2017

I could also reproduce this in Kraken:
ceph version 11.2.0 (f223e27)


          realm 26184571-bb6f-4b7b-8c71-a0d3d7750090 (am)
      zonegroup c1eb737c-d9b0-4850-a951-e1f4f223ebec (us)
           zone 5f945856-f27a-4a33-80c9-0d7f226a4acc (ca-central-2)
  metadata sync syncing
                full sync: 0/64 shards
                master is on a different period: master_period=69ab3325-f996-4fcb-a10e-2ce501c5b4ee local_period=ec726bd1-ca84-4db4-9a09-f53dc52499cc
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: c066f644-1c72-4b1f-8a6a-f6aff1237c09 (ca-central-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

Best Regards

@theanalyst
Copy link
Member

@dbiazus thanks, will check out

@cbodley
Copy link
Contributor

cbodley commented Apr 10, 2017

@dbiazus i'm guessing that your issues are related to http://tracker.ceph.com/issues/18639, which tracks some problems with metadata sync across master changes. if that's the case, then radosgw-admin sync status is correctly reporting that it's on an old period, and it's not due this bug in radosgw-admin

to verify, run the non-master gateway with --debug-rgw=20, wait a few minutes, then search for the last occurrence of RGWMetaSyncCR on in its log. if it says RGWMetaSyncCR on period=X, next=Y, then you're experiencing the issues in http://tracker.ceph.com/issues/18639. if it says RGWMetaSyncCR on current period=X, then you are actually reproducing this radosgw-admin bug in http://tracker.ceph.com/issues/18064

@dbiazus
Copy link

dbiazus commented Apr 28, 2017

hey @cbodley, I'm little confused here:

  1. Running "radosgw-admin sync status" on non-master I got:
          realm c3278efc-56dc-4d1f-b8f2-0693400dddda (am)
      zonegroup 57ae1d4c-e653-4179-90bc-fdda7cde7baa (us)
           zone 97a1eea8-9c9e-4a91-a40d-cf0a7b659d06 (stage-ca-central-2)
  metadata sync syncing
                full sync: 0/64 shards
                master is on a different period: master_period=36420dbb-f33b-475a-99cb-3ce0f41cbda7 local_period=6fdf4373-9873-4da1-9403-10f4f0578461
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 47cd13dc-629b-409d-9b20-746a529e273b (stage-ca-central-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source
  1. However, when I run "radosgw-admin period get-current" the current period is:
{
    "current_period": "36420dbb-f33b-475a-99cb-3ce0f41cbda7"
}
  1. And last occurrence of RGWMetaSyncCR tells me that the current period is actually "6fdf4373-9873-4da1-9403-10f4f0578461"
2017-04-28 16:56:45.748164 7f6e17fff700 20 cr:s=0x7f6e18039140:op=0x7f6e180387b0:26RGWReadSyncStatusCoroutine: operate()
2017-04-28 16:56:45.748167 7f6e17fff700 20 run: stack=0x7f6e18039140 is done
2017-04-28 16:56:45.748176 7f6e17fff700 20 rgw meta sync: run_sync(): sync
2017-04-28 16:56:45.748241 7f6e17fff700 20 cr:s=0x7f6e1800b090:op=0x7f6e180387b0:13RGWMetaSyncCR: operate()
2017-04-28 16:56:45.748259 7f6e17fff700 10 rgw meta sync: RGWMetaSyncCR on current period=6fdf4373-9873-4da1-9403-10f4f0578461

Any idea ?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants