rgw: fix versioned bucket data sync fail when upload is busy #12357

dongbula · 2016-12-07T07:52:35Z

http://tracker.ceph.com/issues/18208

In multisite cluster, upload an file to master zone two times in a short time, you will find the first version sync to slave zone failed, this fix it.

idealguo · 2016-12-08T12:48:52Z

Signed-off-by and tracker

cbodley · 2016-12-13T21:04:10Z

@dongbula thanks for taking this on. i added a test for it at #12474, and verified that your fix works as intended

i do still see an issue with this fix though. the squash_map is important here because these sync entries are processed in parallel. so if there's an OP_WRITE entry followed by OP_DEL on the same object, we can't guarantee that the OP_WRITE will sync before the OP_DEL - if the order gets switched, then this zone would up with that object still existing, while it didn't in the source zone

so we use the squash_map to make sure that we only apply the most recent entry for any given object. but by ignoring the squash_map for versioned entries, we wouldn't know to squash a LINK_OLH entry followed by an UNLINK_INSTANCE on the same object instance, and we could end up in the same situation

i think the ideal solution would involve extending the squash_map key to include the object instance along with the object name, so we can track each version separately. something like this:

-  map<string, pair<real_time, RGWModifyOp> > squash_map;
+  map<pair<string, string>, pair<real_time, RGWModifyOp> > squash_map;

then below:

-        auto& squash_entry = squash_map[e.object];
+        auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
         if (squash_entry.first == e.timestamp &&
-            e.op == CLS_RGW_OP_DEL) {
+            (e.op == CLS_RGW_OP_DEL || e.op == CLS_RGW_OP_UNLINK_INSTANCE)) {
           squash_entry.second = e.op;
         } else if (squash_entry.first < e.timestamp) {
           squash_entry = make_pair<>(e.timestamp, e.op);
         }

does that make sense?

dongbula · 2016-12-14T07:25:08Z

@cbodley Hi,
I understand the idea that

use the squash_map to make sure that we only apply the most recent entry for any given object

, but I still feel puzzled about these code:

       if (squash_entry.first == e.timestamp &&
              e.op == CLS_RGW_OP_DEL) {
            squash_entry.second = e.op;
         }

Is it means that OP_WRITE and OP_DEL entry may have same timestamp?

Considering that the instances of OP_WRITE entry for one object are usually different(just like write operations for different objects in non-versioned situation), I modify the code to make the squash_map just work in non-versioned bucket. That is my original idea, maybe it's not enough.

cbodley · 2016-12-14T18:28:28Z

Is it means that OP_WRITE and OP_DEL entry may have same timestamp?

good point - these are high-resolution timestamps, so that shouldn't be possible (right @yehudasa?). it should also be safe to assume that the bilog entries are already ordered by timestamp, so we could probably replace that whole block with:

-        auto& squash_entry = squash_map[e.object];
+        auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
-        if (squash_entry.first == e.timestamp &&
-            e.op == CLS_RGW_OP_DEL) {
-          squash_entry.second = e.op;
-        } else if (squash_entry.first < e.timestamp) {
-          squash_entry = make_pair<>(e.timestamp, e.op);
-        }
+        squash_entry = make_pair(e.timestamp, e.op);

I modify the code to make the squash_map just work in non-versioned bucket. That is my original idea, maybe it's not enough.

we still need to squash entries for versioned buckets, or we'll run into ordering problems between LINK_OLH and UNLINK_INSTANCE. the squash_map is what prevents them from running in parallel

(http://tracker.ceph.com/issues/15542 has a bit more about the bug that was fixed by adding the squash_map)

yehudasa · 2016-12-14T19:21:28Z

@cbodley iirc for OP_DEL we're storing the removed object's mtime, not when it was removed

cbodley · 2016-12-14T19:47:38Z

@yehudasa maybe you're thinking about the obj_tombstone_cache in RGWRados? these are the timestamps in rgw_bi_log_entry that we're using for the squash_map to resolve the create/delete races in bucket sync that @theanalyst had found. we're discussing whether this check is necessary:

         if (squash_entry.first == e.timestamp &&
              e.op == CLS_RGW_OP_DEL) {
            squash_entry.second = e.op;
         }

the timestamps were converted to from utime_t to ceph::real_time, so it's possible that sync is reading old entries that were encoded with 1-second resolution - so the timestamps could compare equal. but my point is that the bilog is ordered, so i think that the squash_map should always take the most recent entry

cbodley · 2016-12-14T19:56:22Z

@dongbula i updated the test in #12474 to exercise this squash logic. you can run it from your cmake build directory with:
$ nosetests --nocapture ../src/test/rgw/test_multi.py:test_versioned_object_incremental_sync

yehudasa · 2016-12-14T20:06:27Z

@cbodley no, I'm really talking about bucket index log (also verified it now). The timestamp entries that we keep for the delete operations are the object's mtime. Note that although bilog is ordered, entries can arrive out of order as you might have multiple rgws feeding into it, so need to check timestamps.

yehudasa · 2016-12-14T20:10:55Z

@cbodley ahrm.. ah, reading the cls/rgw code again, committed operations will be written to the bi log in order. That been said, I think we should still check timestamp, bugs may happen.

cbodley · 2016-12-14T20:40:06Z

@yehudasa okay, thanks! so it looks like the first diff i posted, that also checks for OP_UNLINK_INSTANCE in that comparison, would be the way to go

dongbula · 2016-12-15T06:04:57Z

@cbodley @yehudasa thanks
Well, squashing entries seems to be necessary whether the bucket is versioned or not. And, so, the comparison for OP_UNLINK_INSTANCE/OP_DEL and LINK_OLH/OP_ADD entries with same timestamp is not necessary, isn't it?

I still have a question about:

the timestamps were converted to from utime_t to ceph::real_time, so it's possible that sync is reading old entries that were encoded with 1-second resolution - so the timestamps could compare equal.

When the entry will be encoded with 1-second resolution?

I modify the code to do squash for all entries, and replace CLS_RGW_OP_DEL with CLS_RGW_OP_LINK_OLH to make versioned upload sync normally. I think that is necessary.

cbodley · 2016-12-15T15:17:44Z

When the entry will be encoded with 1-second resolution?

the type of this timestamp was changed for jewel i think? so any entries that were written before that upgrade would be decoded in seconds

I modify the code to do squash for all entries, and replace CLS_RGW_OP_DEL with CLS_RGW_OP_LINK_OLH to make versioned upload sync normally. I think that is necessary.

looks good, except i'm not sure why you took the OP_DEL part out of that comparison?

dongbula · 2016-12-15T16:02:36Z

@cbodley ah, i see, the comparison for OP_DEL is for compatibility with old entries. I will add that.

cbodley · 2016-12-15T17:35:05Z

some extra notes from testing..

uploading an object to a versioned bucket results in two bilog entries with the same timestamp, OP_ADD and OP_LINK_OLH. so in this case, the squash_map needs to overwrite the OP_ADD with OP_LINK_OLH

deleting an object instance results in OP_UNLINK_INSTANCE and OP_DEL, both with a matching timestamp (but different from the ADD/LINK_OLH). so the existing logic for if (squash_entry.first == e.timestamp && e.op == CLS_RGW_OP_DEL) covers this case

in both cases, when timestamps match, we want to apply the latest bilog entry. even in non-versioned buckets with low-resolution timestamps, if there are entries for OP_ADD, OP_DEL, OP_ADD, we want to apply the last OP_ADD. i think that something like this would work best:

-        auto& squash_entry = squash_map[e.object];
+        auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
-        if (squash_entry.first == e.timestamp &&
-            e.op == CLS_RGW_OP_DEL) {
-          squash_entry.second = e.op;
-        } else if (squash_entry.first < e.timestamp) {
+        if (squash_entry.first <= e.timestamp) {
           squash_entry = make_pair<>(e.timestamp, e.op);
         }

dongbula · 2016-12-16T01:10:40Z

good, that's concise, one more question

Note that although bilog is ordered, entries can arrive out of order as you might have multiple rgws feeding into it, so need to check timestamps.

Is it safety to assume that the OP_DEL is behind OP_ADD in entry when OP_ADD and OP_DEL own same timestamp which are encoded with 1-second resolution?

cbodley · 2016-12-16T14:38:30Z

Is it safety to assume that the OP_DEL is behind OP_ADD in entry when OP_ADD and OP_DEL own same timestamp which are encoded with 1-second resolution?

if the DEL entry comes later in the log, yes

cbodley · 2016-12-19T16:31:52Z

looks good to me 👍 it's passing test_multi.py, including the new test_versioned_object_incremental_sync from #12474

could you please squash this into a single commit so we can merge?

dongbula · 2016-12-20T03:31:09Z

ok, I have squashed the commits

cbodley · 2016-12-20T15:14:20Z

@dongbula thank you. could you add the line Fixes: http://tracker.ceph.com/issues/18208 please?

Fixes: http://tracker.ceph.com/issues/18208 Signed-off-by: lvshuhua <lvshuhua@cmss.chinamobile.com>

dongbula · 2016-12-21T10:42:38Z

okey, have added it

cbodley added bug-fix rgw labels Dec 13, 2016

cbodley self-assigned this Dec 13, 2016

rgw: fix versioned bucket data sync fail when upload is busy

ce7d00a

Fixes: http://tracker.ceph.com/issues/18208 Signed-off-by: lvshuhua <lvshuhua@cmss.chinamobile.com>

cbodley approved these changes Dec 21, 2016

View reviewed changes

cbodley added wip-cbodley-testing and removed wip-cbodley-testing labels Jan 16, 2017

cbodley merged commit 37ff492 into ceph:master Jan 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw: fix versioned bucket data sync fail when upload is busy #12357

rgw: fix versioned bucket data sync fail when upload is busy #12357

dongbula commented Dec 7, 2016 •

edited

Loading

idealguo commented Dec 8, 2016

cbodley commented Dec 13, 2016

dongbula commented Dec 14, 2016 •

edited

Loading

cbodley commented Dec 14, 2016

yehudasa commented Dec 14, 2016

cbodley commented Dec 14, 2016

cbodley commented Dec 14, 2016

yehudasa commented Dec 14, 2016

yehudasa commented Dec 14, 2016

cbodley commented Dec 14, 2016

dongbula commented Dec 15, 2016 •

edited

Loading

cbodley commented Dec 15, 2016

dongbula commented Dec 15, 2016

cbodley commented Dec 15, 2016

dongbula commented Dec 16, 2016 •

edited

Loading

cbodley commented Dec 16, 2016

cbodley commented Dec 19, 2016

dongbula commented Dec 20, 2016

cbodley commented Dec 20, 2016

dongbula commented Dec 21, 2016

rgw: fix versioned bucket data sync fail when upload is busy #12357

rgw: fix versioned bucket data sync fail when upload is busy #12357

Conversation

dongbula commented Dec 7, 2016 • edited Loading

idealguo commented Dec 8, 2016

cbodley commented Dec 13, 2016

dongbula commented Dec 14, 2016 • edited Loading

cbodley commented Dec 14, 2016

yehudasa commented Dec 14, 2016

cbodley commented Dec 14, 2016

cbodley commented Dec 14, 2016

yehudasa commented Dec 14, 2016

yehudasa commented Dec 14, 2016

cbodley commented Dec 14, 2016

dongbula commented Dec 15, 2016 • edited Loading

cbodley commented Dec 15, 2016

dongbula commented Dec 15, 2016

cbodley commented Dec 15, 2016

dongbula commented Dec 16, 2016 • edited Loading

cbodley commented Dec 16, 2016

cbodley commented Dec 19, 2016

dongbula commented Dec 20, 2016

cbodley commented Dec 20, 2016

dongbula commented Dec 21, 2016

dongbula commented Dec 7, 2016 •

edited

Loading

dongbula commented Dec 14, 2016 •

edited

Loading

dongbula commented Dec 15, 2016 •

edited

Loading

dongbula commented Dec 16, 2016 •

edited

Loading