-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: fix versioned bucket data sync fail when upload is busy #12357
rgw: fix versioned bucket data sync fail when upload is busy #12357
Conversation
Signed-off-by and tracker |
@dongbula thanks for taking this on. i added a test for it at #12474, and verified that your fix works as intended i do still see an issue with this fix though. the so we use the i think the ideal solution would involve extending the - map<string, pair<real_time, RGWModifyOp> > squash_map;
+ map<pair<string, string>, pair<real_time, RGWModifyOp> > squash_map; then below: - auto& squash_entry = squash_map[e.object];
+ auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
if (squash_entry.first == e.timestamp &&
- e.op == CLS_RGW_OP_DEL) {
+ (e.op == CLS_RGW_OP_DEL || e.op == CLS_RGW_OP_UNLINK_INSTANCE)) {
squash_entry.second = e.op;
} else if (squash_entry.first < e.timestamp) {
squash_entry = make_pair<>(e.timestamp, e.op);
} does that make sense? |
@cbodley Hi,
, but I still feel puzzled about these code:
Is it means that Considering that the instances of |
good point - these are high-resolution timestamps, so that shouldn't be possible (right @yehudasa?). it should also be safe to assume that the bilog entries are already ordered by timestamp, so we could probably replace that whole block with: - auto& squash_entry = squash_map[e.object];
+ auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
- if (squash_entry.first == e.timestamp &&
- e.op == CLS_RGW_OP_DEL) {
- squash_entry.second = e.op;
- } else if (squash_entry.first < e.timestamp) {
- squash_entry = make_pair<>(e.timestamp, e.op);
- }
+ squash_entry = make_pair(e.timestamp, e.op);
we still need to squash entries for versioned buckets, or we'll run into ordering problems between (http://tracker.ceph.com/issues/15542 has a bit more about the bug that was fixed by adding the |
@cbodley iirc for OP_DEL we're storing the removed object's mtime, not when it was removed |
@yehudasa maybe you're thinking about the if (squash_entry.first == e.timestamp &&
e.op == CLS_RGW_OP_DEL) {
squash_entry.second = e.op;
} the timestamps were converted to from |
@cbodley no, I'm really talking about bucket index log (also verified it now). The timestamp entries that we keep for the delete operations are the object's mtime. Note that although bilog is ordered, entries can arrive out of order as you might have multiple rgws feeding into it, so need to check timestamps. |
@cbodley ahrm.. ah, reading the cls/rgw code again, committed operations will be written to the bi log in order. That been said, I think we should still check timestamp, bugs may happen. |
@yehudasa okay, thanks! so it looks like the first diff i posted, that also checks for |
@cbodley @yehudasa thanks I still have a question about:
When the entry will be encoded with 1-second resolution? I modify the code to do squash for all entries, and replace |
the type of this timestamp was changed for jewel i think? so any entries that were written before that upgrade would be decoded in seconds
looks good, except i'm not sure why you took the |
@cbodley ah, i see, the comparison for OP_DEL is for compatibility with old entries. I will add that. |
some extra notes from testing.. uploading an object to a versioned bucket results in two bilog entries with the same timestamp, deleting an object instance results in in both cases, when timestamps match, we want to apply the latest bilog entry. even in non-versioned buckets with low-resolution timestamps, if there are entries for - auto& squash_entry = squash_map[e.object];
+ auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
- if (squash_entry.first == e.timestamp &&
- e.op == CLS_RGW_OP_DEL) {
- squash_entry.second = e.op;
- } else if (squash_entry.first < e.timestamp) {
+ if (squash_entry.first <= e.timestamp) {
squash_entry = make_pair<>(e.timestamp, e.op);
} |
good, that's concise, one more question
Is it safety to assume that the OP_DEL is behind OP_ADD in entry when OP_ADD and OP_DEL own same timestamp which are encoded with 1-second resolution? |
if the DEL entry comes later in the log, yes |
looks good to me 👍 it's passing test_multi.py, including the new could you please squash this into a single commit so we can merge? |
ok, I have squashed the commits |
@dongbula thank you. could you add the line |
Fixes: http://tracker.ceph.com/issues/18208 Signed-off-by: lvshuhua <lvshuhua@cmss.chinamobile.com>
okey, have added it |
http://tracker.ceph.com/issues/18208
In multisite cluster, upload an file to master zone two times in a short time, you will find the first version sync to slave zone failed, this fix it.