Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: add a pg _fastinfo attribute to reduce per-io metadata updates #11213

Merged
merged 8 commits into from Oct 19, 2016

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Sep 23, 2016

For most IO operations we only update a handful of fields in the pg_info_t structure. However, the full struct, when encoded, is on the order of 800 bytes.

This adds a new attribute, _fastinfo, which contains only the most commonly updated fields. When present, the fastinfo fields should be overlayed on top of the full info struct contained in the existing info attr. If a field outside of the "fast" set is updated, we clear the fastinfo attribute and update the full info attr.

This takes the approach of speculatively trying to do a fast diff at the lowest level helper function and then comparing whether the result is correct. That comparison is probably the slow part. A much harder and more fragile approach would be to make all the call sites that update pg info (and set dirty_info = true) indicate whether they are updating a fast or non-fast field. I'm not sure that's really practical, so I think this is the way to go (at least for now).

pg_fast_info_t encodes to 178 bytes, vs 847 for pg_info_t. That could probably be improved given the high correlation between the last_* fields.

@liewegas
Copy link
Member Author

@markhpc @athanatos

@athanatos
Copy link
Contributor

@liewegas Hmm. If you are right that the check is sufficiently cheap, then this looks pretty reasonable. However, I'm not entirely convinced that explicitly adding a dirty_small_info (or better name!) would really be that bad. It seems to me that all client writes update the same set of fields (see PG::append_log). Those'll be the ones we want to include in the smaller structure. So, it seems like PG::append_log would just set dirty_small_info and for all other users, dirty_info would imply dirty_small_info (don't care about efficiency in those cases).

@athanatos
Copy link
Contributor

@liewegas Conveniently, neither ECMsgTypes.h:ECSubWrite nor MOSDRepOp include the whole pg_info structure, so you can just read out of those what we are updating!

@liewegas
Copy link
Member Author

That'll need some funky conditional in append_log to decide if the delta_stats can be covered by fast info or not. And we'd need to be really careful that the append_log callers aren't doing other info updates and relying on the dirty_info = true in append_log to cover them. Hmm.

@liewegas
Copy link
Member Author

_prepare_write_info goes from .02% to .10% in my perf profile. I think that's okay.

@liewegas liewegas changed the title RFC: osd: add a pg _fastinfo attribute to reduce per-io metadata updates osd: add a pg _fastinfo attribute to reduce per-io metadata updates Sep 26, 2016
@markhpc
Copy link
Member

markhpc commented Sep 28, 2016

No real additional comments, other than that either solution seems like a step up from just blindly varint encoding everything. Given sage's perf results, this seems pretty reasonable.

@shun-s
Copy link
Contributor

shun-s commented Sep 28, 2016

@liewegas
hi, could you tell me more about figuring out "_prepare_write_info goes from .02% to .10% in perf profile" . as i want to learn how to evaluate performance improvemnt. many thanks.

@liewegas
Copy link
Member Author

I did

sudo perf record --call-graph dwarf -F99 -p pidof -s ceph-osd sleep 60

and then looked at the result with

sudo perf report

On Wed, 28 Sep 2016, shun-s wrote:

@liewegas
hi, could you tell me more about figuring out "_prepare_write_info goes from .02% to .10% in perf profile" . as i want to learn how to
evaluate performance improvemnt. many thanks.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.[AAabh90gc_LVTS3b9xUaiybxTc-ziBdLks5quowigaJpZM4KFJk3.gif]

@liewegas liewegas modified the milestone: kraken Oct 4, 2016
jf.flush(*_dout);
}
*_dout << dendl;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas My log is flooding with fastinfo failed error , here is one snippet..

2016-10-04 17:35:50.221450 7fb52ab9c700 -1 _prepare_write_info fastinfo failed, info:
{
"pgid": "1.2b0",
"last_update": "18'41",
"last_complete": "18'41",
"log_tail": "0'0",
"last_user_version": 41,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": "[]",
"history": {
"epoch_created": 15,
"last_epoch_started": 16,
"last_epoch_clean": 16,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 15,
"same_interval_since": 15,
"same_primary_since": 15,
"last_scrub": "0'0",
"last_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_clean_scrub_stamp": "2016-10-04 17:34:54.968066"
},
"stats": {
"version": "18'40",
"reported_seq": "15",
"reported_epoch": "18",
"state": "active+clean",
"last_fresh": "2016-10-04 17:35:42.281977",
"last_change": "2016-10-04 17:34:56.004604",
"last_active": "2016-10-04 17:35:42.281977",
"last_peered": "2016-10-04 17:35:42.281977",
"last_clean": "2016-10-04 17:35:42.281977",
"last_became_active": "2016-10-04 17:34:56.002579",
"last_became_peered": "2016-10-04 17:34:56.002579",
"last_unstale": "2016-10-04 17:35:42.281977",
"last_undegraded": "2016-10-04 17:35:42.281977",
"last_fullsized": "2016-10-04 17:35:42.281977",
"mapping_epoch": 15,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 15,
"last_epoch_clean": 16,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_clean_scrub_stamp": "2016-10-04 17:34:54.968066",
"log_size": 40,
"ondisk_log_size": 40,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 42991616,
"num_objects": 11,
"num_object_clones": 0,
"num_object_copies": 20,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 11,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 82,
"num_write_kb": 41984,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0
},
"up": [
11,
0
],
"acting": [
11,
0
],
"blocked_by": [],
"up_primary": 11,
"acting_primary": 11
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 16,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}

last_written_info:
{
"pgid": "1.2b0",
"last_update": "18'41",
"last_complete": "18'41",
"log_tail": "0'0",
"last_user_version": 41,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": "[]",
"history": {
"epoch_created": 15,
"last_epoch_started": 16,
"last_epoch_clean": 16,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 15,
"same_interval_since": 15,
"same_primary_since": 15,
"last_scrub": "0'0",
"last_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_clean_scrub_stamp": "2016-10-04 17:34:54.968066"
},
"stats": {
"version": "18'40",
"reported_seq": "15",
"reported_epoch": "18",
"state": "active+clean",
"last_fresh": "2016-10-04 17:35:42.281977",
"last_change": "2016-10-04 17:34:56.004604",
"last_active": "2016-10-04 17:35:42.281977",
"last_active": "2016-10-04 17:35:42.281977",
"last_peered": "2016-10-04 17:35:42.281977",
"last_clean": "2016-10-04 17:35:42.281977",
"last_became_active": "2016-10-04 17:34:56.002579",
"last_became_peered": "2016-10-04 17:34:56.002579",
"last_unstale": "2016-10-04 17:35:42.281977",
"last_undegraded": "2016-10-04 17:35:42.281977",
"last_fullsized": "2016-10-04 17:35:42.281977",
"mapping_epoch": 15,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 15,
"last_epoch_clean": 16,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2016-10-04 17:34:54.968066",
"last_clean_scrub_stamp": "2016-10-04 17:34:54.968066",
"log_size": 40,
"ondisk_log_size": 40,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 42991616,
"num_objects": 11,
"num_object_clones": 0,
"num_object_copies": 20,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 10,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 82,
"num_write_kb": 41984,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0
},
"up": [
11,
0
],
"acting": [
11,
0
],
"blocked_by": [],
"up_primary": 11,
"acting_primary": 11
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 16,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}

But, I am seeing pg_info is 178 bytes now , what's wrong ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed the num_dirty_objects field. Updated, can you retest please?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do..

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas working fine , no more log flooding , fastinfo size is now 186 bytes..

@somnathr
Copy link

somnathr commented Oct 5, 2016

@liewegas Also in some cases I am seeing _info of 855 bytes and _fastinfo as 0..

2016-10-04 20:29:58.656150 7f49eeda2700 30 submit_transaction Rocksdb transaction:
Put( Prefix = M key = 0x0000000000000873'.0000000018.00000000000000000085' Value size = 182)
Put( Prefix = M key = 0x0000000000000873'._fastinfo' Value size = 0)
Put( Prefix = M key = 0x0000000000000873'._info' Value size = 855)

Seems to be a bug..

@liewegas
Copy link
Member Author

liewegas commented Oct 5, 2016

The 0 byte _fastinfo happens when fields outside of the fast info struct update.. it's normal that it happens sometimes, we just want to make it rare-ish so that most IO operations take the fastinfo path.

@liewegas liewegas force-pushed the wip-osd-fastinfo branch 2 times, most recently from 208968a to 93c1ca6 Compare October 5, 2016 13:35
@somnathr
Copy link

somnathr commented Oct 6, 2016

@liewegas can't we just not write the key if the value is 0 bytes as it will cause extra write in the kv ?
Yes, it is not frequent at least in my test..

@somnathr
Copy link

somnathr commented Oct 6, 2016

@liewegas can we merge it ?

@liewegas
Copy link
Member Author

liewegas commented Oct 6, 2016

needs to go through qa first.

@liewegas
Copy link
Member Author

liewegas commented Oct 6, 2016 via email

@markhpc
Copy link
Member

markhpc commented Oct 7, 2016

Not QA, but I've successfully been running this through performance tests without issue. I think Somnath is checking to see if this is related to his performance issues with longer running tests though.

@somnathr
Copy link

somnathr commented Oct 7, 2016

@markhpc @liewegas So stupid of me , the latency increase I was seeing because of QD increase in my script. I did modify the script to measure read performance , forgot to revert back..I will update community accordingly..
Checking for long run stable performance now..Sorry for the noise.

@liewegas liewegas force-pushed the wip-osd-fastinfo branch 3 times, most recently from b549907 to 6821ee7 Compare October 17, 2016 02:19
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We'll use this shortly.

Signed-off-by: Sage Weil <sage@redhat.com>
For most IO operations we only update a handful of fields
in the pg_info_t structure.  However, the full struct,
when encoded, is on the order of 800 bytes.

This adds a new attribute, _fastinfo, which contains only
the most commonly updated fields.  When present, the
fastinfo fields should be overlayed on top of the full
info struct contained in the existing info attr.  If
a field outside of the "fast" set is updated, we clear
the fastinfo attribute and update the full info attr.

Signed-off-by: Sage Weil <sage@redhat.com>
This way we don't have to bother deleting the fastinfo
attr, which will reduce the amount of work the
underlying store has to do.

Signed-off-by: Sage Weil <sage@redhat.com>
All compat pg formats can handle past interval removal
and mark-complete.

Signed-off-by: Sage Weil <sage@redhat.com>
Note info updates (of any kind), fastinfo updates,
and biginfo updates.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit 20ee727 into ceph:master Oct 19, 2016
@liewegas liewegas deleted the wip-osd-fastinfo branch October 19, 2016 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants