New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon: restructure prime_pg_temp around a full pg mapping calculated on multiple CPUs #13207
Conversation
i am still reading this. |
retest this please |
c83e936
to
94982cf
Compare
2c35725
to
636027b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will continue the review tmr.
@@ -459,6 +459,7 @@ set(libcommon_files | |||
msg/msg_types.cc | |||
common/hobject.cc | |||
osd/OSDMap.cc | |||
osd/OSDMapMapping.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to squash this commit into the previous one.
src/osd/OSDMap.h
Outdated
void _dump(); | ||
|
||
public: | ||
void get(pg_t pgid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could pass by the reference.
src/osd/OSDMapMapping.cc
Outdated
} | ||
} | ||
|
||
void OSDMapMapping::_dump() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could override the operator<<()
in case we want to debug it using the dout
facility.
src/mon/OSDMonitor.cc
Outdated
for (auto osd : osds) { | ||
const vector<pg_t>& pgs = mapping.get_osd_acting_pgs(osd); | ||
dout(20) << __func__ << " osd." << osd << " " << pgs << dendl; | ||
for (auto pgid : pgs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto&
as pg_t
is not a primitive type.
src/mon/OSDMonitor.cc
Outdated
ceph::unordered_map<pg_t, pg_stat_t>::iterator pp) | ||
void OSDMonitor::prime_pg_temp( | ||
OSDMap& next, | ||
pg_t pgid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could pass by const pg_t&
.
src/mon/OSDMonitor.cc
Outdated
<< g_conf->mon_osd_prime_pg_temp_max_time | ||
<< " seconds, stopping" | ||
<< dendl; | ||
abort = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could just return
here.
src/mon/OSDMonitor.cc
Outdated
<< g_conf->mon_osd_prime_pg_temp_max_time | ||
<< " seconds, stopping" | ||
<< dendl; | ||
abort = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could just return
here.
src/mon/OSDMonitor.cc
Outdated
} | ||
} | ||
return num; | ||
pending_inc.new_pg_temp[pgid] = cur_acting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better off using
pending_inc.new_pg_temp[pgid] = std::move(cur_acting);
so less cpu cycles spent on copy ctor.
src/mon/OSDMonitor.cc
Outdated
<< dendl; | ||
pending_inc.new_pg_temp[pgid] = cur_acting; | ||
pending_inc.new_pg_temp[pgid] = acting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to take this chance to use std::move(acting)
52d0b1f
to
95fd496
Compare
simplified the bool abort thing and squashed the first 2 commits. leaving pg_t args as pass-by-value since it's only a 16 byte struct? |
retest this please |
95fd496
to
4d06121
Compare
src/osd/OSDMapMapping.cc
Outdated
|
||
void ParallelPGMapper::Job::finish_one() | ||
{ | ||
Context *fin = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, use nullptr
for more consistency.
src/osd/OSDMapMapping.h
Outdated
protected: | ||
CephContext *cct; | ||
|
||
struct item { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/item/Item/
src/mon/OSDMonitor.cc
Outdated
break; | ||
} | ||
} | ||
ParallelPGMapper::Job *job = new PrimeTempJob(next, this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can allocate job
on stack?
src/mon/OSDMonitor.cc
Outdated
break; | ||
utime_t stop = ceph_clock_now(); | ||
stop += g_conf->mon_osd_prime_pg_temp_max_time; | ||
int chunk = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could mark chunk
a const
or constexpr
.
@@ -111,10 +112,14 @@ struct failure_info_t { | |||
|
|||
class OSDMonitor : public PaxosService { | |||
CephContext *cct; | |||
|
|||
ParallelPGMapper mapper; ///< for background pg work | |||
unique_ptr<OSDMapMapping> mapping; ///< pg <-> osd mappings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas i don't understand why we want to use a pointer of OSDMapMapping
instead of a plain OSDMapMapping
? the same applies to mapping_job
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because they are optional, and because the classes weren't necessarily written with re-use in mind.
src/mon/PGMap.cc
Outdated
} | ||
|
||
if (check_all) { | ||
for (auto p : pg_map->pg_stat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to use const auto& p
, as pair<pg_t,pg_stat_t>
is a relatively large struct.
@@ -67,8 +67,8 @@ void PGMonitor::on_restart() | |||
void PGMonitor::on_active() | |||
{ | |||
if (mon->is_leader()) { | |||
check_all_pgs = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, need_check_down_pgs
was misleading before.
src/osd/OSDMap.h
Outdated
class CephContext; | ||
class CrushWrapper; | ||
class OSDMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not necessary any more, as OSDMapMapping
is declared in its own header file.
src/osd/OSDMapMapping.cc
Outdated
pools.emplace(p.first, PoolMapping(p.second.get_size(), | ||
p.second.get_pg_num())); | ||
} | ||
while (q != pools.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, just put pools.erase(q , pools.end());
src/osd/OSDMapMapping.cc
Outdated
osdmap.pg_to_up_acting_osds( | ||
pg_t(ps, pool), | ||
&up, &up_primary, &acting, &acting_primary); | ||
i->second.set(ps, up, up_primary, acting, acting_primary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use std::move(up)
and std::move(acting)
for better performance.
updated; will squash once reviewed! |
lgtm. |
This is a precalculated table of all pg mappings. Signed-off-by: Sage Weil <sage@redhat.com>
This is temporarily; we'll put this in an async, parallelized thread shortly. Signed-off-by: Sage Weil <sage@redhat.com>
Simplify the code and remove most of the dependency on PGMap. We still need PGMap for the creating_pgs to ignore; that can be dropped later. Signed-off-by: Sage Weil <sage@redhat.com>
Calculate a mapping in parallel over a workqueue + threadpool. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…t finish Start the background mapping job when we go active. If it has finished by the time we need to prime, use it. If not, skip prime_pg_temp. Signed-off-by: Sage Weil <sage@redhat.com>
The reverse mapping from up osds to pgs is not used; disable it for now. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…nyway If we have enough noteworthy OSDs it is likely we will touch all or most PGs anyway, at which point it is faster and simpler to check all pgs. Signed-off-by: Sage Weil <sage@redhat.com>
Make Job a parent class that can be specialized for any per-PG batch work. Signed-off-by: Sage Weil <sage@redhat.com>
Do the prime_pg_temp work in parallel using the mapper. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Two big changes here: 1) Move the updating into PGMapUpdater along with the other related checks. Cleaner! 2) We use the current OSDMap to mark things stale. Before, we had a weird two stage process: first we would apply the OSDMap to the PGMap, and then the *next* time the OSDMap updated we would look for stale PGs. This was silly. Instead, we mark the PGs stale when we process the OSDMap that made them stale. We have a be slightly careful because the PGMap stats primaries might be a bit old and, for example, might have a smaller max_osd than what we have now. This change is motivated right now because the new prime_pg_temp code is smarter about a failed OSD. While previously failing an OSD would install a pg_temp mapping to the old and new OSDs, even though the old OSD is now down, the new code does not. And previously failing both OSDs for a PG would always result in a subsequent OSDMap update removing those bad pg_temp values, giving check_down_pgs() a change to mark the PGs stale, while the new code has not subsequent OSDMap update, which would leave the PG unstale, causing the dump_stuck.py test to fail. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
4f57916
to
3893432
Compare
retest this please. |
This is removing prime_pg_temps dependency on pg_map, with teh small exception of
checking for creating_pgs (which is being removed in a separate PR).
It also moves processing for calculating PG mappings and checking for needed pg_temp
values to a parallel work queue that runs on multiple cores.
Both work efforts (pg mapping calcs and prime_pg_temp) are time boxed with a graceful fallback.