Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: restructure prime_pg_temp around a full pg mapping calculated on multiple CPUs #13207

Merged
merged 17 commits into from Feb 18, 2017

Conversation

liewegas
Copy link
Member

This is removing prime_pg_temps dependency on pg_map, with teh small exception of
checking for creating_pgs (which is being removed in a separate PR).

It also moves processing for calculating PG mappings and checking for needed pg_temp
values to a parallel work queue that runs on multiple cores.

Both work efforts (pg mapping calcs and prime_pg_temp) are time boxed with a graceful fallback.

@tchaikov
Copy link
Contributor

tchaikov commented Feb 3, 2017

i am still reading this.

@liewegas
Copy link
Member Author

liewegas commented Feb 3, 2017

retest this please

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will continue the review tmr.

@@ -459,6 +459,7 @@ set(libcommon_files
msg/msg_types.cc
common/hobject.cc
osd/OSDMap.cc
osd/OSDMapMapping.cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to squash this commit into the previous one.

src/osd/OSDMap.h Outdated
void _dump();

public:
void get(pg_t pgid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could pass by the reference.

}
}

void OSDMapMapping::_dump()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could override the operator<<() in case we want to debug it using the dout facility.

for (auto osd : osds) {
const vector<pg_t>& pgs = mapping.get_osd_acting_pgs(osd);
dout(20) << __func__ << " osd." << osd << " " << pgs << dendl;
for (auto pgid : pgs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto& as pg_t is not a primitive type.

ceph::unordered_map<pg_t, pg_stat_t>::iterator pp)
void OSDMonitor::prime_pg_temp(
OSDMap& next,
pg_t pgid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could pass by const pg_t&.

<< g_conf->mon_osd_prime_pg_temp_max_time
<< " seconds, stopping"
<< dendl;
abort = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just return here.

<< g_conf->mon_osd_prime_pg_temp_max_time
<< " seconds, stopping"
<< dendl;
abort = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just return here.

}
}
return num;
pending_inc.new_pg_temp[pgid] = cur_acting;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better off using

pending_inc.new_pg_temp[pgid] = std::move(cur_acting);

so less cpu cycles spent on copy ctor.

<< dendl;
pending_inc.new_pg_temp[pgid] = cur_acting;
pending_inc.new_pg_temp[pgid] = acting;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to take this chance to use std::move(acting)

@liewegas liewegas force-pushed the wip-mapping branch 2 times, most recently from 52d0b1f to 95fd496 Compare February 14, 2017 15:26
@liewegas
Copy link
Member Author

simplified the bool abort thing and squashed the first 2 commits.

leaving pg_t args as pass-by-value since it's only a 16 byte struct?

@liewegas
Copy link
Member Author

retest this please


void ParallelPGMapper::Job::finish_one()
{
Context *fin = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, use nullptr for more consistency.

protected:
CephContext *cct;

struct item {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/item/Item/

break;
}
}
ParallelPGMapper::Job *job = new PrimeTempJob(next, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can allocate job on stack?

break;
utime_t stop = ceph_clock_now();
stop += g_conf->mon_osd_prime_pg_temp_max_time;
int chunk = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could mark chunk a const or constexpr.

@@ -111,10 +112,14 @@ struct failure_info_t {

class OSDMonitor : public PaxosService {
CephContext *cct;

ParallelPGMapper mapper; ///< for background pg work
unique_ptr<OSDMapMapping> mapping; ///< pg <-> osd mappings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas i don't understand why we want to use a pointer of OSDMapMapping instead of a plain OSDMapMapping? the same applies to mapping_job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because they are optional, and because the classes weren't necessarily written with re-use in mind.

src/mon/PGMap.cc Outdated
}

if (check_all) {
for (auto p : pg_map->pg_stat) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to use const auto& p, as pair<pg_t,pg_stat_t> is a relatively large struct.

@@ -67,8 +67,8 @@ void PGMonitor::on_restart()
void PGMonitor::on_active()
{
if (mon->is_leader()) {
check_all_pgs = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, need_check_down_pgs was misleading before.

src/osd/OSDMap.h Outdated
class CephContext;
class CrushWrapper;
class OSDMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary any more, as OSDMapMapping is declared in its own header file.

pools.emplace(p.first, PoolMapping(p.second.get_size(),
p.second.get_pg_num()));
}
while (q != pools.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, just put pools.erase(q , pools.end());

osdmap.pg_to_up_acting_osds(
pg_t(ps, pool),
&up, &up_primary, &acting, &acting_primary);
i->second.set(ps, up, up_primary, acting, acting_primary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use std::move(up) and std::move(acting) for better performance.

@liewegas
Copy link
Member Author

updated; will squash once reviewed!

@tchaikov
Copy link
Contributor

lgtm.

This is a precalculated table of all pg mappings.

Signed-off-by: Sage Weil <sage@redhat.com>
This is temporarily; we'll put this in an async, parallelized thread
shortly.

Signed-off-by: Sage Weil <sage@redhat.com>
Simplify the code and remove most of the dependency on PGMap.  We still
need PGMap for the creating_pgs to ignore; that can be dropped later.

Signed-off-by: Sage Weil <sage@redhat.com>
Calculate a mapping in parallel over a workqueue + threadpool.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…t finish

Start the background mapping job when we go active.  If it has finished
by the time we need to prime, use it.  If not, skip prime_pg_temp.

Signed-off-by: Sage Weil <sage@redhat.com>
The reverse mapping from up osds to pgs is not used; disable it for now.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…nyway

If we have enough noteworthy OSDs it is likely we will touch all or most
PGs anyway, at which point it is faster and simpler to check all pgs.

Signed-off-by: Sage Weil <sage@redhat.com>
Make Job a parent class that can be specialized for any per-PG batch work.

Signed-off-by: Sage Weil <sage@redhat.com>
Do the prime_pg_temp work in parallel using the mapper.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Two big changes here:

1) Move the updating into PGMapUpdater along with the other related
checks.  Cleaner!

2) We use the current OSDMap to mark things stale.  Before, we had
a weird two stage process: first we would apply the OSDMap to the
PGMap, and then the *next* time the OSDMap updated we would look
for stale PGs.  This was silly.  Instead, we mark the PGs stale
when we process the OSDMap that made them stale.

We have a be slightly careful because the PGMap stats primaries
might be a bit old and, for example, might have a smaller max_osd
than what we have now.

This change is motivated right now because the new prime_pg_temp
code is smarter about a failed OSD.  While previously failing
an OSD would install a pg_temp mapping to the old and new OSDs,
even though the old OSD is now down, the new code does not.  And
previously failing both OSDs for a PG would always result in a
subsequent OSDMap update removing those bad pg_temp values, giving
check_down_pgs() a change to mark the PGs stale, while the new code
has not subsequent OSDMap update, which would leave the PG unstale,
causing the dump_stuck.py test to fail.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@tchaikov
Copy link
Contributor

retest this please.

@liewegas liewegas merged commit c1c1f2e into ceph:master Feb 18, 2017
@liewegas liewegas deleted the wip-mapping branch February 18, 2017 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants