Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-disk: support bluestore #7218

Merged
merged 26 commits into from Feb 5, 2016
Merged

ceph-disk: support bluestore #7218

merged 26 commits into from Feb 5, 2016

Conversation

ghost
Copy link

@ghost ghost commented Jan 13, 2016

No description provided.

@ghost ghost added feature core labels Jan 13, 2016
@xiexingguo
Copy link
Member

@dachary Could you do me a favor of explaining me what DNM represents? Sorry for my poor english:-)

@theanalyst
Copy link
Member

@xiexingguo It means Do Not Merge, it is set generally when you have some work in progress code and want to receive early feedback

@xiexingguo
Copy link
Member

@theanalyst 👍 Got it. Thank you so very much.

@ghost
Copy link
Author

ghost commented Jan 20, 2016

cw ceph-qa-suite --openrc ovh-gra1-openrc.sh --simultaneous-jobs 9 --verbose --teuthology-git-url http://github.com/dachary/teuthology --teuthology-branch openstack --ceph-qa-suite-git-url http://github.com/ceph/ceph-qa-suite --suite-branch master --ceph-git-url http://github.com/dachary/ceph --ceph wip-13942-ceph-disk --suite ceph-disk --upload

@ghost
Copy link
Author

ghost commented Jan 22, 2016

cw ceph-qa-suite --openrc ovh-gra1-openrc.sh --simultaneous-jobs 9 --verbose --teuthology-git-url http://github.com/dachary/teuthology --teuthology-branch openstack --ceph-qa-suite-git-url http://github.com/ceph/ceph-qa-suite --suite-branch master --ceph-git-url http://github.com/dachary/ceph --ceph wip-13942-ceph-disk --suite ceph-disk --upload

@ghost ghost assigned liewegas Jan 24, 2016
@ghost
Copy link
Author

ghost commented Jan 24, 2016

@liewegas the refactor is complete (to the extent necessary for ceph-disk prepare). It took a few iterations but I like what I have now.

The bluestore support is a draft and I stumbled into a problem: the fsid is assumed to be a sign that --mkfs has run. I worked that around with "bluestore: do not fail on a prepared device" but it has other undesirable side effects. How should we deal with this ?

@ghost ghost changed the title DNM: ceph disk for bluestore ceph disk for bluestore Feb 1, 2016
@ghost
Copy link
Author

ghost commented Feb 1, 2016

with a rebase on the latest master, the bluestore integration tests pass. running them one last time to be sure. Ready for your review @liewegas ;-)

@ghost
Copy link
Author

ghost commented Feb 1, 2016

cw ceph-qa-suite --verbose --teuthology-git-url http://github.com/dachary/teuthology --teuthology-branch openstack --ceph-qa-suite-git-url http://github.com/ceph/ceph-qa-suite --suite-branch master --ceph-git-url http://github.com/dachary/ceph --ceph wip-13942-ceph-disk --suite ceph-disk --upload

@liewegas
Copy link
Member

liewegas commented Feb 1, 2016

This is awesome! Just three comments:

  • let's use the 'type' file to identify existing osds
  • maybe rename the 'default' in method names to 'filestore' for clarity? and then have a small number of sites that indicate filestore is the default (e.g., if type file isn't missing).
  • I think we should get rid of the weird journal wanted/needed thing. This was originally intended to expose the backend requirements in a generic way, but it wasn't expressive enough for bluestore, and keyvaluestore is now removed anyway. Maybe enough to simplify away the PrepareJournal?

@ghost
Copy link
Author

ghost commented Feb 1, 2016

@liewegas I'll address your comments, thanks for the quick review :-). There are zillions things I'd like to simplify now but I tried to keep to the minimum. Unless you think it's important I'd rather keep the want/need for this iteration. I'm semi-worried getting rid of it will create a few additional minor issues that will add delays.

@liewegas
Copy link
Member

liewegas commented Feb 1, 2016 via email

@ghost
Copy link
Author

ghost commented Feb 1, 2016

@liewegas when you suggest to use type to identify the OSD, I assume it's in the context of activate where the initial proposal was to use the "osd objectstore" configure value instead. I'll change that to use the type instead, it's indeed more sensible.

@liewegas
Copy link
Member

liewegas commented Feb 1, 2016 via email

@ghost
Copy link
Author

ghost commented Feb 2, 2016

@liewegas I grouped the changes mainly in 0c10294 except for the s/Default/Filestore/ which was done by amending "ceph-disk: refactor prepare". It passes the ceph-disk suite as shown below.

cw ceph-qa-suite --verbose --teuthology-git-url http://github.com/dachary/teuthology --teuthology-branch openstack --ceph-qa-suite-git-url http://github.com/ceph/ceph-qa-suite --suite-branch master --ceph-git-url http://github.com/dachary/ceph --ceph wip-13942-ceph-disk --suite ceph-disk --upload

@liewegas
Copy link
Member

liewegas commented Feb 2, 2016

One small comment on the bluestore change. Otherwise, I think this is ready for merge!

@ghost
Copy link
Author

ghost commented Feb 2, 2016

@liewegas fixed at 56f0809 and tests are running.

cw ceph-qa-suite --verbose --teuthology-git-url http://github.com/dachary/teuthology --teuthology-branch openstack --ceph-qa-suite-git-url http://github.com/ceph/ceph-qa-suite --suite-branch master --ceph-git-url http://github.com/dachary/ceph --ceph wip-13942-ceph-disk --suite ceph-disk --upload

@ghost
Copy link
Author

ghost commented Feb 3, 2016

@liewegas tests passed

@liewegas
Copy link
Member

liewegas commented Feb 3, 2016 via email

liewegas and others added 7 commits February 4, 2016 14:36
Some backends don't use it, almost all clusters use the default
path, and there'll always be a symlink.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
When make all runs in the ceph-detect-init module, it does a "setup.py
build" which is not used. Replace it with a python setup.py install in a
virtualenv so that tests can add the virtualenv/bin to their PATH and
call ceph-detect-init from sources as they would if it was installed.

Part of run-tox.sh is moved to tools/setup-virtualenv.sh so that it can
be re-used by ceph-disk and other python modules.

Signed-off-by: Loic Dachary <loic@dachary.org>
Because ceph-disk unit tests were not run as part of make check, part of
the most recent changes broke them. This is a batch fix to sanitize the
situation. Since it is now run with make check, that won't happen again.

Signed-off-by: Loic Dachary <loic@dachary.org>
Refactor the test / virtualenv setup in the same way it was done for
ceph-detect-init.

All shell tests use ceph-helpers.sh which is modified to add ceph-disk /
ceph-detect-init virtualenv/bin to the PATH to ensure the source version
is used even if ceph is installed.

See "ceph-detect-init: make all must setup.py install"

Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
To help with the prepare refactor.

Signed-off-by: Loic Dachary <loic@dachary.org>
Because some variables are global in ceph-disk, tests that modify them
interact with each other in non-predictable ways. This will go away
eventually but requires a significant refactor. Workaround by running
one py.test per test file.

Signed-off-by: Loic Dachary <loic@dachary.org>
Remove unused argument.

Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
The helper function no longer has journal hardcoded.

Signed-off-by: Loic Dachary <loic@dachary.org>
The logic / code path is only modified to the extent necessary for the
refactor.

The Prepare class roughly replaces the prepare_main function but also
handles the prepare subcommand argument parsing. It creates the data and
journal objects and delegate the actual work to them via the prepare()
method.

The Prepare class assumes that preparing an OSD consists on the
following phases:

  * optionally prepare auxiliary devices, such as the journal
  * prepare a data directory or device
  * populate the data directory with fsid etc. and optionally
    symbolic links to the auxiliary devices

The PrepareDefault class is derived from Prepare and implements the
current model where there only is one auxiliary device, the journal.

The PrepareJournal class implements the *journal* functions
and is based on a generic class, PrepareSpace which handles the
allocation of an auxiliary device. The only journal specific feature is
left to the PrepareJournal class: querying the OSD to figure out if
a journal is wanted or not.

The OSD data directory is prepared via the PrepareData class. It creates
a file system if necessary (i.e. if a device) and populate the data
directory. Further preparation is then delegated to the auxiliary
devices (i.e. adding a symlink to the device for a journal).

There was some code paths related dmcrypt / multipath devices in
the prepare functions, although it is orthogonal. A class tree for
Devices was created to isolate that.

Although that was the primary reason for adding a new class tree, two
other aspects have also been moved there: ptypes and partition creation.
The ptypes are organized into a data structure with a few helpers in
the hope it will be easier to maintain. All references to the *_UUID
variables have been updated.

The creation of a partition is delegated to sgdisk and a wrapper helps
reduce the code redundancy.

The ptype of a given partition depends on the type of the device (is it
dmcrypt'ed or a multipath device ?). It is best implemented by
derivation so the prepare function does not need to be concerned about
how the ptype of a partition is determined.

Many functions could be refactored into a Device class and its
derivatives, but that was not done to minimize the size of the refactor.

  Device knows how to create a partition and figure out the ptype tobe
  DevicePartition a regular device partition
  DevicePartitionMultipath a partition of a multipath device
  DevicePartitionCrypt base class for luks/plain dmcrypt, can map/unmap
  DevicePartitionCryptPlain knows how to setup dmcrypt plain
  DevicePartitionCryptLuks knows how to setup dmcrypt plain

The CryptHelpers class is introduced to factorize the code snippets that
were duplicated in various places but that do not really belong
because they are convenience wrappers to figure out:

   * if dmrypt should be used
   * the keysize
   * the dmcrypt type (plain or luks)

Signed-off-by: Loic Dachary <loic@dachary.org>
Only support the block file for now. It is handled the same as the
journal, only with a different name (block) and it's own set of ptypes
depending on multipath or dmcrypt.

Signed-off-by: Loic Dachary <loic@dachary.org>
Only support the block file for now. The refactoring consist of
replacing main_activate_journal with main_activate_space and a name
argument (block, journal). More work will be needed to support multiple
auxiliary devices (block.wal etc). But the goal is to minimize the
change because this commit is part of a series of commits focusing on
refactoring prepare, not the entire ceph-disk codebase.

Signed-off-by: Loic Dachary <loic@dachary.org>
Copy paste the journal code and s/journal/block/

More work will be needed to support multiple auxiliary
devices (block.wal etc). But the goal is to minimize the change because
this commit is part of a series of commits focusing on refactoring
prepare, not the entire ceph-disk codebase.

Signed-off-by: Loic Dachary <loic@dachary.org>
The objectstore journal and the bluestore block auxiliary device are
handled in the same way. Each occurrence of journal in the code is
replaced with a variable.

A few helpers are added to the Ptype class to factorize the most common
lookups but the code logic is unmodified with one exception: the
more_osd_info previously added a journal_uuid entry regarless. If there
was no journal_uuid file, it would be None. It is changed to only add
the {block,journal}_uuid entry if the corresponding file exist.

Signed-off-by: Loic Dachary <loic@dachary.org>
It is straightforward because it entirely relies on information
collected by ceph-disk list which has full support for bluestore.
It loops on all possible auxiliary devices (as found in Spaces.NAMES)
and does the associated deactivate / destruction which is merely about
handling dmcrypt map / unmap.

Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
http://tracker.ceph.com/issues/13422 made it so ceph-osd won't start
unless the pidfile can be created successfully. The default location
being the current directory, ceph-osd must explicitly be told to write
in a directory where it has write permissions.

Signed-off-by: Loic Dachary <loic@dachary.org>
The type file in the OSD bluestore data exists and contains the
bluestore string. ceph-disk activate should use it instead of
the "osd objectstore" configuration value. It is better in case the
configuration file changes between prepare and activate.

The fsid file cannot be used by bluestore to signify that ceph-osd
--mkfs has completed successfully because it is pre-populated by
ceph-disk. Introduce the mkfs_done file, dedicated to this, instead of
overloading an existing file.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
@ghost
Copy link
Author

ghost commented Feb 4, 2016

Proposed merge message:

ceph-disk: add support for bluestore

When preparing a device or a directory, the flag --bluestore can be set to select the bluestore experimental object store. It is recommended to also set the following configuration variables: osd objectstore=bluestore, enable experimental unrecoverable data corrupting features=* . The bluestore partition is set with a type specific to bluestore and new udev rules were added to run ceph-disk activate at boot time.

By default two partitions are created for a bluestore device, in the same way a filestore is split between a data and a journal partition. The bluestore data partition is small and contains administrative information. The bluestore block partition uses the rest of the disk and contains the OSD data.

Multipath devices, dmcrypt (plain or luks) are also supported for bluestore.

@ghost
Copy link
Author

ghost commented Feb 4, 2016

cw ceph-qa-suite --verbose --teuthology-git-url http://github.com/dachary/teuthology --teuthology-branch openstack --ceph-qa-suite-git-url http://github.com/ceph/ceph-qa-suite --suite-branch master --ceph-git-url http://github.com/dachary/ceph --ceph wip-13942-ceph-disk --suite ceph-disk --upload

@liewegas
Copy link
Member

liewegas commented Feb 4, 2016 via email

@ghost ghost added the needs-qa label Feb 4, 2016
@ghost
Copy link
Author

ghost commented Feb 4, 2016

@liewegas ceph-disk passes but I think it deserves a rados suite to be sure.

@liewegas
Copy link
Member

liewegas commented Feb 4, 2016

k, i'll get it in the next round. Thanks!

liewegas added a commit that referenced this pull request Feb 5, 2016
ceph-disk: support bluestore

Reviewed-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit adfc653 into ceph:master Feb 5, 2016
@ghost ghost changed the title ceph disk for bluestore ceph-disk: support bluestore Mar 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants