Nathan Cutler edited this page Sep 8, 2018 · 16 revisions

This documentation is for DeepSea 0.7. Documentation for 0.6 is here

Intro

A brief summary of DeepSea, why it exists and how it is organized follows.

What is DeepSea?

DeepSea is a collection of Salt states, runners and modules for deploying and managing Ceph. For those new to Salt and Ceph, that may not mean much. The point is that DeepSea is not a separate application trying to reinvent the wheel. This project is more of a namespace that lives within Salt to accomplish all things related to Ceph.

The traditional method for deploying Ceph is ceph-deploy. As the page says, ceph-deploy has low overhead. However, that overhead is passed on to the administrator. For a distributed storage system, a configuration management/automation framework is essential. Salt is fast, allows manual execution of commands on remote systems and provides many components for automating complex configurations.

For more information see Salt and Ceph.

Purpose

DeepSea tries to be as flexible as necessary but with a desire to make the difficult possible. Those new to both Salt and Ceph should be able to have a working Ceph cluster without too much effort. That effort should go to understanding Ceph and whether to customize Salt. The default configuration of DeepSea accommodates those that are not system administrators.

Not everyone agrees. Some feel strongly that package and in particular, kernel updates should be left to the administrator. Others feel that SSL is a must that should be mandatory. These groups of users are supported but require additional configuration steps and are not the default.

Salt

For those completely unfamiliar with Salt, think of the use of Salt as a collection of modularized shell scripts with colorful output. Another noteworthy behavior is that Salt is asynchronous in the common case. In other words, running many Salt commands is similar to backgrounding a process although the command line client will wait for a response. Some admins will find this initially unsettling. One solution is to use the deepsea cli.

Salt has standard locations and some naming conventions. The configuration data for your Salt cluster is kept in /srv/pillar. The files representing the various tasks are called state files or sls files. These are kept in /srv/salt.

Two other important locations are /srv/module/runners and /srv/salt/_modules. The former holds python scripts known as runners. These run in a particular context on the Salt master. The latter are user defined modules. These modules are also python scripts but the return values are important. Also, the modules only run on the minions. The minion is the daemon or agent that carries out the tasks from the master.

Ceph Configuration

For Ceph newcomers, imagine the following challenge: given a random assortment of hardware with various disk drives and solid state devices, create the optimal Ceph cluster. Considering that other requirements and workloads can change the configuration considerably even with the same hardware, this task is intimidating . Collecting and entering the data manually is prohibitive and error prone. The individual devices can literally run into the hundreds. With pragmatic deadlines, manual management of more than the simplest cluster is not possible.

DeepSea does not promise an optimal configuration out of the box either. The goal is a working system and then incremental or significant reconfigurations until your ideal Ceph cluster is found.

For all this to happen though, much must be automated. The tedious collecting of all device names for every disk, the possibilities of a minion becoming any or several roles and a dozen checks that meet prerequisites to prevent misery of landing in dead end configurations are part of DeepSea.

The other aspect is that one size does not fit all. As much as DeepSea tries to predict popular configurations, many sites will need to customize something. This philosphy permeates the entire project. Every single Salt state and orchestration can be altered, customized or disabled. No administrator needs to struggle with fighting a framework. If some part does too much or not enough, DeepSea supports custom modifications that survive upgrades. If some part does the wrong thing entirely, disable it. That is completely acceptable.

Ceph Lifecycle

An explanation of the fresh installation and day to day management of Ceph.

Stages

With the purpose of DeepSea to save the administrator time and confidently perform complex operations on a Ceph cluster, this idea has driven a few choices. Before presenting those choices, some observations are necessary.

All software has configuraton. Sometimes the default is sufficient. This is not the case with Ceph. Ceph is flexible almost to a fault. Reducing this complexity would force administrators into preconceived configurations. Several of the existing Ceph solutions for an installation create a demonstration cluster of three nodes. However, the most interesting features of Ceph require more.

The steps necessary to provision the servers, collect the configuration, configure and deploy Ceph are mostly the same. However, this does not address managing the separate functions. For day to day operations, the ability to trivially add hardware to a given function and remove it gracefully is a requirement.

With these observations in mind, DeepSea addresses them with the following strategy:

Collect each set of tasks into a stage. Each stage is a Salt orchestration that is idempotent. Each of the individual tasks can be executed independently if needed. DeepSea currently has six stages described here

These stages can always be run sequentially. With some familiarity, subsets can be run for certain operations. For example, adding a storage node requires running Stages 0-3, migrating a role from one minion to another is Stages 2-5 and removing a storage node requires Stage 2 followed by Stage 5.

Mandatory Configuration

Within the cycle of Stages above, two configuration files are required: deepsea_minions.sls and policy.cfg. The first instructs DeepSea to use either all Salt minions or only a subset. The latter handles the assignments of cluster, roles and profiles to those minions.

Organization

In an effort to separate namespaces, DeepSea uses /srv/pillar/ceph and /srv/salt/ceph. The discovery stage stores the collected configuration data in subdirectories under /srv/pillar/ceph/proposals. The configure stage aggregates this data according to the wishes of the admin and stores the result under /srv/pillar/ceph/stack. The data is now available for Salt commands and processes.

The Salt commands use the files stored in the various subdirectories in /srv/salt/ceph. Although all the files have an sls extension, the formats differ. To prevent confusion, all sls files in a subdirectory are of one kind. For example, /srv/salt/ceph/stage contains orchestration files that are executed by the salt-run state.orchestrate command. Another example is /srv/salt/ceph/admin. These files are executed by the salt target state.apply command. Most subdirectories follow the latter example.

Ceph Installation

Installing and modifying a Ceph cluster are no different in DeepSea.

Prerequisites

The prerequisites are

  • a working Salt cluster

  • DeepSea installed

  • access to Ceph Jewel repositories or later

  • blank drives on storage nodes

Manual Installation

Edit /srv/pillar/ceph/deepsea_minions.sls. Set to use all minions:

deepsea_minions: '*'

Run each of these commands. Investigate any errors. Normal issues include repositories not correctly configured on all minions.

salt '*' test.ping
salt-run state.orch ceph.stage.0
salt-run state.orch ceph.stage.1

The Stage 0 command will take the longest. DeepSea updates the Salt master and then the remaining minions in parallel. Once Stage 1 has completed, create a policy.cfg. The simplest example is in the man page. Otherwise, try the example in /usr/share/doc/packages/deepsea/examples/policy.cfg-rolebased. Edit or copy the file to /srv/pillar/ceph/proposals/policy.cfg. In either case, change the minion names to match your existing minions.

salt-run state.orch ceph.stage.2
salt '*' pillar.items

Note the role assignments and that storage nodes have a data structure for the OSDs. If a role assignment is missing or a storage node seems to be lacking, try either of

salt-run push.proposal
salt-run -l debug push.proposal

Warnings are produced if the policy.cfg has unmatching lines. The debug is detailed but does show all source and destination files for the Salt configuration.

salt-run state.orch ceph.stage.3
ceph -s

Stage 3 will take some time initially since Ceph is installed. The amount of time for setting up the storage nodes is relative to the size of the storage node. All storage nodes are setup in parallel, but the individual disks are created sequentially. A thirty drive server takes much longer than two virtual disks in a VM.

salt-run state.orch ceph.stage.4
salt-run state.orch ceph.stage.5

Stage 4 will run every service but if no roles are assigned then nothing is done. All services have their own orchestration that can be run directly. These are

salt-run state.orch ceph.stage.iscsi
salt-run state.orch ceph.stage.cephfs
salt-run state.orch ceph.stage.radosgw
salt-run state.orch ceph.stage.ganesha
salt-run state.orch ceph.stage.openattic

Stage 5 will normally take less than a minute. The notable exception is when a storage node is decommisioned. The OSDs gracefully empty before completing their removal. Between cluster activity, available network bandwidth and the number of PGs to migrate, this operation can take considerably longer.

Custom Profiles

The most common change to any configuration is creating hardware profiles. Stage 1 will create /srv/pillar/ceph/proposals/profile-default. The default configuration will attempt a 1:5 ratio (i.e. one SSD/NVMe to five HDD/SSD) for multiple devices. Otherwise, all devices will be treated as independent OSDs.

To create custom hardware profiles, see salt-run proposal.help. Edit the policy.cfg by removing or commenting out the profile-default lines and adding your custom name.

Alternate Installations

One of the downsides of the current orchestrations is the unnerving silence during the process. For first time users, whether Salt is working or hung seems no different. Either of these sets of commands give feedback by watching the Salt event bus while the orchestration is running. Choose your preference:

deepsea salt-run state.orch ceph.stage.0
deepsea salt-run state.orch ceph.stage.1
deepsea salt-run state.orch ceph.stage.2
deepsea salt-run state.orch ceph.stage.3
deepsea salt-run state.orch ceph.stage.4
deepsea salt-run state.orch ceph.stage.5
deepsea stage run ceph.stage.0
deepsea stage run ceph.stage.1
deepsea stage run ceph.stage.2
deepsea stage run ceph.stage.3
deepsea stage run ceph.stage.4
deepsea stage run ceph.stage.5

Automated Installation

The installation can be automated by using the Salt reactor. For virtual environments or consistent hardware environments, this configuration will allow the creation of a Ceph cluster with the specified behavior.

The prerequisites are the same as the manual installation. The policy.cfg must be created beforehand and placed in /srv/pillar/ceph/proposals. Any custom configurations may also be placed in /srv/pillar/ceph/stack in their appropriate files before starting the stages.

The default reactor configuration will only run Stages 0 and 1. This allows testing of the reactor without waiting for subsequent stages to complete.

When the first salt-minion starts, Stage 0 will begin. A lock prevents multiple instances. When all minions complete Stage 0, Stage 1 will begin.

When satisfied with the operation, change the last line in the reactor.conf from

- /srv/salt/ceph/reactor/discovery.sls

to

- /srv/salt/ceph/reactor/all_stages.sls

Caution

Experimentation with the reactor is known to cause frustration. Salt cannot perform dependency checks based on reactor events. Putting your Salt master into a death spiral is a real risk.

Purging

When experimenting and learning about Ceph, starting over is sometimes the best route. To reset DeepSea to the end of Stage 1 but with your policy.cfg left intact, run the following commands:

salt-run disengage.safety
salt-run state.orch ceph.purge

The safety serves to prevent the accidental destruction of a cluster and will engage automatically after one minute.

Reinstallation

When a role is removed from a minion, the objective is to undo all changes related to that role. For most roles, this is simple. An exception relates to package dependencies. When a package is uninstalled, the dependencies are not.

With regards to storage nodes, a removed OSD will appear as blank drive. The related tasks overwrite the beginning of the filesystems and remove backup partitions in addition to wiping the partition tables.

Disk drives previously configured by other methods, such as ceph-deploy, may still contain partitions. DeepSea will not automatically destroy these. Currently, the administrator must reclaim these drives.

Replacing an OSD

When replacing the physical drive of an OSD, the first step is to remove the OSD from the minion. Run

# salt-run replace.osd ID

The ID will remain in the Ceph crush map marked as destroyed.

After replacing the physical drive, modify the configuration of the minion. The two methods are manual and automated.

Manual configuration

Find the renamed yaml file for the minion. For example, the file for the minion, data1.ceph is /srv/pillar/ceph/stack/default/ceph/minions/data1.ceph.yml-replace. Copy the file to the original name and replace the device with the new device name. Consider using salt 'minion' osd.report to identify the device that has been removed.

For instance, if the data1.ceph.yml file contains

ceph:
  storage:
    osds:
      /dev/disk/by-id/cciss-360022480cb238bb6367d4f1ad304353d:
        format: bluestore
      /dev/disk/by-id/cciss-3600508b1001c93595b70bd0fb700ad38:
        format: bluestore
      /dev/disk/by-id/cciss-360022480542201aae7398ff096e6d06f:
        format: bluestore

replace the old device entry with the new

ceph:
  storage:
    osds:
      /dev/disk/by-id/cciss-360022480cb238bb6367d4f1ad304353d:
        format: bluestore
      /dev/disk/by-id/cciss-3600508b1001c7c24c537bdec8f3a698f:
        format: bluestore
      /dev/disk/by-id/cciss-360022480542201aae7398ff096e6d06f:
        format: bluestore

Automated configuration

While the default profile for Stage 1 may work for the simplest setups, this stage can be customized. See default-nvme.sls. Create your custom Stage 1 and set stage_discovery: your_custom_name in the global.yml file.

Run Stage 1 to generate the new configuration file.

# salt-run state.orch ceph.stage.1

After the manual or automated configuration is complete, run Stage 2 to update the Salt configuration

# salt-run state.orch ceph.stage.2

To deploy the just replaced OSD run

# salt-run state.orch ceph.stage.3

Customizations

Default settings and individual steps can be overridden.

Inspecting the Configuration

To view the current Salt configuration, run

# salt '*' pillar.items

The output for a single minion will be similar to the following:

    ----------
    available_roles:
        - storage
        - admin
        - mon
        - mds
        - mgr
        - igw
        - openattic
        - rgw
        - ganesha
        - client-cephfs
        - client-radosgw
        - client-iscsi
        - client-nfs
        - master
    benchmark:
        ----------
        default-collection:
            simple.yml
        extra_mount_opts:
            nocrc
        job-file-directory:
            /run/ceph_bench_jobs
        log-file-directory:
            /var/log/ceph_bench_logs
        work-directory:
            /run/ceph_bench
    cluster:
        ceph
    cluster_network:
        172.16.12.0/24
    deepsea_minions:
        *
    fsid:
        5539d6b3-a30f-3631-af9a-080ade7ebaee
    master_minion:
        admin.ceph
    public_network:
        172.16.11.0/24
    roles:
        - master
        - openattic
    time_init:
        ntp
    time_server:
        admin.ceph

These settings are the default created by Stage1, Stage 2 and the policy.cfg. The README in /srv/pillar/ceph describes the directory structure.

Understanding Pathnames and Arguments

The astute may have already noticed that the orchestration arguments match the directory pathnames. For instance, salt-run state.orch ceph.stage.0 is executing the contents of /srv/salt/ceph/stage/0. The Salt base /srv/salt/ is prepended to the argument after converting the dots to slashes. This is true for regular Salt commands. For instance, salt '*' state.apply ceph.sync is found in /srv/salt/ceph/sync.

Now this is only true for the salt-run state.orch and salt state.apply commands. Any other salt-run or salt commands come from DeepSea or Salt. For the full list, see here

Overriding Default Settings

If any data is incorrect for your environment, override it. For instance, if the guessed cluster network is 10.0.1.0/24, but the preferred cluster network is 172.16.22.0/24, do the following:

  • Edit the file /srv/pillar/ceph/stack/ceph/cluster.yml

  • Add cluster_network: 172.16.22.0/24

  • Save the file

To verify the change, run

# salt '*' saltutil.pillar_refresh
# salt '*' pillar.items

This can be repeated with any configuration data. For examples, examine any of the files under /srv/pillar/ceph/stack/default.

Overriding Default Steps

Many of the steps and stages have alternate defaults. All have a default.sls. An alternate default state file has the prefix default-. For example, the /srv/salt/ceph/stage/1/default-notimeout.sls will wait forever for the minions to be ready before continuing. This is necessary in some virtual or cloud environments.

To select this alternate default,

  • Edit the file /srv/pillar/ceph/stack/ceph/global.yml

  • Add stage_discovery: default-timeout

  • Save the file

To verify the change, run

# salt '*' saltutil.pillar_refresh
# salt '*' pillar.items

Note that the name of the variable to override is always defined in the init.sls. In this example, /srv/salt/ceph/stage/1/init.sls contains

include:
  - .{{ salt['pillar.get']('stage_discovery', 'default') }}
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.