xml/admin_administration.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter
[
  <!ENTITY % entities SYSTEM "entity-decl.ent">
    %entities;
]>
<chapter version="5.0" xml:id="cha.admin"
  xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink">
 <info>
  <title>Cluster Management</title>
  <dm:docmanager xmlns:dm="urn:x-suse:ns:docmanager">
   <dm:bugtracker/>
   <dm:translation>yes</dm:translation>
  </dm:docmanager>
 </info>

 <!-- FIXME, mnapp 04/09/18 fill in these sections

 <sect1 xml:id="sec.admin.concepts">
  <title>Concepts</title>
 </sect1>

-->

 <sect1 xml:id="sec.admin.kubernetes.install-kubectl">
  <title>Interacting With &kube;</title>
  <para>
   &kube; requires the use of <literal>kubectl</literal> for many tasks.
   You can perform most of these actions while logged in to an SSH session on
   the master node of your &productname; cluster. <literal>kubectl</literal>
   is a pre-installed component of &productname;.
  </para>
  <para>
   The proxy functionality requires <literal>kubectl</literal> to be installed
   on your local machine to act as a proxy between the local workstation and the
   remote cluster.
  </para>

  <important>
   <title>&sle; Desktop 12 SP3 / 15.0 - Installation from Packagehub</title>
   <para>
   The use of PackageHub is <link xlink:href="https://packagehub.suse.com/support/">exempt from commercial support</link>.
   </para>
   <para>
    If you are using &sle; 12 SP3 or 15.0, you must
    <link xlink:href="https://www.suse.com/documentation/sled-15/book_quickstarts/data/sec_modules_installing.html">enable the PackageHub Extension</link>.
   </para>
   <para>
    The instructions are identical for both versions.
   </para>
  </important>

  <tip>
   <title>Installing <command>kubectl</command> on Non-SUSE OS or Old Release</title>
   <para>
    If you are using an operating system other than the current &sle; 12 SP3/15.0
    or &opensuse; Tumbleweed/Leap please consult the
    <link xlink:href="https://kubernetes.io/docs/tasks/tools/install-kubectl/">
    installation instructions</link> from the &kube; project.
   </para>
  </tip>

  <tip>
   <title>The KUBECONFIG Variable</title>
   <para>
    &kubectl; uses an environment variable named <varname>KUBECONFIG</varname>
    to locate your &kubeconfig; file. If this variable is not specified, it
    defaults to <filename>$HOME/.kube/config</filename>. To use a different
    location, run
   </para>
<screen>&prompt.user;<command>export KUBECONFIG=<replaceable>/PATH/TO/KUBE/CONFIG/FILE</replaceable></command></screen>
  </tip>

   <procedure>
    <title>Install the <literal>kubectl</literal> package</title>
   <step>
    <para>
     Install the <filename>kubectl</filename> package:
    </para>
<screen>&prompt.sudo;<command>zypper in kubectl</command></screen>
   </step>
   <step>
    <para>
     To use kubectl to connect to a local machine you must perform <xref linkend="sec.admin.security.auth.kubeconfig" /> against the &kube; master node. Download the <filename>.kubeconfig</filename> file from &dashboard; and place it in <filename>˜/.kube/config</filename>.
    </para>
    <informalfigure>
     <mediaobject>
      <imageobject role="fo">
       <imagedata fileref="velum_status.png" width="100%"/>
      </imageobject>
      <imageobject role="html">
       <imagedata fileref="velum_status.png" width="100%"/>
      </imageobject>
     </mediaobject>
    </informalfigure>
   </step>
   <step>
    <para>
     Verify that <literal>kubectl</literal> was installed and is configured correctly:
    </para>
    <screen>&prompt.user;<command>kubectl get nodes</command>
NAME                  STATUS    ROLES     AGE       VERSION
caasp3-master     Ready     master    1d        v1.9.8
caasp3-worker-1   Ready     &lt;none&gt;    1d        v1.9.8
caasp3-worker-2   Ready     &lt;none&gt;    1d        v1.9.8
caasp3-worker-3   Ready     &lt;none&gt;    1d        v1.9.8
caasp3-worker-4   Ready     &lt;none&gt;    1d        v1.9.8</screen>
     <para>
     You should see the list of nodes known to &productname;.
    </para>
   </step>
 </procedure>
 </sect1>

 <sect1 xml:id="sec.admin.salt">
  <title>Interacting with Salt</title>
  <para>
   You can run commands across all nodes in the cluster by running them via
   <literal>salt</literal>.
  </para>
  <para>
   Log in to the admin node and run:
  </para>
<screen>&prompt.user;<command>docker exec -it $(docker ps -q -f name=salt-master) \
salt -P 'roles:(admin|kube-master|kube-minion)' \
cmd.run "<replaceable>df -h</replaceable>"</command>
</screen>
  <para>
   This command tells <literal>docker</literal> to find the
   <literal>salt-master</literal> container and execute the command on all nodes
   that match the roles <literal>admin</literal>, <literal>kube-master</literal>,
   and <literal>kube-minion</literal> (which is all nodes).
  </para>
  <para>
   Replace the example <command>df -h</command> with a command of your choice.
   The output will be produced in your current terminal session.
  </para>

  <sect2 xml:id="sec.admin.salt.worker_threads">
   <title>Adjusting The Number Of Salt Worker Threads</title>
   <para>
    It will sometimes be necessary to resize the &kube; cluster to adjust for
    workloads or other factors. Salt will run into problems, if the number of
    nodes to handle becomes too large without adjusting the number of available
    Salt worker threads.
   </para>
   <para>
    For the correct value, refer to
    <xref linkend="sec.deploy.requirements.system.cluster.salt_cluster_size"/>.
   </para>
   <procedure>
    <title>Adjust The Salt Worker Count</title>
    <step>
     <para>
      Log in to your admin node via SSH.
     </para>
    </step>
    <step>
     <para>
      Run the following command to adjust the configured number of workers
      (here: <literal>20</literal>).
     </para>
<screen>&prompt.root.admin;<command>echo "worker_threads:<replaceable>20</replaceable>" > /etc/salt/salt-master-custom.conf</command>
    </screen>
    </step>
    <step>
     <para>
      Find the ID of the &smaster; container.
     </para>
<screen>&prompt.root.admin;<command>saltid=$(docker ps -q -f salt-master)</command>
    </screen>
    </step>
    <step>
     <para>
      And restart the &smaster;.
     </para>
<screen>&prompt.root.admin;<command>docker kill $saltid</command>
    </screen>
    </step>
   </procedure>
   <para>
    Now, Salt will restart and adjust the number of workers in the cluster.
    </para>
  </sect2>
 </sect1>

 <sect1 xml:id="sec.admin.nodes">
  <title>Node Management</title>
  <para>
   After you complete the deployment and you bootstrap the cluster, you may
   need to perform additional changes to the cluster. By using &dashboard; you
   can add additional nodes to the cluster. You can also delete some nodes, but
   in that case make sure that you do not break the cluster.
  </para>

  <sect2 xml:id="sec.admin.nodes.add">
   <title>Adding Nodes</title>
   <para>
    You may need to add additional &worker_node;s to your cluster. The
    following steps guides you through that procedure:
   </para>
   <procedure>
    <title>Adding Nodes to Existing Cluster</title>
    <step>
     <para>
      Prepare the node as described in
      <xref linkend="sec.deploy.nodes.worker_install"/>
     </para>
    </step>
    <step>
     <para>
      Open &dashboard; in your browser and login.
     </para>
    </step>
    <step>
     <para>
      You should see the newly added node as a node to be accepted in
      <guimenu>Pending Nodes</guimenu>. Click on <guimenu>Accept Node</guimenu>.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_pending_nodes.png" format="PNG" width="100%"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_pending_nodes.png" width="100%" format="png"
        />
       </imageobject>
      </mediaobject>
     </informalfigure>
    </step>
    <step>
     <para>
      In the <guimenu>Summary</guimenu> you can see the <guimenu>New</guimenu>
      that appears next to <guimenu>New nodes</guimenu>. Click the
      <guimenu>New</guimenu> button.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_unassigned_nodes.png" width="100%"
         format="png"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_unassigned_nodes.png" width="100%"
         format="png"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
    </step>
    <step>
     <para>
      Select the node to be added and click <guimenu>Add nodes</guimenu>.
     </para>
    </step>
    <step>
     <para>
      The node has been added to your cluster.
     </para>
    </step>
   </procedure>

   <sect3 xml:id="sec.admin.nodes.create_autoyast_profile">
     <title>The <command>create_autoyast_profile</command> Command</title>
     <para>
      The <command>create_autoyast_profile</command> command creates an autoyast
      profile for fully automatic installation of &productname;. You can use the
      following options when invoking the command:
     </para>
     <variablelist>
      <varlistentry>
       <term><literal>-o|--output</literal>
       </term>
       <listitem>
        <para>
         Specify to which file the command should save the created profile.
        </para>
<screen>&prompt.root;<command>create_autoyast_profile -o <replaceable>FILENAME</replaceable></command></screen>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term><literal>--salt-master</literal>
       </term>
       <listitem>
        <para>
         Specify the host name of the &smaster;.
        </para>
<screen>&prompt.root;<command>create_autoyast_profile --salt-master <replaceable>SALTMASTER</replaceable></command></screen>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term><literal>--smt-url</literal>
       </term>
       <listitem>
        <para>
         Specify the URL of the SMT server.
        </para>
<screen>&prompt.root;<command>create_autoyast_profile --smt-url <replaceable>SALTMASTER</replaceable></command></screen>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term><literal>--regcode</literal>
       </term>
       <listitem>
        <para>
         Specify the registration code for &productname;.
        </para>
<screen>&prompt.root;<command>create_autoyast_profile --regcode <replaceable>RIGISTRATION_CODE</replaceable></command></screen>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term><literal>--reg-email</literal>
       </term>
       <listitem>
        <para>
         Specify an e-mail address for registration.
        </para>
<screen>&prompt.root;<command>create_autoyast_profile --reg-email <replaceable>E-MAIL_ADRESS</replaceable></command></screen>
       </listitem>
      </varlistentry>
     </variablelist>
    </sect3>
  </sect2>

  <sect2 xml:id="sec.admin.nodes.remove">
   <title>Removing Nodes</title>
   <warning>
    <para>
     If you attempt to remove more nodes than are required for the minimum cluster
     size (3 nodes: 1 master, 2 workers) &dashboard; will display a warning.
     Your cluster will be disfunctional until you add the minimum amount of nodes
     again.
    </para>
   </warning>
   <note>
    <para>
     As each node in the cluster runs also an instance of
     <literal>etcd</literal>, &productname; has to ensure that removing of
     several nodes does not break the <literal>etcd</literal> cluster. In case
     you have, for example, three nodes in the <literal>etcd</literal> and you
     delete two of them, &productname; deletes one node, recovers the cluster
     and only if the recovery is successful, allows the next node to be removed.
     If a node runs just an <literal>etcd-proxy</literal>, there is nothing special
     that has to be done, as deleting any amount of
     <literal>etcd-proxy</literal> can not break the <literal>etcd</literal>
     cluster.
    </para>
   </note>
   <note>
    <para>
     If you have only one master node configured, &dashboard; will not allow you
     to remove it. You must first add a second master node as a replacement.
    </para>
   </note>

   <procedure>
    <step>
     <para>
      Log-in to &dashboard; on your &productname; Admin node.
      Then, click <guimenu>Remove</guimenu> next to the node you wish to remove.
      A dialog will ask you to confirm the removal.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject>
        <imagedata fileref="velum_status.png" format="PNG" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
    </step>
    <step>
     <para>
    The cluster will then attempt to remove the node in a controlled manner.
    Progress is indicated by a spinning icon and the words <literal>Pending removal</literal>
    in the location where the <guimenu>Remove</guimenu>-button was displayed before.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject>
        <imagedata fileref="velum_pending_removal.png" format="PNG" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
     <para>
      This should conclude the regular removal process. If the node is successfully
      removed, it will disappear from the list after a few moments.
     </para>
    </step>
    <step>
     <para>
      In some cases nodes can not be removed in a controlled manner and must be
      forced out of the cluster. A typical scenario is a machine instance was
      removed externally or has no connectivity. In such cases, the removal will
      fail. You then get the option to <guimenu>Force remove</guimenu>. A dialog
      will ask you to confirm the removal.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject>
        <imagedata fileref="velum_failed_removal.png" format="PNG" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
     <para>
      Additionally, a large warning dialog will ask you to confirm the forced
      removal. Click <guimenu>Proceed with forcible removal</guimenu> if you
      are sure you wish to force the node out of the cluster.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject>
        <imagedata fileref="velum_force_removal.png" format="PNG" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
    </step>
   </procedure>
  </sect2>

  <sect2 xml:id="sec.admin.nodes.remove.unassigned">
  <!-- FIXME mnapp 2018-07-03, replace terminology and screenshots once
  bsc#1100113 has been resolved -->

   <title>Removing Unassigned nodes</title>
   <para>You might run into the situation where you have (accidentally) added
    new nodes to a cluster but did not wish to bootstrap them. They are now
    registered against the cluster and show up in "Unassigned nodes".
    You might have already configured the machine to register with another cluster
    and want to clean out this entry from the "Unassigned Nodes" view.

    You must perform the following steps:
   </para>
   <procedure>
   <step>
    <para>
     Find the "Unassigned nodes" line in the overview and click on <guimenu>(new)</guimenu>
     next to the count number. You will be shown the "Unassigned Nodes" view
     where all the unassigned nodes are listed. Make sure that you first assign
     all roles to nodes that you wish to keep and proceed with bootstrapping.
     Once the list only show the nodes you are sure to remove copy the ID of the
     node you wish to drop.
    </para>
    <informalfigure>
     <mediaobject>
      <imageobject>
       <imagedata fileref="velum_unassigned_nodes.png" format="PNG" width="100%"/>
      </imageobject>
     </mediaobject>
    </informalfigure>
   </step>
   <step>
    <para>
   Log into the Admin node of you cluster via SSH.
    </para>
   </step>
   <step>
    <para>
     Run the following command and replace <replaceable>$ID_FROM_UNASSIGNED_QUEUE</replaceable>
     with the node ID that you copied from the "Unassigned nodes" view in &dashboard;.
    </para>
    <warning>
     <para>
      Make absolutely sure that the node ID you have copied is the one of the node
       you wish to drop. This command is <literal>irreversible</literal> and will remove the
       specified node from the cluster without confirmation.
      </para>
     </warning>
<screen>&prompt.root;<command>docker exec -it $(docker ps | grep "velum-dashboard" | awk '{print $1}') \
entrypoint.sh bundle exec rails runner 'puts Minion.find_by(minion_id: "<replaceable>$ID_FROM_UNASSIGNED_QUEUE</replaceable>").destroy'</command>
    </screen>
   </step>
  </procedure>
  </sect2>
 </sect1>

 <sect1 xml:id="sec.admin.nodes.graceful_shutdown">
  <title>Graceful Shutdown and Startup</title>
  <sect2 xml:id="sec.admin.nodes.graceful_shutdown.overview">
   <title>Overview</title>
   <para>
    &kube;, being a self-healing solution, tries to keep all pods and
    services available. In general, this is of its core features and
    desired functions. But it is important to take this into account if
    you are doing a complete shutdown of the infrastructure.
   </para>
   <para>
    There are two ways of shutting down the whole cluster: Shut down
    and start all nodes at once or restart them sequentially in
    segments. In both cases, &productname; expects that IP addresses do
    not change after the restart, even when using dynamic IP addresses.
   </para>
   <para>
    When restarting segments of nodes, it is possible to avoid
    downtime.
   </para>
   <note>
    <title>Deviating from Shutdown and Startup Procedures</title>
    <para>
     The procedures described in this section are recommended to
     reduce logged errors. However, it is possible to not follow this
     order as long as all nodes are stopped in a graceful way.
    </para>
   </note>
  </sect2>
  <sect2 xml:id="sec.admin.nodes.graceful_shutdown.nodes">
   <title>Node Types</title>
   <para>
    For shutting down and starting nodes, three different types of nodes
    are relevant:
   </para>
   <itemizedlist>
    <listitem>
     <para>
      The &admin_node; contains state and needs to be shut down in a graceful
      way to ensure that all state has been synced to disk in a clean way.
     </para>
    </listitem>
    <listitem>
     <para>
      Nodes with <literal>etcd</literal> contain state and also need to be shut
      down in a graceful way. They will usually be a subset of the master nodes.
      But it can happen that some workers run <literal>etcd</literal> members.
     </para>
    </listitem>
    <listitem>
     <para>
      The rest (masters and workers not running <literal>etcd</literal>
      members): These nodes contain local state possibly created by
      applications running on top of the cluster. They need to be
      shut down in a graceful way too, when possible.
     </para>
    </listitem>
   </itemizedlist>
  </sect2>
  <sect2 xml:id="sec.admin.nodes.graceful_shutdown.complete">
   <title>Complete Shutdown</title>
   <sect3 xml:id="sec.admin.nodes.graceful_shutdown.complete.shutdown">
    <title>Shutting Down</title>
    <para>
     All commands are executed on the admin node.
    </para>
    <procedure>
     <step>
      <para>
       Disable scheduling on the whole cluster. This will avoid
       &kube; rescheduling jobs while you are shutting down nodes.
      </para>
<screen>&prompt.root.admin;<command>kubectl get nodes -o name | xargs -I{} kubectl cordon {}</command></screen>
     </step>
     <step>
      <para>
       Gracefully shut down all worker nodes.
      </para>
<screen>&prompt.root.admin;<command>docker exec -it $(docker ps -q -f name=salt-master) \
salt --async -G 'roles:kube-minion' cmd.run 'systemctl poweroff'</command></screen>
     </step>
     <step>
      <para>
       Gracefully shut down all master nodes.
      </para>
<screen>&prompt.root.admin;<command>docker exec -it $(docker ps -q -f name=salt-master) \
salt --async -G 'roles:kube-master' cmd.run 'systemctl poweroff'</command></screen>
     </step>
     <step>
      <para>
       Shut down the &admin_node;:
      </para>
<screen>&prompt.root.admin;<command>systemctl poweroff</command></screen>
     </step>
    </procedure>
   </sect3>
   <sect3 xml:id="sec.admin.nodes.graceful_shutdown.complete.startup">
    <title>Starting Up</title>
    <note>
     <title><literal>kubectl</literal> Needs Master Nodes To Function</title>
     <para>
      <command>kubectl</command> requires use of the &kube; API hosted on the
      master nodes. Therefore, until at least some of the master nodes have
      started successfully, you will see error messages of the type
      <literal>HTTP 503</literal>.
     </para>
     <screen>
Error from server (InternalError): an error on the server
("&lt;html&gt;&lt;body&gt;&lt;h1&gt;503 Service Unavailable&lt;/h1&gt;\nNo server is available
to handle this request.\n&lt;/body&gt;&lt;/html&gt;") has prevented the request
from succeeding (get nodes)
     </screen>
    </note>
    <procedure>
     <step>
      <para>
       Start the &admin_node; up. All commands are executed on the
       &admin_node;.
      </para>
     </step>
     <step>
      <para>
       Once that the admin node is up, start the master nodes. Keep checking
       the status of the master nodes. Continue as soon as all master nodes are
       <literal>Ready</literal>.
      </para>
<screen>&prompt.root.admin;<command>kubectl get nodes</command>
NAME       STATUS                        ROLES     AGE       VERSION
master-0   Ready,SchedulingDisabled      master    21h       v1.9.8
master-1   Ready,SchedulingDisabled      master    21h       v1.9.8
master-2   Ready,SchedulingDisabled      master    21h       v1.9.8
worker-0   NotReady,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-1   NotReady,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-2   NotReady,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-3   NotReady,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-4   NotReady,SchedulingDisabled   &lt;none>    21h       v1.9.8
      </screen>
     </step>
     <step>
      <para>
       Continue by starting all the worker nodes. Keep checking the
       status of the nodes. Continue when all nodes are <literal>Ready</literal>.
      </para>
<screen>&prompt.root.admin;<command>kubectl get nodes</command>
NAME       STATUS                     ROLES     AGE       VERSION
master-0   Ready,SchedulingDisabled   master    21h       v1.9.8
master-1   Ready,SchedulingDisabled   master    21h       v1.9.8
master-2   Ready,SchedulingDisabled   master    21h       v1.9.8
worker-0   Ready,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-1   Ready,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-2   Ready,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-3   Ready,SchedulingDisabled   &lt;none>    21h       v1.9.8
worker-4   Ready,SchedulingDisabled   &lt;none>    21h       v1.9.8
      </screen>
     </step>
     <step>
      <para>
       Uncordon all nodes so they can receive new workloads:
      </para>
<screen>&prompt.root.admin;<command>kubectl get nodes -o name | xargs -I{} kubectl uncordon {}</command></screen>
     </step>
    </procedure>
   </sect3>
  </sect2>
  <sect2 xml:id="sec.admin.nodes.graceful_shutdown.segmented">
   <title>Segmented Reboots</title>
   <para>
    A sequential reboot of cluster segments is a way to completely
    avoid the downtime of services or at least reduce it as much as
    possible. However, downtime of services occurs if:
   </para>
   <itemizedlist>
    <listitem>
     <para>
      All pods of a service are forced on one node
     </para>
    </listitem>
    <listitem>
     <para>
      A pod has only one replica
     </para>
    </listitem>
   </itemizedlist>

   <sect3 xml:id="sec.admin.nodes.graceful_shutdown.segmented.worker">
    <title>Rebooting Worker Nodes</title>
    <para>
     The number of worker nodes to reboot at once depends on the number
     of total worker nodes and their labels.
    </para>
    <para>
     For example: If there are 5 worker nodes with 2 of them having the label
     <literal>diskType: ssd</literal>, then the two nodes with SSDs must not be
     shut down at the same time.
    </para>
    <para>
     The size of segments for simultaneous reboots depends on the
     topology of the cluster and the workload. We recommend to use
     small segment sizes. This makes it less likely that all nodes
     running replicas of the same pod are shut down at the same time.
    </para>
    <para>
     During this migration time, the worker nodes need to be able
     to reach the master nodes at all times. This includes master nodes
     that are already or not yet updated.
    </para>
   </sect3>
   <sect3>
    <title>Rebooting Master Nodes</title>
    <para>
     Master nodes should not run user workloads. This means that the
     decision to batch the reboots of master nodes depends on whether
     you want to keep control of the cluster while the reboot is
     taking place.
    </para>
    <para>
     If all the master nodes disappear at the same time, the worker
     nodes continue serving the services they are running. No further operation
     will take place on the worker nodes, since they cannot contact an
     <literal>apiserver</literal> to discover new workloads or perform any other
     operations.
    </para>
    <para>
     It is safe to choose batches as desired. Rebooting one by one is
     the safest, two by two is generally safe too. For larger batches
     than two, certain cluster services, for example
     <literal>dex</literal>, could be completely shut down.
    </para>
   </sect3>
  </sect2>
  <sect2 xml:id="sec.admin.nodes.graceful_shutdown.etcd">
   <title>Behavior of <literal>etcd</literal></title>
   <para>
    <literal>etcd</literal> is a distributed key-value store. Some
    nodes on the cluster run <literal>etcd</literal> members that
    sync with other peers in order to provide a fault-tolerant storage
    that &kube; uses for persistence.
   </para>
   <para>
    <literal>etcd</literal> is the central component where &kube; reads and
    writes in order to have global knowledge about the cluster status
    and desired state.
   </para>
   <para>
    It's very important to note that <literal>etcd</literal>
    automatically recovers from temporary failures like machine
    reboots.
   </para>
   <para>
    <literal>etcd</literal> knows how many peers conform the
    <literal>etcd</literal> cluster; based on this information the
    <literal>etcd</literal> cluster can be in three different states:
    healthy, degraded or unavailable.
   </para>
   <variablelist>
    <varlistentry>
     <term>Healthy</term>
     <listitem>
      <para>
       All <literal>etcd</literal> members are working as expected.
      </para>
     </listitem>
    </varlistentry>
    <varlistentry>
     <term>Degraded</term>
     <listitem>
      <para>
       Some <literal>etcd</literal> members are not working as
       expected, but there's still a majority in the working ones. This
       still means the cluster is working, because it has quorum.
      </para>
     </listitem>
    </varlistentry>
    <varlistentry>
     <term>Unavailable</term>
     <listitem>
      <para>
       There is no working majority of peers. The cluster is not
       available and cannot be used because the quorum is lost.
      </para>
     </listitem>
    </varlistentry>
   </variablelist>
   <para>
   Whether <literal>etcd</literal> is available or not depends on how many
   <literal>etcd</literal> members are available/not available at a given
   moment. It is important to differentiate between transient and
   permanent failures. Transient failures happen when a member is
   temporarily not available, for example when a machine running one
   <literal>etcd</literal> member is rebooting. Permanent failures
   happen when a member was irrevocably lost, for example a machine
   hard disk failure. The <literal>etcd</literal> cluster can tolerate
   up to (N - 1) / 2 permanent failures, where N is the number of
   <literal>etcd</literal> members; a subset of masters and possibly
   workers. The number of etcd nodes must always maintain
   <literal>Majority</literal> quorum.
  </para>
  <para>
   <literal>Majority</literal> means that the number of available etcd cluster
   members must never be lower or equal to the number of unavailable nodes.
   If, for example, you have only <literal>1</literal> or <literal>2</literal>
   etcd members, the cluster has a fault tolerance of <literal>0</literal>
   because <literal>0</literal> nodes can be faulty for the cluster to maintain
   <literal>Majority</literal>.
  </para>
  <para>
   If you have <literal>6</literal> nodes, a maximum of <literal>2</literal>
   nodes can become faulty for the cluster to remain in degraded but working
   state. If <literal>3</literal> or more nodes fail, there is no longer a
   majority of nodes working, therefore the cluster becomes unavailable.
  </para>
  <para>
   For example: The fault tolerance of a cluster with <literal>7</literal>
   nodes is <literal>3</literal>, because you need at least <literal>4</literal>
   active nodes to maintain majority.
  </para>
   <para>

   </para>
   <para>
    When (N - 1) / 2 or fewer permanent failures happen in a given
    <literal>etcd</literal> cluster, the cluster still has a quorum. It
    is then possible to remove the faulty members and add new ones. The
    new members will synchronize with the existing ones. This does not
    require an explicit backup/restore procedure, as it is normal
    <literal>etcd</literal> operation.
   </para>
   <para>
    When more than (N - 1) / 2 permanent failures happen in a given
    <literal>etcd</literal> cluster, the quorum is lost irrevocably.
    That means that there is no way to recover from that situation,
    because it is no longer possible to remove faulty members or add
    new members. In this case, it is necessary to start a new
    <literal>etcd</literal> cluster from a backup, and grow it.
   </para>
  </sect2>
 </sect1>

 <sect1 xml:id="sec.admin.scale_cluster">
  <title>Scaling the Cluster</title>
  <para>
   The default maximum number of nodes in a cluster is 40. The Salt
   Master configuration needs to be adjusted to handle installation and
   updating a of larger cluster:
  </para>
  <table>
   <title>Node Count and Salt Worker Threads</title>
   <tgroup cols="2">
    <thead>
     <row>
      <entry>
       <para>
        Nodes
       </para>
      </entry>
      <entry>
       <para>
        Salt Worker Threads
       </para>
      </entry>
     </row>
    </thead>
    <tbody>
     <row>
      <entry>
       <para>
        &gt;40
       </para>
      </entry>
      <entry>
       <para>
        20
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        &gt;60
       </para>
      </entry>
      <entry>
       <para>
        30
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        &gt;75
       </para>
      </entry>
      <entry>
       <para>
        40
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        &gt;85
       </para>
      </entry>
      <entry>
       <para>
        50
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        &gt;95
       </para>
      </entry>
      <entry>
       <para>
        60
       </para>
      </entry>
     </row>
    </tbody>
   </tgroup>
  </table>
  <para>
   To change the variable in the &smaster; configuration, run the
   following on the &admin_node;:
  </para>
<screen>&prompt.root;<command>echo "worker_threads: 20" > /etc/caasp/salt-master-custom.conf</command>
&prompt.root;<command>docker restart $(docker ps | grep salt-master | awk '{print $1}')</command></screen>
  <para>
   &smaster; will be automatically restarted by kubelet.
  </para>
  <para>
   Following bootstrapping failure, you can check if Salt
   worker_threads is too low.
  </para>
<screen>&prompt.root;<command>docker logs $(docker ps | grep salt-master | \
    awk '{print $1}') 2>&amp;1 | grep -i worker_threads</command></screen>
 </sect1>

 <sect1 xml:id="sec.admin.velum.registry">
  <title>Configuring Remote Container Registry</title>
  <para>
   A remote registry allows you to access container images locally.
   This is commonly used in cases where a &productname; cluster is not allowed
   to have direct access to the internet. You can create a local registry with
   the images that you will need and add the information for that registry here.
   If the registry is using a self-signed certificate, it can be added here to
   create trust between Kubernetes and the registry.
  </para>
  <para>
   By default, the &suse; container registry is configured as the only remote
   registry and has the name <literal>SUSE</literal>.
  </para>

  <informalfigure>
   <mediaobject>
    <imageobject role="fo">
     <imagedata fileref="velum_settings_registry_overview.png" width="100%"/>
    </imageobject>
    <imageobject role="html">
     <imagedata fileref="velum_settings_registry_overview.png" width="100%"/>
    </imageobject>
   </mediaobject>
  </informalfigure>

  <sect2 xml:id="sec.admin.velum.registry.add">
   <title>Adding A Remote Registry</title>

   <procedure>
    <step>
     <para>
      Log in to &dashboard; and navigate to
      <guimenu>Settings &rarr; Remote Registries</guimenu>.
     </para>
    </step>
    <step>
     <para>
      Click on <guimenu>Add Remote Registry</guimenu> to add a new remote
      registry configuration.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_settings_remote_registry.png" width="100%"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_settings_remote_registry.png" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
    </step>

    <step>
     <para>
      Fill in the options for the new registry.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_settings_new_registry.png" width="100%"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_settings_new_registry.png" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>

     <variablelist>
      <varlistentry>
       <term>Name</term>
       <listitem>
        <para>
         Define a name for the registry.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>URL</term>
       <listitem>
        <para>
         Enter the URL for the registry in the format
         <literal>http(s)://&lt;hostname&gt;:&lt;port&gt;</literal>.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Certificate</term>
       <listitem>
        <para>
         Will only be shown if the <literal>URL</literal> field contains
         <literal>https:</literal>.
        </para>
        <para>
         Provide the body of the (self-signed) SSL certificate for the registry.
        </para>
       </listitem>
      </varlistentry>
     </variablelist>
    </step>

    <step>
     <para>
      You will be shown a summary of the details of the registry you have just
      created.
     </para>
     <para>
      If you have to adjust the registry click
      <guimenu>Edit</guimenu>
      to return to the editing dialog.
     </para>
     <para>
      Click <guimenu>Delete</guimenu> if you made a mistake and wish to remove
      the registry. You can always remove the registry from the overview later.
     </para>
     <para>
      If you wish to define a mirror for this registry you can click on
      <guimenu>Add Mirror</guimenu> to do so. For details, refer to
      <xref linkend="sec.admin.velum.mirror"/>
     </para>

     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_settings_registry_details.png" width="100%"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_settings_registry_details.png" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
    </step>

    <step>
     <para>
      If you have further registries to add, repeat the previous steps.
     </para>
    </step>
    <step>
     <para>
      Finally, click the <guimenu>Apply Changes</guimenu> button on the top of
      the page. This will update the registry settings across the cluster.
     </para>
    </step>
   </procedure>
  </sect2>

  <sect2 xml:id="sec.admin.velum.registry.modify">
   <title>Modifying A Registry</title>
   <procedure>
    <step>
     <para>
      Log in to &dashboard; and navigate to
      <guimenu>Settings &rarr; Remote Registries</guimenu>.
     </para>
    </step>
    <step>
     <para>
      Click on the pencil icon in the row of the registry you wish to modify.
      Perform the changes you wish to make and click "Save".
     </para>
    </step>
    <step>
     <para>
      If you have further registries to modify, repeat the previous steps.
     </para>
    </step>
    <step>
     <para>
      Finally, click the <guimenu>Apply Changes</guimenu> button on the top of
      the page. This will update the registry settings across the cluster.
     </para>
    </step>
   </procedure>
  </sect2>

  <sect2 xml:id="sec.admin.velum.registry.remove">
   <title>Removing A Registry</title>
   <procedure>
    <step>
     <para>
      Log in to &dashboard; and navigate to
      <guimenu>Settings &rarr; Remote Registries</guimenu>.
     </para>
    </step>
    <step>
     <para>
      Click on the red trashcan icon in the row of the registry you wish to
      delete and confirm the popup dialog by clicking
      <guimenu>OK</guimenu>.
     </para>
    </step>
    <step>
     <para>
      If you have further registries to remove, repeat the previous steps.
     </para>
    </step>
    <step>
     <para>
      Finally, click the <guimenu>Apply Changes</guimenu> button on the top of
      the page. This will update the registry settings across the cluster.
     </para>
    </step>
   </procedure>
  </sect2>
 </sect1>

 <sect1 xml:id="sec.admin.velum.mirror">
  <title>Configuring A Registry Mirror</title>
  <para>
   Similar to the
   <guimenu>Remote Registries</guimenu>
   page, the
   <guimenu>Mirrors</guimenu>
   page allows you to add redundant image mirrors to existing registries. The
   internal container engine will use this information to reroute requests from
   the cluster nodes to the defined mirror address.
  </para>

  <informalfigure>
   <mediaobject>
    <imageobject role="fo">
     <imagedata fileref="velum_settings_mirror_overview.png" width="100%"/>
    </imageobject>
    <imageobject role="html">
     <imagedata fileref="velum_settings_mirror_overview.png" width="100%"/>
    </imageobject>
   </mediaobject>
  </informalfigure>

  <sect2 xml:id="sec.admin.velum.mirror.add">
   <title>Adding A Mirror</title>
   <procedure>
    <step>
     <para>
      Log in to &dashboard; and navigate to
      <guimenu>Settings &rarr; Mirrors</guimenu>.
     </para>
     </step>
     <step>
      <para>
       Click on <guimenu>Add Mirror</guimenu> to add a new registry mirror
       configuration.
      </para>
     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_settings_mirror.png" width="100%"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_settings_mirror.png" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
     </step>

    <step>
     <para>
      Fill in the options for the new mirror.
     </para>
     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_settings_new_mirror.png" width="100%"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_settings_new_mirror.png" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>

     <variablelist>
      <varlistentry>
       <term>Mirror of</term>
       <listitem>
        <para>
         Select one of the configured registries from the menu.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Name</term>
       <listitem>
        <para>
         Define a name for the mirror.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>URL</term>
       <listitem>
        <para>
         Enter the URL for the mirror in the format
         <literal>http(s)://&lt;hostname&gt;:&lt;port&gt;</literal>.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Certificate</term>
       <listitem>
        <para>
         Will only be shown if the <literal>URL</literal> field contains
         <literal>https:</literal>.
        </para>
        <para>
         Provide the body of the (self-signed) SSL certificate for the registry.
        </para>
       </listitem>
      </varlistentry>
     </variablelist>
     </step>

    <step>
     <para></para>
     <informalfigure>
      <mediaobject>
       <imageobject role="fo">
        <imagedata fileref="velum_settings_mirror_details.png" width="100%"/>
       </imageobject>
       <imageobject role="html">
        <imagedata fileref="velum_settings_mirror_details.png" width="100%"/>
       </imageobject>
      </mediaobject>
     </informalfigure>
    </step>
   </procedure>
  </sect2>

  <sect2 xml:id="sec.admin.velum.mirror.modify">
   <title>Modifying A Mirror</title>
   <procedure>
    <step>
     <para>
      Log in to &dashboard; and navigate to
      <guimenu>Settings &rarr; Mirrors</guimenu>.
     </para>
    </step>
    <step>
     <para>
      Click on the pencil icon in the row of the mirror you wish to modify.
      Perform the changes you wish to make and click "Save".
     </para>
    </step>
    <step>
     <para>
      If you have further mirrors to modify, repeat the previous steps.
     </para>
    </step>
    <step>
     <para>
      Finally, click the <guimenu>Apply Changes</guimenu> button on the top of
      the page. This will update the mirror settings across the cluster.
     </para>
    </step>
   </procedure>
  </sect2>

  <sect2 xml:id="sec.admin.velum.mirror.remove">
   <title>Removing A Mirror</title>
   <procedure>
    <step>
     <para>
      Log in to &dashboard; and navigate to
      <guimenu>Settings &rarr; Mirrors</guimenu>.
     </para>
    </step>
    <step>
     <para>
      Click on the trashcan icon in the row of the mirror you wish to remove and
      confirm the popup dialog with <guimenu>OK</guimenu>.
     </para>
    </step>
    <step>
     <para>
      If you have further mirrors to remove, repeat the previous steps.
     </para>
    </step>
    <step>
     <para>
      Finally, click the <guimenu>Apply Changes</guimenu> button on the top of
      the page. This will update the mirror settings across the cluster.
     </para>
    </step>
   </procedure>
  </sect2>
 </sect1>

 <sect1 xml:id="sec.admin.compute_resources">
  <title>Reserving Compute Resources</title>
  <para>
   By default, &kube; will allocate all available hardware resources of
   a node to pods. This can starve core services of needed resources,
   which are, for example, required for managing single nodes or the
   cluster. To prevent core services from running out of resources, you
   can reserve CPU, memory, and disk resources for them.
  </para>
  <warning>
   <title>Carefully Check Entered Values</title>
   <para>
    Entering invalid values into the input fields may break nodes.
    Carefully check the entered values before selecting the
    <guimenu>Save</guimenu> button.
   </para>
  </warning>
  <informalfigure>
   <mediaobject>
    <imageobject role="fo">
     <imagedata fileref="velum_settings_compute_resource.png" width="100%"/>
    </imageobject>
    <imageobject role="html">
     <imagedata fileref="velum_settings_compute_resource.png" width="100%"/>
    </imageobject>
   </mediaobject>
  </informalfigure>
  <para>
   To reserve hardware resources, go to the &dashboard; dashboard and
   then proceed to <emphasis>Settings</emphasis> and <emphasis>Compute
   Resources Reservation</emphasis>.
  </para>
  <para>
   You can reserve resources for &kube; services in the box
   <literal>&kube; core services</literal> and for services running on
   a single node in <literal>Host system services</literal>.
  </para>
  <para>
   In the box <literal>Eviction threshold</literal>, you can set rules
   for killing pods when the usage of RAM or storage reaches a defined
   level. This prevents nodes from actually running out of resources,
   which would then trigger the default out-of-resource-handling.
  </para>
  <para>
   To activate entered settings, use the <guimenu>Save</guimenu> button
   at the bottom of the page.
  </para>
 </sect1>
</chapter>