xml/operations-storage_alarmdefinitions.xml

<?xml version="1.0"?>
<!DOCTYPE section [
 <!ENTITY % entities SYSTEM "entity-decl.ent"> %entities;
]>
<section xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="storage-alarmdefinitions">
 <title>Storage Alarms</title>
 <para>
  These alarms show under the Storage section of the &productname; &opscon;.
 </para>
 <section>
  <title>SERVICE: OBJECT-STORAGE</title>
  <informaltable>
   <?dbhtml table-width="99%" ?>
   <tgroup cols="2">
    <colspec colname="c1" colnum="1" colwidth="1*"/>
    <colspec colname="c2" colnum="2" colwidth="2*"/>
    <thead>
     <row>
      <entry>Alarm Information</entry>
      <entry>Mitigation Tasks</entry>
     </row>
    </thead>
    <tbody valign="top">
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: swiftlm-scan monitor</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if
        <literal>swiftlm-scan</literal> cannot execute a monitoring task.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The
        <literal>swiftlm-scan</literal> program is used to monitor and measure
        a number of metrics. If it is unable to monitor or measure something,
        it raises this alarm.
       </para>
      </entry>
      <entry>
       <para>
        Click on the alarm to examine the <literal>Details</literal> field and
        look for a <literal>msg</literal> field. The text may explain the error
        problem. To view/confirm this, you can also log into the host specified
        by the <literal>hostname</literal> dimension, and then run this
        command:
       </para>
<screen>sudo swiftlm-scan | python -mjson.tool</screen>
       <para>
        The <literal>msg</literal> field is contained in the
        <literal>value_meta</literal> item.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; account replicator last</emphasis>
        completed in 12 hours
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if an
        <literal>account-replicator</literal> process did not complete a
        replication cycle within the last 12 hours.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> This can indicate that
        the <literal>account-replication</literal> process is stuck.
       </para>
      </entry>
      <entry>
       <para>
        Another cause of this problem may be that a file system may be corrupt.
        Look for sign of this in these logs on the affected node:
       </para>
<screen>/var/log/swift/swift.log
/var/log/kern.log</screen>
       <para>
        The file system may need to be wiped, contact &serviceteam; for advice
        on the best way to do that if needed. You can then reformat the file
        system with these steps:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; deploy playbook against the affected node, which will
          format the wiped file system:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-deploy.yml
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; container replicator last</emphasis>
        completed in 12 hours
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if a
        container-replicator process did not complete a replication cycle
        within the last 12 hours
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> This can indicate that
        the container-replication process is stuck.
       </para>
      </entry>
      <entry>
       <para>
        SSH to the affected host and restart the process with this command:
       </para>
<screen>sudo systemctl restart swift-container-replicator</screen>
       <para>
        Another cause of this problem may be that a file system may be corrupt.
        Look for sign of this in these logs on the affected node:
       </para>
<screen>/var/log/swift/swift.log
/var/log/kern.log</screen>
       <para>
        The file system may need to be wiped, contact &serviceteam; for advice
        on the best way to do that if needed. You can then reformat the file
        system with these steps:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; deploy playbook against the affected node, which will
          format the wiped file system:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-deploy.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; object replicator last</emphasis>
        completed in 24 hours
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if an
        object-replicator process did not complete a replication cycle within
        the last 24 hours
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> This can indicate that
        the object-replication process is stuck.
       </para>
      </entry>
      <entry>
       <para>
        SSH to the affected host and restart the process with this command:
       </para>
<screen>sudo systemctl restart swift-account-replicator</screen>
       <para>
        Another cause of this problem may be that a file system may be corrupt.
        Look for sign of this in these logs on the affected node:
       </para>
<screen>/var/log/swift/swift.log
/var/log/kern.log</screen>
       <para>
        The file system may need to be wiped, contact &serviceteam; for advice
        on the best way to do that if needed. You can then reformat the file
        system with these steps:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; deploy playbook against the affected node, which will
          format the wiped file system:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>
&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-deploy.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; configuration file</emphasis>
        ownership
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if
        files/directories in <literal>/etc/swift</literal> are not owned by
        &o_objstore;.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> For files in
        <literal>/etc/swift</literal>, somebody may have manually edited or
        created a file.
       </para>
      </entry>
      <entry>
       <para>
        For files in <literal>/etc/swift</literal>, use this command to change
        the file ownership:
       </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;sudo chown swift.swift /etc/swift/, /etc/swift/*</screen>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; data filesystem ownership</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if files or
        directories in <literal>/srv/node</literal> are not owned by &o_objstore;.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> For directories in
        <literal>/srv/node/*</literal>, it may happen that the root partition
        was reimaged or reinstalled and the UID assigned to the &o_objstore; user
        change. The directories and files would then not be owned by the UID
        assigned to the &o_objstore; user.
       </para>
      </entry>
      <entry>
       <para>
        For directories and files in <filename>/srv/node/*</filename>, compare
        the swift UID of this system and other systems and the UID of the owner
        of <filename>/srv/node/*</filename>. If possible, make the UID of the
        &o_objstore; user match the directories or files. Otherwise, change the
        ownership of all files and directories under the
        <filename>/srv/node</filename> path using a similar <command>chown
        swift.swift</command> command as above.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Drive URE errors detected</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if
        <literal>swift-drive-audit</literal> reports an unrecoverable read
        error on a drive used by the &o_objstore; service.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> An unrecoverable read
        error occurred when &o_objstore; attempted to access a directory.
       </para>
      </entry>
      <entry>
       <para>
        The UREs reported only apply to file system metadata (that is,
        directory structures). For UREs in object files, the &o_objstore; system
        automatically deletes the file and replicates a fresh copy from one of
        the other replicas.
       </para>
       <para>
        UREs are a normal feature of large disk drives. It does not mean that
        the drive has failed. However, if you get regular UREs on a specific
        drive, then this may indicate that the drive has indeed failed and
        should be replaced.
       </para>
       <para>
        You can use standard XFS repair actions to correct the UREs in the file
        system.
       </para>
       <para>
        If the XFS repair fails, you should wipe the GPT table as follows
        (where &lt;drive_name&gt; is replaced by the actual drive name):
       </para>
<screen>&prompt.ardana;sudo dd if=/dev/zero of=/dev/sd&lt;drive_name&gt; \
bs=$((1024*1024)) count=1</screen>
       <para>
        Then follow the steps below which will reformat the drive, remount it,
        and restart &o_objstore; services on the affected node.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; reconfigure playbook, specifying the affected node:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts _swift-configure.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
       <para>
        It is safe to reformat drives containing &o_objstore; data because &o_objstore;
        maintains other copies of the data (usually, &o_objstore; is configured to
        have three replicas of all data).
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; service</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if a &o_objstore;
        process, specified by the <literal>component</literal> field, is not
        running.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> A daemon specified by
        the <literal>component</literal> dimension on the host specified by the
        <literal>hostname</literal> dimension has stopped running.
       </para>
      </entry>
      <entry>
       <para>
        Examine the <filename>/var/log/swift/swift.log</filename> file for
        possible error messages related the &o_objstore; process. The process in
        question is listed in the alarm dimensions in the
        <literal>component</literal> dimension.
       </para>
       <para>
        Restart &o_objstore; processes by running the
        <filename>swift-start.yml</filename> playbook, with these steps:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; start playbook against the affected host:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; filesystem mount point</emphasis>
        status
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if a file
        system/drive used by &o_objstore; is not correctly mounted.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The device specified by
        the <literal>device</literal> dimension is not correctly mounted at the
        mountpoint specified by the <literal>mount</literal> dimension.
       </para>
       <para>
        The most probable cause is that the drive has failed or that it had a
        temporary failure during the boot process and remained unmounted.
       </para>
       <para>
        Other possible causes are a file system corruption that prevents the
        device from being mounted.
       </para>
      </entry>
      <entry>
       <para>
        Reboot the node and see if the file system remains unmounted.
       </para>
       <para>
        If the file system is corrupt, see the process used for the "Drive URE
        errors" alarm to wipe and reformat the drive.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; uptime-monitor status</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if the
        swiftlm-uptime-monitor has errors using &o_ident; (<literal>keystone-get-token</literal>),
        &o_objstore; (<literal>rest-api</literal>) or &o_objstore;'s healthcheck.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The
        swiftlm-uptime-monitor cannot get a token from &o_ident; or cannot get a
        successful response from the &o_objstore; Object-Storage API.
       </para>
      </entry>
      <entry>
       <para>
        Check that the &o_ident; service is running:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Check the status of the &o_ident; service:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts keystone-status.yml</screen>
        </step>
        <step>
         <para>
          If it is not running, start the service:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts keystone-start.yml</screen>
        </step>
        <step>
         <para>
          Contact the support team if further assistance troubleshooting the
          &o_ident; service is needed.
         </para>
        </step>
       </procedure>
       <para>
        Check that &o_objstore; is running:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Check the status of the &o_ident; service:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml</screen>
        </step>
        <step>
         <para>
          If it is not running, start the service:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml</screen>
        </step>
       </procedure>
       <para>
        Restart the swiftlm-uptime-monitor as follows:
       </para>
       <procedure>
        <step>
         <para>
          Log into the first server running the swift-proxy-server service. Use
          this playbook below to determine whcih host this is:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml
--limit SWF-PRX[0]</screen>
        </step>
        <step>
         <para>
          Restart the swiftlm-uptime-monitor with this command:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;sudo systemctl restart swiftlm-uptime-monitor</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; &o_ident; server connect</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if a socket cannot
        be opened to the &o_ident; service (used for token validation)
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The Identity service
        (&o_ident;) server may be down. Another possible cause is that the
        network between the host reporting the problem and the &o_ident; server
        or the <literal>haproxy</literal> process is not forwarding requests to
        &o_ident;.
       </para>
      </entry>
      <entry>
       <para>
        The <literal>URL</literal> dimension contains the name of the virtual
        IP address. Use cURL or a similar program to confirm that a connection
        can or cannot be made to the virtual IP address. Check that
        <literal>haproxy</literal> is running. Check that the &o_ident; service
        is working.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; service listening on ip</emphasis>
        and port
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when a &o_objstore;
        service is not listening on the correct port or ip.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The &o_objstore; service may be
        down.
       </para>
      </entry>
      <entry>
       <para>
        Verify the status of the &o_objstore; service on the affected host, as
        specified by the <literal>hostname</literal> dimension.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; status playbook to confirm status:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
       <para>
        If an issue is determined, you can stop and restart the &o_objstore; service
        with these steps:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Stop the &o_objstore; service on the affected host:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts &o_objstore;-stop.yml \
--limit &lt;hostname&gt;</screen>
        </step>
        <step>
         <para>
          Restart the &o_objstore; service on the affected host:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; rings checksum</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if the &o_objstore; rings
        checksums do not match on all hosts.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The &o_objstore; ring files
        must be the same on every node. The files are located in
        <filename>/etc/swift/*.ring.gz</filename>.
       </para>
       <para>
        If you have just changed any of the rings and you are still deploying
        the change, it is normal for this alarm to trigger.
       </para>
      </entry>
      <entry>
       <para>
        If you have just changed any of your &o_objstore; rings, if you wait until the
        changes complete then this alarm will likely clear on its own. If it
        does not, then continue with these steps.
       </para>
       <para>
        Use <command>sudo swift-recon --md5</command> to find which node has
        outdated rings.
       </para>
       <para>
        Run the <filename>swift-reconfigure.yml</filename> playbook, using the
        steps below. This deploys the same set of rings to every node.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; start playbook against the affected host:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; memcached server connect</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if a socket cannot
        be opened to the specified memcached server.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The server may be down.
        The memcached daemon running the server may have stopped.
       </para>
      </entry>
      <entry>
       <para>
        If the server is down, restart it.
       </para>
       <para>
        If memcached has stopped, you can restart it by using the
        <filename>memcached-start.yml</filename> playbook, using the steps
        below. If this fails, rebooting the node will restart the process.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the memcached start playbook against the affected host:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts memcached-start.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
       <para>
        If the server is running and memcached is running, there may be a
        network problem blocking port 11211.
       </para>
       <para>
        If you see sporadic alarms on different servers, the system may be
        running out of resources. Contact &serviceteam; for advice.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; individual disk usage
        exceeds 80%</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when a disk drive
        used by &o_objstore; exceeds 80% utilization.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> Generally all disk
        drives will fill roughly at the same rate. If an individual disk drive
        becomes filled faster than other drives it can indicate a problem with
        the replication process.
       </para>
      </entry>
      <entry>
       <para>
        If many or most of your disk drives are 80% full, you need to add more
        nodes to your system or delete existing objects.
       </para>
       <para>
        If one disk drive is noticeably (more than 30%) more utilized than the
        average of other disk drives, check that &o_objstore; processes are working on
        the server (use the steps below) and also look for alarms related to
        the host. Otherwise continue to monitor the situation.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; status:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; individual disk usage exceeds
        90%</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when a disk drive
        used by &o_objstore; exceeds 90% utilization.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> Generally all disk
        drives will fill roughly at the same rate. If an individual disk drive
        becomes filled faster than other drives it can indicate a problem with
        the replication process.
       </para>
      </entry>
      <entry>
       <para>
        If one disk drive is noticeably (more than 30%) more utilized than the
        average of other disk drives, check that &o_objstore; processes are working on
        the server, using these steps:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the &o_objstore; status:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-status.yml</screen>
        </step>
       </procedure>
       <para>
        Also look for alarms related to the host. An individual disk drive
        filling can indicate a problem with the replication process.
       </para>
       <para>
        Restart &o_objstore; on that host using the <literal>--limit</literal>
        argument to target the host:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Stop the &o_objstore; service:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-stop.yml \
--limit &lt;hostname&gt;</screen>
        </step>
        <step>
         <para>
          Start the &o_objstore; service back up:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-start.yml \
--limit &lt;hostname&gt;</screen>
        </step>
       </procedure>
       <para>
        If the utilization does not return to similar values as other disk
        drives, you can reformat the disk drive. You should only do this if the
        average utilization of all disk drives is less than 80%. To format a
        disk drive contact &serviceteam; for instructions.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; total disk usage exceeds
        80%</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when the average
        disk utilization of &o_objstore; disk drives exceeds 80% utilization.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The number and size of
        objects in your system is beginning to fill the available disk space.
        Account and container storage is included in disk utilization. However,
        this generally consumes 1-2% of space compared to objects, so object
        storage is the dominate consumer of disk space.
       </para>
      </entry>
      <entry>
       <para>
        You need to add more nodes to your system or delete existing objects to
        remain under 80% utilization.
       </para>
       <para>
        If you delete a project/account, the objects in that account are not
        removed until a week later by the <literal>account-reaper</literal>
        process, so this is not a good way of quickly freeing up space.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; total disk usage exceeds
        90%</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when the average
        disk utilization of &o_objstore; disk drives exceeds 90% utilization.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The number and size of
        objects in your system is beginning to fill the available disk space.
        Account and container storage is included in disk utilization. However,
        this generally consumes 1-2% of space compared to objects, so object
        storage is the dominate consumer of disk space.
       </para>
      </entry>
      <entry>
       <para>
        If your disk drives are 90% full, you must immediately stop all
        applications that put new objects into the system. At that point you
        can either delete objects or add more servers.
       </para>
       <para>
        Using the steps below, set the <literal>fallocate_reserve</literal>
        value to a value higher than the currently available space on disk
        drives. This will prevent more objects being created.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Edit the configuration files below and change the value for
          <literal>fallocate_reserve</literal> to a value higher than the
          currently available space on the disk drives:
         </para>
<screen>~/openstack/my_cloud/config/swift/account-server.conf.j2
~/openstack/my_cloud/config/swift/container-server.conf.j2
~/openstack/my_cloud/config/swift/object-server.conf.j2</screen>
        </step>
        <step>
         <para>
          Commit the changes to git:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;git add -A
&prompt.ardana;git commit -a -m "changing &o_objstore; fallocate_reserve value"</screen>
        </step>
        <step>
         <para>
          Run the configuration processor:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/openstack/ardana/ansible
&prompt.ardana;ansible-playbook -i hosts/localhost config-processor-run.yml</screen>
        </step>
        <step>
         <para>
          Update your deployment directory:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/openstack/ardana/ansible
&prompt.ardana;ansible-playbook -i hosts/localhost ready-deployment.yml</screen>
        </step>
        <step>
         <para>
          Run the &o_objstore; reconfigure playbook to deploy the change:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml</screen>
        </step>
       </procedure>
       <para>
        If you allow your file systems to become full, you will be unable to
        delete objects or add more nodes to the system. This is because the
        system needs some free space to handle the replication process when
        adding nodes. With no free space, the replication process cannot work.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; service per-minute
        availability</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if the &o_objstore;
        service reports unavailable for the previous minute.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The
        <literal>swiftlm-uptime-monitor</literal> service runs on the first
        proxy server. It monitors the &o_objstore; endpoint and reports latency data.
        If the endpoint stops reporting, it generates this alarm.
       </para>
      </entry>
      <entry>
       <para>
        There are many reasons why the endpoint may stop running. Check:
       </para>
       <itemizedlist>
        <listitem>
         <para>
          Is <literal>haproxy</literal> running on the control nodes?
         </para>
        </listitem>
        <listitem>
         <para>
          Is <literal>swift-proxy-server</literal> running on the &o_objstore; proxy
          servers?
         </para>
        </listitem>
       </itemizedlist>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; rsync connect</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if a socket cannot
        be opened to the specified rsync server
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The rsync daemon on the
        specified node cannot be contacted. The most probable cause is that the
        node is down. The rsync service might also have been stopped on the
        node.
       </para>
      </entry>
      <entry>
       <para>
        Reboot the server if it is down.
       </para>
       <para>
        Attempt to restart rsync with this command:
       </para>
<screen>systemctl restart rsync.service</screen>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; smart array controller
        status</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if there is a
        failure in the Smart Array.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The Smart Array or Smart
        HBA controller has a fault or a component of the controller (such as a
        battery) is failed or caching is disabled.
       </para>
       <para>
        The HPE Smart Storage Administrator (HPE SSA) CLI component will have
        to be installed for SSACLI status to be reported. HPE-specific binaries
        that are not based on open source are distributed directly from and
        supported by HPE. To download and install the SSACLI utility, please
        refer to:
        <link
       xlink:href="https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f"/>
       </para>
      </entry>
      <entry>
       <para>
        Log in to the reported host and run these commands to find out the
        status of the controllers:
       </para>
<screen>sudo hpssacli
=&gt; controller show all detail</screen>
       <para>
        For hardware failures (such as failed battery), replace the failed
        component. If the cache is disabled, reenable the cache.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; physical drive status</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if there is a
        failure in the Physical Drive.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis>A disk drive on the
        server has failed or has warnings.
       </para>
      </entry>
      <entry>
       <para>
        Log in to the reported and run these commands to find out the status of
        the drive:
       </para>
<screen>sudo hpssacli
=&gt; ctrl slot=1 pd all show</screen>
       <para>
        Replace any broken drives.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_objstore; logical drive status</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if there is a
        failure in the Logical Drive.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> A LUN on the server is
        degraded or has failed.
       </para>
      </entry>
      <entry>
       <para>
        Log in to the reported host and run these commands to find out the
        status of the LUN:
       </para>
<screen>sudo hpssacli
=&gt; ctrl slot=1 ld all show
=&gt; ctrl slot=1 pd all show</screen>
       <para>
        Replace any broken drives.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Process Check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when the specified
        process is not running.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> If the
        <literal>service</literal> dimension is
        <literal>object-store</literal>, see the description of the "&o_objstore;
        Service" alarm for possible causes.
       </para>
      </entry>
      <entry>
       <para>
        If the <literal>service</literal> dimension is
        <literal>object-storage</literal>, see the description of the "&o_objstore;
        Service" alarm for possible mitigation tasks.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: HTTP Status</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when the specified
        HTTP endpoint is down or not reachable.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> If the
        <literal>service</literal> dimension is
        <literal>object-store</literal>, see the description of the "&o_objstore; host
        socket connect" alarm for possible causes.
       </para>
      </entry>
      <entry>
       <para>
        If the <literal>service</literal> dimension is
        <literal>object-storage</literal>, see the description of the "&o_objstore;
        host socket connect" alarm for possible mitigation tasks.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Service Log Directory Size</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Service log directory
        consuming more disk than its quota.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> This could be due to a
        service set to <literal>DEBUG</literal> instead of
        <literal>INFO</literal> level. Another reason could be due to a
        repeating error message filling up the log files. Finally, it could be
        due to log rotate not configured properly so old log files are not
        being deleted properly.
       </para>
      </entry>
      <entry>
       <para>
        Find the service that is consuming too much disk space. Look at the
        logs. If <literal>DEBUG</literal> log entries exist, set the logging
        level to <literal>INFO</literal>. If the logs are repeatedly logging an
        error message, do what is needed to resolve the error. If old log files
        exist, configure log rotate to remove them. You could also choose to
        remove old log files by hand after backing them up if needed.
       </para>
      </entry>
     </row>
    </tbody>
   </tgroup>
  </informaltable>
 </section>
 <section>
  <title>SERVICE: BLOCK-STORAGE in Storage section</title>
  <informaltable colsep="1" rowsep="1">
   <tgroup cols="2">
    <colspec colname="c1" colnum="1" colwidth="30*"/>
    <colspec colname="c2" colnum="2" colwidth="70*"/>
    <thead>
     <row>
      <entry>Alarm Information</entry>
      <entry>Mitigation Tasks</entry>
     </row>
    </thead>
    <tbody valign="top">
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Process Check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Separate alarms for each
        of these &o_blockstore; services, specified by the <literal>component</literal>
        dimension:
       </para>
       <itemizedlist>
        <listitem>
         <para>
          cinder-api
         </para>
        </listitem>
        <listitem>
         <para>
          cinder-backup
         </para>
        </listitem>
        <listitem>
         <para>
          cinder-scheduler
         </para>
        </listitem>
        <listitem>
         <para>
          cinder-volume
         </para>
        </listitem>
       </itemizedlist>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> Process crashed.
       </para>
      </entry>
      <entry>
       <para>
        Restart the process on the affected node. Review the associated logs.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the <filename>cinder-start.yml</filename> playbook to start the
          process back up:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible
&prompt.ardana;ansible-playbook -i hosts/verb_hosts cinder-start.yml
--limit &lt;hostname&gt;</screen>
         <note>
          <para>
           The <literal>--limit &lt;hostname&gt;</literal> switch is optional.
           If it is included, then the <literal>&lt;hostname&gt;</literal> you
           should use is the host where the alarm was raised.
          </para>
         </note>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Process Check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when the specified
        process is not running: <literal>process_name=cinder-backup</literal>
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> Process crashed.
       </para>
      </entry>
      <entry>
       <para>
        Alert may be incorrect if the service has migrated. Validate that the
        service is intended to be running on this node before restarting the
        service. Review the associated logs.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Process Check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when the specified
        process is not running:
       </para>
       <screen>process_name=cinder-scheduler                       </screen>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> Process crashed.
       </para>
      </entry>
      <entry>
       <para>
        Restart the process on the affected node. Review the associated logs.
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run the <filename>cinder-start.yml</filename> playbook to start the
          process back up:
         </para>
<screen>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible
&prompt.ardana;ansible-playbook -i hosts/verb_hosts cinder-start.yml \
--limit &lt;hostname&gt;</screen>
         <note>
          <para>
           The <literal>--limit &lt;hostname&gt;</literal> switch is optional.
           If it is included, then the <literal>&lt;hostname&gt;</literal> you
           should use is the host where the alarm was raised.
          </para>
         </note>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Process Check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms when the specified
        process is not running: <literal>process_name=cinder-volume</literal>
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis>Process crashed.
       </para>
      </entry>
      <entry>
       <para>
        Alert may be incorrect if the service has migrated. Validate that the
        service is intended to be running on this node before restarting the
        service. Review the associated logs.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_blockstore; backup running
        &lt;hostname&gt; check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> &o_blockstore; backup singleton
        check.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> Backup process is one of
        the following:
       </para>
       <itemizedlist>
        <listitem>
         <para>
          It is running on a node it should not be on
         </para>
        </listitem>
        <listitem>
         <para>
          It is not running on a node it should be on
         </para>
        </listitem>
       </itemizedlist>
      </entry>
      <entry>
       <para>
        Run the <filename>cinder-migrate-volume.yml</filename> playbook to
        migrate the volume and back up to the correct node:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run this playbook to migrate the service:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts cinder-migrate-volume.yml</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: &o_blockstore; volume running
        &lt;hostname&gt; check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> &o_blockstore; volume singleton
        check.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> The
        <literal>cinder-volume</literal> process is either:
       </para>
       <itemizedlist>
        <listitem>
         <para>
          running on a node it should not be on, or
         </para>
        </listitem>
        <listitem>
         <para>
          not running on a node it should be on
         </para>
        </listitem>
       </itemizedlist>
      </entry>
      <entry>
       <para>
        Run the <filename>cinder-migrate-volume.yml</filename> playbook to
        migrate the volume and backup to correct node:
       </para>
       <procedure>
        <step>
         <para>
          Log in to the &clm;.
         </para>
        </step>
        <step>
         <para>
          Run this playbook to migrate the service:
         </para>
<screen><?dbsuse-fo font-size="0.70em"?>&prompt.ardana;cd ~/scratch/ansible/next/ardana/ansible/
&prompt.ardana;ansible-playbook -i hosts/verb_hosts cinder-migrate-volume.yml</screen>
        </step>
       </procedure>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Storage faulty lun check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if local LUNs on
        your HPE servers using smartarray are not OK.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> A LUN on the server is
        degraded or has failed.
       </para>
      </entry>
      <entry>
       <para>
        Log in to the reported host and run these commands to find out the
        status of the LUN:
       </para>
<screen>sudo hpssacli
=&gt; ctrl slot=1 ld all show
=&gt; ctrl slot=1 pd all show</screen>
       <para>
        Replace any broken drives.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Storage faulty drive check</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Alarms if the local disk
        drives on your HPE servers using smartarray are not OK.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> A disk drive on the
        server has failed or has warnings.
       </para>
      </entry>
      <entry>
       <para>
        Log in to the reported and run these commands to find out the status of
        the drive:
       </para>
<screen>sudo hpssacli
=&gt; ctrl slot=1 pd all show</screen>
       <para>
        Replace any broken drives.
       </para>
      </entry>
     </row>
     <row>
      <entry>
       <para>
        <emphasis role="bold">Name: Service Log Directory Size</emphasis>
       </para>
       <para>
        <emphasis role="bold">Description:</emphasis> Service log directory
        consuming more disk than its quota.
       </para>
       <para>
        <emphasis role="bold">Likely cause:</emphasis> This could be due to a
        service set to <literal>DEBUG</literal> instead of
        <literal>INFO</literal> level. Another reason could be due to a
        repeating error message filling up the log files. Finally, it could be
        due to log rotate not configured properly so old log files are not
        being deleted properly.
       </para>
      </entry>
      <entry>
       <para>
        Find the service that is consuming too much disk space. Look at the
        logs. If <literal>DEBUG</literal> log entries exist, set the logging
        level to <literal>INFO</literal>. If the logs are repeatedly logging an
        error message, do what is needed to resolve the error. If old log files
        exist, configure log rotate to remove them. You could also choose to
        remove old log files by hand after backing them up if needed.
       </para>
      </entry>
     </row>
    </tbody>
   </tgroup>
  </informaltable>
 </section>
</section>