bug #1046525

Bringing openstack monitoring content from Mirantis blog to official documentation. Change-Id: I45fb95eae7877ac8e856d89b6fcff876090ab1b5
openstack · Oct 22, 2012 · fc11bbe · fc11bbe
1 parent 2427ea5
commit fc11bbe
Showing 1 changed file with 119 additions and 0 deletions.
diff --git a/doc/src/docbkx/common/monitoring.xml b/doc/src/docbkx/common/monitoring.xml
@@ -0,0 +1,119 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter xml:id="ops-diagnostic-troubleshooting" xmlns="http://docbook.org/ns/docbook"
+    xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xlink="http://www.w3.org/1999/xlink"
+    version="5.0">
+    <title>Monitoring</title>
+
+    <section xml:id="monitoring-aspects">
+    <title>Different aspects of cloud monitoring</title>
+    <para>In cloud environments, we can identify three distinct areas for monitoring:
+    1. Cloud hardware and services: These are different hardware and software pieces  of the cluster
+       running on bare metal, including hypervisors and storage and controller nodes. The problem is
+       well-known and a number of tools exist to deal with it. Some of most popular are Nagios,
+       Ganglia, Cacti, and Zabbix.
+    2. User’s cloud ecosystem: This is everything that makes up a user’s cloud account. In case of
+       OpenStack it is instances, persistent volumes, floating IPs, security groups, etc. For all
+       these components, the user needs reliable and clear information on their status. This info
+       generally should come from the internals of the cloud software.
+    3. Performance of cloud resources: This is the performance of tenants’ cloud infrastructures
+       running on top of a given OpenStack installation. This specifically boils down to determining
+       what prevents tenants’ resources from functioning properly and how these problematic resources
+       affect other cloud resources.</para>
+    </section>
+
+    <section xml:id="monitoring-userresources">
+    <title>Monitoring the status of user's resources</title>
+
+    <para>When it comes to users, their primary expectation is usually about consistent feedback from
+    OpenStack about their instance states. There is nothing more annoying than having an instance
+    reported as “ACTIVE” by the dashboard, even though it’s been gone for several minutes. OpenStack
+    tries to prevent such discrepancies by running several checks in a cron-like manner (they are
+    called “periodic tasks” in OpenStack code) on each compute node. These checks are simply methods
+    of the ComputeManager class located in compute/manager.py or directly in drivers for different
+    hypervisors (some of these checks are available for certain hypervisors only),
+    e.g. _check_instance_build_time, _cleanup_running_deleted_instances, _sync_power_states.
+    </para>
+    </section>
+
+    <section xml:id="monitoring-cloudresources">
+    <title>Monitoring performance of cloud resources</title>
+
+    <para>While monitoring farms of physical servers is a standard task even on a large scale,
+    monitoring virtual infrastructure (“cloud scale”) is much more daunting. Cloud introduces a lot
+    of dynamic resources, which can behave unpredictably and move between different hardware
+    components. So it is usually quite hard to tell which of the thousands of VMs has problems
+    (without even having root access to it) and how the problem affects other resources. Standards
+    like sFlow try to tackle this problem, by providing efficient monitoring for a high volume of
+    events (based on probing) and determining relationships between different cloud resources.
+    </para>
+    <para>sFlow is worked on by a consortium of mainstream network device vendors, including
+    ExtremeNetworks, HP, Hitachi, etc. Since it’s embedded in their devices, it provides a consistent
+    way to monitor traffic across different networks. However, from the standpoint of a number of
+    open source projects, It’s also built into OpenvSwitch virtual switch (which
+    is a more robust alternative to a Linux bridge).
+    </para>
+    <para>To provide end-to-end packet flow analysis, sFlow agents need to be deployed on all devices
+    across the network. The agents simply collect sample data on network devices (including packet
+    headers and forwarding/routing table state), and send them to the sFlow Collector. The collector
+    gathers data samples from all the sFlow agents and produces meaningful statistics out of them,
+    including traffic characteristics and network topology (how packets traverse the network between
+    two endpoints).</para>
+    <para>While sFlow itself is defined as a standard to monitor networks, it also comes with a “host
+    sFlow agent.” Per the website:
+            "The Host sFlow agent exports physical and virtual server performance metrics using
+            the sFlow protocol. The agent provides scalable, multi-vendor, multi-OS performance
+            monitoring with minimal impact on the systems being monitored."
+    </para>
+    <para>sFlow agents are available for mainstream hypervisors, including Xen, KVM/libvirt, and
+    Hyper-V (VMWare to be added soon) and can be installed on a number of operating systems (FreeBSD,
+    Linux, Solaris, Windows) to monitor applications running on them. For the IaaS clouds based on
+    these hypervisors, it means that it’s now possible to sample different metrics of an instance
+    (including I/O, CPU, RAM, interrupts/sec etc.) without even logging into it. To make it even
+    better, one can combine the “network” and “host” parts of sFlow data to provide a complex
+    monitoring solution
+    </para>
+    <para>With the advent of Quantum in the Folsom release, the virtual network device moved from
+    Linux bridge to OpenvSwitch. If we add KVM or Xen to the mix, we will have sFlow as an applicable
+    framework to monitor instances themselves and their virtual network topologies as well.
+    There are a number of  sFlow collectors available. The most widely used seem to be Ganglia and
+    sFlowTrend, which are free. While Ganglia is focused mainly on monitoring the performance of
+    clustered hosts or instance pools, sFlowTrend seems to be more robust, adding network metrics
+    and topologies on top of that.
+    </para>
+    </section>
+
+    <section xml:id="monitoring-openstackservices">
+    <title>Monitoring Openstack services</title>
+
+    <para>The table below shows which Nagios checks can be used to monitor different
+    openstack services.
+    ===================== =========================
+    Service               Nagios check
+    ===================== =========================
+    Database              check_mysql/check_pgsql
+    --------------------- -------------------------
+    RabbitMQ              <link xlink:href='https://github.com/jamesc/nagios-plugins-rabbitmq'>nagios rabbitmq plugin</link>
+    --------------------- -------------------------
+    libvirt               check_libvirt
+    --------------------- -------------------------
+    dnsmasq               check_dhcp
+    --------------------- -------------------------
+    nova-api              check_http
+    --------------------- -------------------------
+    nova-scheduler        check_procs
+    --------------------- -------------------------
+    nova-compute          check_procs
+    --------------------- -------------------------
+    nova-network          check_procs
+    --------------------- -------------------------
+    keystone-api          check_http
+    --------------------- -------------------------
+    glance-api            check_http
+    --------------------- -------------------------
+    glance-registry       check_http
+    --------------------- -------------------------
+    server availability   check_ping
+    ===================== =========================
+    </para>
+    </section>
+</chapter>