Skip to content

Commit

Permalink
bug #1046525
Browse files Browse the repository at this point in the history
Bringing openstack monitoring content from Mirantis
blog to official documentation.

Change-Id: I45fb95eae7877ac8e856d89b6fcff876090ab1b5
  • Loading branch information
Piotr Siwczak committed Oct 22, 2012
1 parent 2427ea5 commit fc11bbe
Showing 1 changed file with 119 additions and 0 deletions.
119 changes: 119 additions & 0 deletions doc/src/docbkx/common/monitoring.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
<?xml version="1.0" encoding="UTF-8"?>
<chapter xml:id="ops-diagnostic-troubleshooting" xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0">
<title>Monitoring</title>

<section xml:id="monitoring-aspects">
<title>Different aspects of cloud monitoring</title>
<para>In cloud environments, we can identify three distinct areas for monitoring:
1. Cloud hardware and services: These are different hardware and software pieces of the cluster
running on bare metal, including hypervisors and storage and controller nodes. The problem is
well-known and a number of tools exist to deal with it. Some of most popular are Nagios,
Ganglia, Cacti, and Zabbix.
2. User’s cloud ecosystem: This is everything that makes up a user’s cloud account. In case of
OpenStack it is instances, persistent volumes, floating IPs, security groups, etc. For all
these components, the user needs reliable and clear information on their status. This info
generally should come from the internals of the cloud software.
3. Performance of cloud resources: This is the performance of tenants’ cloud infrastructures
running on top of a given OpenStack installation. This specifically boils down to determining
what prevents tenants’ resources from functioning properly and how these problematic resources
affect other cloud resources.</para>
</section>

<section xml:id="monitoring-userresources">
<title>Monitoring the status of user's resources</title>

<para>When it comes to users, their primary expectation is usually about consistent feedback from
OpenStack about their instance states. There is nothing more annoying than having an instance
reported as “ACTIVE” by the dashboard, even though it’s been gone for several minutes. OpenStack
tries to prevent such discrepancies by running several checks in a cron-like manner (they are
called “periodic tasks” in OpenStack code) on each compute node. These checks are simply methods
of the ComputeManager class located in compute/manager.py or directly in drivers for different
hypervisors (some of these checks are available for certain hypervisors only),
e.g. _check_instance_build_time, _cleanup_running_deleted_instances, _sync_power_states.
</para>
</section>

<section xml:id="monitoring-cloudresources">
<title>Monitoring performance of cloud resources</title>

<para>While monitoring farms of physical servers is a standard task even on a large scale,
monitoring virtual infrastructure (“cloud scale”) is much more daunting. Cloud introduces a lot
of dynamic resources, which can behave unpredictably and move between different hardware
components. So it is usually quite hard to tell which of the thousands of VMs has problems
(without even having root access to it) and how the problem affects other resources. Standards
like sFlow try to tackle this problem, by providing efficient monitoring for a high volume of
events (based on probing) and determining relationships between different cloud resources.
</para>
<para>sFlow is worked on by a consortium of mainstream network device vendors, including
ExtremeNetworks, HP, Hitachi, etc. Since it’s embedded in their devices, it provides a consistent
way to monitor traffic across different networks. However, from the standpoint of a number of
open source projects, It’s also built into OpenvSwitch virtual switch (which
is a more robust alternative to a Linux bridge).
</para>
<para>To provide end-to-end packet flow analysis, sFlow agents need to be deployed on all devices
across the network. The agents simply collect sample data on network devices (including packet
headers and forwarding/routing table state), and send them to the sFlow Collector. The collector
gathers data samples from all the sFlow agents and produces meaningful statistics out of them,
including traffic characteristics and network topology (how packets traverse the network between
two endpoints).</para>
<para>While sFlow itself is defined as a standard to monitor networks, it also comes with a “host
sFlow agent.” Per the website:
"The Host sFlow agent exports physical and virtual server performance metrics using
the sFlow protocol. The agent provides scalable, multi-vendor, multi-OS performance
monitoring with minimal impact on the systems being monitored."
</para>
<para>sFlow agents are available for mainstream hypervisors, including Xen, KVM/libvirt, and
Hyper-V (VMWare to be added soon) and can be installed on a number of operating systems (FreeBSD,
Linux, Solaris, Windows) to monitor applications running on them. For the IaaS clouds based on
these hypervisors, it means that it’s now possible to sample different metrics of an instance
(including I/O, CPU, RAM, interrupts/sec etc.) without even logging into it. To make it even
better, one can combine the “network” and “host” parts of sFlow data to provide a complex
monitoring solution
</para>
<para>With the advent of Quantum in the Folsom release, the virtual network device moved from
Linux bridge to OpenvSwitch. If we add KVM or Xen to the mix, we will have sFlow as an applicable
framework to monitor instances themselves and their virtual network topologies as well.
There are a number of sFlow collectors available. The most widely used seem to be Ganglia and
sFlowTrend, which are free. While Ganglia is focused mainly on monitoring the performance of
clustered hosts or instance pools, sFlowTrend seems to be more robust, adding network metrics
and topologies on top of that.
</para>
</section>

<section xml:id="monitoring-openstackservices">
<title>Monitoring Openstack services</title>

<para>The table below shows which Nagios checks can be used to monitor different
openstack services.
===================== =========================
Service Nagios check
===================== =========================
Database check_mysql/check_pgsql
--------------------- -------------------------
RabbitMQ <link xlink:href='https://github.com/jamesc/nagios-plugins-rabbitmq'>nagios rabbitmq plugin</link>
--------------------- -------------------------
libvirt check_libvirt
--------------------- -------------------------
dnsmasq check_dhcp
--------------------- -------------------------
nova-api check_http
--------------------- -------------------------
nova-scheduler check_procs
--------------------- -------------------------
nova-compute check_procs
--------------------- -------------------------
nova-network check_procs
--------------------- -------------------------
keystone-api check_http
--------------------- -------------------------
glance-api check_http
--------------------- -------------------------
glance-registry check_http
--------------------- -------------------------
server availability check_ping
===================== =========================
</para>
</section>
</chapter>

0 comments on commit fc11bbe

Please sign in to comment.