-
Notifications
You must be signed in to change notification settings - Fork 260
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Bringing openstack monitoring content from Mirantis blog to official documentation. Change-Id: I45fb95eae7877ac8e856d89b6fcff876090ab1b5
- Loading branch information
Piotr Siwczak
committed
Oct 22, 2012
1 parent
2427ea5
commit fc11bbe
Showing
1 changed file
with
119 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<chapter xml:id="ops-diagnostic-troubleshooting" xmlns="http://docbook.org/ns/docbook" | ||
xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xlink="http://www.w3.org/1999/xlink" | ||
version="5.0"> | ||
<title>Monitoring</title> | ||
|
||
<section xml:id="monitoring-aspects"> | ||
<title>Different aspects of cloud monitoring</title> | ||
<para>In cloud environments, we can identify three distinct areas for monitoring: | ||
1. Cloud hardware and services: These are different hardware and software pieces of the cluster | ||
running on bare metal, including hypervisors and storage and controller nodes. The problem is | ||
well-known and a number of tools exist to deal with it. Some of most popular are Nagios, | ||
Ganglia, Cacti, and Zabbix. | ||
2. User’s cloud ecosystem: This is everything that makes up a user’s cloud account. In case of | ||
OpenStack it is instances, persistent volumes, floating IPs, security groups, etc. For all | ||
these components, the user needs reliable and clear information on their status. This info | ||
generally should come from the internals of the cloud software. | ||
3. Performance of cloud resources: This is the performance of tenants’ cloud infrastructures | ||
running on top of a given OpenStack installation. This specifically boils down to determining | ||
what prevents tenants’ resources from functioning properly and how these problematic resources | ||
affect other cloud resources.</para> | ||
</section> | ||
|
||
<section xml:id="monitoring-userresources"> | ||
<title>Monitoring the status of user's resources</title> | ||
|
||
<para>When it comes to users, their primary expectation is usually about consistent feedback from | ||
OpenStack about their instance states. There is nothing more annoying than having an instance | ||
reported as “ACTIVE” by the dashboard, even though it’s been gone for several minutes. OpenStack | ||
tries to prevent such discrepancies by running several checks in a cron-like manner (they are | ||
called “periodic tasks” in OpenStack code) on each compute node. These checks are simply methods | ||
of the ComputeManager class located in compute/manager.py or directly in drivers for different | ||
hypervisors (some of these checks are available for certain hypervisors only), | ||
e.g. _check_instance_build_time, _cleanup_running_deleted_instances, _sync_power_states. | ||
</para> | ||
</section> | ||
|
||
<section xml:id="monitoring-cloudresources"> | ||
<title>Monitoring performance of cloud resources</title> | ||
|
||
<para>While monitoring farms of physical servers is a standard task even on a large scale, | ||
monitoring virtual infrastructure (“cloud scale”) is much more daunting. Cloud introduces a lot | ||
of dynamic resources, which can behave unpredictably and move between different hardware | ||
components. So it is usually quite hard to tell which of the thousands of VMs has problems | ||
(without even having root access to it) and how the problem affects other resources. Standards | ||
like sFlow try to tackle this problem, by providing efficient monitoring for a high volume of | ||
events (based on probing) and determining relationships between different cloud resources. | ||
</para> | ||
<para>sFlow is worked on by a consortium of mainstream network device vendors, including | ||
ExtremeNetworks, HP, Hitachi, etc. Since it’s embedded in their devices, it provides a consistent | ||
way to monitor traffic across different networks. However, from the standpoint of a number of | ||
open source projects, It’s also built into OpenvSwitch virtual switch (which | ||
is a more robust alternative to a Linux bridge). | ||
</para> | ||
<para>To provide end-to-end packet flow analysis, sFlow agents need to be deployed on all devices | ||
across the network. The agents simply collect sample data on network devices (including packet | ||
headers and forwarding/routing table state), and send them to the sFlow Collector. The collector | ||
gathers data samples from all the sFlow agents and produces meaningful statistics out of them, | ||
including traffic characteristics and network topology (how packets traverse the network between | ||
two endpoints).</para> | ||
<para>While sFlow itself is defined as a standard to monitor networks, it also comes with a “host | ||
sFlow agent.” Per the website: | ||
"The Host sFlow agent exports physical and virtual server performance metrics using | ||
the sFlow protocol. The agent provides scalable, multi-vendor, multi-OS performance | ||
monitoring with minimal impact on the systems being monitored." | ||
</para> | ||
<para>sFlow agents are available for mainstream hypervisors, including Xen, KVM/libvirt, and | ||
Hyper-V (VMWare to be added soon) and can be installed on a number of operating systems (FreeBSD, | ||
Linux, Solaris, Windows) to monitor applications running on them. For the IaaS clouds based on | ||
these hypervisors, it means that it’s now possible to sample different metrics of an instance | ||
(including I/O, CPU, RAM, interrupts/sec etc.) without even logging into it. To make it even | ||
better, one can combine the “network” and “host” parts of sFlow data to provide a complex | ||
monitoring solution | ||
</para> | ||
<para>With the advent of Quantum in the Folsom release, the virtual network device moved from | ||
Linux bridge to OpenvSwitch. If we add KVM or Xen to the mix, we will have sFlow as an applicable | ||
framework to monitor instances themselves and their virtual network topologies as well. | ||
There are a number of sFlow collectors available. The most widely used seem to be Ganglia and | ||
sFlowTrend, which are free. While Ganglia is focused mainly on monitoring the performance of | ||
clustered hosts or instance pools, sFlowTrend seems to be more robust, adding network metrics | ||
and topologies on top of that. | ||
</para> | ||
</section> | ||
|
||
<section xml:id="monitoring-openstackservices"> | ||
<title>Monitoring Openstack services</title> | ||
|
||
<para>The table below shows which Nagios checks can be used to monitor different | ||
openstack services. | ||
===================== ========================= | ||
Service Nagios check | ||
===================== ========================= | ||
Database check_mysql/check_pgsql | ||
--------------------- ------------------------- | ||
RabbitMQ <link xlink:href='https://github.com/jamesc/nagios-plugins-rabbitmq'>nagios rabbitmq plugin</link> | ||
--------------------- ------------------------- | ||
libvirt check_libvirt | ||
--------------------- ------------------------- | ||
dnsmasq check_dhcp | ||
--------------------- ------------------------- | ||
nova-api check_http | ||
--------------------- ------------------------- | ||
nova-scheduler check_procs | ||
--------------------- ------------------------- | ||
nova-compute check_procs | ||
--------------------- ------------------------- | ||
nova-network check_procs | ||
--------------------- ------------------------- | ||
keystone-api check_http | ||
--------------------- ------------------------- | ||
glance-api check_http | ||
--------------------- ------------------------- | ||
glance-registry check_http | ||
--------------------- ------------------------- | ||
server availability check_ping | ||
===================== ========================= | ||
</para> | ||
</section> | ||
</chapter> |