Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Introduction of a central agent for archipel #650

Merged
merged 1 commit into from

4 participants

@nicolasochem

Archipel is a mostly decentralized system but some common features of orchestrators require centralized tracking of the vms and hypervisors, namely high availability, billing and more. This PR introduces some elements of centralization.

It has been tested only on a simulated environment using nested virtualization. The test code is added in a new folder ArchipelTest. It is a full integrated test environment for archipel agent. For now it covers only the central agent functionality.

Testing is required on a real hypervisor environment. Some UI work is also required.
Documentation draft :
https://gist.github.com/nicolasochem/4956799

PR details:

1/ The Central Agent

Manages the central db. Code is in archipel-central-agent

Its files are totally separated from archipel-agent. Conf file for example is different.

The schema of this central db consists of an "hypervisor" table and a "vm" table. It replaces the previously existing vmparking database. The vmparking functionality is unchanged for now.

Use PubSub for keep alive event and more (ex : platform request will add an indication of which stats are required)
There can be several central agent instances on the same pubsub but there will be one and only one active central agent at a time when becoming the central (or statring) it will re-query all entities, hypervisors and vms, to rebuild its database from scratch.
Several central agent instances need to share the same centraldb.sqlite3 through shared storage.

HOOK on IQ to read/write in the central database from the plugins

Check the keepalive for hypervisor to update the vm hypervisor status in central database.

2/ The Centraldb plugin

Is the connecting point to central db from all hypervisors. code is in archipel-agent-centraldb

Every hypervisors centraldb plugin subscribes to the central agent pubsub whose purpose is to advertise periodically the central agent jid.

For Virtual machine entities : Register on HOOK definine/initialise to update the central dabatase and on terminate to remove from the central db (if not parked only).

For Hypervisor entities : Register on HOOK xmmp_authenticate to subscribe on pubsub events. From here, for each keepalive event received :

starting or force-update : update central db with all vm and hypervisor entities
keepalive : update last seen or stats if needed

Provides plenty of hooks to do every CRUD operation on central db. These commands will translate into iqs for read/write operations, sent to central db.

An exit proc is added where the function archipel_exit_proc is executed when the agent is interrupted. This function removes the hypervisor from the central db before archipel stops.

3/ Plateform Request plugin for central agent

code is in archipel-platformrequest-defaultcomputingunit and archipel-agent-hypervisor-platformrequest

Here is how it works : you send an iq to central agent:

<iq xmlns="jabber:client" to="archipel-hyp-3.archipel.priv@archipel-test.archipel.priv/archipel-hyp-3.archipel.priv" type="get"><query xmlns="archipel:centralagent:platform"><archipel action="request" limit="10" /></query></iq>

It returns the score for top vms :

<iq xmlns="jabber:client" to="professeur@archipel-test.archipel.priv/professeur" from="archipel-hyp-3.archipel.priv@archipel-test.archipel.priv/archipel-hyp-3.archipel.priv" id="professeur-0.0474851987786-137" type="result"><query xmlns="archipel:centralagent:platform" /><hypervisor jid="archipel-hyp-2.archipel.priv@archipel-test.archipel.priv/archipel-hyp-2.archipel.priv" score="0.001431875" /><hypervisor jid="archipel-hyp-1.archipel.priv@archipel-test.archipel.priv/archipel-hyp-1.archipel.priv" score="0.003019" /></iq>

The score is based on
1. number of vms running
2. amount of free ram on the hypervisor

The free ram is written to central db every time the central agent sends a keepalive. This may be a big load on the central db but we'll see how it performs in real world.

I respected the architecture implemented previously : there is a dummy score computing file which is extended by the python egg "default computing unit". It is possible to build alternate computing methods in a different egg.

The computing unit tells hypervisors which statistics to choose, as central agent keepalive message advertises them (now free ram, but it could be anything else in the hypervisor_health db).
the client is sending the request. it's not implemented yet (maybe a "new vm wizard").

4/ VMParking plugin to archipel agent adapted to centraldb

Park vm in central db so it don't exist anymore on hypervisor side

Previously it was using a shared sqlite file. All sqlite operations were in-band.

This has been adapted to work with read/write iqs. As a result, some operations are asynchronous.

Everything is backwards compatible. But the updated client should query the centraldb directly instead of passing through the hypervisor. Eventually, only park and unpark operations should remain.

5/ Test suite

Code is in ArchipelTest

This should be moved to another repository. I will remove it before merge. For now it's more convenient for me.

All key functionalities of central agent are tested. There is a README file.

Here is the current list of tests. For each tests, queries the central db sqlite3 file directly, checks that its state is consistent. It also parses for ERRORs in the logs.

Start test 1 : Hypervisors come online.
Start test 2 : Subscribe to hypervisor.
Start test 3 : Hypervisors are in central database.
Start test 4 : Create undefined VMs.
Start test 5 : Delete undefined VM by sending stanza to hypervisor.
Start test 6 : Delete undefined VM by sending stanza to vm.
Start test 7 : Define vm.
Start test 8 : Delete defined VM by sending stanza to hypervisor.
Start test 9 : Delete defined VM by sending stanza to vm.
Start test 10 : Create vms directly in parking.
Start test 11 : Unpark one vm in each hypervisor.
Start test 12 : Park vms using the hypervisor park command.
Start test 13 : Park vms using the vm park command.
Start test 14 : Unpark and start multiple vms in both hypervisors at the same time.
Start test 15 : List parked vms.
Start test 16 : Destroy and park multiple vms in both hypervisors at the same time.
Start test 17 : Graceful shutdown of the hypervisor, checks status is 'off'.
Start test 18 : Ungraceful shutdown of the hypervisor checks status is 'unreachable'.
Start test 19 : Restart hypervisor with vms, checks that vm xmpp entities are instanciated.
Start test 20 : When central agent is off, restart hypervisor, check that vms xmpp entities are instanciated.
Start test 21 : Hypervisor switches off and on and finds out one of its vms has been started somewhere else, deletes them locally.
Start test 22 : Live migration of offline hypervisor.
Start test 23 : Test score computing iq.
Start test 24 : Xml update in parking, should pass.
Start test 25 : Faulty xml update in parking, should fail.
Start test 26 : Delete all vms from parking.

@primalmotion

First this is a great work!

Now, as discussed off, this needs some small architectural design updates.

  • to me, letting hypervisors to become central agents is confusing and will prevent showing special tabs module in UI. We need to have separate TNArchipeEntities with a distinct VCard type, running on their own process, completely separated from the agents.

  • The core of the central agents needs to maintain a generic database of all existing entities and their status, coupled with a ping mechanism to maintain the status of entities coherent.

  • It should provides XMPP API, internal API, and HOOKS to allow agents, and internal modules to communicate with it. (see graphs below)

  • It should do nothing else than maintaining the central platform db state

  • It should load modules like parking, HA, platform stats and whatever we can imagine (and I have a lot of imagination :) )

centralagent

@nicolasochem

1.In my current implementation, the APIs of central agent provide a grammar to execute sql statements with several parameters (a sort of ACP version of "executemany" sqlite3 command). Is there a need to make such specific grammar, considering that the only entity which can add and remove vms and hypervisors from the central database is the hypervisor itself ?

  1. is there a need for a list for unavailbale hypervisors in central db since we already have one in the roster ? Would that not be a duplication ? Then we would have to make sure these 2 lists (central db and roster) are consistent with each other.
ArchipelAgent/archipel-agent/install/bin/runarchipel
@@ -302,33 +304,54 @@ def main(config):
@type config: ConfigParser
@param config: the configuration
"""
- jid = xmpp.JID(config.get("HYPERVISOR", "hypervisor_xmpp_jid"))
- password = config.get("HYPERVISOR", "hypervisor_xmpp_password")
- database = config.get("HYPERVISOR", "hypervisor_database_path")
- name = config.get("HYPERVISOR", "hypervisor_name")
-
+ entity_type = config.get("GLOBAL","entity_type")
+ if entity_type and entity_type=="centralagent":
+ is_central_agent = True
@primalmotion Owner

tabs problem here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...Agent/archipel-agent/archipel/archipelCentralAgent.py
@@ -0,0 +1,162 @@
+# -*- coding: utf-8 -*-
@primalmotion Owner

Why this is in Archipel Agent?

This should be like a completely separate project with it's own init script, own conf file, own installer etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...Agent/archipel-agent/archipel/archipelCentralAgent.py
((37 lines not shown))
+
+from archipelcore.archipelAvatarControllableEntity import TNAvatarControllableEntity
+from archipelcore.archipelEntity import TNArchipelEntity
+from archipelcore.archipelHookableEntity import TNHookableEntity
+from archipelcore.archipelTaggableEntity import TNTaggableEntity
+from archipelcore.utils import build_error_iq, build_error_message
+
+#from archipelLibvirtEntity import ARCHIPEL_NS_LIBVIRT_GENERIC_ERROR
+#from archipelVirtualMachine import TNArchipelVirtualMachine
+#import archipelLibvirtEntity
+
+# XMPP shows
+ARCHIPEL_XMPP_SHOW_ONLINE = "Online"
+
+
+class TNArchipelCentralAgent (TNArchipelEntity, TNHookableEntity, TNTaggableEntity, TNAvatarControllableEntity):
@primalmotion Owner

I'm not sure it needs to inherit from TNTaggableEntity and TNAvatarControllableEntity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...Agent/archipel-agent/archipel/archipelCentralAgent.py
((97 lines not shown))
+ def update_presence(self, origin=None, user_info=None, parameters=None):
+ """
+ Set the presence of the hypervisor.
+ @type origin: L{TNArchipelEntity}
+ @param origin: the origin of the hook
+ @type user_info: object
+ @param user_info: random user info
+ @type parameters: object
+ @param parameters: runtime arguments
+ """
+ status = "%s" % ARCHIPEL_XMPP_SHOW_ONLINE
+ self.change_presence(self.xmppstatusshow, status)
+
+ ### Overrides
+
+ def set_custom_vcard_information(self, vCard):
@primalmotion Owner

I'm not sure we need this level of configuration of the central agent's VCARD. It should only have a name which is based on his JID node and his central-agent type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...Agent/archipel-agent/archipel/archipelCentralAgent.py
@@ -0,0 +1,162 @@
+# -*- coding: utf-8 -*-
+#
+# archipelHypervisor.py
@primalmotion Owner

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...el-central-agent/archipelcentralagent/centralagent.py
((29 lines not shown))
+from archipelcore.utils import build_error_iq
+
+ARCHIPEL_CENTRAL_AGENT_KEEPALIVE = 4 #seconds
+ARCHIPEL_CENTRAL_AGENT_TIMEOUT = 10 #seconds
+ARCHIPEL_CENTRAL_HYP_PING_FREQUENCY = 30 #ticks
+ARCHIPEL_CENTRAL_HYP_PING_TIMEOUT = 60 #seconds
+
+# this pubsub is subscribed by all hypervisors and carries the keepalive messages
+# for the central agent
+ARCHIPEL_KEEPALIVE_PUBSUB = "/archipel/centralagentkeepalive"
+
+ARCHIPEL_NS_CENTRALAGENT = "archipel:centralagent"
+
+ARCHIPEL_ERROR_CODE_CENTRALAGENT = 123
+
+class TNCentralAgent (TNArchipelPlugin):
@primalmotion Owner

I think this should not be a plugin. This class contains all the core feature of the central agent, and to me, it should be in TNArchipelCentralAgent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...el-central-agent/archipelcentralagent/centralagent.py
((59 lines not shown))
+
+ if self.entity.__class__.__name__ == "TNArchipelVirtualMachine":
+ self.entity.register_hook("HOOK_VM_DEFINE", method=self.hook_vm_event)
+ self.entity.register_hook("HOOK_VM_INITIALIZE", method=self.hook_vm_event)
+ self.entity.register_hook("HOOK_VM_TERMINATE", method=self.hook_vm_terminate)
+
+ self.is_standalone = False
+ self.central_agent_jid_val = None
+ if self.entity.__class__.__name__ == "TNArchipelCentralAgent":
+ self.is_standalone = True
+
+ self.xmpp_authenticated=False
+ self.is_central_agent = False
+ self.salt = random.random()
+ self.random_wait = random.random()
+ self.database = sqlite3.connect(self.configuration.get("VMPARKING", "database"), check_same_thread=False)
@primalmotion Owner

Has this DB is not only for VMParking, we should rename it to something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...el-central-agent/archipelcentralagent/centralagent.py
((131 lines not shown))
+ raise xmpp.protocol.NodeProcessed
+
+ ### Pubsub management
+
+ def hypervisor_hook_xmpp_authenticated(self, origin=None, user_info=None, arguments=None):
+ """
+ Triggered when we are authenticated. Initializes everything.
+ @type origin: L{TNArchipelEnity}
+ @param origin: the origin of the hook
+ @type user_info: object
+ @param user_info: random user information
+ @type arguments: object
+ @param arguments: runtime argument
+ """
+
+ self.xmpp_authenticated=True
@primalmotion Owner

style

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...el-central-agent/archipelcentralagent/centralagent.py
((229 lines not shown))
+ vm_table=[]
+ for vm,vmprops in self.entity.virtualmachines.iteritems():
+ vm_table.append({"uuid":vmprops.uuid,"parker":None,"creation_date":None,"domain":vmprops.definition,"hypervisor":self.entity.jid})
+ if len(vm_table)>=1:
+ self.commit_to_db("insert into vms values(:uuid, :parker, :creation_date, :domain, :hypervisor)",vm_table)
+ self.commit_to_db("insert into hypervisors values(:jid)",[{"jid":self.entity.jid}])
+
+ def handle_central_database_event(self,iq):
+ """
+ Called when the central agent receives a database update.
+ @type iq: xmpp.Iq
+ @param event: received Iq
+ """
+ try:
+ reply = iq.buildReply("result")
+ self.entity.log.debug("CENTRALAGENT: I've got a central database event : %s"%iq)
@primalmotion Owner

style

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...el-central-agent/archipelcentralagent/centralagent.py
((256 lines not shown))
+ self.database.commit()
+ else:
+ raise Exception("CENTRALAGENT: we are not central agent")
+ except Exception as ex:
+ reply = build_error_iq(self, ex, iq, ARCHIPEL_ERROR_CODE_CENTRALAGENT)
+ return reply
+
+
+ ### Ping functionality
+
+ def spread_pings(self):
+ """
+ We spread out the hypervisors to be pinged evenly across every tick of the cycle.
+ This way, there is no burst of pings every beginning of cycle.
+ """
+ hyp_list=self.get_all_hypervisors()
@primalmotion Owner

Style again, I'll stop commenting that, but please use the coding guidelines :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@primalmotion

1.In my current implementation, the APIs of central agent provide a grammar to execute sql statements with several parameters (a sort of ACP version of "executemany" sqlite3 command). Is there a need to make such specific grammar, considering that the only entity which can add and remove vms and hypervisors from the central database is the hypervisor itself ?

Well I really don't like the idea to give full database access from API. Please make some proxies for different functionalities. The code will be more maintainable. We can control a lot of things, and the API stays clear

is there a need for a list for unavailbale hypervisors in central db since we already have one in the roster ? Would that not be a duplication ? Then we would have to make sure these 2 lists (central db and roster) are consistent with each other.

I see that when an entity is not responding to ping, you remove it. And I also think we should maintain the list of unavailable hypervisor. But instead of having a second list, why not add some cols in the DB to mark the last_seen date. If null, then it means that the hypervisor is alive, if there is a date, then we can have a threaded jobs that periodically clear unavailable entities after a while (like a few days — configurable)

@nicolasochem

Thanks for the review. Another question : you say that parking should be a plugin of centralagent. But a lot of code for the parking plugin has to be run on every hypervisor (for exemple, the parking acps go to each hypervisor when you park, unpark vms from the ui).

Similarly, some "central agent" code is run on every hypervisor. How shall I proceed ? I think leaving it like it is today in my branch is the best.

@primalmotion

we should have two parking modules. One for the central agent which is actually doing the job, and one for the hypervisor, very simple, that simply supports park, unpark, delete, edit, but that is just forwarding the request from the user to the central agent.

It's ok to add different layers. The more it is layered, the more it is maintainable and future ready, especially for that kind of asynchronous real time programs :)

@Nowaker

Great work @nicolasochem! Archipel indeed needs some central agent to keep track of whole data center.

Can you guys explain what "vm parking" is? This feature doesn't work in current Archipel, so I have no idea what it is about.

@nicolasochem

@primalmotion , I am making the central agent a totally separate entity with its own scripts. I am creating the files:

  • archipel-central-agent-initinstall
  • archipel-central-agent.conf
  • /etc/init.d/archipel-central-agent

I have a "generic" runarchipel which works for both.

My concern is that I have 2 identical "runarchipel" files in 2 locations in the repo (archipel-agent/install/bin and archipel-central-agent/install/bin). Can I move this file to archipel-core ?

(what I describe is not committed yet)

@primalmotion

Awesome.

We can move it to archipel-core, but it if it's the case, it needs to be completely generic in a sense it could start any program based on TNArchipelEntity.

The simplest way I think would be to have runarchipel as it is now, and runcentralagent or something like that that'll do the job. My concern about keeping the same initial "binary" is that you won't be able to start central agent and standard agent on the same machine.

@nicolasochem

Some more remarks:

  • about making more detailed apis for central agent operations : in the current implementation, all write operations to central db go through xmpp, but the read operations are done directly with sqlite3 module. Therefore I won't implement all the "get" commands in your diagram above.
  • about an available/unavailable flag for hyprevisors : I will implement 3 columns : hypervisor jid (like today), last_seen date, and "status" where status can be "available", "off" and "unreachable", to distinguish between hypervisors which have been switched off properly and improperly (crash, unplugging, connectivity issues etc)

Let me know if you have comments on that.

@primalmotion

about making more detailed apis for central agent operations : in the current implementation, all write operations to central db go through xmpp, but the read operations are done directly with sqlite3 module. Therefore I won't implement all the "get" commands in your diagram above.

Even if we don't have any usage for it now, we could use them later. For now you could just implement the methods, and throw a Not Implemented exception. We'll see if users complains about it.

about an available/unavailable flag for hyprevisors : I will implement 3 columns : hypervisor jid (like today), last_seen date, and "status" where status can be "available", "off" and "unreachable", to distinguish between hypervisors which have been switched off properly and improperly (crash, unplugging, connectivity issues etc)

Seems perfect

@nicolasochem

@primalmotion ok for review

@primalmotion

please sync with the latest master. It'll be easier for me to review. Thanks!

@nicolasochem

documentation up-to-date with latest changes
https://gist.github.com/nicolasochem/4956799

@primalmotion

Thanks, I'll try to review this as fast as I can, but it's a big one :)

@primalmotion
Owner

please rebase, I'll have some time to test and review this week normally :)

@nicolasochem

@primalmotion : I have a working implementation of platform request.

Here is how it works : you send an iq to central agent:

<iq xmlns="jabber:client" to="archipel-hyp-3.archipel.priv@archipel-test.archipel.priv/archipel-hyp-3.archipel.priv" type="get"><query xmlns="archipel:centralagent:platform"><archipel action="request" limit="10" /></query></iq>

It returns the score for top vms :

<iq xmlns="jabber:client" to="professeur@archipel-test.archipel.priv/professeur" from="archipel-hyp-3.archipel.priv@archipel-test.archipel.priv/archipel-hyp-3.archipel.priv" id="professeur-0.0474851987786-137" type="result"><query xmlns="archipel:centralagent:platform" /><hypervisor jid="archipel-hyp-2.archipel.priv@archipel-test.archipel.priv/archipel-hyp-2.archipel.priv" score="0.001431875" /><hypervisor jid="archipel-hyp-1.archipel.priv@archipel-test.archipel.priv/archipel-hyp-1.archipel.priv" score="0.003019" /></iq>

The score is based on
1. number of vms running
2. amount of free ram on the hypervisor

The free ram is written to central db every time the central agent sends a keepalive. This may be a big load on the central db but we'll see how it performs in real world.

I respected the architecture that you implemented : there is a dummy score computing file which is extended by the python egg "default computing unit". It is possible to build alternate computing methods in a different egg.

The computing unit tells hypervisors which statistics to choose, as central agent keepalive message advertises them (now free ram, but it could be anything else in the hypervisor_health db).

@CyrilPeponnet

Maybe it could be even better to check the load average of the hosting machine. I mean :

Check if a server as enough RAM / Storage to host a new VM -> compute a hosting capability score
Check the loadaverage of hosts -> compute a load state score

Ex : any host with loadaverage > 80 % will not be choosen if they have plenty of RAM/Storage.

The number of vm is irrelevant (on my point of view), I can have several tiny VM or 2 big ones with same loadaverag / host.

@nicolasochem

I never really paid attention to the loadavg (actually I just read about what it is now) and it could sure be added to the calculation. I think that it is only interesting in case your hypervisor is doing something else than running virtual machines.

The number of vms is still interesting to "spread" the vms across hypervisors, even when they are off.

In my implementation you can add any statistic you like (loadavg: 1 min, 5 min or 15min ?) and they will be stored in central db, then you can modify the sqlite statement to take them into account, you can even build another score calculator altogether.

At the end it is just a suggestion (the end user chooses where to start the vm, possibly in a "new vm wizard" scenario, the vm with the best score will just be the default choice).

@CyrilPeponnet

It makes sense, and has I am thinking we have to deal with differents kinds of hypervisor (xen, KVM and why not LXC) to let the user choose the best environment that will fit his needs. Anyway, this part is only a details for now I need to focus on your amazing work around the central agent. I will certainly come back with questions about it :) I have to make a deep dive into your code for now. Thanks.

...l-agent-centraldb/archipelagentcentraldb/centraldb.py
((51 lines not shown))
+ @type entry_point_group: string
+ @param entry_point_group: the group name of plugin entry_point
+ """
+ TNArchipelPlugin.__init__(self, configuration=configuration, entity=entity, entry_point_group=entry_point_group)
+ if self.entity.__class__.__name__ == "TNArchipelHypervisor":
+ self.entity.register_hook("HOOK_ARCHIPELENTITY_XMPP_AUTHENTICATED", method=self.hypervisor_hook_xmpp_authenticated)
+
+ if self.entity.__class__.__name__ == "TNArchipelVirtualMachine":
+ self.entity.register_hook("HOOK_VM_DEFINE", method=self.hook_vm_event)
+ self.entity.register_hook("HOOK_VM_INITIALIZE", method=self.hook_vm_event)
+ self.entity.register_hook("HOOK_VM_TERMINATE", method=self.hook_vm_terminate)
+
+ self.central_agent_jid_val = None
+
+ self.xmpp_authenticated = False
+ self.database = sqlite3.connect(self.configuration.get("CENTRALDB", "database"), check_same_thread=False)
@CyrilPeponnet Owner

Not sure to understand the purpose of a db agent side. Seems no longer used as you split central agent from agent. Or I missed something :)

Thanks for having a look. "centraldb" is the agent's module to read and write to central db.
For writing, it sends iqs to the central agent. But it reads the central db directly by connecting to the sqlite.
I don't know whether it is a good idea but I found that sqlite supports concurrent reads quite well - for concurrent writes it's a different story.

@CyrilPeponnet Owner

So the central agent must have the database on a shared storage ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@CyrilPeponnet CyrilPeponnet commented on the diff
...gent/archipel-agent/archipel/archipelLibvirtEntity.py
@@ -24,6 +24,7 @@
import time
import random
import sys
+import traceback
@CyrilPeponnet Owner

Is t really required or just for debugging purpose when setting up the central agent ?

I left it here (and in other parts of the code). When we run archipel in "debug" mode, tracebacks appear in the log. Otherwise you just have the error msg without any line number. So I ended up adding this line permanently. It's useful for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...ent/archipel-agent/install/etc/archipel/archipel.conf
((8 lines not shown))
#
-[VMPARKING]
-
-# path for shared parking database file
-database = %(archipel_folder_data)s/shared_parking.sqlite3
+[CENTRALDB]
+# location of the central agent database. Must be readable by all hypervisors and central agents.
+database = %(archipel_folder_data)s/central_db.sqlite3
@CyrilPeponnet Owner

Not sure to understand why we need a centraldb on each hypervisor.

Same answer than above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...-agent/install/bin/archipel-central-agent-initinstall
((51 lines not shown))
+ sys.exit(0)
+
+ warn("Ok. you have been warned...")
+ msg("cleaning old existing files")
+ msg("cleaning init script from %s/etc/init.d/archipel-central-agent" % prefix)
+ os.system("rm -rf %s/etc/init.d/archipel-central-agent" % prefix)
+ msg("Cleaning configuration file from %s/etc/archipel" % prefix)
+ os.system("rm -rf %s/etc/archipel" % prefix)
+ msg("Cleaning log files %s/var/log/archipel" % prefix)
+ os.system("rm -rf %s/var/log/archipel" % prefix)
+ success("Previous installation cleaned")
+ except Exception as ex:
+ error(str(ex))
+
+
+def install_init_script(prefix, init_script):
@CyrilPeponnet Owner

You should merge here systemd support (check the archipel-initinstall)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@CyrilPeponnet CyrilPeponnet commented on the diff
...pelAgent/archipel-core/archipelcore/archipelEntity.py
@@ -1269,7 +1271,7 @@ def loop(self):
else:
self.log.error("LOOP EXCEPTION : Disconnected from server. Trying to reconnect in 5 seconds.")
t, v, tr = sys.exc_info()
- self.log.error("TRACEBACK: %s" % traceback.format_exception(t, v, tr))
+ self.log.error("TRACEBACK: %s" % "\n".join(traceback.format_exception(t, v, tr)))
@CyrilPeponnet Owner

Regarding my previous remarque on traceback, it could be usefull for easier debugging. But, you should only log them on debug state (log.debug). And it could be smart to load the module only on debug state (to minimise footprint).

That traceback exists on master - I just split it correctly.
What you suggest could be part of another PR.

@CyrilPeponnet Owner

Ok my bad forget about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@CyrilPeponnet

To be sure, here are some questions : (please correct me)

  • CentralDB is an agent plugin hooking events and fill the central-agent db through xmpp

  • Vmparking become a central-agent pluging, the agent vmparging plugin will talk to central-agent (fuzzy for me here)
    (note in this case parkingvm need a shared storage between hypervisor ?)

  • PlatformRequest become a central-agent plugin listening for events.

If you can sum up how it works (maybe with a diagram) it could help :).

Thanks.

@nicolasochem

CentralDB is an agent plugin hooking events and fill the central-agent db through xmpp

Yes

Vmparking become a central-agent pluging, the agent vmparging plugin will talk to central-agent (fuzzy for me here)

Yes, at least in the current form, so that I do not break any existing functionality of vmparking.
However I never liked the concept of vmparking in its current form. I think some client work is mandated and we need to introduce the concept of "unattached vms" and "orphan vms" which today are grouped into the concept of "vmparking"

(note in this case parkingvm need a shared storage between hypervisor ?)

not any more

PlatformRequest become a central-agent plugin listening for events.

platform request does not listen to events, it just makes a compliceted query to central db returning a list of likely hypervisor candidates for vm start

@CyrilPeponnet

Hi Nicolas,

I definitely think that the central agent must be independant of any shared folder between hypervisor and central-agent. You can use shared storage (block device like iscsi/aoe/fc) as a storage backend, or you can even use non-shared storage just to have a failover for a frontend web server for example (active/passive).

Let me try suggesting something :

If there is not centraldb instance, the vmparking use a local database as before.

if there is a centraldb instance, the vmparking use the central-agent database.

It seems you focus your amazing work on sort of HA (if hypervisor die, reload vm on another node).

On my point of view, the central agent should be a central-database to store/get things, and the HA or other things (like plateformrequest) should be added as a plugin to this central agent.

@primalmotion What is your advice / point of view about this ?

Thanks both :)

@nicolasochem

I will remove the requirement of sharing central db among hypervisors.

The parking db was always shared between hypervisors. That's how it's meant to be.

There is no such thing as automated reloadiing of vm on another hypervisor when one dies. You just have such orphan vms appearing in the parking.

HA is possible to implement based on this. It probably should be decoupled as a plugin as you say.

@primalmotion

I definitely think that the central agent must be independant of any shared folder between hypervisor and central-agent

Does this means that for now, in the current implementation central agent and hypervisors have/can access the same sqlite file? In that case I also think this should not be.

For the parking, previously it didn't necessary force users to have a shared file. Each hypervisor can have their own parking sqlite file, even if in that case the use of parking is disputably useless :)
I'm also not a big fan of the two modes. If we move the feature to central agent (where it should be in my opinion), then let's drop the current per-hypervisor mode. The parking will become a sort HA storage pool later, and that's great.

...l-agent-vmparking/archipelagentvmparking/vmparking.py
@@ -506,7 +479,7 @@ def iq_list(self, iq):
nodes = []
for parked_vm in parked_vms:
vm_node = xmpp.Node("virtualmachine", attrs=parked_vm["info"])
- if parked_vm["domain"].getTag('description'):
+ if parked_vm["domain"] and parked_vm["domain"].getTag('description'):
@CyrilPeponnet Owner

This test seems not really useful. Yeah I know that's nothing but I came across so... :)

Nope, it checks that "domain" is not null, which is the case when the vm is undefined.

@CyrilPeponnet Owner

Sure but if parked_vm["domain"] is null there is not tag 'description' available so it's useless IMO as your processing with a AND combination :).

Hit me if I'm wrong :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@CyrilPeponnet

One more thing, to understand the central parked feature.

Has I see, all vm seems centrally parked in central db. A non running centrally parked vm will appear in parking ready to be unpacked on another hypervisor.

Seems that I start to better understand how you build this. Need more time to get it :)

A full step by step wallthrough could be great to fully understand your approach.

@nicolasochem

The parking mode has been fully moved to central agent. No more "per-hypervisor parking"

Has I see, all vm seems centrally parked in central db. A non running centrally parked vm will appear in parking ready to be unpacked on another hypervisor.

If it's not running but still defined, it will not appear in parking. It has to be undefined (aka parked) or the hypervisor has to be off.

...elAgent/archipel-agent/archipel/archipelHypervisor.py
((25 lines not shown))
+ if not did_start_elsewhere:
+ self.log.debug("Now starting vm %s" % string_jid)
+ vm_thread = self.create_threaded_vm(jid, password, name, self.vcard_infos)
+ self.virtualmachines[vm_thread.jid.getNode()] = vm_thread.get_instance()
+ vm_thread.start()
+ self.perform_hooks("HOOK_HYPERVISOR_VM_WOKE_UP", vm_thread.get_instance())
+
+ # remove from local db and libvirt vms that exist somewhere else
+ c.executemany("delete from virtualmachines where jid=:jid" , vms_started_elsewhere)
+ self.database.commit()
+ for vm in vms_started_elsewhere:
+ vm_uuid = xmpp.JID(vm["jid"]).getNode()
+ libvirt_vm = self.libvirt_connection.lookupByUUIDString(vm_uuid)
+ if libvirt_vm.info()[0] in [1, 2, 3]:
+ try:
+ libvirt_vm.destroy()
@CyrilPeponnet Owner

As the vm doesn't belong anymore to the current hypervisor, maybe you can use the free function to remove the vm from libvirt but also from hypervisor database and remove the xmpp container associated (I guess it's recreated on the other side by the hypervisor holding the started elsewhere VM.

the problem is that, to use the "free" function, you need the vm to be instanciated. In that case, I don't want to instanciate. You're right that it's recreated on the other side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...elAgent/archipel-agent/archipel/archipelHypervisor.py
((9 lines not shown))
jid = xmpp.JID(string_jid)
jid.setResource(self.jid.getNode().lower())
- vm_thread = self.create_threaded_vm(jid, password, name, self.vcard_infos)
- self.virtualmachines[vm_thread.jid.getNode()] = vm_thread.get_instance()
- vm_thread.start()
- self.perform_hooks("HOOK_HYPERVISOR_VM_WOKE_UP", vm_thread.get_instance())
+ # check in central agent db if vm did not start somewhere else
+ did_start_elsewhere = False
+ if self.get_plugin("centraldb"):
@CyrilPeponnet Owner

Can we use HOOK for this instead of hardcoding a plugin behavior in the main agent ?

I was thinking about that too. Will try today.

I did not use a hook finally as the behaviour is more complex : if there is a central agent, then delay creation of threaded vms after first keepalive, otherwise do it straight away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@CyrilPeponnet

Hi Nicolas, I tried to sum-up your work on this PR. Here it is, please correct me if I'm wrong somewhere, you can find below some remarks/questions.

I can see 4 parts in this PR :

1/ The Central Agent (mainly a remote database)
2/ The Centraldb plugin for archipel-agent (to read/write in the remote database)
3/ The Plateform Request plugin for central-agent (to reply to plateformrequest IQ).
4/ The VMParking plugin for archipel-agent (adapted to centraldb)

Let me details how I see they are working (to check we all understood how it works).

1/ The Central Agent :

  • Use a conf file to get some default values (1)
  • Use PubSub for keep alive event and more (ex : platform request will add some attr to ask if enable)
  • On XMPP_Auth : One and only one active central agent at a time when becoming the central (or statring) it will fetch all the entity for updated data. (2)
  • HOOK on IQ to read/write in the central database from the plugins
  • Check the keepalive for hypervisor to update the vm hypervisor status in central database.

2/ The Centraldb plugin :

  • For Virtual machine entities : Register on HOOK definine/initialise to update the central dabatase and on terminate to remove from the central db (if not parked only).

  • For Hypervisor entities : Register on HOOK xmmp_authenticate to subscribe on pubsub events. From here, for each keepalive event received :

    • starting or force-update : update central db with all vm and hypervisor entities
    • keepalive : update last seen or stats if needed

3/ Plateform Request plugin for central agent

  • Ask for hypervisor stats through central-agent and compute scoring for platform request IQ(3)

4/ VMParking plugin to archipel agent adapted to centraldb

  • Park vm in central db so it don't exist anymore on hypervisor side (2)

(1) on this conf file, there are a lot of things you can remove I think. What's happen for JID if central agent is running on a same host as a Hypervisor with archipel-agent ?

(2) what's happen if I have two central agents, and one with parked vm is dying ? The second central agent will start and send a start event to rebuild the centraldb from hypervisors but I will loose the definition for the parked vm stored in the dyed central-agent! (as it does'nt exist anymore on the hypervisor).

(3) Who is sending the request ?

On last thing but a big one, you should also take care of the capabilities of the hypervisor. I mean, if I have an heterogenous hypervisor parc (ex xen,kvm and lxc), it shouldn't be possible to unpark a LXC defintion vm on a KVM hypervisor. The Other way could be to check in vmparking plugin the definition of the vm and check if it can run on the hypervisor (if not don't show it).

Maybe it would be better for review to split up your work :

  • Central db / central agent on a side
  • plugins on the other side

Thanks for your patience :) and your comments.

@nicolasochem

Hi Cyril, thanks for taking the time to write this down. I will use it in the main description of the PR in a bit.

The new version of the PR has no more requirement to have a shared storage for central db. All read/write operations go through IQs. That is quite a big change, as few operations needed to become 2-way and asynchronous (like unpark and list).

For now, list is still broken. @primalmotion, I tried to do a closure as you recommended, but to no avail : the vmparking_list iq still returns me a 501 error, service unavailable. Any chance you could check out my code and point me to what I'm doing wrong (in vmparking.py, function iq_list). The hypervisor is supposed to send an iq to central agent, and when it receives the result, answer the iq from client.

Here are answers to your 3 points :
(1) central agent and agent are totally separated, all files are different, including the conf files.
(2) 2 central agent should have a shared storage. It's a must.
(3) the client is sending the request. it's to be implemented (maybe a "new vm wizard"). ask @primalmotion what he has in mind, I just adapted his code.

Hypervisor capabilities: you're right but today there is no check in vmparking functionality, so it should be done outside of this PR.

I'll be offline for 4 days. I encourage you to install and run my code.

@nicolasochem

@primalmotion: I found the solution for my closure problem.
I had to send a "NodeProcessed" immediately...
Thanks anyway :)

@nicolasochem

@primalmotion @CyrilPeponnet the main description of pull request is updated.

All read/write operations are going though xmpp, so the sqlite3 can live on a non-shared storage.

The code is in a good enough state for review. Please, consider merging.

@primalmotion primalmotion commented on the diff
...nt/archipel-central-agent/install/bin/runcentralagent
((17 lines not shown))
+#
+# You should have received a copy of the GNU Affero General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+import optparse
+import os
+import sys
+import socket
+import signal as sig
+
+# Import and check essential modules
+try:
+ from archipelcore.scriptutils import error, msg, success
+except ImportError as ex:
+ print "FATAL: you need to install archipel-core"
+ sys.exit(ARCHIPEL_INIT_ERROR_NO_MODULE)
@primalmotion Owner

All that variables are not imported here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@primalmotion primalmotion commented on the diff
...nt/archipel-central-agent/install/bin/runcentralagent
((50 lines not shown))
+ @param config: the configuration
+ """
+ jid = xmpp.JID(config.get("CENTRALAGENT", "central_agent_xmpp_jid"))
+ # Set the resource
+ jid.setResource(socket.gethostname())
+
+
+ # Create the archipel main xmpp entity instance
+ password = config.get("CENTRALAGENT", "central_agent_xmpp_password")
+ centralagent = TNArchipelCentralAgent (jid, password, config)
+
+ # Try to connect to XMPP
+ try:
+ centralagent.connect()
+ except Exception as ex:
+ error("Cannot connect using JID %s. Initialization aborted: %s" % (jid, str(ex)), code=ARCHIPEL_INIT_ERROR_CONNECTION)
@primalmotion Owner

ARCHIPEL_INIT_ERROR_CONNECTION this one too is not imported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@primalmotion

Do we need to define migration_uri etc in archipel-central-agent.conf?

@primalmotion

It seems that centralegent is not accepting subscription request. It should be able to according to permissions

...gent/install/etc/archipel/archipel-central-agent.conf
((23 lines not shown))
+#
+# General configuration. You should just need to edit these values
+#
+[DEFAULT]
+
+# the default XMPP server to user
+xmpp_server = PARAM_XMPP_SERVER
+
+# archipel's data folder
+archipel_folder_lib = /var/lib/archipel/
+
+# this UUID will be used to identify the hypervisor
+# internally. It MUST be different foreach one over
+# your platform. You can generate a new one using
+# uuidgen command
+archipel_general_uuid = PARAM_UUID
@primalmotion Owner

Do we need this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...gent/install/etc/archipel/archipel-central-agent.conf
((27 lines not shown))
+
+# the default XMPP server to user
+xmpp_server = PARAM_XMPP_SERVER
+
+# archipel's data folder
+archipel_folder_lib = /var/lib/archipel/
+
+# this UUID will be used to identify the hypervisor
+# internally. It MUST be different foreach one over
+# your platform. You can generate a new one using
+# uuidgen command
+archipel_general_uuid = PARAM_UUID
+
+# the base working folder, where virtual machine related
+# stuff will be stored
+archipel_folder_data = /vm/
@primalmotion Owner

And that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...gent/install/etc/archipel/archipel-central-agent.conf
((97 lines not shown))
+# minimal log level. it can be in order:
+# - debug
+# - info
+# - warning
+# - error
+# - critical
+logging_level = debug
+
+# max life time of a log node in the pubsub
+log_pubsub_item_expire = 3600
+
+# max number of stored log in the pubsub log node
+log_pubsub_max_items = 1000
+
+# the path of file to store logs
+logging_file_path = /var/log/archipel/archipel.log
@primalmotion Owner

This one should be different by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
...chipel-central-agent/archipel/archipelCentralAgent.py
((700 lines not shown))
+ """
+ self.database.execute("create table if not exists vms (uuid text unique on conflict replace, parker string, creation_date date, domain string, hypervisor string)")
+ self.database.execute("create table if not exists hypervisors (jid text unique on conflict replace, last_seen date, status string, stat1 int, stat2 int, stat3 int)")
+ #By default on startup, put everything in the parking. Hypervisors will announce their vms.
+ self.database.execute("update vms set hypervisor='None';")
+ self.database.commit()
+
+ ### Event loop
+
+ def on_xmpp_loop_tick(self):
+ if self.xmpp_authenticated:
+ if not self.is_central_agent and self.central_agent_mode=="auto":
+ # before becoming a central agent, we wait for timeout period plus a random amount of time
+ # to avoid race conditions
+ central_agent_timeout = ARCHIPEL_CENTRAL_AGENT_TIMEOUT*(1+self.random_wait)
+ if (datetime.datetime.now()-self.last_keepalive_heard).seconds>central_agent_timeout:
@primalmotion Owner

This continuously fail telling that self.last_keepalive_heard is not defined. I guess it's a race condition, but it's happening for 100% of time now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@primalmotion

An error is raised on connect stuff because of permissionsCenter. I guess you don't have any permissions DB, so it fails. It's not a critical error, but it's dirty

@primalmotion

There is name collision in the module names. the module in archipel-agent is the same than in archipel-central-agent (which is messing up with the python imports)

You rename the archipel-central-agent/archipel into archipel-central-agent/archipelcentral

You also need to change provides=["archipel"] to provides=["archipelcentral"], in setup.py

@primalmotion

Also, there is a lot of styling issue.

Please be sure to make all your code conform to coding guideline
https://github.com/ArchipelProject/Archipel/wiki/Developer%3A-Coding-guidelines

Note that there are always a space before and after an operator (never to a-b or a<0 but a - b and a < 0 )

Please skip somes lines to light up your code. for instance, it is usually a good idea to skip a line before and after a control statement.

Please fix all my remarks, I'll give it another try after.

Thanks!

@nicolasochem

@primalmotion thanks. Please describe your methodology. Which OS did you use ? Did it already have archipel hypervisor ? Did you do the developer install ? Or anything else ? Thanks

@primalmotion

Which OS did you use ?

CentOS 6.4

Did it already have archipel hypervisor ?

Yeah

Did you do the developer install ?

Of course

@nicolasochem

@primalmotion: for permissions, I think that central agent will never need the permission infrastructure, so it's better to fix ArchipelEntity not to expect to have a permission database when i.e. "self.permissions_in_use" is set to False. Is that ok with you ?

@primalmotion
@nicolasochem

@primalmotion @CyrilPeponnet : ready for review.

I could understand/reproduce all your remarks except self.last_keepalive_heard not defined. I edited the code to define this variable earlier on though. If you still see the issue, please give more info.

@nicolasochem

@primalmotion @CyrilPeponnet please consider reviewing my feature thanks :)

@primalmotion

I will give it another try this week

@primalmotion

It seems to work better but I'm having some difficulties with my second hypervisor.

What I have.

archipel.com: archipel + central agent
archipel2.com: archipel

The conf is correct to point to the same central agent.

This is the startup log of central agent (no agent started)

INFO    ::2013-04-22 13:52:11::utils.py:71::TNArchipelCentralAgent.perform_hooks (centralagent@archipel.com/hypervisor1)::HOOK: going to run methods for hook HOOK_ARCHIPELENTITY_XMPP_AUTHENTICATED
DEBUG   ::2013-04-22 13:52:11::utils.py:69::TNArchipelCentralAgent.perform_hooks (centralagent@archipel.com/hypervisor1)::HOOK: performing method recover_pubsubs registered in hook with name HOOK_ARCHIPELENTITY_XMPP_AUTHENTICATED and user_info: None (oneshot: False)
DEBUG   ::2013-04-22 13:52:11::utils.py:69::TNArchipelCentralAgent.recover_pubsubs (centralagent@archipel.com/hypervisor1)::Here is the final admin list: {'STATIC_admin@archipel.com': 'admin@archipel.com'}
DEBUG   ::2013-04-22 13:52:11::utils.py:69::TNArchipelCentralAgent.perform_hooks (centralagent@archipel.com/hypervisor1)::HOOK: performing method hook_xmpp_authenticated registered in hook with name HOOK_ARCHIPELENTITY_XMPP_AUTHENTICATED and user_info: None (oneshot: False)
INFO    ::2013-04-22 13:52:11::utils.py:71::TNArchipelCentralAgent.change_presence (centralagent@archipel.com/hypervisor1)::status change: Online show:None
DEBUG   ::2013-04-22 13:52:11::utils.py:69::TNArchipelCentralAgent.hook_xmpp_authenticated (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Mode auto
DEBUG   ::2013-04-22 13:52:11::utils.py:69::TNArchipelCentralAgent.presence_callback (centralagent@archipel.com/hypervisor1)::PRESENCE : I just set change presence. The result is <presence xmlns="jabber:client" to="centralagent@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="centralagent-15"><status>Online</status></presence>
INFO    ::2013-04-22 13:52:11::utils.py:71::TNArchipelCentralAgent.hook_xmpp_authenticated (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: entity centralagent@archipel.com/hypervisor1 is now subscribed to events from node /archipel/centralagentkeepalive
INFO    ::2013-04-22 13:52:11::utils.py:71::TNArchipelCentralAgent.change_presence (centralagent@archipel.com/hypervisor1)::status change: Standby show:away
DEBUG   ::2013-04-22 13:52:11::utils.py:69::TNArchipelCentralAgent.presence_callback (centralagent@archipel.com/hypervisor1)::PRESENCE : I just set change presence. The result is <presence xmlns="jabber:client" to="centralagent@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="centralagent-19"><show>away</show><status>Standby</status></presence>
INFO    ::2013-04-22 13:52:29::utils.py:71::TNArchipelCentralAgent.on_xmpp_loop_tick (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: has not detected any central agent for the last 17.230557687 seconds, becoming central agent.
DEBUG   ::2013-04-22 13:52:29::utils.py:69::TNArchipelCentralAgent.become_central_agent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: initial keepalive sent
INFO    ::2013-04-22 13:52:29::utils.py:71::TNArchipelCentralAgent.change_presence (centralagent@archipel.com/hypervisor1)::status change: Active show:
DEBUG   ::2013-04-22 13:52:29::utils.py:69::TNArchipelCentralAgent.presence_callback (centralagent@archipel.com/hypervisor1)::PRESENCE : I just set change presence. The result is <presence xmlns="jabber:client" to="centralagent@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="centralagent-21"><status>Active</status></presence>
DEBUG   ::2013-04-22 13:52:29::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
DEBUG   ::2013-04-22 13:52:35::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
DEBUG   ::2013-04-22 13:52:41::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
DEBUG   ::2013-04-22 13:52:47::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.

Nothing can be founf on the central_db.sqlite3

Now I start archipel on archipel.com.

That's what I get in the central agent log

DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:54:35::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'last_seen': u'2013-04-22 13:54:35.897240'}]
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-28" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:54:35::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: read_vms_started_elsewhere
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.iq_read_vms_started_elsewhere (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: iq_read_vms_started_elsewhere : iq : <iq xmlns="jabber:client" to="centralagent@archipel.com/hypervisor1" from="hypervisor@archipel.com/hypervisor1" id="hypervisor-29" type="set"><query xmlns="archipel:centralagent"><archipel action="read_vms_started_elsewhere"><event jid="hypervisor@archipel.com/hypervisor1"><entry><item key="uuid" value="457e8966-ab76-11e2-871b-525400ac42d0" /></entry></event></archipel></query></iq>, entries : [{u'uuid': u'457e8966-ab76-11e2-871b-525400ac42d0'}]
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.read_vms_started_elsewhere (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: read_vms_started_elsewhere uuids :[u'457e8966-ab76-11e2-871b-525400ac42d0'] 
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.read_vms_started_elsewhere (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: read statement select vms.uuid from vms join hypervisors on hypervisors.jid=vms.hypervisor where (vms.uuid='457e8966-ab76-11e2-871b-525400ac42d0') and hypervisors.jid != 'hypervisor@archipel.com/hypervisor1' and hypervisors.status='Online'
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.read_vms_started_elsewhere (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: return of read statement : []
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-29" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:54:35::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: register_hypervisors
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'insert into hypervisors values(:jid, :last_seen, :status, :stat1, :stat2, :stat3)' with entries [{u'status': u'Online', u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat3': u'0', u'stat2': u'0', u'stat1': u'0', u'last_seen': u'2013-04-22 13:54:35.906396'}]
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-30" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:54:35::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:54:36::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: 457e8966-ab76-11e2-871b-525400ac42d0@archipel.com/hypervisor, type: set, namespace: archipel:centralagent, action: register_vms
DEBUG   ::2013-04-22 13:54:36::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'insert into vms values(:uuid, :parker, :creation_date, :domain, :hypervisor)' with entries [{u'hypervisor': u'hypervisor@archipel.com/hypervisor1', u'domain': u'<domain xmlns="http://www.gajim.org/xmlns/undeclared" type="kvm">   <name>Strackea</name>   <uuid>457e8966-ab76-11e2-871b-525400ac42d0</uuid>   <description>457e8966-ab76-11e2-871b-525400ac42d0@archipel.com::::t4bv9J3za4SPckHYGheVdUcWDlBYt0wG</description>   <metadata>     <nuage xmlns="http://www.nuagenetworks.net/2013/Vm/Metadata">       <user name="Antoine Mercadal" />       <enterprise name="Archipel Corp." />       <application name="default" />     </nuage>   </metadata>   <memory unit="KiB">1048576</memory>   <currentMemory unit="KiB">1048576</currentMemory>   <vcpu placement="static">1</vcpu>   <os>     <type machine="rhel6.3.0" arch="x86_64">hvm</type>     <boot dev="hd" />   </os>   <features>     <acpi />     <apic />   </features>   <clock offset="utc" />   <on_poweroff>destroy</on_poweroff>   <on_reboot>restart</on_reboot>   <on_crash>restart</on_crash>   <devices>     <emulator>/usr/libexec/qemu-kvm</emulator>     <controller index="0" type="usb">       <address slot="0x01" bus="0x00" domain="0x0000" type="pci" function="0x2" />     </controller>     <input bus="usb" type="tablet" />     <input bus="ps2" type="mouse" />     <graphics autoport="yes" keymap="en-us" type="vnc" port="-1" />     <video>       <model type="cirrus" vram="9216" heads="1" />       <address slot="0x02" bus="0x00" domain="0x0000" type="pci" function="0x0" />     </video>     <memballoon model="virtio">       <address slot="0x03" bus="0x00" domain="0x0000" type="pci" function="0x0" />     </memballoon>   </devices> </domain>', u'parker': u'None', u'uuid': u'457e8966-ab76-11e2-871b-525400ac42d0', u'creation_date': u'None'}]
DEBUG   ::2013-04-22 13:54:36::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="457e8966-ab76-11e2-871b-525400ac42d0@archipel.com/hypervisor" from="centralagent@archipel.com/hypervisor1" id="457e8966-ab76-11e2-871b-525400ac42d0-49" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:54:36::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
DEBUG   ::2013-04-22 13:54:42::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:54:42::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:54:42::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9204960', u'last_seen': u'2013-04-22 13:54:42.570162'}]
DEBUG   ::2013-04-22 13:54:42::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-54" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:54:42::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
DEBUG   ::2013-04-22 13:54:48::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:54:48::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:54:48::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9200320', u'last_seen': u'2013-04-22 13:54:48.655301'}]
DEBUG   ::2013-04-22 13:54:48::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-58" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:54:48::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent

And in the central_db.sqlite I can see my first hypervisor marked as Online. (if I stop the agenrt then it's marked as Off, which is good).

Now, I start the agent on archipel2.com


DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.check_hyps (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: now checking all hypervisors are alive
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:56:32::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9203060', u'last_seen': u'2013-04-22 13:56:32.088171'}]
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-126" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:56:32::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor2@archipel.com/hypervisor2, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor2@archipel.com/hypervisor2', u'last_seen': u'2013-04-22 13:55:09.176706'}]
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="centralagent@archipel.com/hypervisor1" id="hypervisor2-28" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:56:32::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor2@archipel.com/hypervisor2, type: set, namespace: archipel:centralagent, action: register_hypervisors
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'insert into hypervisors values(:jid, :last_seen, :status, :stat1, :stat2, :stat3)' with entries [{u'status': u'Online', u'jid': u'hypervisor2@archipel.com/hypervisor2', u'stat3': u'0', u'stat2': u'0', u'stat1': u'0', u'last_seen': u'2013-04-22 13:55:09.204565'}]
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="centralagent@archipel.com/hypervisor1" id="hypervisor2-31" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:56:32::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent

The second hypervisor shows up in the sqlite file marked as online.

But after a random number of pings (1 to 5) it becomes unreachable


DEBUG   ::2013-04-22 13:56:57::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.check_hyps (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: now checking all hypervisors are alive
INFO    ::2013-04-22 13:57:03::utils.py:71::TNArchipelCentralAgent.check_hyps (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: hyp hypervisor2@archipel.com/hypervisor2 timed out
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set status=:status where jid=:jid' with entries [{'status': 'Unreachable', 'jid': u'hypervisor2@archipel.com/hypervisor2'}]
INFO    ::2013-04-22 13:57:03::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9202928', u'last_seen': u'2013-04-22 13:57:03.401345'}]
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-146" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:57:03::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor2@archipel.com/hypervisor2, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor2@archipel.com/hypervisor2', u'stat1': u'2164664', u'last_seen': u'2013-04-22 13:55:40.487746'}]
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="centralagent@archipel.com/hypervisor1" id="hypervisor2-51" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
DEBUG   ::2013-04-22 13:57:03::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
DEBUG   ::2013-04-22 13:57:09::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:57:09::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:09::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9202900', u'last_seen': u'2013-04-22 13:57:09.554171'}]
DEBUG   ::2013-04-22 13:57:09::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-150" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:09::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:57:09::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor2@archipel.com/hypervisor2, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:09::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor2@archipel.com/hypervisor2', u'stat1': u'2164664', u'last_seen': u'2013-04-22 13:55:46.641004'}]
DEBUG   ::2013-04-22 13:57:09::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="centralagent@archipel.com/hypervisor1" id="hypervisor2-55" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:09::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
DEBUG   ::2013-04-22 13:57:15::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:57:15::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:15::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9203104', u'last_seen': u'2013-04-22 13:57:15.701124'}]
DEBUG   ::2013-04-22 13:57:15::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-154" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:15::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:57:15::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor2@archipel.com/hypervisor2, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:15::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor2@archipel.com/hypervisor2', u'stat1': u'2164076', u'last_seen': u'2013-04-22 13:55:52.788468'}]
DEBUG   ::2013-04-22 13:57:15::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="centralagent@archipel.com/hypervisor1" id="hypervisor2-59" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:15::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
DEBUG   ::2013-04-22 13:57:21::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:57:21::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:21::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9202996', u'last_seen': u'2013-04-22 13:57:21.836630'}]
DEBUG   ::2013-04-22 13:57:21::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-158" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:21::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:57:21::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor2@archipel.com/hypervisor2, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:21::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor2@archipel.com/hypervisor2', u'stat1': u'2164080', u'last_seen': u'2013-04-22 13:55:58.923276'}]
DEBUG   ::2013-04-22 13:57:21::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="centralagent@archipel.com/hypervisor1" id="hypervisor2-63" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:21::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
DEBUG   ::2013-04-22 13:57:27::utils.py:69::TNArchipelCentralAgent.handle_central_keepalive_event (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: Keepalive heard.
INFO    ::2013-04-22 13:57:27::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor@archipel.com/hypervisor1, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:27::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor@archipel.com/hypervisor1', u'stat1': u'9202752', u'last_seen': u'2013-04-22 13:57:27.969442'}]
DEBUG   ::2013-04-22 13:57:28::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor@archipel.com/hypervisor1" from="centralagent@archipel.com/hypervisor1" id="hypervisor-162" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:28::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent
INFO    ::2013-04-22 13:57:28::utils.py:71::TNArchipelCentralAgent.check_acp (centralagent@archipel.com/hypervisor1)::acp received: from: hypervisor2@archipel.com/hypervisor2, type: set, namespace: archipel:centralagent, action: update_hypervisors
DEBUG   ::2013-04-22 13:57:28::utils.py:69::TNArchipelCentralAgent.db_commit (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: commit 'update hypervisors set stat1=:stat1, last_seen=:last_seen where jid=:jid' with entries [{u'jid': u'hypervisor2@archipel.com/hypervisor2', u'stat1': u'2164080', u'last_seen': u'2013-04-22 13:56:05.057057'}]
DEBUG   ::2013-04-22 13:57:28::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: we got a reply for this iq <iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="centralagent@archipel.com/hypervisor1" id="hypervisor2-67" type="result"><query xmlns="archipel:centralagent" /></iq>
DEBUG   ::2013-04-22 13:57:28::utils.py:69::TNArchipelCentralAgent.process_iq_for_centralagent (centralagent@archipel.com/hypervisor1)::CENTRALAGENT: reply sent

And that's game over until I restart the agent. Then it became online for a few seconds, and unreachble again. There is no error or nothing. The only thing I notices is that there is a 3min difference on each hypervisors clocks.

@primalmotion

If the problem comes from a time issue, then the implementation is not good enough. I don't want anything to rely on absolute time. If this the issue, then the central agent should use a date seed, and only use the relative difference between two pings. If it's not due to that, then I have no idea what's going on :)

@primalmotion

I also get this error when I try to unpark a VM from hypervisor1 to hypervisor2

ERROR   ::2013-04-22 14:02:49::utils.py:163::<archipelagentvmparking.vmparking.TNVMParking instance at 0x2b465f0>._on_centralagent_reply: exception raised is: 'syntax error: line 1, column 0' triggered by stanza :
<iq xmlns="jabber:client" to="hypervisor2@archipel.com/hypervisor2" from="admin@archipel.com/ArchipelController" id="82866" type="get"><query xmlns="archipel:hypervisor:vmparking"><archipel action="list" /></query></iq>
DEBUG   ::2013-04-22 14:02:49::utils.py:165::Traceback (most recent call last):

  File "/usr/local/src/Archipel/ArchipelAgent/archipel-agent-vmparking/archipelagentvmparking/vmparking.py", line 245, in _on_centralagent_reply
    parked_vms.append({"info": {"uuid": vm["uuid"], "parker": vm["parker"], "date": vm["creation_date"]}, "domain": xmpp.simplexml.NodeBuilder(vm["domain"]).getDom()})

  File "/usr/lib/python2.6/site-packages/xmpppy-0.5.0rc1-py2.6.egg/xmpp/simplexml.py", line 366, in __init__
    self._parser.Parse(data,1)

ExpatError: syntax error: line 1, column 0

Ending up with the VM unparked, but still in the parking

@nicolasochem

@primalmotion: unfortunately I cannot reproduce your "syntax error" problem.
About your other observation (time drift resulting to hypervisors becoming unavailable), I made sure that all hypervisors use the central agent time. Please check if it is solved.

Thanks

@primalmotion primalmotion merged commit 51eb33b into from
@nicolasochem nicolasochem deleted the branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jun 15, 2013
  1. @nicolasochem
Something went wrong with that request. Please try again.