Every repository with this icon (
Every repository with this icon (
| Description: | self assembling fabric of ruby daemons edit |
-
Mapper unable to send requests from Merb (mongrel adapter)
2 comments Created 4 months ago by highandwildI have started the mapper in config/init.rb but Nanite.request() simply returns false without any error message. The agent does not receive the request, either.
But it works fine if I test it with nanite-mapper !
Here's the code: http://gist.github.com/147641
ruby - 1.8.7
mongrel - 1.1.5
nanite - 0.4.1 (edge)
amqp - 0.6.0
rabbitmq - 1.5.5
Arch - x64
OS - Cent OS 5.1 (virtual box)Comments
-
load average / status function not updated with heartbeat messages when using Redis
1 comment Created 3 months ago by joshwilsdonWhen using redis the nanite- key gets set with the load_average when the node registers but it is not updated after that by heartbeat messages. It seems that the intention is that the heartbeat/ping sends the load average so that the load average will be put in Redis. This is not happening because the code in cluster.rb (handle_ping) is doing:
if nanite = nanites[ping.identity] nanite[:status] = ping.statusbut nanites[ping.identity] returns an anonymous Hash, so updating it here does nothing. As such this value is never sent to Redis. I have confirmed that hacking in a update_status function to the Nanite::State class (which just updates the nanite- key in redis) and then calling it in the handle_ping as:
nanites.update_status(ping.identity, ping.status)causes the value in Redis to be updated at every heartbeat. Was it the intended behavior for this function (handle_ping) to update Redis? The comment seems to indicate that is the case.
Comments
-
Supposing cli-mapper that sends a push to a client-agent:
push('/client_agent/foo', "hi")And this client-agent sends a push (or request) to a main-agent:
push('/main_agent/foo', ["bar"] )Main agent receives 2 requests instead of one. If I have for instance, 8 thins running, main agent receives 10 exaclty equal requests.
Looks like the requests are getting multiplicated by each mapper online on that time.Comments
I'm guessing you're not using Redis, correct? If you don't all mappers will get the agent's request, and therefore all of them will forward it to an agent of their own. When using Redis only one mapper will get the request and forward it. It's probably a matter of discussion if this is a bug or a feature. I guess it wouldn't hurt to enable exclusiveness on the request queue when not using Redis as well, but there may be situations where not all mappers know the appropriate agents yet to handle that request. That again wouldn't hurt too much when using an offline queue, since that'd eventually lead to the right agent getting the message.
Hi Matt, thanks for the quick reply.
Yup, I'm not using redis. Hm, lots of options:
Use Redis, because this will really be a bug here this, heh...
Try to play with a tokyo adapter (already got it running on the server), redis is key/value store, right?
Or, if it's not too much to ask, could you give me some directions on how to disable this "feature" ? and rely on offline queue.Redis is key/value, that's correct. I don't have a solution for it per se. A quickfix would be to look into cluster.rb. In setup_request_queue it's using an exclusive queue when using Redis. You could try always using that, i.e. remove the shared_state? check. The offline queue is a simple command line switch (--offline-failsafe) for the nanite-agent and a parameter for the mapper class (:offline_failsafe => true). It's just an additional sanity check which you should do anyway when you're relying on your messages being delivered.
The worst case scenario should be avoidable when using the offline queue and only having one mapper pick up the message from the request queue. Let me know if that works. Maybe it's worth considering to get that fix in.
Ops, now I realize I was pvt messaging matt. Sorry man.
Same issue with redis enabled. =/
UPDATE: I've tryed editing setup_request_queue in all ways I could imagine, same result.
Btw, got 2 specs failiing too, something about the ProxyMapper instance was not being erased...That's a bit odd, even though we had to fix some issues with Redis and internal timeouts, that solved it for us, since then only one mapper gets the request and forwards it. I'd need some more log output from the mapper logs to get a better look at what's going on.
Gosh, I'm embarassed now, There was a mapper running I didn't see (w/o redis).
It's working fine with Redis. Really sorry, need to sleep, nanite is givin me some insomnia... (and it feels great).
Thank you Matt, I owe you a (or some) beer. Just let me know when you came to Brazil.Just to confirm, removing the -#{identity} and the exclusive option of the amq.queue request, it works. Only one request and without Redis.
I'll be happy to work in a patch to make this an "option", if anyone is interested, or no better solution came to light. In the while, will keep a fork to make it easy to install it on my servers.Thanks matt, thanks all you nanite devs, you guys rock!
I've added an option to the mapper init, well, it works, will try in production this weekend.
http://github.com/nofxx/nanite/commit/7804058cf297088f063cf5d1d2695c8b15ab71a0Gonna write/fix the specs soon.. heh, sorry about the emacs whitespace cleanup too.
Wow, finally, looks like it's working now! ;)
It's only calling it once, offline_failsafe ensure that some agent will find the request. All good.Was having a weird problem with some actors that use ActiveRecord, they just stop advertising their methods. The problem was I didn't knew about single-threaded... working fine now.
I've added those stones I've found on my path to the wiki. Again, thanks!
Good to be on nanite hehe..Just an update: Albeit working flawless for weeks, on the deploy something strange happens (sometimes), the rails mappers, if I'm not wrong:
heartbeat-19018d26cdb64d27e25c55d007e73ebb 8149
heartbeat-25941f2a7e262e05e05c2349f08ff468 8153
....Heartbeats start to accumulate, until god restart RabbitMQ, than everything gets back to normal... heh weird.
But heartbeat is about to be gonne, right? heh -
I met some promblem when using nanite. I think it may caused by gem environment.
Can you tell me the gem version information which you guys running.One problem I met is.
After I start up agent and mapper using the simple-agent example which is in the source code, the agent side log stopped at :
[Fri, 06 Nov 2009 17:07:01 +0800] INFO: SEND [register] d72993a0de1aed09f42dd291e5046d3d, services: /simple/echo, /simple/time, /simple/gems, /simple/yielding, /simple/delayed, tags:And the mapper side the log stopped at:
[Fri, 06 Nov 2009 17:07:11 +0800] INFO: [setup] starting mapperafter a while the mapper down, cause this error
/opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:811:in `connect_server': no connection (RuntimeError)from /opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:811:in `reconnect' from /opt/local/lib/ruby/gems/1.8/gems/amqp-0.6.0/lib/amqp/client.rb:172:in `reconnect' from /opt/local/lib/ruby/gems/1.8/gems/amqp-0.6.0/lib/amqp/client.rb:85:in `call' from /opt/local/lib/ruby/gems/1.8/gems/amqp-0.6.0/lib/amqp/client.rb:85:in `unbind' from /opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:995:in `call' from /opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:995:in `run_deferred_callbacks' from /opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:995:in `times' from /opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:995:in `run_deferred_callbacks' from /opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:242:in `run_machine' from /opt/local/lib/ruby/gems/1.8/gems/eventmachine-0.12.8/lib/eventmachine.rb:242:in `run' from simpleagent/cli.rb:21My environment is EventMachine 0.12.8 and amqp 0.6.0, rabbitmq 1.7.1
I just think whether it is rabbitmq permission,I try to re-install rabbitmq and run the "rabbitconf.rb" , then I the problem still.
here is the log when I run "rabbitconf.rb"Setting permissions for user "mapper" in vhost "/nanite" ...
...done. Setting permissions for user "nanite" in vhost "/nanite" ... ...done. Listing users ... guest
mapper
nanite
...done. Listing vhosts ... / /nanite ...done. Listing permissions in vhost "/nanite" ... mapper . . .
nanite . . .
...done.Comments
Can you be a bit more specific? Do you mean the installed dependencies of Nanite? Do you get a specific error message at some point while trying it out?
If so, you need either EventMachine 0.12.8 and the amqp gem < 0.6.5 or EventMachine 0.12.10 and the amqp gem = 0.6.5.
-
I have been struggling for several days with a problem in my agents. Randomly, they will stall and use 100% of the CPU. strace reveals the agents are just context switching and doing nothing:
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 40001616
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 101
--- SIGVTALRM (Virtual timer expired) @ 0 (0) --- rt_sigreturn(0) = 0I have tried everything: modified agents to use epoll rather than select, tried ruby enterprise edition and ruby1.9 (they remove the syscalls in strace, but agents still lock). I cannot discern a pattern or reason the agents lock specifically, meaning the job they lock on isn't consistent ASIDE from happening during a job that utilizes net/http to pull down some images and stitch them together.
I thought it might be an issue with calling sleep() inside the agents, but that didn't solve anything. I really have no idea where to go from here.
Pastie to my agent code: http://pastie.org/702881
Pastie to image fetch/stitch code: http://pastie.org/702895On the plus side, I'll be able to give you a quick modification to nanite that causes it to use epoll, which dropped my CPU utilization a hair while performing a large amount of jobs! Any ideas on where to start even looking from here would be appreciated, otherwise I am going to just start commenting out code until something changes (the worst way to debug!).
Comments
I wish I had any idea where to start. What's the number of messages your seeing? Please also run the agents with debug log mode so you can at least see what the last of their activities is. I'd like to get Nanite a lot more bullet-proof in that regard.
Also, what EventMachine and AMQP version are you using?
Whoops, forgot to mention the particulars:
AMQP 0.6.5
EM 0.12.10 (same happens with 0.12.8)Happens when I push through a group of 200 or so jobs, with prefetch set at 1, so only one job is on the agent at a time. It doesn't seem to stall on any particular piece of code that I can discern. Additionally, strace shows absolutely 0 activity outside of the SIGVTALRM syscalls, even though the CPU is pegged, which is beyond me. From my understanding, this means all threads have finished. This is why I think there is something odd going on with Nanite/AMQP/EM, because it's inconsistent and they think work is done before it is done.
I have been reading up on ltrace and more detailed strace use, so I'll have a bit more something to go off here soon I think.
-
I'm working on a test for our Nanite agent, but the request initiated by Nanite.request are all async, which means, i can get nothing back from my agent before the test finishes.
Can I initiate a sync request? Thank you
Comments
-
Two questions
When a nanite agent timeouts - Is there a way to detect it with code? Also sometimes the agents come back alive after a long running task. We want to strike a balance between number of agents vs timeouts. Any suggestions/ideas?
Also is there an inbuilt way to gracefully stop agents? Something like nanite-agent --token "test" stop --force-after 60 seconds
I did read an email from mattmatt a while ago. Thoughts?
Comments
There's callbacks available for timeout, register and unregister in the mapper, the RDoc for Mapper#start has all the info available.
Planned feature for me, since we need something similar. I'd like to get some Unix-y stuff like Unicorn has into Nanite so that you can gracefully shutdown and bring up new agents in the meantime.
Regarding the first issue is this functionality available outside of the mapper? That would be ideal. How does the nanite-admin get a hold of it?
Just curious how do you deal with creating new agents to scale? Time-outs impacts that decision for us big time.
-
nanite wont startup in ruby1.9 without some code changes. nanite uses FileUtils which does not seem to be available by default in 1.9. I have sent a pull request with a fix to the issue but never got a response (probably with good reason). in any case, with the change nanite runs flawlessly on 1.9. we have it running on several servers without any issues once that change is made.
Comments
I'll check it out. I thought I had caught most of the Ruby 1.9 issues, but I'll look into your fork.
mattmatt: i fixed this on mine, it's 1.9 getting confused on where FileUtils is. It thinks it's a Nanite class. Putting in a "require 'fileutils'" inside nanite.rb solved the issue for me. There is probably a better place for the require, but I didn't feel like digging through and finding all recurrences of FileUtils inside Nanite, and confining the require to those.
pmamediagroup
Wed Nov 18 12:59:32 -0800 2009
| link
i think its global to 1.9? irb doesn't recognize fileutils either until it's required.. at least on our system. we put ours in nanite.rb too












Update: I have no idea how (perhaps because I downgraded to ruby 1.8.6) but it seems to finally queue the job. Now the problem is that it just queues it and it never reaches the agent ! (I'm running it inside a merb console).
Nanite.mapper.job_warden.inspect shows this:
And yes, I am getting a hearbeat from the agents. Also, Nanite.mapper.cluster.nanites lists the nanites !
What am I doing wrong ?
Tested it with sinatra (mongrel), and it works perfectly. Also upgraded to amqp edge. Still Zero luck with merb :( I also ran merb in debug mode but can't find anything amiss...