Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRB error : connection timeout #13958

Closed
fvillain opened this issue Feb 17, 2017 · 15 comments
Closed

DRB error : connection timeout #13958

fvillain opened this issue Feb 17, 2017 · 15 comments

Comments

@fvillain
Copy link

Hi,

We got an error with DRBd that doesn't start with the appliance, i got the following logs :

[----] E, [2017-01-12T06:44:15.394235 #32307:73797c] ERROR -- : EMS [] as [AKIAJAK6YKET7IZL6TBA] ID [150807] PID [32307] GUID [6e0e011e-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
[----] I, [2017-01-12T06:44:15.479441 #32244:73797c]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::RefreshWorker#log_status) [Refresh Worker for Cloud/Infrastructure Providers: AWS Singapore] Worker ID [150800], PID [32244], GUID [6dd2de40-d8bc-11e6-94b7-06dc150d810d], Last Heartbeat [2017-01-12 11:44:12 UTC], Process Info: Memory Usage [311087104], Memory Size [650801152], Proportional Set Size: [213718000], Memory % [2.03], CPU Time [137.0], CPU % [0.06], Priority [27]
[----] E, [2017-01-12T06:44:15.479840 #32244:73797c] ERROR -- : EMS [] as [AKIAJAK****] ID [150800] PID [32244] GUID [6dd2de40-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
[----] I, [2017-01-12T06:44:15.510840 #32253:73797c]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::RefreshWorker#log_status) [Refresh Worker for Cloud/Infrastructure Providers: AWS Sao Paulo] Worker ID [150801], PID [32253], GUID [6dd7f54c-d8bc-11e6-94b7-06dc150d810d], Last Heartbeat [2017-01-12 11:44:12 UTC], Process Info: Memory Usage [311148544], Memory Size [651853824], Proportional Set Size: [213737000], Memory % [2.03], CPU Time [136.0], CPU % [0.06], Priority [27]
[----] E, [2017-01-12T06:44:15.511207 #32253:73797c] ERROR -- : EMS [] as [AKIAJAK****] ID [150801] PID [32253] GUID [6dd7f54c-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.

@jrafanie looked it up, and It looks like the server process was failing when trying to sync_workers for one of the worker classes, possibly for the cinder/swift providers. For some reason, calling authentications on the provider are nil instead of being an empty array since it's Rails relation. It looks like a bug.

/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:26:in `authentication_userid_passwords': private method `select' called for nil:NilClass (NoMethodError)
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:356:in `available_authentications'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:189:in `authentication_type'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:344:in `authentication_best_fit'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:99:in `authentication_status_ok?'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `select'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `all_valid_ems_in_zone'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:26:in `desired_queue_names'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:32:in `sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:52:in `block in sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server.rb:158:in `start'
	from /var/www/miq/vmdb/app/models/miq_server.rb:249:in `start'
	from /var/www/miq/vmdb/lib/workers/evm_server.rb:65:in `start'
	from /var/www/miq/vmdb/lib/workers/evm_server.rb:92:in `start'
	from /var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'

You can see the full discussion / details here : http://talk.manageiq.org/t/drb-error-connection-timeout/2025

Thank you !

@fvillain
Copy link
Author

Forgot to say : it happened on euwe-1 version if that's usefull

@jrafanie
Copy link
Member

Thanks for reporting this @fvillain! @blomquisg @Ladas Can you take a look? I'm not quite sure where to start understanding why the authentications relation is returning nil instead of an empty AR relation. Thanks.

@Ladas
Copy link
Contributor

Ladas commented Feb 17, 2017

@jrafanie In cloud, other managers delegates to CloudManager for authentication. So if the .parent_manager association is missing, .authentications will return nil.

@jrafanie
Copy link
Member

Good find @Ladas. I thought we had ensure_managers to prevent that from happening. I'd imagine we shouldn't delegate to something that might not be set, so we either prevent it from happening via something like ensure_managers or we need a hack proxy or something that do the right thing.

@jrafanie
Copy link
Member

@Ladas I opened #13976 so the server doesn't die when worker classes sync_workers blows up. I'll leave this issue open so we can fix the various sync_workers blowing up in the first place.

jrafanie added a commit to jrafanie/manageiq that referenced this issue Feb 17, 2017
Related to ManageIQ#13958

In the above issue, if ManageIQ::Providers::StorageManager::CinderManager::EventCatcher.sync_workers
raises an exception, the server process exits fatally and
all workers exit `Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.`

We now rescue any exceptions here, log it and move on to other worker
classes.
@Ladas
Copy link
Contributor

Ladas commented Feb 20, 2017

@jrafanie the ensure_managers had a side effect, that could be actually causing this. When deleting the managers, the running refresh would re-add managers without the parent manager. So now, the ensure_managers runs only on create.

@carbonin
Copy link
Member

This issue also seems to be causing this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1417171

In this case we were calling sync_workers from the monitor code rather than at start up. This had the effect of preventing the server from starting workers such as the UI worker.

After @jrafanie's change I see the following in the logs:

[----] E, [2017-02-20T14:05:52.927795 #3012:481138] ERROR -- : MIQ(MiqServer#sync_workers) Failed to sync_workers for class: ManageIQ::Providers::StorageManager::SwiftManager::RefreshWorker
[----] E, [2017-02-20T14:05:52.928865 #3012:481138] ERROR -- : [NoMethodError]: private method `select' called for nil:NilClass  Method:[rescue in block in sync_workers]
[----] E, [2017-02-20T14:05:52.929022 #3012:481138] ERROR -- : /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:26:in `authentication_userid_passwords'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:356:in `available_authentications'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:189:in `authentication_type'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:344:in `authentication_best_fit'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:99:in `authentication_status_ok?'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `select'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `all_valid_ems_in_zone'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:26:in `desired_queue_names'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:32:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:22:in `monitor_workers'
/var/www/miq/vmdb/app/models/miq_server.rb:346:in `block in monitor'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:346:in `monitor'
/var/www/miq/vmdb/app/models/miq_server.rb:368:in `block (2 levels) in monitor_loop'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:368:in `block in monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:367:in `loop'
/var/www/miq/vmdb/app/models/miq_server.rb:367:in `monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:250:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:65:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:92:in `start'
/var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'

@blomquisg blomquisg assigned Ladas and unassigned blomquisg Mar 6, 2017
@Ladas
Copy link
Contributor

Ladas commented Mar 7, 2017

the ensure_managers that was creating managers without parent manager was fixed here:
#12878

The general fix for the delegation issues is here:
#12884

but a bug in Rails prevents that from finishing


although after the #12878, we should not be seeing managers without a parent manager, so any idea why this is still happening?

@Ladas
Copy link
Contributor

Ladas commented Mar 7, 2017

@durandom seems like Ansible still does before_validation https://github.com/Ladas/manageiq/blob/2835c365b3f180cd36911a5bd4346c8ef7d11ff3/app/models/manageiq/providers/ansible_tower/provider_mixin.rb#L7

@carbonin @jrafanie can you check which providers are causing this failure?

@jrafanie
Copy link
Member

jrafanie commented Mar 7, 2017

@Ladas it was reported on Swift and Cinder providers here. I believe @fvillain also had Swift/Cinder providers in the descroption.

Note, @fvillain if you use the master branch, we no longer have a fatal error in the server process, instead the failing worker class will log a message like this in evm.log:
"Failed to sync_workers for class: #{class_name}")

See: #13976

@durandom
Copy link
Member

durandom commented Mar 8, 2017

@Ladas
Copy link
Contributor

Ladas commented Mar 10, 2017

@jrafanie right, so I am pretty sure that #12878 should be fixing this issue. Then storage managers should be deleted when you delete a cloud manager.

@fvillain
Copy link
Author

@jrafanie @Ladas : i confirm we have Cinder/Swift providers.

@jrafanie We now run on stable release of MIQ (euwe-1), will this fix be backported in euwe-1, or only shipped with the next stable release ?

@jrafanie
Copy link
Member

yes @fvillain, the next tab of euwe will contain this fix.

It was backported to euwe as part of #12878 here

@carbonin
Copy link
Member

Looks like this was fixed in #12878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants