DRB error : connection timeout #13958

fvillain · 2017-02-17T10:05:46Z

Hi,

We got an error with DRBd that doesn't start with the appliance, i got the following logs :

[----] E, [2017-01-12T06:44:15.394235 #32307:73797c] ERROR -- : EMS [] as [AKIAJAK6YKET7IZL6TBA] ID [150807] PID [32307] GUID [6e0e011e-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
[----] I, [2017-01-12T06:44:15.479441 #32244:73797c]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::RefreshWorker#log_status) [Refresh Worker for Cloud/Infrastructure Providers: AWS Singapore] Worker ID [150800], PID [32244], GUID [6dd2de40-d8bc-11e6-94b7-06dc150d810d], Last Heartbeat [2017-01-12 11:44:12 UTC], Process Info: Memory Usage [311087104], Memory Size [650801152], Proportional Set Size: [213718000], Memory % [2.03], CPU Time [137.0], CPU % [0.06], Priority [27]
[----] E, [2017-01-12T06:44:15.479840 #32244:73797c] ERROR -- : EMS [] as [AKIAJAK****] ID [150800] PID [32244] GUID [6dd2de40-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
[----] I, [2017-01-12T06:44:15.510840 #32253:73797c]  INFO -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::RefreshWorker#log_status) [Refresh Worker for Cloud/Infrastructure Providers: AWS Sao Paulo] Worker ID [150801], PID [32253], GUID [6dd7f54c-d8bc-11e6-94b7-06dc150d810d], Last Heartbeat [2017-01-12 11:44:12 UTC], Process Info: Memory Usage [311148544], Memory Size [651853824], Proportional Set Size: [213737000], Memory % [2.03], CPU Time [136.0], CPU % [0.06], Priority [27]
[----] E, [2017-01-12T06:44:15.511207 #32253:73797c] ERROR -- : EMS [] as [AKIAJAK****] ID [150801] PID [32253] GUID [6dd7f54c-d8bc-11e6-94b7-06dc150d810d] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.

@jrafanie looked it up, and It looks like the server process was failing when trying to sync_workers for one of the worker classes, possibly for the cinder/swift providers. For some reason, calling authentications on the provider are nil instead of being an empty array since it's Rails relation. It looks like a bug.

/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:26:in `authentication_userid_passwords': private method `select' called for nil:NilClass (NoMethodError)
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:356:in `available_authentications'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:189:in `authentication_type'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:344:in `authentication_best_fit'
	from /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:99:in `authentication_status_ok?'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `select'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `all_valid_ems_in_zone'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:26:in `desired_queue_names'
	from /var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:32:in `sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:52:in `block in sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
	from /var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
	from /var/www/miq/vmdb/app/models/miq_server.rb:158:in `start'
	from /var/www/miq/vmdb/app/models/miq_server.rb:249:in `start'
	from /var/www/miq/vmdb/lib/workers/evm_server.rb:65:in `start'
	from /var/www/miq/vmdb/lib/workers/evm_server.rb:92:in `start'
	from /var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'

You can see the full discussion / details here : http://talk.manageiq.org/t/drb-error-connection-timeout/2025

Thank you !

The text was updated successfully, but these errors were encountered:

fvillain · 2017-02-17T10:18:35Z

Forgot to say : it happened on euwe-1 version if that's usefull

jrafanie · 2017-02-17T14:13:18Z

Thanks for reporting this @fvillain! @blomquisg @Ladas Can you take a look? I'm not quite sure where to start understanding why the authentications relation is returning nil instead of an empty AR relation. Thanks.

Ladas · 2017-02-17T15:03:56Z

@jrafanie In cloud, other managers delegates to CloudManager for authentication. So if the .parent_manager association is missing, .authentications will return nil.

jrafanie · 2017-02-17T15:24:09Z

Good find @Ladas. I thought we had ensure_managers to prevent that from happening. I'd imagine we shouldn't delegate to something that might not be set, so we either prevent it from happening via something like ensure_managers or we need a hack proxy or something that do the right thing.

jrafanie · 2017-02-17T20:43:13Z

@Ladas I opened #13976 so the server doesn't die when worker classes sync_workers blows up. I'll leave this issue open so we can fix the various sync_workers blowing up in the first place.

Related to ManageIQ#13958 In the above issue, if ManageIQ::Providers::StorageManager::CinderManager::EventCatcher.sync_workers raises an exception, the server process exits fatally and all workers exit `Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.` We now rescue any exceptions here, log it and move on to other worker classes.

Ladas · 2017-02-20T08:53:27Z

@jrafanie the ensure_managers had a side effect, that could be actually causing this. When deleting the managers, the running refresh would re-add managers without the parent manager. So now, the ensure_managers runs only on create.

carbonin · 2017-02-20T19:07:30Z

This issue also seems to be causing this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1417171

In this case we were calling sync_workers from the monitor code rather than at start up. This had the effect of preventing the server from starting workers such as the UI worker.

After @jrafanie's change I see the following in the logs:

[----] E, [2017-02-20T14:05:52.927795 #3012:481138] ERROR -- : MIQ(MiqServer#sync_workers) Failed to sync_workers for class: ManageIQ::Providers::StorageManager::SwiftManager::RefreshWorker
[----] E, [2017-02-20T14:05:52.928865 #3012:481138] ERROR -- : [NoMethodError]: private method `select' called for nil:NilClass  Method:[rescue in block in sync_workers]
[----] E, [2017-02-20T14:05:52.929022 #3012:481138] ERROR -- : /var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:26:in `authentication_userid_passwords'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:356:in `available_authentications'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:189:in `authentication_type'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:344:in `authentication_best_fit'
/var/www/miq/vmdb/app/models/mixins/authentication_mixin.rb:99:in `authentication_status_ok?'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `select'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:21:in `all_valid_ems_in_zone'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:26:in `desired_queue_names'
/var/www/miq/vmdb/app/models/mixins/per_ems_worker_mixin.rb:32:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:22:in `monitor_workers'
/var/www/miq/vmdb/app/models/miq_server.rb:346:in `block in monitor'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:346:in `monitor'
/var/www/miq/vmdb/app/models/miq_server.rb:368:in `block (2 levels) in monitor_loop'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/var/www/miq/vmdb/gems/pending/util/extensions/miq-benchmark.rb:30:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:368:in `block in monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:367:in `loop'
/var/www/miq/vmdb/app/models/miq_server.rb:367:in `monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:250:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:65:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:92:in `start'
/var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'

Ladas · 2017-03-07T10:20:55Z

the ensure_managers that was creating managers without parent manager was fixed here:
#12878

The general fix for the delegation issues is here:
#12884

but a bug in Rails prevents that from finishing

although after the #12878, we should not be seeing managers without a parent manager, so any idea why this is still happening?

Ladas · 2017-03-07T10:24:56Z

@durandom seems like Ansible still does before_validation https://github.com/Ladas/manageiq/blob/2835c365b3f180cd36911a5bd4346c8ef7d11ff3/app/models/manageiq/providers/ansible_tower/provider_mixin.rb#L7

@carbonin @jrafanie can you check which providers are causing this failure?

jrafanie · 2017-03-07T17:09:01Z

@Ladas it was reported on Swift and Cinder providers here. I believe @fvillain also had Swift/Cinder providers in the descroption.

Note, @fvillain if you use the master branch, we no longer have a fatal error in the server process, instead the failing worker class will log a message like this in evm.log:
"Failed to sync_workers for class: #{class_name}")

See: #13976

durandom · 2017-03-08T07:28:47Z

seems like Ansible still does before_validation https://github.com/Ladas/manageiq/blob/2835c365b3f180cd36911a5bd4346c8ef7d11ff3/app/models/manageiq/providers/ansible_tower/provider_mixin.rb#L7

Good find, will fix this.

Ladas · 2017-03-10T10:04:11Z

@jrafanie right, so I am pretty sure that #12878 should be fixing this issue. Then storage managers should be deleted when you delete a cloud manager.

fvillain · 2017-04-18T14:17:30Z

@jrafanie @Ladas : i confirm we have Cinder/Swift providers.

@jrafanie We now run on stable release of MIQ (euwe-1), will this fix be backported in euwe-1, or only shipped with the next stable release ?

jrafanie · 2017-04-18T14:27:04Z

yes @fvillain, the next tab of euwe will contain this fix.

It was backported to euwe as part of #12878 here

carbonin · 2017-09-14T13:40:42Z

Looks like this was fixed in #12878

jrafanie assigned blomquisg Feb 17, 2017

jrafanie added providers/openstack/infra bug labels Feb 17, 2017

jrafanie mentioned this issue Feb 17, 2017

Rescue worker class sync_workers exceptions and move on #13976

Merged

blomquisg assigned Ladas and unassigned blomquisg Mar 6, 2017

carbonin closed this as completed Sep 14, 2017

Fryguy added the providers/infrastructure label Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRB error : connection timeout #13958

DRB error : connection timeout #13958

fvillain commented Feb 17, 2017

fvillain commented Feb 17, 2017

jrafanie commented Feb 17, 2017

Ladas commented Feb 17, 2017

jrafanie commented Feb 17, 2017

jrafanie commented Feb 17, 2017

Ladas commented Feb 20, 2017

carbonin commented Feb 20, 2017

Ladas commented Mar 7, 2017

Ladas commented Mar 7, 2017

jrafanie commented Mar 7, 2017

durandom commented Mar 8, 2017

Ladas commented Mar 10, 2017

fvillain commented Apr 18, 2017

jrafanie commented Apr 18, 2017

carbonin commented Sep 14, 2017

DRB error : connection timeout #13958

DRB error : connection timeout #13958

Comments

fvillain commented Feb 17, 2017

fvillain commented Feb 17, 2017

jrafanie commented Feb 17, 2017

Ladas commented Feb 17, 2017

jrafanie commented Feb 17, 2017

jrafanie commented Feb 17, 2017

Ladas commented Feb 20, 2017

carbonin commented Feb 20, 2017

Ladas commented Mar 7, 2017

Ladas commented Mar 7, 2017

jrafanie commented Mar 7, 2017

durandom commented Mar 8, 2017

Ladas commented Mar 10, 2017

fvillain commented Apr 18, 2017

jrafanie commented Apr 18, 2017

carbonin commented Sep 14, 2017