Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UNTESTED] fix selection of cinder-volume node #1225

Closed
wants to merge 1 commit into from

Conversation

aspiers
Copy link
Contributor

@aspiers aspiers commented Sep 9, 2016

PR #1195 introduced breakage in Jenkins openstack-mkcloud, because it can cause values of cinder_volume such as "d52-54-77-77-77-03.ve1.cloud.suse.de\nd52-54-77-77-77-04.ve1.cloud.suse.de", i.e. multiple hostnames delimited by "\n" rather than comma-separated like is correctly done with the other barclamps above.

So instead we assign to an array breaking on whitespace, so that the first host is correctly picked.

Here's an example failure:

W, [2016-09-08T20:36:56.103076 #8898:0x00000005f37410]  WARN -- Could not recover Chef Crowbar Node on load d52-54-77-77-77-03.ve1.cloud.suse.de
d52-54-77-77-77-04.ve1.cloud.suse.de: #<URI::InvalidURIError: bad URI(is not URI?): http://localhost:4000/nodes/d52-54-77-77-77-03.ve1.cloud.suse.de
d52-54-77-77-77-04.ve1.cloud.suse.de>
I, [2016-09-08T20:36:56.103681 #8898:0x00000005f37410]  INFO -- Completed 500 Internal Server Error in 85ms (ActiveRecord: 8.0ms)
F, [2016-09-08T20:36:56.104726 #8898:0x00000005f37410] FATAL --
NoMethodError (undefined method `[]' for nil:NilClass):
  app/models/service_object.rb:744:in `block (2 levels) in violates_exclude_platform_constraint?'
  app/models/service_object.rb:743:in `each'
  app/models/service_object.rb:743:in `any?'
  app/models/service_object.rb:743:in `block in violates_exclude_platform_constraint?'
  app/models/service_object.rb:739:in `each'
  app/models/service_object.rb:739:in `violates_exclude_platform_constraint?'
  app/models/service_object.rb:805:in `block in validate_proposal_constraints'
  app/models/service_object.rb:778:in `each'
  app/models/service_object.rb:778:in `validate_proposal_constraints'
  app/models/service_object.rb:669:in `validate_proposal_after_save'
  app/models/cinder_service.rb:204:in `validate_proposal_after_save'
  app/models/service_object.rb:547:in `save_proposal!'
  app/models/service_object.rb:871:in `_proposal_update'
  app/models/service_object.rb:526:in `proposal_edit'
  app/controllers/barclamp_controller.rb:563:in `proposal_update'

The problem occurs in this line in
ServiceObject#violates_exclude_platform_constraint?:

    node = NodeObject.find_node_by_name(element)

Chef::Node.load raises the URI::InvalidURIError exception which gets caught, turned into a warning, and then nil is returned for the node, causing the NoMethodError soon after.

PR SUSE-Cloud#1195 introduced breakage in Jenkins openstack-mkcloud, because it
can cause values of `cinder_volume` such as
"d52-54-77-77-77-03.ve1.cloud.suse.de\nd52-54-77-77-77-04.ve1.cloud.suse.de",
i.e. multiple hostnames delimited by "\n" rather than comma-separated
like is correctly done with the other barclamps above.

So instead we assign to an array breaking on whitespace, so that the
first host is correctly picked.

Here's an example failure:

    W, [2016-09-08T20:36:56.103076 #8898:0x00000005f37410]  WARN -- Could not recover Chef Crowbar Node on load d52-54-77-77-77-03.ve1.cloud.suse.de
    d52-54-77-77-77-04.ve1.cloud.suse.de: #<URI::InvalidURIError: bad URI(is not URI?): http://localhost:4000/nodes/d52-54-77-77-77-03.ve1.cloud.suse.de
    d52-54-77-77-77-04.ve1.cloud.suse.de>
    I, [2016-09-08T20:36:56.103681 #8898:0x00000005f37410]  INFO -- Completed 500 Internal Server Error in 85ms (ActiveRecord: 8.0ms)
    F, [2016-09-08T20:36:56.104726 #8898:0x00000005f37410] FATAL --
    NoMethodError (undefined method `[]' for nil:NilClass):
      app/models/service_object.rb:744:in `block (2 levels) in violates_exclude_platform_constraint?'
      app/models/service_object.rb:743:in `each'
      app/models/service_object.rb:743:in `any?'
      app/models/service_object.rb:743:in `block in violates_exclude_platform_constraint?'
      app/models/service_object.rb:739:in `each'
      app/models/service_object.rb:739:in `violates_exclude_platform_constraint?'
      app/models/service_object.rb:805:in `block in validate_proposal_constraints'
      app/models/service_object.rb:778:in `each'
      app/models/service_object.rb:778:in `validate_proposal_constraints'
      app/models/service_object.rb:669:in `validate_proposal_after_save'
      app/models/cinder_service.rb:204:in `validate_proposal_after_save'
      app/models/service_object.rb:547:in `save_proposal!'
      app/models/service_object.rb:871:in `_proposal_update'
      app/models/service_object.rb:526:in `proposal_edit'
      app/controllers/barclamp_controller.rb:563:in `proposal_update'

The problem occurs in this line in
ServiceObject#violates_exclude_platform_constraint?:

        node = NodeObject.find_node_by_name(element)

Chef::Node.load raises the URI::InvalidURIError exception which gets
caught, turned into a warning, and then nil is returned for the node,
causing the NoMethodError soon after.
@aspiers
Copy link
Contributor Author

aspiers commented Sep 9, 2016

@nicolasbock
Copy link
Contributor

+1

@rsalevsky
Copy link
Contributor

Could this be the reson for this mkcloud failure? https://ci.suse.de/job/openstack-mkcloud/33567/console

@nicolasbock
Copy link
Contributor

@rsalevsky Yes, I think so, good catch!

@aspiers
Copy link
Contributor Author

aspiers commented Sep 12, 2016

@rsalevsky It's definitely the reason for that failure.

@rhafer
Copy link
Contributor

rhafer commented Sep 13, 2016

+1

@rhafer rhafer removed their assignment Sep 13, 2016
@aplanas
Copy link
Contributor

aplanas commented Sep 13, 2016

+1

@aspiers
Copy link
Contributor Author

aspiers commented Sep 13, 2016

Someone (maybe me) needs to verify that this really does the right thing before merging.

@ellisab
Copy link
Contributor

ellisab commented Sep 13, 2016

Without this fix ha deployment for GM5+up and GM6+up always fails with:

qa_crowbarsetup.sh: line 3032: [: too many arguments
Failed to talk to service proposal edit: 500: {"status":"500","error":"Internal Server Error"}
Error: 'crowbar cinder proposal --file=/root/cinder.default.proposal edit default' failed with exit code: 1

With the fix mkcloud run completes successfully including tempest smoke test run. Side effect is exposing other small problems like:

Starting proposal nova(default) at: Tue Sep 13 10:12:43 UTC 2016
qa_crowbarsetup.sh: line 2866: [: too many arguments

Starting proposal ceilometer(default) at: Tue Sep 13 10:20:33 UTC 2016
qa_crowbarsetup.sh: line 2923: [: too many arguments

@aspiers
Copy link
Contributor Author

aspiers commented Sep 13, 2016

@ellisab Thanks for the useful info! Can you share the logs from an mkcloud run where those problems with nova and ceilometer occur?

The gating check for this PR passed, but it ran without hacloud enabled, so it did not test this change. I have manually triggered a rebuild with hacloud=1.

@aspiers
Copy link
Contributor Author

aspiers commented Sep 13, 2016

I would say

qa_crowbarsetup.sh: line 2866: [: too many arguments

is not a small problem, it's a potentially bad bug.

@aspiers
Copy link
Contributor Author

aspiers commented Sep 13, 2016

The logs from @ellisab didn't have debug enabled, so although it very much looks like this PR is working, I can't verify 100%. However the gating run I triggered is providing the required info.

@aspiers
Copy link
Contributor Author

aspiers commented Sep 13, 2016

Gate failed but I think I might have messed up the build parameters.

@rsalevsky
Copy link
Contributor

@aspiers Was there some progress? It really creates blockers for C5 and C6.

@aspiers
Copy link
Contributor Author

aspiers commented Sep 17, 2016

@rsalevsky @rhafer The gate appears to be failing because get_unclustered_sles12plus_nodes is returning no nodes. This function was introduced by #1195 (the same one which created the delimiter-related breakage explained above), but I'm having trouble understanding #1195, so I think it makes more sense to hand this to @rhafer who wrote it :-)

@rhafer
Copy link
Contributor

rhafer commented Sep 20, 2016

@aspiers Thanks. I am currently trying to come up with a more complete fix for this. (AFAICS the same or similar issues are there for ceilometer and nova).

@rhafer
Copy link
Contributor

rhafer commented Sep 20, 2016

I've created #1255 which is rather similar to this, but also tries to avoid similar issues for ceilometer and nova.

The

qa_crowbarsetup.sh: line 2866: [: too many arguments

error was btw caused by missing quotation in the if [ -z $something ] tests.

@rhafer
Copy link
Contributor

rhafer commented Sep 20, 2016

@rsalevsky @rhafer The gate appears to be failing because get_unclustered_sles12plus_nodes is returning no nodes.

It's returning no node because there were no unclustered SLE12-SP2 nodes left in the deployment. It was a 4 node setup using HA and ceph. As ceph currently needs two SP1 nodes there's nothing left to deploy cinder-volume, ceilometer-agent and nova-compute to. So the fix would be to either use at least 5 nodes or deploy with want_ceph=0.

@aspiers
Copy link
Contributor Author

aspiers commented Sep 20, 2016

@rhafer Thanks a lot for working on this! I guess we can close this in favour of #1255 then.

@aspiers aspiers closed this Sep 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants