Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #25415 - import Hypervisor facts from Candlepin #7821

Merged
merged 1 commit into from
Jan 9, 2019

Conversation

evgeni
Copy link
Member

@evgeni evgeni commented Nov 8, 2018

No description provided.

@theforeman-bot
Copy link

Issues: #25415

@@ -35,6 +35,15 @@ def get_all(uuids)
consumers
end

# workaround for https://bugzilla.redhat.com/1647724
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might have a heavy performance impact, as it queries each hypervisor directly, do we have tests that cover this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of hyprevisors won't be too big I think and it also runs asynchronously... we should ask QE for explicit test if we're in doubt, but this should be good until the candlepin provides data on /consumers endpoint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently batching the requests 75 at a time, so if we have 15k hypervisors (and users do), the old method makes 200 HTTPS requests, the new one 15000. I'll run a few tests on my local box to have a few before/after numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with 500 hypervisors, 50 guests each

before patch: 0(run)+110(finalize) sec new, 0(run)+20(finalize) sec update
after patch: 180(run)+200(finalize) sec new, 75(run)+175(finalize) sec update

I'll see how we can speed that up :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzap @jlsherrill if you have ideas, please :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is to immediately add telemetry to this file to get some real-world numbers, then we can talk about poking the codebase. It's super easy the very same thing is done in this plugin: https://github.com/theforeman/foreman_discovery/pull/408/files#diff-45af6b2f1c078550eba223b206b16cb4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evgeni yeah, we used to query each host individually and recently changed to not do that to speed things up. I'm not surprised this slows it down quite a bit.

If we are okay with facts being imported eventually (but not necessarily immediately), one option would be to use our 'event_queue' to import them. This would involve just throwing the host id on the queue, and then it would get processed in the background and would eventually be imported.

@evgeni evgeni force-pushed the issue25415 branch 2 times, most recently from e2ba5e7 to 3720367 Compare November 8, 2018 15:05
@evgeni
Copy link
Member Author

evgeni commented Nov 8, 2018

[test katello]

@ares
Copy link
Contributor

ares commented Nov 9, 2018

I tested with local libvirt and got a duplicate host. Otherwise it works great.

More details: I already had server ibm-x3655-03...com on which I run foreman+katello+libvirt, I configured virt-who on the same machine and restarted virt-who service. A new host `virt-who-ibm-x3655-03...com-1 appeared. I see facts correctly set. But given the history of duplicated hosts, we should try to map it correctly to existing hosts I think as we always have the Foreman host available in DB after installation.

@ares
Copy link
Contributor

ares commented Nov 9, 2018

Sorry got confused, this only adds facts fetching (which works), the duplicity issue is already there.

@evgeni evgeni force-pushed the issue25415 branch 2 times, most recently from 49cd832 to b2863a7 Compare November 16, 2018 11:42
@evgeni
Copy link
Member Author

evgeni commented Nov 16, 2018

Not sure why that one test is failing now :/

@evgeni
Copy link
Member Author

evgeni commented Nov 23, 2018

So, I played a bit more with this. And the slowdown does not come from this PR at all.

It's from bd456f150c00ab782b38f663af9cd6e3880c9a7e, where we started to correctly load data from Candlepin. Before that commit, HypervisorsUpdate.update_subscription_facet was mostly a NOOP, as @candlepin_attributes.key?(uuid) was always false.

def update_subscription_facet(uuid, host)
host.subscription_facet ||= host.build_subscription_facet(uuid: uuid)
if @candlepin_attributes.key?(uuid)
host.subscription_facet.candlepin_consumer.consumer_attributes = @candlepin_attributes[uuid]
host.subscription_facet.import_database_attributes
host.subscription_facet.save!
host.subscription_facet.update_subscription_status(@candlepin_attributes[uuid].try(:[], :entitlementStatus))
end
host.save!
end

So, to revisit my numbers:

without facts

  • load_resources: 3-5 sec
  • update_subscription_facet: 3 min
  • update_facts: not run

with facts

  • load_resources: 20-25 sec
  • update_subscription_facet: 3 min
  • update_facts: 45 sec

With that said, I would say I am mostly happy with the performance, as I knew it will have an impact.

Still need to figure out why that one test fails.

@evgeni
Copy link
Member Author

evgeni commented Nov 26, 2018

The test failure is due to https://projects.theforeman.org/issues/25546 and by that unrelated.

@evgeni evgeni force-pushed the issue25415 branch 4 times, most recently from 8d9a8a0 to c7795ae Compare November 29, 2018 19:34
@evgeni
Copy link
Member Author

evgeni commented Nov 29, 2018

[test katello]

@evgeni
Copy link
Member Author

evgeni commented Nov 29, 2018

Hah. 💚 tests

@jlsherrill if you could have another look, that'd be awesome :)

Copy link
Member Author

@evgeni evgeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline comment

def get_all_with_facts(uuids)
consumers = []
uuids.each do |uuid|
consumers << get(uuid)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can be better written as

uuids.collect { |uuid| get(uuid) }

@hosts.each do |uuid, host|
update_subscription_facet(uuid, host)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, why did you move this to the run phase and put it in a transaction ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FactImporter needs to run outside of a transaction, so I moved the code to the run phase. But I also wanted the rest of the code to run in transaction, so I wrapped it with one. Should I add a comment with this explanation?

@jlsherrill
Copy link
Member

Testing this against master, i saw these timings with ~450 hypervisors:

initial load 172s -> 266s
secondary runs: 80s ->141s

This seems like a lot to me, although it seems to conflict with your findings? I can send you some user provided json with these 450 hypervisors if you want to try yourself.

If this performance decrease is accurate, i'd suggest one of a few things:

  1. We push to get https://bugzilla.redhat.com/1647724 fixed sooner (although that would only account for part of the decrease in performance)
  2. We make this optional, so those with a large number of hypervisors can opt-out and choose performance over this functionality
  3. we delegate this to the event queue. This would mean these hypervisor facts are imported asynchronously in the background

@evgeni
Copy link
Member Author

evgeni commented Dec 12, 2018

Just re-run this on a fresh VM with 500 hypervisors (2 guests each):

without patch:
new: Actions::Katello::Host::HypervisorsUpdate (success) [ 204.57s / 204.57s ]
update: Actions::Katello::Host::HypervisorsUpdate (success) [ 90.58s / 90.58s ]

with patch:
new: Actions::Katello::Host::HypervisorsUpdate (success) [ 325.19s / 325.19s ]
update: Actions::Katello::Host::HypervisorsUpdate (success) [ 180.20s / 180.20s ]

So not far from what you've seen (and I'd say also not too different to the previous numbers, which were in the 60-90sec increase ballpark, even tho the "new data" increase is more 120sec here).

  • Fixing https://bugzilla.redhat.com/1647724 would "only" make the data collection faster, saving us roughly 20-30 seconds on each run (the collection is identical for new vs updated data)
  • This task already runs async to the virt-who checkin, so we're not blocking anyone. Would switching to the event queue have any further benefits?
  • Making this configurable (but I'd prefer on by default) sounds good.

@jlsherrill
Copy link
Member

@evgeni i think my bigger concern was around db locking and increasing that time. (although the transaction is only be increased by however long it takes to fetch the facts, but not store them).

I wonder if we should just move the entire thing outside of a transaction? Since its written to be idempotent (and can be re-run at any time), i think that would be better? Curious your thoughts.

@evgeni
Copy link
Member Author

evgeni commented Dec 13, 2018

@evgeni i think my bigger concern was around db locking and increasing that time. (although the transaction is only be increased by however long it takes to fetch the facts, but not store them).

Yeah, I can see this being a concern. We could break that up into two transactions? One for load_resources, one for update_subscription_facet?

I wonder if we should just move the entire thing outside of a transaction? Since its written to be idempotent (and can be re-run at any time), i think that would be better? Curious your thoughts.

I didn't dare to touch this aspect yet. One of the "hidden" "gems" of this task is that load_resources will actually also create new Host resources if they were missing, and I think this should not happen outside a transaction. Updating the SubscriptionFacet and the Facts is probably fine outside, but again, I only have a very high level understanding of how this all works and what we will break.

@jlsherrill
Copy link
Member

Yeah, I can see this being a concern. We could break that up into two transactions? One for load_resources, one for update_subscription_facet?

Yeah, I think that makes sense!

I didn't dare to touch this aspect yet.

fair enough, i may file an issue and try to tackle this soon in some manner.

@evgeni
Copy link
Member Author

evgeni commented Dec 17, 2018

@jlsherrill updated with two transactions

@jlsherrill
Copy link
Member

[test katello]

@johnpmitsch
Copy link
Contributor

johnpmitsch commented Jan 8, 2019

@evgeni I was able to get facts importing from a virt-who hypervisor with this PR 👍

irb(main):021:0> Host.find(8).facts
=> {"hypervisor::type"=>"QEMU", "cpu::cpu_socket(s)"=>"1", "hypervisor::version"=>"2010001", "_timestamp"=>"2019-01-08 21:03:53 +0000", "hypervisor"=>nil, "cpu"=>nil}

But when I try to get the facts from the API, it still returns null. I couldn't find why this is happening for just the virt-who hypervisors vs. other content hosts. Let me know if you see the same, it would be helpful to use fo this bug

[vagrant@coffee foreman{develop}]$ curl -g -k -u admin:changeme -H "Content-Type: application/json"  https://coffee.jomitsch.example.com/api/v2/hosts/8 | jq '.facts'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3162    0  3162    0     0  10903      0 --:--:-- --:--:-- --:--:-- 10941
null

Also, should hypervisor have a value for the facts hash?

@evgeni
Copy link
Member Author

evgeni commented Jan 9, 2019

@johnpmitsch I've seen that, it seems to me that it will return fine on /hosts/:host_id/facts but not in /hosts/:host_id and I have no idea why (the templates don't look like they would exclude it or anything).

And no hypervisor does not have a value, it's a dummy fact to allow subfacts :)

Copy link
Contributor

@johnpmitsch johnpmitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked well for me and code looks good. My other comment seems to be a valid issue, but seems pre-existing and not caused by this PR, so I think we can fix that separately

I haven't tested or evaluated any of the scaling performance, I'll leave it up to others who have been in those conversations to give their approval.

@beav
Copy link
Contributor

beav commented Jan 9, 2019

we discussed this PR during grooming. The performance impact is OK

thanks @evgeni !

@beav beav merged commit 81530a0 into Katello:master Jan 9, 2019
@evgeni
Copy link
Member Author

evgeni commented Jan 10, 2019

💚 thanks everyone for reviewing, helping, discussing! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants