-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #32334 - Subscriptions and Pools can cross organizations #9295
Conversation
Issues: #32334 |
0eff22f
to
e27179a
Compare
def determine_subscription(product_id: nil, source_stack_id: nil, organization:) | ||
if source_stack_id | ||
self.stacking_subscription(organization, source_stack_id) | ||
# isn't it an error if we have a sourceStackID but no stacking subscription? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an open question
@@ -93,24 +123,13 @@ def provider?(organization) | |||
|
|||
# rubocop:disable Metrics/MethodLength,Metrics/AbcSize | |||
# rubocop:disable Metrics/CyclomaticComplexity | |||
# rubocop:disable Metrics/PerceivedComplexity | |||
def import_data(index_hosts_and_activation_keys = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was written in a very confusing way, especially with the lack of validations around org and subscription ids, non-nil constraints, and tests to demonstrate the desired behavior. It gives the impression that a Pool can change organizations or Subscriptions and I don't think that's ever valid.
@@ -51,29 +51,34 @@ def status | |||
end | |||
|
|||
def run_event(event) | |||
@logger.debug("event_queue_event: type=#{event.event_type}, object_id=#{event.object_id}") | |||
Katello::Logging.time("katello event handled") do |data| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes here except being wrapped in Logging.time
reindex_purpose_status | ||
when /owner_content_access_mode\.modified/ | ||
message_handler.handle_content_access_mode_modified | ||
Katello::Logging.time("candlepin event handled", logger: @logger) do |data| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing changed here except a wrap in Logging.time
if pool_json["sourceStackId"] | ||
subscription = Pool.stacking_subscription(pool_json['owner']['key'], pool_json["sourceStackId"]) | ||
else | ||
subscription = ::Katello::Subscription.find_by(:cp_id => pool_json["productId"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the culprit responsible for assigning the given pool to a subscription in a different org. the productId is not unique to an org so the first one it finds is the lucky winner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i remember soooo many bugs in the past with this same root cause!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc pre-candlepin 2.0 products were shared across owners, but as part of 2.0 they no longer were, so this is probably from around before then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes perfect sense!
fdc3654
to
9ecc8cd
Compare
@jlsherrill someone on my team will do the formal review but if you could take a look at the changes w.r.t the db constraints and the modified pool import to work with those I'd really appreciate - don't want to run into something you might have some history on or could otherwise foresee :) |
fail ArgumentError, "pool id cannot contain ?" if pool_id["?"] | ||
begin | ||
pool_json = self.get(path(pool_id), self.default_headers).body | ||
rescue RestClient::ResourceNotFound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I pick this downstream I'll bring in the new CandlepinNotRunning error introduced in #9270 to make sure this error is raised accurately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do it here? Are upstream and downstream diverged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The referenced PR was merged 6 days ago so it hasn't reached any downstream release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I understand your question now! My point was not that we would rescue from CandlepinNotRunning
here. This code already behaves correctly because that error + supporting code is in master. The result is that if candlepin is down, CandlepinNotRunning is raised, whereas before it would raise ResourceNotFound
which is ambigious. Is candlepin down or is the pool gone? Having the CandlepinNotRunning
error + supporting code is necessary to include downstream because in the event that candlepin were down we could end up deleting all the pools with the new event. We can trust ResourceNotFound to be raised and caught here only when the pool is actually gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Maybe there should be a test that fails when CandlepinNotRunning
is not properly in place, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe but I don't think this PR is the place for it
organization: org | ||
) | ||
|
||
fail("No subscription found") unless subscription |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would the result be in this exception being thrown? Would it be better to log it and continue? (i don't actually know the answer, just curious what the implications are)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat unnecessary but does allow to fail faster. The new validations catch the problem and import_data should raise an error when it calls save!
if I remove this line.
But now I'm wondering now if first_or_create
should be first_or_create!
before going thru import_data
at all. I think that's the way to go here given the new validations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
end | ||
|
||
imported_count = invalid_pools.count - orphaned_pools.count | ||
logger.info("Corrected #{imported_count} invalid pools") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if it makes sense to add info on the pools that got corrected instead of just the number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final summary - each individual pool that is modified is logged above (see logger statements)
codewise lgtm . Will play with it tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing wise it works good on both the internal reproducer and my devel box. APJ after other comments are addressed.
Before Patch:
Katello::Resources::Candlepin::Pool: 404 Not Found {"displayMessage":"Pool with id 2c93015274789c210174bc9cb5bb6956 could not be found.","requestUuid":"7bc6cea5-9d69-4b04-8f1f-6059827579b2"} (GET /candlepin/pools/2c93015274789c210174bc9cb5bb6956)
With Patch:
db migration
[root@server ~]# foreman-rake db:migrate --trace
** Invoke db:migrate (first_time)
** Invoke db:load_config (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute db:load_config
** Invoke plugin:refresh_migrations (first_time)
** Invoke environment
** Execute plugin:refresh_migrations
** Execute db:migrate
== 20210331180353 KatelloPoolOrganizationIdNotNullable: migrating =============
-- change_column(:katello_pools, :organization_id, :integer, {:null=>false})
-> 0.0356s
-- change_column(:katello_pools, :subscription_id, :integer, {:null=>false})
-> 0.0016s
-- add_index(:katello_pools, :organization_id)
-> 0.0309s
== 20210331180353 KatelloPoolOrganizationIdNotNullable: migrated (15.9788s) ===
** Invoke db:_dump (first_time)
** Execute db:_dump
better logging and pools being deleted:
2021-04-21T14:30:37 [W|app|fef28609] Sending pool delete event for missing candlepin pool cp_id=2c9301527725a5dc017747d676c43dbe
2021-04-21T14:30:39 [I|app|47838418] Deleted pool 96135
2021-04-21T14:30:39 [I|app|47838418] katello event handled success=true type=delete_pool object_id=96135 expired=false rescheduled=false duration=125.16
Rake script
[root@server~]# foreman-rake katello:upgrades:'4.1':fix_invalid_pools
I, [2021-04-21T14:42:36.215332 #2222] INFO -- : Pool id=95138 cp_id=2c93015276f936140176f98bcba60f77 has been reimported
I, [2021-04-21T14:42:36.990414 #2222] INFO -- : Pool id=95142 cp_id=2c93015276f936140176fa05fb8b11f2 has been reimported
I, [2021-04-21T14:42:37.323883 #2222] INFO -- : Pool id=95135 cp_id=2c93015276f936140176f97118f20c05 has been reimported
I, [2021-04-21T14:42:37.323942 #2222] INFO -- : Corrected 3 invalid pools
I, [2021-04-21T14:42:37.323957 #2222] INFO -- : Removed 0 orphaned pools
- From production.log
2021-04-21T14:42:34 [I|app|] Imported Katello::Subscription cp_id=RC1257407 destroyed=false success=true duration=23.75
2021-04-21T14:42:34 [I|app|] Imported Katello::Subscription cp_id=RC1257407 destroyed=false success=true duration=20.36
2021-04-21T14:42:36 [I|aud|] Katello::Pool (95138) update event on host_ids 705313, 797400, 1163766, 1163770, 1201273, 1233083, 1253656, 1253730, 1258244, 1258245, 1258246, 1261849, 1261851, 1261860, 1263344, 1263348, 1263351, 1263357, 1263370, 1263380, 1263390, 1264661, 1264671, 1264672, 1264679, 1264706, 1265786, 1265937, 1265942, 1265946, 1265951, 1265959, 1265965, 1266260, 1266263, 1266403, 1266497, 1266499, 1266533, 1266838, 1266841, 1266844, 1266845, 1266852, 1267005, 1267017, 1267030, 1267036, 1267045, 1267049, 1267059, 1267063, 1267090, 1261849, 1261851, 1261860, 1267005, 1267063, 1263344, 1263348, 1263351, 1263357, 1263370, 1263380, 1263390, 1265786, 1265937, 1265946, 1265942, 1265951, 1265959, 1265965, 1267017, 1267030, 1266260, 1266263, 1267036, 1266403, 1267045, 1266497, 1266499, 1267049, 1266533, 1264661, 1264672, 1264671, 1267059, 1264679, 1267090, 1264706, 1253656, 1266838, 1266841, 1266844, 1266845, 1258246, 1266852, 705313, 1258244, 1253730, 1201273, 1163770, 1258245, 1163766, 797400, 1233083
2021-04-21T14:42:36 [I|aud|] Katello::Pool (95142) update event on host_ids 1279235, 1272201, 1279265, 1279316, 1279319, 1277038, 1282134, 1279936, 1279943, 1279983, 1272425, 1292617, 1280518, 1280588, 1277401, 1277402, 1277462, 1293096, 1293125, 1277614, 1277617, 1282626, 1282654, 1282657, 1281062, 1281064, 1281089, 1281101, 1282738, 1282756, 1293514, 1281152, 1272926, 1281210, 1281254, 1282760, 1282804, 1282807, 1282812, 1282820, 1282830, 1282837, 1282843, 1283021, 1283028, 1293807, 1298982, 1298983, 1277971, 1277972, 1293869, 1278596, 1283829, 1278899, 1284119, 1274083, 1279066, 1284133, 1284172, 1288274, 1284265, 1284363, 1284386, 1284388, 1284412, 1284542, 1284552, 1274554, 1284648, 1288886, 1288895, 1288898, 1288907, 1288911, 1288914, 1288919, 1288921, 1288953, 1289009, 1289028, 1289054, 1289208, 1289211, 1289254, 1289262, 1289263, 1289290, 1289291, 1289307, 1289316, 1289462, 1289498, 1289503, 1289511, 1289517, 1275279, 1289709, 1289793, 1289795, 1289799, 1289806, 1289841, 1289854, 1289866, 1289928, 1289930, 1290180, 1290182, 1290208, 1290229, 1290231, 1290248, 1290287, 1290334, 1290367, 1285583, 1275782, 1290406, 1290414, 973473, 1290436, 1290453, 1290457,
Issue resolved:
[root@server ~]# hammer subscription list
------|----------------------------------|----------------------------------------------------------|------------------------------------------------------------|----------|---------|---------|---------------------|---------------------|----------|---------
ID | UUID | NAME | TYPE
| CONTRACT | ACCOUNT | SUPPORT | START DATE | END DATE | QUANTITY | CONSUMED
------|----------------------------------|----------------------------------------------------------|------------------------------------------------------------|----------|---------|---------|---------------------|---------------------|----------|---------
96151 | 2c9301527725a5dc017749137cb36a46 | Red Hat Satellite - Add-Ons for Providers | Guests of 1763971-3115738-p | 11703931 | 719762 | Premium | 2021/01/01 05:00:00 | 2022/01/17 04:59:59 | 4000 | 0
The latest commit (fb6b876) suppresses an error raised by the Candlepin message handler when creating products (custom, manifest import) as a result of the subscription not being created yet. It would work before due to the lack of validations. Now we can just log what happened for debug purposes as it's likely not an error (see comment in the commit). This creates a log pattern like this and shows that the first attempt fails and the second succeeds (notice the tight timings that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @jturel
I didn't test anything but left some code comments below.. take them with any grains of salt you find necessary 😉
org = Organization.find_by(label: json['owner']['key']) | ||
fail("Organization with label #{json['owner']['key']} wasn't found while importing Candlepin pool") unless org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if their db has orgs with labels that don't match the name? And if we should do any testing around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for thinking at that level here. That should never be the case because org label cannot change - and the candlepin Owner should always match because of that. Checked their db - everything matches as I'd expect!
rescue ActiveRecord::RecordInvalid | ||
# if we hit this block it's likely that the pool's subscription, product are being created | ||
# as a result of manifest import/refresh or custom product creation | ||
Rails.logger.warn("Unable to import pool. It will likely be created by another process.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if it's not created by another process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then that is an error and will be raised accordingly by product creation, manifest import, and so on. I can't think of a way that those actions would even raise that error because everything is happening synchronously - so it signifies a serious problem. At this point in the code there's nothing we can do but log it (and avoid unnecessary stack trace)
invalid_pools = Katello::Pool.all.select(&:invalid?) | ||
|
||
# Make sure Subscriptions are up to date for any org that has invalid pools | ||
org_ids = invalid_pools.map(&:organization_id).uniq.compact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL is always faster than Ruby :)
invalid_pools = Katello::Pool.all.select(&:invalid?) | |
# Make sure Subscriptions are up to date for any org that has invalid pools | |
org_ids = invalid_pools.map(&:organization_id).uniq.compact | |
invalid_pools = Katello::Pool.all.select(&:invalid?).distinct | |
# Make sure Subscriptions are up to date for any org that has invalid pools | |
org_ids = invalid_pools.map(&:organization_id).compact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work here since the .select returns an Array 'cause the validations happen in Ruby land and not the db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully you 'd not encounter too many invalid pools over time. Not a biggie imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus this rake task is a one-off as part of the upgrade process. No more invalid pools will be created after this point, and eventually this task will go away
Co-authored-by: Jeremy Lenz <jlenz@redhat.com>
Tested on my devel box with creating a custom product and get the message you stated: 10:03:36 rails.1 | 2021-04-23T10:03:36 [I|app|f253b875] import candlepin pool cp_id=4028dd0078ff077b0178ff0b073a0003 success=false duration=151.27
10:03:36 rails.1 | 2021-04-23T10:03:36 [W|app|f253b875] Unable to import pool. It will likely be created by another process.
10:03:36 rails.1 | 2021-04-23T10:03:36 [I|kat|f253b875] candlepin event handled success=true subject=pool.created entity_id=4028dd0078ff077b0178ff0b073a0003 duration=153.59
10:03:36 rails.1 | 2021-04-23T10:03:36 [D|kat|8a237a8a] Resource GET request: /candlepin/owners/Default_Organization/products/866264587212/?
10:03:36 rails.1 | 2021-04-23T10:03:36 [D|kat|8a237a8a] Headers: {"accept":"application/json","accept-language":"en","content-type":"application/json","X-Correlation-ID":"8a237a8a-8396-420a-ab8b-a692144b5d86","cp-user":"foreman_admin"}
10:03:36 rails.1 | 2021-04-23T10:03:36 [D|kat|8a237a8a] Body: {}
10:03:36 rails.1 | /home/vagrant/katello/app/lib/katello/http_resource.rb:85: warning: URI.escape is obsolete
10:03:36 rails.1 | 2021-04-23T10:03:36 [D|kat|8a237a8a] Candlepin request 8370d781-48d1-463d-bf51-b95be2a45994 returned with code 200
10:03:36 rails.1 | 2021-04-23T10:03:36 [D|kat|8a237a8a] Processing response: 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK worked well for my tests.
A customer faces a problem where they are unable to load the subscriptions page / api / hammer subscription list because an error is raised about a candlepin pool being missing (404 error). The error is raised because the subscriptions API shows information from Candlepin in the response as not all data is stored in Katello. Ordinarily, Pools are removed from Katello via actions like manifest refresh, import, subscription expiration, or events via the CandlepinEventListener and our own EventQueue. The root cause of why these removed subscriptions are hanging around is unknown. Can't repro the problem from their backup. Could be a race condition of some kind, or perhaps Candlepin isn't sending us an event where it should be. The logging in this area is poor so it's impossible to make a determination.
While looking through their db I found some concerning things:
The point is that we can't make a real assessment when there are data integrity problems. This PR aims to correct these data integrity problems, improve logging, cleanup invalid pools, and handle automatic cleanup of missing pools that crop up in the future with an eye toward identifying the root of the problem. It is achieved in this way:
More logging
Candlepin and Katello events are no longer logged at debug level, but rather info, and in a more consistent pattern and with more information to be able to track which entities action was taken on
New validations
Katello::Pool now validates that organization_id and subscription_id are not null and that the organization_id of the Pool and Subscription are identical so that they cannot span orgs. The db migration also deletes Pools with a nil organization_id outright.
New event queue event
Since the root cause is unknown it's feasible that there could be more Pools hanging around in the future which didn't get cleaned up. To prevent the API from raising errors, the ResourceNotFound error is intercepted and the pool is deleted asynchronously via the event queue, and that is logged. This prevents the API from raising errors. The side effect is that the 'missing' pool is still shown in the response, but it will be gone in the next request due to being cleaned up by the event.
Data setup:
::Katello::Pool.create(cp_id: 'BOGUS', organization: Organization.first, subscription: Organization.first.subscriptions.first)
Cleanup upgrade rake task
This rake task will be run at upgrade to remove orphaned pools: those which don't map to a valid org, subscription, or candlepin pool and corrects any org mismatch between the Pool < -> Subscription association. It will report the actions taken to clean up:
How to test this PR