Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #32334 - Subscriptions and Pools can cross organizations #9295

Merged
merged 6 commits into from
Apr 23, 2021

Conversation

jturel
Copy link
Member

@jturel jturel commented Apr 16, 2021

A customer faces a problem where they are unable to load the subscriptions page / api / hammer subscription list because an error is raised about a candlepin pool being missing (404 error). The error is raised because the subscriptions API shows information from Candlepin in the response as not all data is stored in Katello. Ordinarily, Pools are removed from Katello via actions like manifest refresh, import, subscription expiration, or events via the CandlepinEventListener and our own EventQueue. The root cause of why these removed subscriptions are hanging around is unknown. Can't repro the problem from their backup. Could be a race condition of some kind, or perhaps Candlepin isn't sending us an event where it should be. The logging in this area is poor so it's impossible to make a determination.

While looking through their db I found some concerning things:

  • there are 11k+ pool without an organization ID. These aren't causing a problem but why are they there?
  • there are Pools and Subscriptions associated to one another which span organizations - this is not good. For example. deleting an org could remove Pools from another.

The point is that we can't make a real assessment when there are data integrity problems. This PR aims to correct these data integrity problems, improve logging, cleanup invalid pools, and handle automatic cleanup of missing pools that crop up in the future with an eye toward identifying the root of the problem. It is achieved in this way:

More logging

Candlepin and Katello events are no longer logged at debug level, but rather info, and in a more consistent pattern and with more information to be able to track which entities action was taken on

=> candlepin event handled success=true subject=pool.created entity_id=4028fa7178d190b80178d60f5e0e006d duration=950.26
=> katello event handled success=true type=import_pool object_id=19903 expired=false rescheduled=false duration=5.63

New validations

Katello::Pool now validates that organization_id and subscription_id are not null and that the organization_id of the Pool and Subscription are identical so that they cannot span orgs. The db migration also deletes Pools with a nil organization_id outright.

New event queue event

Since the root cause is unknown it's feasible that there could be more Pools hanging around in the future which didn't get cleaned up. To prevent the API from raising errors, the ResourceNotFound error is intercepted and the pool is deleted asynchronously via the event queue, and that is logged. This prevents the API from raising errors. The side effect is that the 'missing' pool is still shown in the response, but it will be gone in the next request due to being cleaned up by the event.

Data setup: ::Katello::Pool.create(cp_id: 'BOGUS', organization: Organization.first, subscription: Organization.first.subscriptions.first)

Cleanup upgrade rake task

This rake task will be run at upgrade to remove orphaned pools: those which don't map to a valid org, subscription, or candlepin pool and corrects any org mismatch between the Pool < -> Subscription association. It will report the actions taken to clean up:

I, [2021-04-15T11:42:06.688617 #21648]  INFO -- : Corrected 0 invalid pools
I, [2021-04-15T11:42:06.688660 #21648]  INFO -- : Removed 0 orphaned pools

How to test this PR

  • Does the added logging give enough information to extract meaningful information from the handled events?
  • Do the added db constraints and validations hold up in terms of workflow? ie are there actually situations where we NEED subscription_id and organization_id to be nil?
  • Verify that the cleanup script works. You can see how the tests for the rake task sets up invalid pools. Use the Rails console to achieve the same, run the rake task, and see the output - your faked pools should get updated or deleted accordingly.
  • Regression test import of pools mainly thru manifest import / refresh with and without SCA for good measure, creating subscriptions and reimport_subscriptions rake task

@theforeman-bot
Copy link

Issues: #32334

def determine_subscription(product_id: nil, source_stack_id: nil, organization:)
if source_stack_id
self.stacking_subscription(organization, source_stack_id)
# isn't it an error if we have a sourceStackID but no stacking subscription?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an open question

@@ -93,24 +123,13 @@ def provider?(organization)

# rubocop:disable Metrics/MethodLength,Metrics/AbcSize
# rubocop:disable Metrics/CyclomaticComplexity
# rubocop:disable Metrics/PerceivedComplexity
def import_data(index_hosts_and_activation_keys = false)
Copy link
Member Author

@jturel jturel Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was written in a very confusing way, especially with the lack of validations around org and subscription ids, non-nil constraints, and tests to demonstrate the desired behavior. It gives the impression that a Pool can change organizations or Subscriptions and I don't think that's ever valid.

@@ -51,29 +51,34 @@ def status
end

def run_event(event)
@logger.debug("event_queue_event: type=#{event.event_type}, object_id=#{event.object_id}")
Katello::Logging.time("katello event handled") do |data|
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes here except being wrapped in Logging.time

reindex_purpose_status
when /owner_content_access_mode\.modified/
message_handler.handle_content_access_mode_modified
Katello::Logging.time("candlepin event handled", logger: @logger) do |data|
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed here except a wrap in Logging.time

if pool_json["sourceStackId"]
subscription = Pool.stacking_subscription(pool_json['owner']['key'], pool_json["sourceStackId"])
else
subscription = ::Katello::Subscription.find_by(:cp_id => pool_json["productId"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the culprit responsible for assigning the given pool to a subscription in a different org. the productId is not unique to an org so the first one it finds is the lucky winner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i remember soooo many bugs in the past with this same root cause!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc pre-candlepin 2.0 products were shared across owners, but as part of 2.0 they no longer were, so this is probably from around before then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes perfect sense!

@jturel
Copy link
Member Author

jturel commented Apr 19, 2021

@jlsherrill someone on my team will do the formal review but if you could take a look at the changes w.r.t the db constraints and the modified pool import to work with those I'd really appreciate - don't want to run into something you might have some history on or could otherwise foresee :)

fail ArgumentError, "pool id cannot contain ?" if pool_id["?"]
begin
pool_json = self.get(path(pool_id), self.default_headers).body
rescue RestClient::ResourceNotFound
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I pick this downstream I'll bring in the new CandlepinNotRunning error introduced in #9270 to make sure this error is raised accurately

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do it here? Are upstream and downstream diverged?

Copy link
Member Author

@jturel jturel Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referenced PR was merged 6 days ago so it hasn't reached any downstream release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I understand your question now! My point was not that we would rescue from CandlepinNotRunning here. This code already behaves correctly because that error + supporting code is in master. The result is that if candlepin is down, CandlepinNotRunning is raised, whereas before it would raise ResourceNotFound which is ambigious. Is candlepin down or is the pool gone? Having the CandlepinNotRunning error + supporting code is necessary to include downstream because in the event that candlepin were down we could end up deleting all the pools with the new event. We can trust ResourceNotFound to be raised and caught here only when the pool is actually gone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Maybe there should be a test that fails when CandlepinNotRunning is not properly in place, then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but I don't think this PR is the place for it

organization: org
)

fail("No subscription found") unless subscription
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would the result be in this exception being thrown? Would it be better to log it and continue? (i don't actually know the answer, just curious what the implications are)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unnecessary but does allow to fail faster. The new validations catch the problem and import_data should raise an error when it calls save! if I remove this line.

But now I'm wondering now if first_or_create should be first_or_create! before going thru import_data at all. I think that's the way to go here given the new validations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

end

imported_count = invalid_pools.count - orphaned_pools.count
logger.info("Corrected #{imported_count} invalid pools")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if it makes sense to add info on the pools that got corrected instead of just the number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final summary - each individual pool that is modified is logged above (see logger statements)

@parthaa
Copy link
Contributor

parthaa commented Apr 22, 2021

codewise lgtm . Will play with it tomorrow.

Copy link
Member

@chris1984 chris1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing wise it works good on both the internal reproducer and my devel box. APJ after other comments are addressed.

Before Patch:

Katello::Resources::Candlepin::Pool: 404 Not Found {"displayMessage":"Pool with id 2c93015274789c210174bc9cb5bb6956 could not be found.","requestUuid":"7bc6cea5-9d69-4b04-8f1f-6059827579b2"} (GET /candlepin/pools/2c93015274789c210174bc9cb5bb6956)

With Patch:

db migration

[root@server ~]# foreman-rake db:migrate --trace
** Invoke db:migrate (first_time)
** Invoke db:load_config (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute db:load_config
** Invoke plugin:refresh_migrations (first_time)
** Invoke environment
** Execute plugin:refresh_migrations
** Execute db:migrate
== 20210331180353 KatelloPoolOrganizationIdNotNullable: migrating =============
-- change_column(:katello_pools, :organization_id, :integer, {:null=>false})
   -> 0.0356s
-- change_column(:katello_pools, :subscription_id, :integer, {:null=>false})
   -> 0.0016s
-- add_index(:katello_pools, :organization_id)
   -> 0.0309s
== 20210331180353 KatelloPoolOrganizationIdNotNullable: migrated (15.9788s) ===

** Invoke db:_dump (first_time)
** Execute db:_dump

better logging and pools being deleted:

2021-04-21T14:30:37 [W|app|fef28609] Sending pool delete event for missing candlepin pool cp_id=2c9301527725a5dc017747d676c43dbe
2021-04-21T14:30:39 [I|app|47838418] Deleted pool 96135
2021-04-21T14:30:39 [I|app|47838418] katello event handled success=true type=delete_pool object_id=96135 expired=false rescheduled=false duration=125.16

Rake script

[root@server~]# foreman-rake katello:upgrades:'4.1':fix_invalid_pools
I, [2021-04-21T14:42:36.215332 #2222]  INFO -- : Pool id=95138 cp_id=2c93015276f936140176f98bcba60f77 has been reimported
I, [2021-04-21T14:42:36.990414 #2222]  INFO -- : Pool id=95142 cp_id=2c93015276f936140176fa05fb8b11f2 has been reimported
I, [2021-04-21T14:42:37.323883 #2222]  INFO -- : Pool id=95135 cp_id=2c93015276f936140176f97118f20c05 has been reimported
I, [2021-04-21T14:42:37.323942 #2222]  INFO -- : Corrected 3 invalid pools
I, [2021-04-21T14:42:37.323957 #2222]  INFO -- : Removed 0 orphaned pools
  • From production.log
2021-04-21T14:42:34 [I|app|] Imported Katello::Subscription cp_id=RC1257407 destroyed=false success=true duration=23.75
2021-04-21T14:42:34 [I|app|] Imported Katello::Subscription cp_id=RC1257407 destroyed=false success=true duration=20.36
2021-04-21T14:42:36 [I|aud|] Katello::Pool (95138) update event on host_ids 705313, 797400, 1163766, 1163770, 1201273, 1233083, 1253656, 1253730, 1258244, 1258245, 1258246, 1261849, 1261851, 1261860, 1263344, 1263348, 1263351, 1263357, 1263370, 1263380, 1263390, 1264661, 1264671, 1264672, 1264679, 1264706, 1265786, 1265937, 1265942, 1265946, 1265951, 1265959, 1265965, 1266260, 1266263, 1266403, 1266497, 1266499, 1266533, 1266838, 1266841, 1266844, 1266845, 1266852, 1267005, 1267017, 1267030, 1267036, 1267045, 1267049, 1267059, 1267063, 1267090, 1261849, 1261851, 1261860, 1267005, 1267063, 1263344, 1263348, 1263351, 1263357, 1263370, 1263380, 1263390, 1265786, 1265937, 1265946, 1265942, 1265951, 1265959, 1265965, 1267017, 1267030, 1266260, 1266263, 1267036, 1266403, 1267045, 1266497, 1266499, 1267049, 1266533, 1264661, 1264672, 1264671, 1267059, 1264679, 1267090, 1264706, 1253656, 1266838, 1266841, 1266844, 1266845, 1258246, 1266852, 705313, 1258244, 1253730, 1201273, 1163770, 1258245, 1163766, 797400, 1233083
2021-04-21T14:42:36 [I|aud|] Katello::Pool (95142) update event on host_ids 1279235, 1272201, 1279265, 1279316, 1279319, 1277038, 1282134, 1279936, 1279943, 1279983, 1272425, 1292617, 1280518, 1280588, 1277401, 1277402, 1277462, 1293096, 1293125, 1277614, 1277617, 1282626, 1282654, 1282657, 1281062, 1281064, 1281089, 1281101, 1282738, 1282756, 1293514, 1281152, 1272926, 1281210, 1281254, 1282760, 1282804, 1282807, 1282812, 1282820, 1282830, 1282837, 1282843, 1283021, 1283028, 1293807, 1298982, 1298983, 1277971, 1277972, 1293869, 1278596, 1283829, 1278899, 1284119, 1274083, 1279066, 1284133, 1284172, 1288274, 1284265, 1284363, 1284386, 1284388, 1284412, 1284542, 1284552, 1274554, 1284648, 1288886, 1288895, 1288898, 1288907, 1288911, 1288914, 1288919, 1288921, 1288953, 1289009, 1289028, 1289054, 1289208, 1289211, 1289254, 1289262, 1289263, 1289290, 1289291, 1289307, 1289316, 1289462, 1289498, 1289503, 1289511, 1289517, 1275279, 1289709, 1289793, 1289795, 1289799, 1289806, 1289841, 1289854, 1289866, 1289928, 1289930, 1290180, 1290182, 1290208, 1290229, 1290231, 1290248, 1290287, 1290334, 1290367, 1285583, 1275782, 1290406, 1290414, 973473, 1290436, 1290453, 1290457, 

Issue resolved:

[root@server ~]# hammer subscription list
------|----------------------------------|----------------------------------------------------------|------------------------------------------------------------|----------|---------|---------|---------------------|---------------------|----------|---------
ID    | UUID                             | NAME                                                     | TYPE
                                         | CONTRACT | ACCOUNT | SUPPORT | START DATE          | END DATE            | QUANTITY | CONSUMED
------|----------------------------------|----------------------------------------------------------|------------------------------------------------------------|----------|---------|---------|---------------------|---------------------|----------|---------
96151 | 2c9301527725a5dc017749137cb36a46 | Red Hat Satellite - Add-Ons for Providers                | Guests of 1763971-3115738-p                                | 11703931 | 719762  | Premium | 2021/01/01 05:00:00 | 2022/01/17 04:59:59 | 4000     | 0

@jturel
Copy link
Member Author

jturel commented Apr 22, 2021

The latest commit (fb6b876) suppresses an error raised by the Candlepin message handler when creating products (custom, manifest import) as a result of the subscription not being created yet. It would work before due to the lack of validations. Now we can just log what happened for debug purposes as it's likely not an error (see comment in the commit).

This creates a log pattern like this and shows that the first attempt fails and the second succeeds (notice the tight timings that active_record_retry does not help us with here because it doesn't sleep between retries)

15:59:34 rails.1   | 2021-04-22T15:59:34 [I|app|ae2a4b95] import candlepin pool cp_id=4028fa7178f64ce20178fa4ed2941090 success=false duration=1260.96
15:59:34 rails.1   | 2021-04-22T15:59:34 [W|app|ae2a4b95] Unable to import pool. It will likely be created by another process.
15:59:34 rails.1   | 2021-04-22T15:59:34 [I|kat|ae2a4b95] candlepin event handled success=true subject=pool.created entity_id=4028fa7178f64ce20178fa4ed2941090 duration=1262.05

Copy link
Member

@jeremylenz jeremylenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @jturel

I didn't test anything but left some code comments below.. take them with any grains of salt you find necessary 😉

Comment on lines +50 to +51
org = Organization.find_by(label: json['owner']['key'])
fail("Organization with label #{json['owner']['key']} wasn't found while importing Candlepin pool") unless org
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if their db has orgs with labels that don't match the name? And if we should do any testing around that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for thinking at that level here. That should never be the case because org label cannot change - and the candlepin Owner should always match because of that. Checked their db - everything matches as I'd expect!

app/models/katello/glue/candlepin/pool.rb Show resolved Hide resolved
app/models/katello/pool.rb Outdated Show resolved Hide resolved
rescue ActiveRecord::RecordInvalid
# if we hit this block it's likely that the pool's subscription, product are being created
# as a result of manifest import/refresh or custom product creation
Rails.logger.warn("Unable to import pool. It will likely be created by another process.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if it's not created by another process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then that is an error and will be raised accordingly by product creation, manifest import, and so on. I can't think of a way that those actions would even raise that error because everything is happening synchronously - so it signifies a serious problem. At this point in the code there's nothing we can do but log it (and avoid unnecessary stack trace)

Comment on lines +7 to +10
invalid_pools = Katello::Pool.all.select(&:invalid?)

# Make sure Subscriptions are up to date for any org that has invalid pools
org_ids = invalid_pools.map(&:organization_id).uniq.compact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL is always faster than Ruby :)

Suggested change
invalid_pools = Katello::Pool.all.select(&:invalid?)
# Make sure Subscriptions are up to date for any org that has invalid pools
org_ids = invalid_pools.map(&:organization_id).uniq.compact
invalid_pools = Katello::Pool.all.select(&:invalid?).distinct
# Make sure Subscriptions are up to date for any org that has invalid pools
org_ids = invalid_pools.map(&:organization_id).compact

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work here since the .select returns an Array 'cause the validations happen in Ruby land and not the db

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully you 'd not encounter too many invalid pools over time. Not a biggie imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus this rake task is a one-off as part of the upgrade process. No more invalid pools will be created after this point, and eventually this task will go away

Co-authored-by: Jeremy Lenz <jlenz@redhat.com>
@chris1984
Copy link
Member

The latest commit (fb6b876) suppresses an error raised by the Candlepin message handler when creating products (custom, manifest import) as a result of the subscription not being created yet. It would work before due to the lack of validations. Now we can just log what happened for debug purposes as it's likely not an error (see comment in the commit).

This creates a log pattern like this and shows that the first attempt fails and the second succeeds (notice the tight timings that active_record_retry does not help us with here because it doesn't sleep between retries)

15:59:34 rails.1   | 2021-04-22T15:59:34 [I|app|ae2a4b95] import candlepin pool cp_id=4028fa7178f64ce20178fa4ed2941090 success=false duration=1260.96
15:59:34 rails.1   | 2021-04-22T15:59:34 [W|app|ae2a4b95] Unable to import pool. It will likely be created by another process.
15:59:34 rails.1   | 2021-04-22T15:59:34 [I|kat|ae2a4b95] candlepin event handled success=true subject=pool.created entity_id=4028fa7178f64ce20178fa4ed2941090 duration=1262.05

Tested on my devel box with creating a custom product and get the message you stated:

10:03:36 rails.1   | 2021-04-23T10:03:36 [I|app|f253b875] import candlepin pool cp_id=4028dd0078ff077b0178ff0b073a0003 success=false duration=151.27
10:03:36 rails.1   | 2021-04-23T10:03:36 [W|app|f253b875] Unable to import pool. It will likely be created by another process.
10:03:36 rails.1   | 2021-04-23T10:03:36 [I|kat|f253b875] candlepin event handled success=true subject=pool.created entity_id=4028dd0078ff077b0178ff0b073a0003 duration=153.59
10:03:36 rails.1   | 2021-04-23T10:03:36 [D|kat|8a237a8a] Resource GET request: /candlepin/owners/Default_Organization/products/866264587212/?
10:03:36 rails.1   | 2021-04-23T10:03:36 [D|kat|8a237a8a] Headers: {"accept":"application/json","accept-language":"en","content-type":"application/json","X-Correlation-ID":"8a237a8a-8396-420a-ab8b-a692144b5d86","cp-user":"foreman_admin"}
10:03:36 rails.1   | 2021-04-23T10:03:36 [D|kat|8a237a8a] Body: {}
10:03:36 rails.1   | /home/vagrant/katello/app/lib/katello/http_resource.rb:85: warning: URI.escape is obsolete
10:03:36 rails.1   | 2021-04-23T10:03:36 [D|kat|8a237a8a] Candlepin request 8370d781-48d1-463d-bf51-b95be2a45994 returned with code 200
10:03:36 rails.1   | 2021-04-23T10:03:36 [D|kat|8a237a8a] Processing response: 200

Copy link
Contributor

@parthaa parthaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK worked well for my tests.

@jturel jturel merged commit df86940 into Katello:master Apr 23, 2021
@jturel jturel deleted the pool_subscriptions branch April 23, 2021 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants