-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DR-268] Resources can be fetched when locked #1470
[DR-268] Resources can be fetched when locked #1470
Conversation
This field will be included on resource retrieval, summary retrieval, and enumeration. At this point, we only intend on exposing flight IDs which have obtained exclusive locks on resources: flight IDs which have obtained shared locks on resources are still internal-only.
Expose it from a getter in the dataset object and use it for conversions to model objects used in API responses.
In any flight where we attempt a modification of the dataset, we try to obtain a lock on it: if an exclusive lock is present, the flight will fail and the modification will not be allowed to proceed. This changeset allows datasets to be retrieved in read-only operations even when exclusively locked. Removed now-unneeded code and updated tests. Refactored DatasetsApiControllerTest to JUnit 5 with test slicing. Drive-by: refactored out DatasetService method for checking that callers can read datasets, and instead call the IamService within DatasetsApiController. This code followed a pattern set in SnapshotService but was overly complex for datasets (a user may be able to read a snapshot because they have direct access via Sam and/or indirect access via a RAS passport... dataset access is only handled by Sam).
In any flight where we attempt a modification of the snapshot, we should obtain a lock on it: if an exclusive lock is present, the flight will fail and the modification will not be allowed to proceed. This changeset allows snapshots to be retrieved in read-only operations even when exclusively locked. Removed now-unneeded code and updated tests. Improvement: added new enum LockOperation to allow for more descriptive error messages to users when their jobs fail on trying to lock a resource (in an upcoming commit, I will refactor DatasetDao to use this). Drive-bys: addressed Intellij warnings where they were a small lift.
This allows for more descriptive error messages to users when their jobs fail on trying to lock a dataset. Drive-bys: addressed Intellij warnings where they were a small lift (ex. removing unused userReq param).
Otherwise, we return a resource summary object that could incorrectly indicate that it's exclusively locked. This was exposed by broken connected tests.
…toModel The previous commit exposed a latent bug: inconsistency in how a dataset source model was constructed, which caused several connected tests to fail due to mismatched values. In the process of investigating these test failures, I found that SnapshotFileLookupConnectedTest could have several tests removed as long as we increased SnapshotDao unit test coverage. I made the same test expansion for DatasetDao unit tests. A nice side effect of this test refactor is that we got to remove two configuration faults previously used for testing only.
Awhile back, snapshot creation would obtain an exclusive lock on its source dataset: this was overkill and caused problems for reasonable concurrent operations performed on the same dataset. It was removed in full. However, snapshot creation should obtain a shared (non-exclusive) lock on its source dataset to guard against the dataset being deleted out from under it, which would result in an orphaned snapshot (we don't allow dataset deletion to proceed if the dataset has snapshots).
@@ -437,54 +432,6 @@ public void testProjectDeleteAfterSnapshotDelete() throws Exception { | |||
googleResourceManagerService.getProject(googleProjectId).getLifecycleState()); | |||
} | |||
|
|||
@Ignore("Remove ignore after DR-1770 is addressed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally went to refactor this test so that the faults weren't needed. But I saw that the referenced ticket was closed as "won't due" as the issue hadn't reoccurred in some time, so I deleted the test:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this further?
- Did we not see that error again because starting ignoring the test? I'm either confused now or I was confused when I wrote that comment in that ticket.
- I think the comment indicating that we should delete the "Ignore" tag, not delete the test altogether. I don't know that we yet have an alternate test covering this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for probing this!
The reason the test issue had not reoccurred is because we added the @Ignore
annotation. We have not been running this test for several years without any ill effects, so one could argue that it may not have been valuable to us to begin with.
What was helpful in our post-standup conversation was to think through the question: "what behavior are we trying to verify with this test?". That lead me to the initial PR and ticket.
The initial intent was to make sure that snapshot deletion takes out an exclusive lock on the snapshot, and that concurrent attempts at snapshot deletion would fail as a result.
A more complete, precise, stable, and speedy way to test this is through a chain of unit tests:
- Verify that
SnapshotDeleteFlight
addsLockSnapshotStep
to its manifest in the way we expect - Verify that
LockSnapshotStep
behaves as we'd expect for all possible input conditions (including if the snapshot is already locked, has already been deleted, etc). - Verify that the underlying DAO methods to lock / unlock snapshots behave as we'd expect
We already had unit tests to verify DAO behavior. I supplemented tests to include unit tests for flight construction and step behavior. For symmetry, I also wrote unit tests for UnlockSnapshotStep
, even though it's not called in this flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! Thanks for looking into this and writing tests that are more aligned with our current test strategy. I'm glad this allowed for removal of the fault insertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do this for DatasetConnectedTest.testOverlappingDeletes
too? (This doesn't have to happen in this PR though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks for flagging that! I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you flagged this :)
I was able to remove DatasetConnectedTest.testOverlappingDeletes
by adding a unit test for DatasetDeleteFlight
construction. We already had unit tests to verify DAO behavior and step behavior.
But the fun part came when moving to remove the faults referenced in that connected test. I found them referenced in DatasetLockConnectedTest
, in tests that should have been failing with my code changes… but I realized that this test suite was not annotated and thus its tests weren't running with our connected tests as expected. That's been the case for the last 2 years.
jade-data-repo/src/test/java/bio/terra/service/dataset/DatasetLockConnectedTest.java
Lines 53 to 54 in cd8e294
public class DatasetLockConnectedTest { |
So I went through those connected tests, verified the behavior with additional flight construction unit tests where necessary, and got to remove 9 faults and clean up everywhere they were called in production code!
src/test/java/bio/terra/service/snapshot/SnapshotFileLookupConnectedTest.java
Show resolved
Hide resolved
src/main/java/bio/terra/service/snapshot/flight/create/SnapshotCreateFlight.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so excited for this change 🥳
Thanks for the deleted commit messages, they were very helpful, especially in such a large PR!
@@ -437,54 +432,6 @@ public void testProjectDeleteAfterSnapshotDelete() throws Exception { | |||
googleResourceManagerService.getProject(googleProjectId).getLifecycleState()); | |||
} | |||
|
|||
@Ignore("Remove ignore after DR-1770 is addressed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this further?
- Did we not see that error again because starting ignoring the test? I'm either confused now or I was confused when I wrote that comment in that ticket.
- I think the comment indicating that we should delete the "Ignore" tag, not delete the test altogether. I don't know that we yet have an alternate test covering this scenario.
…apshot In a previous PR, I removed a flaky ignored connected test that ran two concurrent snapshot deletes. Talking with @snf2ye raised the million-dollar question: 'What exactly where we trying to test with that flaky test?' The answer: we wanted to make sure that snapshot deletion takes out an exclusive lock on the snapshot, and that concurrent attempts at snapshot deletion would fail as a result. A more complete, precise, stable, and speedy way to test this is through a chain of unit tests: - Verify that SnapshotDeleteFlight adds LockSnapshotStep to its manifest in the way we expect - Verify that LockSnapshotStep behaves as we'd expect for all possible input conditions (including if the snapshot is already locked, has already been deleted, etc). - Verify that the underlying DAO methods to lock / unlock snapshots behave as we'd expect We already had unit tests to verify DAO behavior. I supplemented tests to include unit tests for flight construction and step behavior. For symmetry, I also wrote unit tests for UnlockSnapshotStep, even though it's not called in this flight. Future similar opportunities in 'test engoodification' include similar tests for dataset locking / unlocking steps, as well as flight construction in other places to verify locking behavior.
…eAfterDatasetDelete This set-up is already performed before the test and triggered a spotbugs error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I'm so excited for this change 🎉 I had a comment about updating one of the tests, but I would not consider it a blocker here.
src/main/java/bio/terra/service/dataset/flight/LockDatasetStep.java
Outdated
Show resolved
Hide resolved
List<DatasetProject> datasetProjects; | ||
try { | ||
datasetProjects = | ||
objectMapper.readValue(rs.getString("dataset_sources"), new TypeReference<>() {}); | ||
} catch (JsonProcessingException e) { | ||
throw new CorruptMetadataException("Invalid dataset sources for snapshot"); | ||
} | ||
return new SnapshotProject() | ||
.id(rs.getObject("id", UUID.class)) | ||
.name(rs.getString("name")) | ||
.profileId(rs.getObject("profile_id", UUID.class)) | ||
.dataProject(rs.getString("google_project_id")) | ||
.cloudPlatform(CloudPlatform.fromValue(rs.getString("cloud_platform"))) | ||
.sourceDatasetProjects(datasetProjects); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer this form over putting the return
inside the try
? Moving the return
up lets you get rid of the separate declaration for datasetProjects
.
try {
List<DatasetProject> datasetProjects =
objectMapper.readValue(rs.getString("dataset_sources"), new TypeReference<>() {});
return new SnapshotProject()
.id(rs.getObject("id", UUID.class))
.name(rs.getString("name"))
.profileId(rs.getObject("profile_id", UUID.class))
.dataProject(rs.getString("google_project_id"))
.cloudPlatform(CloudPlatform.fromValue(rs.getString("cloud_platform")))
.sourceDatasetProjects(datasetProjects);
} catch (JsonProcessingException e) {
throw new CorruptMetadataException("Invalid dataset sources for snapshot");
}
Also I would expect e
to be included in the rethrown exception, is there a reason to omit it?
src/test/java/bio/terra/app/controller/DatasetsApiControllerTest.java
Outdated
Show resolved
Hide resolved
…ests Added DatasetDeleteFlightTest -- we already had unit test coverage for step behavior, DAO behavior. When moving to remove the faults referenced in that connected test, I found that they were used in DatasetLockConnectedTest too. But that test suite was not annotated, so its tests hadn't been running for 2 years. I went through those tests and expanded flight construction unit tests to make up the coverage. I was then able to remove 9 faults, clean up everywhere they were referenced in production code, and remove DatasetLockConnectedTest suite in full.
Including: - Enum fields follow Java static final naming convention - Consolidate calls to DatasetDao: some places we previously fetched a dataset to then convert it to a model, but we can get this in a single call now - Use Optional.orElseThrow to concisely return optional content, or throw if not present
https://broadworkbench.atlassian.net/browse/DR-268
Prior behavior
When resources are exclusively locked, they cannot be retrieved or modified.
Background
One problem with the prior behavior is that it has been shown to cause user panic; if a user's resource was exclusively locked and they tried to retrieve it, they can reasonably think that their resource and its data could be lost.
What we really want to protect against is the possible modification of exclusively-locked resources, which we already get in resource modification flights and their attempts to obtain resource locks (either exclusive or shared). If the resource in question already has an exclusive lock, the flight will fail and the modification will not be performed.
This leads into a user request from Nate: a user had inadvertently deleted critical TDR resources, and it would have been nice to have an extra layer of protection against such deletion. Our existing locking mechanism has the potential to help support this use case, but we need to expose it first.
New behavior
When resources are exclusively locked...
Testing and Quality
My developer environment reflects these latest changes. I've established an exclusive lock on this dataset via manual DB modification, and can retrieve it but cannot modify it.
An attempt to generate a snapshot from it also displays new improved error messaging, giving the user more information about the cause of the failure (we previously would only say "failed to lock the
<resource>
"):In the process of taking on this work, I fixed several latent bugs, added necessary shared dataset locking to snapshot creation, and followed up on learnings from a previous test improvement PR to make further test improvements (i.e. refactored a controller test to JUnit 5 with test slicing, removed faults in production code previously required for testing, and tested functionality via unit tests rather than connected tests). More details can be found in PR comments.
Future Work