Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DR-268] Resources can be fetched when locked #1470

Merged
merged 19 commits into from
Jun 26, 2023

Conversation

okotsopoulos
Copy link
Contributor

@okotsopoulos okotsopoulos commented Jun 6, 2023

https://broadworkbench.atlassian.net/browse/DR-268

Prior behavior

When resources are exclusively locked, they cannot be retrieved or modified.

Background

One problem with the prior behavior is that it has been shown to cause user panic; if a user's resource was exclusively locked and they tried to retrieve it, they can reasonably think that their resource and its data could be lost.

What we really want to protect against is the possible modification of exclusively-locked resources, which we already get in resource modification flights and their attempts to obtain resource locks (either exclusive or shared). If the resource in question already has an exclusive lock, the flight will fail and the modification will not be performed.

This leads into a user request from Nate: a user had inadvertently deleted critical TDR resources, and it would have been nice to have an extra layer of protection against such deletion. Our existing locking mechanism has the potential to help support this use case, but we need to expose it first.

New behavior

When resources are exclusively locked...

  • they can be retrieved but cannot be modified.
  • the locking flight ID is returned in resource retrieval, summary retrieval, and enumeration.
    • Note: I only exposed the exclusive lock, but datasets also can have shared locks. The presence of a shared lock will prevent an exclusive lock from being taken out on the resource. In the future, we might like to expose this information in our API responses as well, in case a shared lock is stuck and needs to be cleared out.

Testing and Quality

My developer environment reflects these latest changes. I've established an exclusive lock on this dataset via manual DB modification, and can retrieve it but cannot modify it.

curl -X 'GET' \
  'https://jade-ok.datarepo-dev.broadinstitute.org/api/repository/v1/datasets/786725ff-5822-4e51-9e71-50793c561d32?include=NONE' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer <token>'

{
  "id": "786725ff-5822-4e51-9e71-50793c561d32",
  "name": "person_sample",
  "description": "Dataset with person, sample tables",
  …
  "resourceLocks": {
    "exclusive": "DR-268-exclusive-lock"
  }
}

An attempt to generate a snapshot from it also displays new improved error messaging, giving the user more information about the cause of the failure (we previously would only say "failed to lock the <resource>"):

Screenshot 2023-06-07 at 7 41 10 PM

In the process of taking on this work, I fixed several latent bugs, added necessary shared dataset locking to snapshot creation, and followed up on learnings from a previous test improvement PR to make further test improvements (i.e. refactored a controller test to JUnit 5 with test slicing, removed faults in production code previously required for testing, and tested functionality via unit tests rather than connected tests). More details can be found in PR comments.

Future Work

  • Handling in TDR UI for exclusively-locked resources (to be spiked)
  • Expose shared locks in resources fetched
  • Allow TDR Admins (and perhaps eventually data stewards) to remove locks from API
  • Allow data stewards to establish locks on resources to prevent their modification (to be spiked)

This field will be included on resource retrieval, summary retrieval, and enumeration.
At this point, we only intend on exposing flight IDs which have obtained exclusive locks on resources:
flight IDs which have obtained shared locks on resources are still internal-only.
Expose it from a getter in the dataset object and use it for conversions to model objects used in API responses.
In any flight where we attempt a modification of the dataset, we try to obtain a lock on it: if an exclusive lock is present, the flight will fail and the modification will not be allowed to proceed.
This changeset allows datasets to be retrieved in read-only operations even when exclusively locked.
Removed now-unneeded code and updated tests.
Refactored DatasetsApiControllerTest to JUnit 5 with test slicing.

Drive-by: refactored out DatasetService method for checking that callers can read datasets, and instead call the IamService within DatasetsApiController.  This code followed a pattern set in SnapshotService but was overly complex for datasets (a user may be able to read a snapshot because they have direct access via Sam and/or indirect access via a RAS passport... dataset access is only handled by Sam).
In any flight where we attempt a modification of the snapshot, we should obtain a lock on it: if an exclusive lock is present, the flight will fail and the modification will not be allowed to proceed.
This changeset allows snapshots to be retrieved in read-only operations even when exclusively locked.
Removed now-unneeded code and updated tests.

Improvement: added new enum LockOperation to allow for more descriptive error messages to users when their jobs fail on trying to lock a resource (in an upcoming commit, I will refactor DatasetDao to use this).
Drive-bys: addressed Intellij warnings where they were a small lift.
This allows for more descriptive error messages to users when their jobs fail on trying to lock a dataset.
Drive-bys: addressed Intellij warnings where they were a small lift (ex. removing unused userReq param).
Otherwise, we return a resource summary object that could incorrectly indicate that it's exclusively locked.
This was exposed by broken connected tests.
…toModel

The previous commit exposed a latent bug: inconsistency in how a dataset source model was constructed, which caused several connected tests to fail due to mismatched values.
In the process of investigating these test failures, I found that SnapshotFileLookupConnectedTest could have several tests removed as long as we increased SnapshotDao unit test coverage.
I made the same test expansion for DatasetDao unit tests.
A nice side effect of this test refactor is that we got to remove two configuration faults previously used for testing only.
Awhile back, snapshot creation would obtain an exclusive lock on its source dataset:
this was overkill and caused problems for reasonable concurrent operations performed on the same dataset.
It was removed in full.
However, snapshot creation should obtain a shared (non-exclusive) lock on its source dataset to guard against the dataset being deleted out from under it, which would result in an orphaned snapshot (we don't allow dataset deletion to proceed if the dataset has snapshots).
@@ -437,54 +432,6 @@ public void testProjectDeleteAfterSnapshotDelete() throws Exception {
googleResourceManagerService.getProject(googleProjectId).getLifecycleState());
}

@Ignore("Remove ignore after DR-1770 is addressed")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally went to refactor this test so that the faults weren't needed. But I saw that the referenced ticket was closed as "won't due" as the issue hadn't reoccurred in some time, so I deleted the test:

https://broadworkbench.atlassian.net/browse/DR-1770

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this further?

  • Did we not see that error again because starting ignoring the test? I'm either confused now or I was confused when I wrote that comment in that ticket.
  • I think the comment indicating that we should delete the "Ignore" tag, not delete the test altogether. I don't know that we yet have an alternate test covering this scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for probing this!

The reason the test issue had not reoccurred is because we added the @Ignore annotation. We have not been running this test for several years without any ill effects, so one could argue that it may not have been valuable to us to begin with.

What was helpful in our post-standup conversation was to think through the question: "what behavior are we trying to verify with this test?". That lead me to the initial PR and ticket.

The initial intent was to make sure that snapshot deletion takes out an exclusive lock on the snapshot, and that concurrent attempts at snapshot deletion would fail as a result.

A more complete, precise, stable, and speedy way to test this is through a chain of unit tests:

  • Verify that SnapshotDeleteFlight adds LockSnapshotStep to its manifest in the way we expect
  • Verify that LockSnapshotStep behaves as we'd expect for all possible input conditions (including if the snapshot is already locked, has already been deleted, etc).
  • Verify that the underlying DAO methods to lock / unlock snapshots behave as we'd expect

We already had unit tests to verify DAO behavior. I supplemented tests to include unit tests for flight construction and step behavior. For symmetry, I also wrote unit tests for UnlockSnapshotStep, even though it's not called in this flight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! Thanks for looking into this and writing tests that are more aligned with our current test strategy. I'm glad this allowed for removal of the fault insertion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do this for DatasetConnectedTest.testOverlappingDeletes too? (This doesn't have to happen in this PR though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks for flagging that! I'll take a look.

Copy link
Contributor Author

@okotsopoulos okotsopoulos Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you flagged this :)

I was able to remove DatasetConnectedTest.testOverlappingDeletes by adding a unit test for DatasetDeleteFlight construction. We already had unit tests to verify DAO behavior and step behavior.

But the fun part came when moving to remove the faults referenced in that connected test. I found them referenced in DatasetLockConnectedTest, in tests that should have been failing with my code changes… but I realized that this test suite was not annotated and thus its tests weren't running with our connected tests as expected. That's been the case for the last 2 years.

public class DatasetLockConnectedTest {

So I went through those connected tests, verified the behavior with additional flight construction unit tests where necessary, and got to remove 9 faults and clean up everywhere they were called in production code!

@okotsopoulos okotsopoulos marked this pull request as ready for review June 8, 2023 15:30
Copy link
Contributor

@snf2ye snf2ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so excited for this change 🥳
Thanks for the deleted commit messages, they were very helpful, especially in such a large PR!

src/main/resources/api/data-repository-openapi.yaml Outdated Show resolved Hide resolved
@@ -437,54 +432,6 @@ public void testProjectDeleteAfterSnapshotDelete() throws Exception {
googleResourceManagerService.getProject(googleProjectId).getLifecycleState());
}

@Ignore("Remove ignore after DR-1770 is addressed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this further?

  • Did we not see that error again because starting ignoring the test? I'm either confused now or I was confused when I wrote that comment in that ticket.
  • I think the comment indicating that we should delete the "Ignore" tag, not delete the test altogether. I don't know that we yet have an alternate test covering this scenario.

…apshot

In a previous PR, I removed a flaky ignored connected test that ran two concurrent snapshot deletes.
Talking with @snf2ye raised the million-dollar question: 'What exactly where we trying to test with that flaky test?'

The answer: we wanted to make sure that snapshot deletion takes out an exclusive lock on the snapshot, and that concurrent attempts at snapshot deletion would fail as a result.
A more complete, precise, stable, and speedy way to test this is through a chain of unit tests:
- Verify that SnapshotDeleteFlight adds LockSnapshotStep to its manifest in the way we expect
- Verify that LockSnapshotStep behaves as we'd expect for all possible input conditions (including if the snapshot is already locked, has already been deleted, etc).
- Verify that the underlying DAO methods to lock / unlock snapshots behave as we'd expect

We already had unit tests to verify DAO behavior.  I supplemented tests to include unit tests for flight construction and step behavior.  For symmetry, I also wrote unit tests for UnlockSnapshotStep, even though it's not called in this flight.

Future similar opportunities in 'test engoodification' include similar tests for dataset locking / unlocking steps, as well as flight construction in other places to verify locking behavior.
Copy link
Contributor

@samanehsan samanehsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'm so excited for this change 🎉 I had a comment about updating one of the tests, but I would not consider it a blocker here.

Comment on lines 440 to 453
List<DatasetProject> datasetProjects;
try {
datasetProjects =
objectMapper.readValue(rs.getString("dataset_sources"), new TypeReference<>() {});
} catch (JsonProcessingException e) {
throw new CorruptMetadataException("Invalid dataset sources for snapshot");
}
return new SnapshotProject()
.id(rs.getObject("id", UUID.class))
.name(rs.getString("name"))
.profileId(rs.getObject("profile_id", UUID.class))
.dataProject(rs.getString("google_project_id"))
.cloudPlatform(CloudPlatform.fromValue(rs.getString("cloud_platform")))
.sourceDatasetProjects(datasetProjects);
Copy link
Member

@pshapiro4broad pshapiro4broad Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer this form over putting the return inside the try? Moving the return up lets you get rid of the separate declaration for datasetProjects.

            try {
              List<DatasetProject> datasetProjects =
                  objectMapper.readValue(rs.getString("dataset_sources"), new TypeReference<>() {});
              return new SnapshotProject()
                  .id(rs.getObject("id", UUID.class))
                  .name(rs.getString("name"))
                  .profileId(rs.getObject("profile_id", UUID.class))
                  .dataProject(rs.getString("google_project_id"))
                  .cloudPlatform(CloudPlatform.fromValue(rs.getString("cloud_platform")))
                  .sourceDatasetProjects(datasetProjects);
            } catch (JsonProcessingException e) {
              throw new CorruptMetadataException("Invalid dataset sources for snapshot");
            }

Also I would expect e to be included in the rethrown exception, is there a reason to omit it?

…ests

Added DatasetDeleteFlightTest -- we already had unit test coverage for step behavior, DAO behavior.
When moving to remove the faults referenced in that connected test, I found that they were used in DatasetLockConnectedTest too.
But that test suite was not annotated, so its tests hadn't been running for 2 years.
I went through those tests and expanded flight construction unit tests to make up the coverage.
I was then able to remove 9 faults, clean up everywhere they were referenced in production code, and remove DatasetLockConnectedTest suite in full.
Including:
- Enum fields follow Java static final naming convention
- Consolidate calls to DatasetDao: some places we previously fetched a dataset to then convert it to a model, but we can get this in a single call now
- Use Optional.orElseThrow to concisely return optional content, or throw if not present
@okotsopoulos okotsopoulos merged commit 1fa7250 into develop Jun 26, 2023
8 checks passed
@okotsopoulos okotsopoulos deleted the okotsopo/DR-268-expose-locked-resources branch June 26, 2023 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants