[DR-268] Resources can be fetched when locked #1470

okotsopoulos · 2023-06-06T01:18:18Z

https://broadworkbench.atlassian.net/browse/DR-268

Prior behavior

When resources are exclusively locked, they cannot be retrieved or modified.

Background

One problem with the prior behavior is that it has been shown to cause user panic; if a user's resource was exclusively locked and they tried to retrieve it, they can reasonably think that their resource and its data could be lost.

What we really want to protect against is the possible modification of exclusively-locked resources, which we already get in resource modification flights and their attempts to obtain resource locks (either exclusive or shared). If the resource in question already has an exclusive lock, the flight will fail and the modification will not be performed.

This leads into a user request from Nate: a user had inadvertently deleted critical TDR resources, and it would have been nice to have an extra layer of protection against such deletion. Our existing locking mechanism has the potential to help support this use case, but we need to expose it first.

New behavior

When resources are exclusively locked...

they can be retrieved but cannot be modified.
the locking flight ID is returned in resource retrieval, summary retrieval, and enumeration.
- Note: I only exposed the exclusive lock, but datasets also can have shared locks. The presence of a shared lock will prevent an exclusive lock from being taken out on the resource. In the future, we might like to expose this information in our API responses as well, in case a shared lock is stuck and needs to be cleared out.

Testing and Quality

My developer environment reflects these latest changes. I've established an exclusive lock on this dataset via manual DB modification, and can retrieve it but cannot modify it.

curl -X 'GET' \
  'https://jade-ok.datarepo-dev.broadinstitute.org/api/repository/v1/datasets/786725ff-5822-4e51-9e71-50793c561d32?include=NONE' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer <token>'

{
  "id": "786725ff-5822-4e51-9e71-50793c561d32",
  "name": "person_sample",
  "description": "Dataset with person, sample tables",
  …
  "resourceLocks": {
    "exclusive": "DR-268-exclusive-lock"
  }
}

An attempt to generate a snapshot from it also displays new improved error messaging, giving the user more information about the cause of the failure (we previously would only say "failed to lock the <resource>"):

In the process of taking on this work, I fixed several latent bugs, added necessary shared dataset locking to snapshot creation, and followed up on learnings from a previous test improvement PR to make further test improvements (i.e. refactored a controller test to JUnit 5 with test slicing, removed faults in production code previously required for testing, and tested functionality via unit tests rather than connected tests). More details can be found in PR comments.

Future Work

Handling in TDR UI for exclusively-locked resources (to be spiked)
Expose shared locks in resources fetched
Allow TDR Admins (and perhaps eventually data stewards) to remove locks from API
- When a lock is stuck, presently TDR Admins need to modify database records directly
Allow data stewards to establish locks on resources to prevent their modification (to be spiked)

This field will be included on resource retrieval, summary retrieval, and enumeration. At this point, we only intend on exposing flight IDs which have obtained exclusive locks on resources: flight IDs which have obtained shared locks on resources are still internal-only.

Expose it from a getter in the dataset object and use it for conversions to model objects used in API responses.

In any flight where we attempt a modification of the dataset, we try to obtain a lock on it: if an exclusive lock is present, the flight will fail and the modification will not be allowed to proceed. This changeset allows datasets to be retrieved in read-only operations even when exclusively locked. Removed now-unneeded code and updated tests. Refactored DatasetsApiControllerTest to JUnit 5 with test slicing. Drive-by: refactored out DatasetService method for checking that callers can read datasets, and instead call the IamService within DatasetsApiController. This code followed a pattern set in SnapshotService but was overly complex for datasets (a user may be able to read a snapshot because they have direct access via Sam and/or indirect access via a RAS passport... dataset access is only handled by Sam).

In any flight where we attempt a modification of the snapshot, we should obtain a lock on it: if an exclusive lock is present, the flight will fail and the modification will not be allowed to proceed. This changeset allows snapshots to be retrieved in read-only operations even when exclusively locked. Removed now-unneeded code and updated tests. Improvement: added new enum LockOperation to allow for more descriptive error messages to users when their jobs fail on trying to lock a resource (in an upcoming commit, I will refactor DatasetDao to use this). Drive-bys: addressed Intellij warnings where they were a small lift.

This allows for more descriptive error messages to users when their jobs fail on trying to lock a dataset. Drive-bys: addressed Intellij warnings where they were a small lift (ex. removing unused userReq param).

Otherwise, we return a resource summary object that could incorrectly indicate that it's exclusively locked. This was exposed by broken connected tests.

…toModel The previous commit exposed a latent bug: inconsistency in how a dataset source model was constructed, which caused several connected tests to fail due to mismatched values. In the process of investigating these test failures, I found that SnapshotFileLookupConnectedTest could have several tests removed as long as we increased SnapshotDao unit test coverage. I made the same test expansion for DatasetDao unit tests. A nice side effect of this test refactor is that we got to remove two configuration faults previously used for testing only.

Awhile back, snapshot creation would obtain an exclusive lock on its source dataset: this was overkill and caused problems for reasonable concurrent operations performed on the same dataset. It was removed in full. However, snapshot creation should obtain a shared (non-exclusive) lock on its source dataset to guard against the dataset being deleted out from under it, which would result in an orphaned snapshot (we don't allow dataset deletion to proceed if the dataset has snapshots).

okotsopoulos · 2023-06-08T14:03:00Z

src/test/java/bio/terra/service/snapshot/SnapshotConnectedTest.java

@@ -437,54 +432,6 @@ public void testProjectDeleteAfterSnapshotDelete() throws Exception {
        googleResourceManagerService.getProject(googleProjectId).getLifecycleState());
  }

-  @Ignore("Remove ignore after DR-1770 is addressed")


I originally went to refactor this test so that the faults weren't needed. But I saw that the referenced ticket was closed as "won't due" as the issue hadn't reoccurred in some time, so I deleted the test:

https://broadworkbench.atlassian.net/browse/DR-1770

Can you expand on this further?

Did we not see that error again because starting ignoring the test? I'm either confused now or I was confused when I wrote that comment in that ticket.

I think the comment indicating that we should delete the "Ignore" tag, not delete the test altogether. I don't know that we yet have an alternate test covering this scenario.

Thank you for probing this!

The reason the test issue had not reoccurred is because we added the @Ignore annotation. We have not been running this test for several years without any ill effects, so one could argue that it may not have been valuable to us to begin with.

What was helpful in our post-standup conversation was to think through the question: "what behavior are we trying to verify with this test?". That lead me to the initial PR and ticket.

The initial intent was to make sure that snapshot deletion takes out an exclusive lock on the snapshot, and that concurrent attempts at snapshot deletion would fail as a result.

A more complete, precise, stable, and speedy way to test this is through a chain of unit tests:

Verify that SnapshotDeleteFlight adds LockSnapshotStep to its manifest in the way we expect

Verify that LockSnapshotStep behaves as we'd expect for all possible input conditions (including if the snapshot is already locked, has already been deleted, etc).

Verify that the underlying DAO methods to lock / unlock snapshots behave as we'd expect

We already had unit tests to verify DAO behavior. I supplemented tests to include unit tests for flight construction and step behavior. For symmetry, I also wrote unit tests for UnlockSnapshotStep, even though it's not called in this flight.

That's great! Thanks for looking into this and writing tests that are more aligned with our current test strategy. I'm glad this allowed for removal of the fault insertion.

Would it make sense to do this for DatasetConnectedTest.testOverlappingDeletes too? (This doesn't have to happen in this PR though)

Good idea, thanks for flagging that! I'll take a look.

I'm glad you flagged this :)

I was able to remove DatasetConnectedTest.testOverlappingDeletes by adding a unit test for DatasetDeleteFlight construction. We already had unit tests to verify DAO behavior and step behavior.

But the fun part came when moving to remove the faults referenced in that connected test. I found them referenced in DatasetLockConnectedTest, in tests that should have been failing with my code changes… but I realized that this test suite was not annotated and thus its tests weren't running with our connected tests as expected. That's been the case for the last 2 years.

jade-data-repo/src/test/java/bio/terra/service/dataset/DatasetLockConnectedTest.java

Lines 53 to 54 in cd8e294

public class DatasetLockConnectedTest {

So I went through those connected tests, verified the behavior with additional flight construction unit tests where necessary, and got to remove 9 faults and clean up everywhere they were called in production code!

src/test/java/bio/terra/service/snapshot/SnapshotFileLookupConnectedTest.java

src/main/java/bio/terra/service/snapshot/flight/create/SnapshotCreateFlight.java

src/main/resources/api/data-repository-openapi.yaml

snf2ye

I'm so excited for this change 🥳
Thanks for the deleted commit messages, they were very helpful, especially in such a large PR!

src/main/resources/api/data-repository-openapi.yaml

snf2ye · 2023-06-09T01:53:22Z

src/test/java/bio/terra/service/snapshot/SnapshotConnectedTest.java

@@ -437,54 +432,6 @@ public void testProjectDeleteAfterSnapshotDelete() throws Exception {
        googleResourceManagerService.getProject(googleProjectId).getLifecycleState());
  }

-  @Ignore("Remove ignore after DR-1770 is addressed")


Can you expand on this further?

Did we not see that error again because starting ignoring the test? I'm either confused now or I was confused when I wrote that comment in that ticket.

I think the comment indicating that we should delete the "Ignore" tag, not delete the test altogether. I don't know that we yet have an alternate test covering this scenario.

src/main/java/bio/terra/service/dataset/DatasetDao.java

@snf2ye

…apshot In a previous PR, I removed a flaky ignored connected test that ran two concurrent snapshot deletes. Talking with @snf2ye raised the million-dollar question: 'What exactly where we trying to test with that flaky test?' The answer: we wanted to make sure that snapshot deletion takes out an exclusive lock on the snapshot, and that concurrent attempts at snapshot deletion would fail as a result. A more complete, precise, stable, and speedy way to test this is through a chain of unit tests: - Verify that SnapshotDeleteFlight adds LockSnapshotStep to its manifest in the way we expect - Verify that LockSnapshotStep behaves as we'd expect for all possible input conditions (including if the snapshot is already locked, has already been deleted, etc). - Verify that the underlying DAO methods to lock / unlock snapshots behave as we'd expect We already had unit tests to verify DAO behavior. I supplemented tests to include unit tests for flight construction and step behavior. For symmetry, I also wrote unit tests for UnlockSnapshotStep, even though it's not called in this flight. Future similar opportunities in 'test engoodification' include similar tests for dataset locking / unlocking steps, as well as flight construction in other places to verify locking behavior.

…eAfterDatasetDelete This set-up is already performed before the test and triggered a spotbugs error

src/main/java/bio/terra/common/LockOperation.java

samanehsan

This looks great! I'm so excited for this change 🎉 I had a comment about updating one of the tests, but I would not consider it a blocker here.

src/main/java/bio/terra/service/dataset/DatasetJsonConversion.java

src/main/java/bio/terra/service/dataset/DatasetService.java

src/main/java/bio/terra/service/dataset/flight/LockDatasetStep.java

pshapiro4broad · 2023-06-14T17:31:07Z

src/main/java/bio/terra/service/snapshot/SnapshotDao.java

+            List<DatasetProject> datasetProjects;
+            try {
+              datasetProjects =
+                  objectMapper.readValue(rs.getString("dataset_sources"), new TypeReference<>() {});
+            } catch (JsonProcessingException e) {
+              throw new CorruptMetadataException("Invalid dataset sources for snapshot");
+            }
+            return new SnapshotProject()
+                .id(rs.getObject("id", UUID.class))
+                .name(rs.getString("name"))
+                .profileId(rs.getObject("profile_id", UUID.class))
+                .dataProject(rs.getString("google_project_id"))
+                .cloudPlatform(CloudPlatform.fromValue(rs.getString("cloud_platform")))
+                .sourceDatasetProjects(datasetProjects);


Is there a reason to prefer this form over putting the return inside the try? Moving the return up lets you get rid of the separate declaration for datasetProjects.

try { List<DatasetProject> datasetProjects = objectMapper.readValue(rs.getString("dataset_sources"), new TypeReference<>() {}); return new SnapshotProject() .id(rs.getObject("id", UUID.class)) .name(rs.getString("name")) .profileId(rs.getObject("profile_id", UUID.class)) .dataProject(rs.getString("google_project_id")) .cloudPlatform(CloudPlatform.fromValue(rs.getString("cloud_platform"))) .sourceDatasetProjects(datasetProjects); } catch (JsonProcessingException e) { throw new CorruptMetadataException("Invalid dataset sources for snapshot"); }

Also I would expect e to be included in the rethrown exception, is there a reason to omit it?

src/main/java/bio/terra/service/snapshot/SnapshotDao.java

src/main/java/bio/terra/service/snapshot/SnapshotService.java

src/test/java/bio/terra/app/controller/DatasetsApiControllerTest.java

…ests Added DatasetDeleteFlightTest -- we already had unit test coverage for step behavior, DAO behavior. When moving to remove the faults referenced in that connected test, I found that they were used in DatasetLockConnectedTest too. But that test suite was not annotated, so its tests hadn't been running for 2 years. I went through those tests and expanded flight construction unit tests to make up the coverage. I was then able to remove 9 faults, clean up everywhere they were referenced in production code, and remove DatasetLockConnectedTest suite in full.

Including: - Enum fields follow Java static final naming convention - Consolidate calls to DatasetDao: some places we previously fetched a dataset to then convert it to a model, but we can get this in a single call now - Use Optional.orElseThrow to concisely return optional content, or throw if not present

okotsopoulos added 11 commits June 5, 2023 19:44

Add lockingJobId to dataset summary object

6eb05c7

Expose it from a getter in the dataset object and use it for conversions to model objects used in API responses.

Spotless and spotbugs

252d94a

Add lockingJobId to snapshot, snapshot summary objects

2e38c5b

Refactor DatasetDao to used shared LockOperation enum

6c81e12

This allows for more descriptive error messages to users when their jobs fail on trying to lock a dataset. Drive-bys: addressed Intellij warnings where they were a small lift (ex. removing unused userReq param).

Resources written as job responses should be done after unlock

9f23c7b

Otherwise, we return a resource summary object that could incorrectly indicate that it's exclusively locked. This was exposed by broken connected tests.

Spotbugs

5e5c556

okotsopoulos commented Jun 8, 2023

View reviewed changes

src/test/java/bio/terra/service/snapshot/SnapshotFileLookupConnectedTest.java Show resolved Hide resolved

okotsopoulos commented Jun 8, 2023

View reviewed changes

src/main/java/bio/terra/service/snapshot/flight/create/SnapshotCreateFlight.java Show resolved Hide resolved

okotsopoulos marked this pull request as ready for review June 8, 2023 15:30

okotsopoulos requested review from snf2ye, nmalfroy and samanehsan as code owners June 8, 2023 15:30

okotsopoulos commented Jun 8, 2023

View reviewed changes

src/main/resources/api/data-repository-openapi.yaml Outdated Show resolved Hide resolved

snf2ye reviewed Jun 9, 2023

View reviewed changes

snf2ye approved these changes Jun 9, 2023

View reviewed changes

okotsopoulos added 4 commits June 12, 2023 16:46

Add unit tests to verify resource locking in snapshot creation flight

f31b7b6

Return dataset exclusive lock in encapsulating ResourceLocks structure

219d445

Return snapshot exclusive lock in encapsulating ResourceLocks structure

c5c8e38

Remove duplicate test set-up in DatasetConnectedTest.testProjectDelet…

1370a8d

…eAfterDatasetDelete This set-up is already performed before the test and triggered a spotbugs error