[datalake] Add DirectoryClient and FileClient #610

roeap · 2022-01-13T19:51:19Z

In this PR I hope to finish migration of existing operations and add some more related ones.

There are some slight changes to current patterns in the hopes to address some of what I gathered from discussions in this repo as well as some thoughts collected while implementing prior changes.

From what I understand in many ways C# often serves as reference implementation for SDKs. As over there I introduced a DirectoryClient and FileClient rather than having file operations live on the FileSystemClient. Not sure if other crates do this as well, but since the file and directory operations all work against the same same REST Api route, I opted for having one operation that supports all options for the specific route and using the builder for multiple operations on multiple clients. Main motivation is to avoid the significant code duplication and reduce maintenance effort.

roeap · 2022-01-13T19:54:46Z

@ctaggart @rylev @thovoll - I made some somewhat structural changes to the crate and am hoping to get some early feedback if that approach makes sense. There is still a bunch of work in the PR, but the relevant aspects can be seen in the files 'sdk/storage_datalake/src/operations/path_put.rs', sdk/storage_datalake/src/clients/directory_client.rs, and sdk/storage_datalake/src/request_options.rs.

Any guidance if this is a desired direction is greatly appreciated.

roeap · 2022-01-14T07:43:03Z

One thing that came to my mind is the question how the client should behave for rename operations. Specifically the client can no longer be used to operate on the file that has been renamed, but a new client needs to be created. We do create a client for the target location inside the rename call. SO my question is should a file or directory client update its internal path to still track the renamed location. In that case it could be re-used after renaming?

To be honest I am not even sure what I would expect as a user - on one hand i specified a path and changing that could be surprising, on the other hand if I think of this as a physical file I would feel it still is the same file, just with a different name.

rylev

Looking good - perhaps you could talk more about the motivation behind PathClient? I'm not sure I fully understand why that approach is preferable.

As for the question of file renaming. Without thinking too deeply about it, my first impression would be that the client would update its internal path since logically the file in question is the same just with a different name.

sdk/storage_datalake/examples/data_lake_04_directory.rs

roeap · 2022-01-14T10:03:43Z

perhaps you could talk more about the motivation behind PathClient? I'm not sure I fully understand why that approach is preferable.

The main motivation is that operations against files and directories are essentially the same in terms of the REST API request. However - taking C# as guideline - we wanted to separate Files and Directories conceptually. So the main motivation is to avoid code duplication since the request mainly just differ in the resource query parameter.

roeap · 2022-01-14T10:19:43Z

Another question that came up is how to present the results for delete operations. These operations return a continuation token since there seems to be a max number of items to delete in a single request - as such should the delete operation return a stream much like the list ops? From the tech side that seems clear, but I am not sure from the user perspective. On the file system no continuation is needed, so delete would look different on file systems and file / directory.

We do return the token in the response, but right now we treat it like other "one-off" operations.

Update: The same actually applies to rename operations - specifically on directories. Given that, my impression is that "missing" continuation could potentially lead to severe disruptions in certain scenarios and the client should make that explicit. My proposal would then be to implement both into_future and into_stream on the builder.

Alternatively we could separate builders after all and have the operations on directories only return builders with into_stream.

thovoll · 2022-01-14T15:26:09Z

One thing that came to my mind is the question how the client should behave for rename operations. Specifically the client can no longer be used to operate on the file that has been renamed, but a new client needs to be created. We do create a client for the target location inside the rename call. SO my question is should a file or directory client update its internal path to still track the renamed location. In that case it could be re-used after renaming?

To be honest I am not even sure what I would expect as a user - on one hand i specified a path and changing that could be surprising, on the other hand if I think of this as a physical file I would feel it still is the same file, just with a different name.

The .Net and Java SDKs return a new client when renaming and don't update the internal state of the "old" client, which has the problem you mentioned (and more) but it's a pattern we will have to follow unless there's a Rust specific reason that we can't. Also, updating the internal state isn't that great either.

thovoll · 2022-01-14T15:33:33Z

Another question that came up is how to present the results for delete operations. These operations return a continuation token since there seems to be a max number of items to delete in a single request - as such should the delete operation return a stream much like the list ops? From the tech side that seems clear, but I am not sure from the user perspective. On the file system no continuation is needed, so delete would look different on file systems and file / directory.

We do return the token in the response, but right now we treat it like other "one-off" operations.

Update: The same actually applies to rename operations - specifically on directories. Given that, my impression is that "missing" continuation could potentially lead to severe disruptions in certain scenarios and the client should make that explicit. My proposal would then be to implement both into_future and into_stream on the builder.

Alternatively we could separate builders after all and have the operations on directories only return builders with into_stream.

This is an interesting and confusing case. With ADLS Gen2 storage accounts (meaning that hierarchical namespaces are enabled) file and directory deletes and renames are atomic and don't need continuations. However, the ADLS Gen2 REST API can also be used with a storage account that has hierarchical namespaces DISABLED (which means it's not an ADLS Gen2 storage account). This is where continuations come into the picture.

The .Net SDK simply returns the continuation as a header and lets the SDK user deal with it if needed. I think we should do the same here and not return a stream, since using the ADLS Gen2 API with a non-ADLS Gen2 storage account is not the main use case.

thovoll · 2022-01-14T15:38:09Z

in motivation is that operations against files and directories are essentially the same in terms of the REST API request. However - taking C# as guideline - we wanted to separate Files and Directories conceptually. So the main motivation is to avoid code duplication since the request mainly just differ in the resource query parameter.

Interestingly, the .Net SDK seems to use a generated Path client under the covers. We could eventually look at using the generated Path client here in Rust as well, but I would leave that to another potential future PR.

thovoll · 2022-01-16T13:59:12Z

Nit: Probably should use file_system everywhere instead of filesystem (e.g. examples/filesystem.rs). The REST API does call it Filesystem but the .Net and Java SDKs call it FileSystem.

thovoll · 2022-01-16T14:32:25Z

@roeap this looks really good, thanks for all the work!

sdk/storage_datalake/src/clients/file_client.rs

sdk/storage_datalake/src/clients/directory_client.rs

sdk/storage_datalake/src/clients/file_client.rs

sdk/storage_datalake/src/clients/directory_client.rs

roeap · 2022-01-17T11:39:42Z

Thanks for the feedback @thovoll - i updated parameter names and made the APIs a little bit more generic.

roeap · 2022-01-17T12:05:41Z

@rylev - we seem to see intermittent test failures and I am trying to figure out why. Essentially most of the time test with REPLAY work, until they don't :D. Trying to figure out the root cause, and if it is even related to the mock framework.

Do you have any suggestions where to look here?

thovoll · 2022-01-17T14:04:22Z

Resolves: #490

thovoll · 2022-01-17T14:19:30Z

@rylev - we seem to see intermittent test failures and I am trying to figure out why. Essentially most of the time test with REPLAY work, until they don't :D. Trying to figure out the root cause, and if it is even related to the mock framework.

Do you have any suggestions where to look here?

What is the error?

roeap · 2022-01-17T14:30:22Z

you can look at the latest failed run, just before the latest successful one. the only change is re-recording transactions. also in the previos change between last success and new failure the changes "should" not have had any effect. It was just a bit of renaming and making parameter types a bit more permissive. String --> ´into´

thovoll · 2022-01-21T15:24:31Z

@roeap I can reproduce the intermittent failure on my machine. I added some println!s and was able to see what's going on.

Transaction 3 (PATCH) sometimes fails because the semicolon-separated items within the value of the "x-ms-properties" header are ordered differently:

Failure case:

transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 3
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12"), ("content-length", "0")]'

Success case:

transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 3
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("content-length", "0"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("content-length", "0"), ("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12")]'

When it does fail, it gets retried and on each retry another instance of the "x-ms-version" header is added, causing each retry to fail because the header count doesn't match:

----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 4
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12"), ("content-length", "0")]'
----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 5
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("content-length", "0"), ("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12")]'
----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 6
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12"), ("content-length", "0")]'
----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 7
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("content-length", "0"), ("x-ms-version", "2019-12-12")]'
Error: CoreError(Policy(MismatchedRequestHeadersCount(7, 3)))

rylev · 2022-01-21T16:11:52Z

Thanks for the investigation @thovoll. It looks like the root cause is that Properties are stored in a HashMap which does not maintain consistent order. We can instead use a BTreeMap which keeps its keys in sorted order. This would fix the issue.

roeap · 2022-01-21T17:42:06Z

Thanks @thovoll @rylev for figuring this out ... I updated the implementation of properties accordingly. At leat locally I could not generate the error again after several trials... so looking good so far :)

thovoll · 2022-01-21T18:32:37Z

sdk/storage_datalake/src/properties.rs

@@ -29,7 +29,7 @@ impl Properties {
        self.0.insert(k.into(), v.into())
    }

-    pub fn hash_map(&self) -> &HashMap<Cow<'static, str>, Cow<'static, str>> {
+    pub fn hash_map(&self) -> &BTreeMap<Cow<'static, str>, Cow<'static, str>> {


Should this method be renamed? Should this method even exist?

roeap added 10 commits January 13, 2022 20:02

migrate file system operations

97cefd4

example fixes

44d0ff2

update constuctors

2b4ee52

add date and version headers in requests

63758f1

add resource type to fs requests

16da69b

remove debug and trace statememnts

6ebadf6

make constructor infallible again

73b8073

put path operation

cf09ca4

rename reuqst options

1736f0b

allow directly initializing clients

81607c9

roeap added 7 commits January 13, 2022 20:56

Merge branch 'main' into storage-pipelines

4f51598

make put operation generic over client

151cc03

migrate file create

c99e3f6

file / directory delete

4d68e75

cleanup

5a317a5

remove warnings

7720109

make properties required param

a39b042

rylev reviewed Jan 14, 2022

View reviewed changes

sdk/storage_datalake/examples/data_lake_04_directory.rs Outdated Show resolved Hide resolved

add head operation

4584d74

roeap added 2 commits January 14, 2022 11:08

remove prints

a120d4c

rename example files

2cd2220

prepare separating tests

6e652e4

return destination client on rename operations

7f95426

roeap added 9 commits January 14, 2022 22:35

remove unused import

afcbd85

update test recordings

4446f3f

require credentials only when recording test

da943db

add test_e2e feature again

d12283c

more consistent naming

622afb1

list paths

23d998f

split tests

270c536

add file read operations

c2a7db5

add test for reading file

b823f07

thovoll approved these changes Jan 16, 2022

View reviewed changes

roeap added 2 commits January 17, 2022 12:33

more flexible api parameters

0c7d067

parameter naming

e9b9047

re-record transactions

294eec8

thovoll mentioned this pull request Jan 17, 2022

[Tracking Issue] Create/Convert ADLS Gen2 (data_lake) using Pipeline Architecture #496

Closed

34 tasks

thovoll requested a review from rylev January 20, 2022 14:07

roeap added 2 commits January 21, 2022 18:38

properties as BTreeMao

ebdaea6

format json files

9534d93

replace with direct accesor on Properties

febfd7b

thovoll merged commit 1512a35 into Azure:main Jan 21, 2022

thovoll reviewed Jan 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datalake] Add DirectoryClient and FileClient #610

[datalake] Add DirectoryClient and FileClient #610

roeap commented Jan 13, 2022

roeap commented Jan 13, 2022 •

edited

roeap commented Jan 14, 2022

rylev left a comment

roeap commented Jan 14, 2022

roeap commented Jan 14, 2022 •

edited

thovoll commented Jan 14, 2022 •

edited

thovoll commented Jan 14, 2022 •

edited

thovoll commented Jan 14, 2022

thovoll commented Jan 16, 2022 •

edited

thovoll commented Jan 16, 2022

roeap commented Jan 17, 2022

roeap commented Jan 17, 2022

thovoll commented Jan 17, 2022

thovoll commented Jan 17, 2022

roeap commented Jan 17, 2022

thovoll commented Jan 21, 2022 •

edited

rylev commented Jan 21, 2022 •

edited

roeap commented Jan 21, 2022

thovoll Jan 21, 2022

[datalake] Add DirectoryClient and FileClient #610

[datalake] Add DirectoryClient and FileClient #610

Conversation

roeap commented Jan 13, 2022

roeap commented Jan 13, 2022 • edited

roeap commented Jan 14, 2022

rylev left a comment

Choose a reason for hiding this comment

roeap commented Jan 14, 2022

roeap commented Jan 14, 2022 • edited

thovoll commented Jan 14, 2022 • edited

thovoll commented Jan 14, 2022 • edited

thovoll commented Jan 14, 2022

thovoll commented Jan 16, 2022 • edited

thovoll commented Jan 16, 2022

roeap commented Jan 17, 2022

roeap commented Jan 17, 2022

thovoll commented Jan 17, 2022

thovoll commented Jan 17, 2022

roeap commented Jan 17, 2022

thovoll commented Jan 21, 2022 • edited

rylev commented Jan 21, 2022 • edited

roeap commented Jan 21, 2022

thovoll Jan 21, 2022

Choose a reason for hiding this comment

roeap commented Jan 13, 2022 •

edited

roeap commented Jan 14, 2022 •

edited

thovoll commented Jan 14, 2022 •

edited

thovoll commented Jan 14, 2022 •

edited

thovoll commented Jan 16, 2022 •

edited

thovoll commented Jan 21, 2022 •

edited

rylev commented Jan 21, 2022 •

edited