Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datalake] Add DirectoryClient and FileClient #610

Merged
merged 46 commits into from Jan 21, 2022

Conversation

roeap
Copy link
Contributor

@roeap roeap commented Jan 13, 2022

In this PR I hope to finish migration of existing operations and add some more related ones.

There are some slight changes to current patterns in the hopes to address some of what I gathered from discussions in this repo as well as some thoughts collected while implementing prior changes.

From what I understand in many ways C# often serves as reference implementation for SDKs. As over there I introduced a DirectoryClient and FileClient rather than having file operations live on the FileSystemClient. Not sure if other crates do this as well, but since the file and directory operations all work against the same same REST Api route, I opted for having one operation that supports all options for the specific route and using the builder for multiple operations on multiple clients. Main motivation is to avoid the significant code duplication and reduce maintenance effort.

@roeap
Copy link
Contributor Author

roeap commented Jan 13, 2022

@ctaggart @rylev @thovoll - I made some somewhat structural changes to the crate and am hoping to get some early feedback if that approach makes sense. There is still a bunch of work in the PR, but the relevant aspects can be seen in the files 'sdk/storage_datalake/src/operations/path_put.rs', sdk/storage_datalake/src/clients/directory_client.rs, and sdk/storage_datalake/src/request_options.rs.

Any guidance if this is a desired direction is greatly appreciated.

@roeap
Copy link
Contributor Author

roeap commented Jan 14, 2022

One thing that came to my mind is the question how the client should behave for rename operations. Specifically the client can no longer be used to operate on the file that has been renamed, but a new client needs to be created. We do create a client for the target location inside the rename call. SO my question is should a file or directory client update its internal path to still track the renamed location. In that case it could be re-used after renaming?

To be honest I am not even sure what I would expect as a user - on one hand i specified a path and changing that could be surprising, on the other hand if I think of this as a physical file I would feel it still is the same file, just with a different name.

Copy link
Contributor

@rylev rylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - perhaps you could talk more about the motivation behind PathClient? I'm not sure I fully understand why that approach is preferable.

As for the question of file renaming. Without thinking too deeply about it, my first impression would be that the client would update its internal path since logically the file in question is the same just with a different name.

sdk/storage_datalake/examples/data_lake_04_directory.rs Outdated Show resolved Hide resolved
@roeap
Copy link
Contributor Author

roeap commented Jan 14, 2022

perhaps you could talk more about the motivation behind PathClient? I'm not sure I fully understand why that approach is preferable.

The main motivation is that operations against files and directories are essentially the same in terms of the REST API request. However - taking C# as guideline - we wanted to separate Files and Directories conceptually. So the main motivation is to avoid code duplication since the request mainly just differ in the resource query parameter.

@roeap
Copy link
Contributor Author

roeap commented Jan 14, 2022

Another question that came up is how to present the results for delete operations. These operations return a continuation token since there seems to be a max number of items to delete in a single request - as such should the delete operation return a stream much like the list ops? From the tech side that seems clear, but I am not sure from the user perspective. On the file system no continuation is needed, so delete would look different on file systems and file / directory.

We do return the token in the response, but right now we treat it like other "one-off" operations.

Update: The same actually applies to rename operations - specifically on directories. Given that, my impression is that "missing" continuation could potentially lead to severe disruptions in certain scenarios and the client should make that explicit. My proposal would then be to implement both into_future and into_stream on the builder.

Alternatively we could separate builders after all and have the operations on directories only return builders with into_stream.

@thovoll
Copy link
Contributor

thovoll commented Jan 14, 2022

One thing that came to my mind is the question how the client should behave for rename operations. Specifically the client can no longer be used to operate on the file that has been renamed, but a new client needs to be created. We do create a client for the target location inside the rename call. SO my question is should a file or directory client update its internal path to still track the renamed location. In that case it could be re-used after renaming?

To be honest I am not even sure what I would expect as a user - on one hand i specified a path and changing that could be surprising, on the other hand if I think of this as a physical file I would feel it still is the same file, just with a different name.

The .Net and Java SDKs return a new client when renaming and don't update the internal state of the "old" client, which has the problem you mentioned (and more) but it's a pattern we will have to follow unless there's a Rust specific reason that we can't. Also, updating the internal state isn't that great either.

@thovoll
Copy link
Contributor

thovoll commented Jan 14, 2022

Another question that came up is how to present the results for delete operations. These operations return a continuation token since there seems to be a max number of items to delete in a single request - as such should the delete operation return a stream much like the list ops? From the tech side that seems clear, but I am not sure from the user perspective. On the file system no continuation is needed, so delete would look different on file systems and file / directory.

We do return the token in the response, but right now we treat it like other "one-off" operations.

Update: The same actually applies to rename operations - specifically on directories. Given that, my impression is that "missing" continuation could potentially lead to severe disruptions in certain scenarios and the client should make that explicit. My proposal would then be to implement both into_future and into_stream on the builder.

Alternatively we could separate builders after all and have the operations on directories only return builders with into_stream.

This is an interesting and confusing case. With ADLS Gen2 storage accounts (meaning that hierarchical namespaces are enabled) file and directory deletes and renames are atomic and don't need continuations. However, the ADLS Gen2 REST API can also be used with a storage account that has hierarchical namespaces DISABLED (which means it's not an ADLS Gen2 storage account). This is where continuations come into the picture.

The .Net SDK simply returns the continuation as a header and lets the SDK user deal with it if needed. I think we should do the same here and not return a stream, since using the ADLS Gen2 API with a non-ADLS Gen2 storage account is not the main use case.

@thovoll
Copy link
Contributor

thovoll commented Jan 14, 2022

in motivation is that operations against files and directories are essentially the same in terms of the REST API request. However - taking C# as guideline - we wanted to separate Files and Directories conceptually. So the main motivation is to avoid code duplication since the request mainly just differ in the resource query parameter.

Interestingly, the .Net SDK seems to use a generated Path client under the covers. We could eventually look at using the generated Path client here in Rust as well, but I would leave that to another potential future PR.

@thovoll
Copy link
Contributor

thovoll commented Jan 16, 2022

Nit: Probably should use file_system everywhere instead of filesystem (e.g. examples/filesystem.rs). The REST API does call it Filesystem but the .Net and Java SDKs call it FileSystem.

@thovoll
Copy link
Contributor

thovoll commented Jan 16, 2022

@roeap this looks really good, thanks for all the work!

sdk/storage_datalake/src/clients/file_client.rs Outdated Show resolved Hide resolved
sdk/storage_datalake/src/clients/directory_client.rs Outdated Show resolved Hide resolved
sdk/storage_datalake/src/clients/file_client.rs Outdated Show resolved Hide resolved
sdk/storage_datalake/src/clients/directory_client.rs Outdated Show resolved Hide resolved
@roeap
Copy link
Contributor Author

roeap commented Jan 17, 2022

Thanks for the feedback @thovoll - i updated parameter names and made the APIs a little bit more generic.

@roeap
Copy link
Contributor Author

roeap commented Jan 17, 2022

@rylev - we seem to see intermittent test failures and I am trying to figure out why. Essentially most of the time test with REPLAY work, until they don't :D. Trying to figure out the root cause, and if it is even related to the mock framework.

Do you have any suggestions where to look here?

@thovoll
Copy link
Contributor

thovoll commented Jan 17, 2022

Resolves: #490

@thovoll
Copy link
Contributor

thovoll commented Jan 17, 2022

@rylev - we seem to see intermittent test failures and I am trying to figure out why. Essentially most of the time test with REPLAY work, until they don't :D. Trying to figure out the root cause, and if it is even related to the mock framework.

Do you have any suggestions where to look here?

What is the error?

@roeap
Copy link
Contributor Author

roeap commented Jan 17, 2022

you can look at the latest failed run, just before the latest successful one. the only change is re-recording transactions. also in the previos change between last success and new failure the changes "should" not have had any effect. It was just a bit of renaming and making parameter types a bit more permissive. String --> ´into´

@thovoll
Copy link
Contributor

thovoll commented Jan 21, 2022

@roeap I can reproduce the intermittent failure on my machine. I added some println!s and was able to see what's going on.

Transaction 3 (PATCH) sometimes fails because the semicolon-separated items within the value of the "x-ms-properties" header are ordered differently:

Failure case:

transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 3
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12"), ("content-length", "0")]'

Success case:

transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 3
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("content-length", "0"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("content-length", "0"), ("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12")]'

When it does fail, it gets retried and on each retry another instance of the "x-ms-version" header is added, causing each retry to fail because the header count doesn't match:

----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 4
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12"), ("content-length", "0")]'
----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 5
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("content-length", "0"), ("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12")]'
----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 6
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("x-ms-version", "2019-12-12"), ("content-length", "0")]'
----------------------------------------
transaction.name = datalake_file_system
transaction.number = 3
actual_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
expected_uri = '"/azurerustsdk-datalake-file-system?resource=filesystem"'
request.method() = 'PATCH'
expected_request.method() = 'PATCH'
actual_headers.len() = 7
expected_headers.len() = 3
actual_headers = '[("x-ms-properties", "ModifiedBy=SW90YQ==,AddedVia=QXp1cmUgU0RLIGZvciBSdXN0"), ("content-length", "0"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12"), ("x-ms-version", "2019-12-12")]'
expected_headers = '[("x-ms-properties", "AddedVia=QXp1cmUgU0RLIGZvciBSdXN0,ModifiedBy=SW90YQ=="), ("content-length", "0"), ("x-ms-version", "2019-12-12")]'
Error: CoreError(Policy(MismatchedRequestHeadersCount(7, 3)))

@rylev
Copy link
Contributor

rylev commented Jan 21, 2022

Thanks for the investigation @thovoll. It looks like the root cause is that Properties are stored in a HashMap which does not maintain consistent order. We can instead use a BTreeMap which keeps its keys in sorted order. This would fix the issue.

@roeap
Copy link
Contributor Author

roeap commented Jan 21, 2022

Thanks @thovoll @rylev for figuring this out ... I updated the implementation of properties accordingly. At leat locally I could not generate the error again after several trials... so looking good so far :)

@thovoll thovoll merged commit 1512a35 into Azure:main Jan 21, 2022
@@ -29,7 +29,7 @@ impl Properties {
self.0.insert(k.into(), v.into())
}

pub fn hash_map(&self) -> &HashMap<Cow<'static, str>, Cow<'static, str>> {
pub fn hash_map(&self) -> &BTreeMap<Cow<'static, str>, Cow<'static, str>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method be renamed? Should this method even exist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants