New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[datalake] Add DirectoryClient and FileClient #610
Conversation
@ctaggart @rylev @thovoll - I made some somewhat structural changes to the crate and am hoping to get some early feedback if that approach makes sense. There is still a bunch of work in the PR, but the relevant aspects can be seen in the files 'sdk/storage_datalake/src/operations/path_put.rs', Any guidance if this is a desired direction is greatly appreciated. |
One thing that came to my mind is the question how the client should behave for rename operations. Specifically the client can no longer be used to operate on the file that has been renamed, but a new client needs to be created. We do create a client for the target location inside the rename call. SO my question is should a file or directory client update its internal path to still track the renamed location. In that case it could be re-used after renaming? To be honest I am not even sure what I would expect as a user - on one hand i specified a path and changing that could be surprising, on the other hand if I think of this as a physical file I would feel it still is the same file, just with a different name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good - perhaps you could talk more about the motivation behind PathClient
? I'm not sure I fully understand why that approach is preferable.
As for the question of file renaming. Without thinking too deeply about it, my first impression would be that the client would update its internal path since logically the file in question is the same just with a different name.
The main motivation is that operations against files and directories are essentially the same in terms of the REST API request. However - taking C# as guideline - we wanted to separate Files and Directories conceptually. So the main motivation is to avoid code duplication since the request mainly just differ in the resource query parameter. |
Another question that came up is how to present the results for delete operations. These operations return a continuation token since there seems to be a max number of items to delete in a single request - as such should the delete operation return a stream much like the list ops? From the tech side that seems clear, but I am not sure from the user perspective. On the file system no continuation is needed, so delete would look different on file systems and file / directory. We do return the token in the response, but right now we treat it like other "one-off" operations. Update: The same actually applies to rename operations - specifically on directories. Given that, my impression is that "missing" continuation could potentially lead to severe disruptions in certain scenarios and the client should make that explicit. My proposal would then be to implement both Alternatively we could separate builders after all and have the operations on directories only return builders with |
The .Net and Java SDKs return a new client when renaming and don't update the internal state of the "old" client, which has the problem you mentioned (and more) but it's a pattern we will have to follow unless there's a Rust specific reason that we can't. Also, updating the internal state isn't that great either. |
This is an interesting and confusing case. With ADLS Gen2 storage accounts (meaning that hierarchical namespaces are enabled) file and directory deletes and renames are atomic and don't need continuations. However, the ADLS Gen2 REST API can also be used with a storage account that has hierarchical namespaces DISABLED (which means it's not an ADLS Gen2 storage account). This is where continuations come into the picture. The .Net SDK simply returns the continuation as a header and lets the SDK user deal with it if needed. I think we should do the same here and not return a stream, since using the ADLS Gen2 API with a non-ADLS Gen2 storage account is not the main use case. |
Interestingly, the .Net SDK seems to use a generated Path client under the covers. We could eventually look at using the generated Path client here in Rust as well, but I would leave that to another potential future PR. |
Nit: Probably should use file_system everywhere instead of filesystem (e.g. |
@roeap this looks really good, thanks for all the work! |
Thanks for the feedback @thovoll - i updated parameter names and made the APIs a little bit more generic. |
@rylev - we seem to see intermittent test failures and I am trying to figure out why. Essentially most of the time test with Do you have any suggestions where to look here? |
Resolves: #490 |
What is the error? |
you can look at the latest failed run, just before the latest successful one. the only change is re-recording transactions. also in the previos change between last success and new failure the changes "should" not have had any effect. It was just a bit of renaming and making parameter types a bit more permissive. |
@roeap I can reproduce the intermittent failure on my machine. I added some println!s and was able to see what's going on. Transaction 3 (PATCH) sometimes fails because the semicolon-separated items within the value of the "x-ms-properties" header are ordered differently: Failure case:
Success case:
When it does fail, it gets retried and on each retry another instance of the "x-ms-version" header is added, causing each retry to fail because the header count doesn't match:
|
Thanks for the investigation @thovoll. It looks like the root cause is that |
@@ -29,7 +29,7 @@ impl Properties { | |||
self.0.insert(k.into(), v.into()) | |||
} | |||
|
|||
pub fn hash_map(&self) -> &HashMap<Cow<'static, str>, Cow<'static, str>> { | |||
pub fn hash_map(&self) -> &BTreeMap<Cow<'static, str>, Cow<'static, str>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method be renamed? Should this method even exist?
In this PR I hope to finish migration of existing operations and add some more related ones.
There are some slight changes to current patterns in the hopes to address some of what I gathered from discussions in this repo as well as some thoughts collected while implementing prior changes.
From what I understand in many ways C# often serves as reference implementation for SDKs. As over there I introduced a
DirectoryClient
andFileClient
rather than having file operations live on theFileSystemClient
. Not sure if other crates do this as well, but since the file and directory operations all work against the same same REST Api route, I opted for having one operation that supports all options for the specific route and using the builder for multiple operations on multiple clients. Main motivation is to avoid the significant code duplication and reduce maintenance effort.