Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multiple createPathFile internal operations causing uploadFIle api to convert finite size file to zero bytes file in ADLS gen2 #40235

Open
fivetran-arunsuri opened this issue May 17, 2024 · 10 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)

Comments

@fivetran-arunsuri
Copy link

fivetran-arunsuri commented May 17, 2024

Describe the bug
Following code is used to upload the files to Azure Data lake storage container(retrier retries the upload in case of failure). It internally involves, CreatePathFile, Append File,
FlushFile rest API operations. We have enabled the diagnostic logs on the container and in some cases we saw following operations are performed in following order
image
There are multiple createPathFile operations being performed here for the same file. It should not happen ideally because if just createPathFile will be triggered once the file has data, it will turn it into zero bytes file.

Also, we noticed the LastModified time in logs for append file operation always shows Monday, 01-Jan-01 00:00:00 GMT, which looks incorrect to me. Can you please help in this case?

Exception or Stack Trace

To Reproduce

Code Snippet
DataLakeDirectoryClient directoryClient = fileSystemClient.getDirectoryClient(adlsFolderPath);
RETRIER.get(
() -> {
DataLakeFileClient fileClient = directoryClient.createFile(fileToUpload.getName(), true);
fileClient.uploadFromFile(fileToUpload.getPath(), true);
return fileClient.getFileName();
},
MAX_ATTEMPTS);

Expected behavior
It should just upload the file with finite size

Setup (please complete the following information):
"com.azure:azure-storage-file-datalake:12.18.1",
"com.azure:azure-storage-common:12.24.1",

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files) labels May 17, 2024
Copy link

Copy link

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@alzimmermsft
Copy link
Member

alzimmermsft commented May 17, 2024

Hi @fivetran-arunsuri , thanks for reporting this issue. Taking a quick look at the sample provided:

DataLakeDirectoryClient directoryClient = fileSystemClient.getDirectoryClient(adlsFolderPath);
RETRIER.get(() -> {
    DataLakeFileClient fileClient = directoryClient.createFile(fileToUpload.getName(), true);
    fileClient.uploadFromFile(fileToUpload.getPath(), true);
    return fileClient.getFileName();
}, MAX_ATTEMPTS);

Calling both

directoryClient.createFile(fileToUpload.getName(), true);
and
fileClient.uploadFromFile(fileToUpload.getPath(), true);

will result in a CreateFilePath REST operation as DataLakeDirectoryClient.createFile and DataLakeFileClient.uploadFromFile will create a DataLake file. Changing to DataLakeDirectoryClient.getFileClient will remove one of the first two CreateFilePath REST operations you're seeing.

Still looking into why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. Is there any code after this which is making additional DataLake calls?

@alzimmermsft alzimmermsft added needs-author-feedback More information is needed from author to address the issue. and removed needs-team-attention This issue needs attention from Azure service team or SDK team labels May 20, 2024
Copy link

Hi @fivetran-arunsuri. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@fivetran-arunsuri
Copy link
Author

fivetran-arunsuri commented May 20, 2024

hey alzimmermsft I understood the fix suggested by you to remove one of the 2 createPathFile operation but I am interested to know answers of why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. No We are not making upload related calls after this

@github-actions github-actions bot added needs-team-attention This issue needs attention from Azure service team or SDK team and removed needs-author-feedback More information is needed from author to address the issue. labels May 20, 2024
@fivetran-arunsuri
Copy link
Author

alzimmermsft any update on this?

@alzimmermsft
Copy link
Member

alzimmermsft commented May 23, 2024

hey alzimmermsft I understood the fix suggested by you to remove one of the 2 createPathFile operation but I am interested to know answers of why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. No We are not making upload related calls after this

Still haven't found a root cause on this as the uploadFromFile code flow calls create before running all necessary append operations to upload the file and concludes with the flush to finalize the uploaded data.

And does this happen every time or randomly during the application run?

@fivetran-arunsuri
Copy link
Author

It does not happen everytime. It is an intermittent issue but it always occurs at times

@alzimmermsft
Copy link
Member

If possible, could you produce HTTP logs when the scenario is hit?

https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/core/azure-core#http-request-and-response-logging

DataLakeServiceClient serviceClient = new DataLakeServiceClientBuilder()
  // add credential information here
  // endpoint here
  .httpLogOptions(DataLakeServiceClientBuilder.getDefaultHttpLogOptions().setLogLevel(HttpLogDetailLevel.HEADERS)) 
  .buildClient();

DataLakeDirectoryClient directoryClient = serviceClient.getFileSystemClient(fileSystemName).getDirectoryClient(adlsFolderPath);

// Retrier code here

This will help find if the empty file creation correlates with a retry, which would be a bug in this case.

Still can't reproduce the Last-Modified time issue as I've tried creating empty files, files that are created with a single append, and files created with multiple appends before flushing and still haven't hit this case yet. FYI @seanmcc-msft if you know anything about this edge case problem.

@fivetran-arunsuri
Copy link
Author

alzimmermsft where will these logs will be produced in this case? I mean if we create the service client like this, should this be part of client logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

3 participants