Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] DataLakeFileClient read from InputStream #19612

Closed
ppalaga opened this issue Mar 4, 2021 · 10 comments
Closed

[FEATURE REQUEST] DataLakeFileClient read from InputStream #19612

ppalaga opened this issue Mar 4, 2021 · 10 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. Storage Storage Service (Queues, Blobs, Files)

Comments

@ppalaga
Copy link

ppalaga commented Mar 4, 2021

My intuitive expectation is that (1) DataLakeFileClient.openQueryInputStream("SELECT * from BlobStorage") is equivalent to (2) DataLakeFileClient.read(OutputStream). However, the result of (1) always contains a \n characted appended at the end, which (2) does not.

Steps to reproduce:

git clone git@github.com:ppalaga/azure-sdk-issue-19612.git
cd azure-sdk-issue-19612
mvn test

Expected: both tests pass
Actual:

[ERROR] Failures: 
[ERROR]   ReproduceIssue19612Test.openQueryInputStream:77 
expected: [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
but was : [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 10]

Perhaps, the behavior is not an error. If so, I wonder where it is documented?

@ghost ghost added needs-triage This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 4, 2021
ppalaga added a commit to ppalaga/azure-sdk-issue-19612 that referenced this issue Mar 4, 2021
@gapra-msft
Copy link
Member

Hi @ppalaga Thank you for reporting this issue.
There are a few reasons for why you are seeing an extra \n appended to the end.
Query should only be used on structured blobs/files (files that have a csv or json format).
I believe the service defaults to setting the record separator to \n as that is a common standard in json/csv files. Unfortunately it looks like the documentation around the defaults aren't very clear. We will work on improving some of that messaging.

@ppalaga
Copy link
Author

ppalaga commented Mar 4, 2021

Thanks for the explanation, @gapra-msft. I agree this can be solved by improving the documentation.

While I see that there are other endpoints of the API that allow to get the file without the newline appended, DataLakeFileClient.openQueryInputStream() seems to be the only one returning an InputStream. An InputStream allows an effective processing of the data without needing to store it in memory or on disk. I wonder if there is a way to configure the request in such a way that the record separator is omitted? I have not found any hint in the API docs.

@gapra-msft
Copy link
Member

@ppalaga Ah, I see the problem now.

We do not currently support being able to read Datalake files from InputStream (and correspondingly write to them via OutputStream) but it is something that is on our radar, and I'd be happy to tag this as a Feature Request.

With that being said, as a temporary workaround for you, we do currently support reading from an InputStream using our blobs library. You can instantiate a BlobClient that points to the datalake file (just replace the dfs in the endpoint with blob) and use the openInputStream() API to read the file, and this will work as expected.

@ppalaga
Copy link
Author

ppalaga commented Mar 4, 2021

We do not currently support being able to read Datalake files from InputStream (and correspondingly write to them via OutputStream) but it is something that is on our radar, and I'd be happy to tag this as a Feature Request.

That would be nice, thanks!

With that being said, as a temporary workaround for you, we do currently support reading from an InputStream using our blobs library. You can instantiate a BlobClient that points to the datalake file (just replace the dfs in the endpoint with blob) and use the openInputStream() API to read the file, and this will work as expected.

What a trick! Thanks!

@gapra-msft gapra-msft changed the title [BUG] Superfluous newline in the result of DataLakeFileClient.openQueryInputStream("SELECT * from BlobStorage") [FEATURE REQUEST] DataLakeFileClient read from InputStream Mar 4, 2021
@gapra-msft
Copy link
Member

Dev notes for feature request:
InputStream will just wrap existing BlobInputStream implementation

OuputStream will call into DataLakeFileClient buffered upload using a similar pattern to BlockBlobOutputStream

@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. feature-request This issue requires a new behavior in the product in order be resolved. Storage Storage Service (Queues, Blobs, Files) labels Mar 4, 2021
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Mar 4, 2021
@joshfree joshfree added needs-triage This is a new issue that needs to be triaged to the appropriate team. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 4, 2021
@joshfree joshfree removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Mar 4, 2021
@omarsmak
Copy link
Contributor

omarsmak commented Mar 5, 2021

Hey @gapra-msft,
Related to this issue, I was experimenting on using openQueryInputStreamWithResponse API to override the record separator through the input/output serialization as mentioned here, even with explicitly specifying /0 for the record separator, Azure will ignore it and just append with /n, though it works with any other char except /0, is this intentional?
As I understood from your comment above, if this not specified, Azure will use default \n, however since I explicitly specified it to use /0, I'd expect it to use that though unless I am missing something?

@gapra-msft
Copy link
Member

Hi @omarsmak Thank you for posting your question. I am following up with my team to figure out what are invalid parameters for the query method. I believe /0 is one such character for record separator.

@omarsmak
Copy link
Contributor

Hi @omarsmak Thank you for posting your question. I am following up with my team to figure out what are invalid parameters for the query method. I believe /0 is one such character for record separator.

Thanks for the feedback. If indeed /0 is also one of these chars that are being ignored, may I know what is the reasoning behind this decision? In my understanding, if I explicitly set such parameter, it means I am full aware of the consequence and thus I would expect it to behave according to the parameters that I set.

@gapra-msft
Copy link
Member

gapra-msft commented Mar 31, 2021

Hi @omarsmak
I followed up with my team and it looks like any combination of characters should be valid as long as they are unique.

It looks like there are two issues at play here.

  1. The Java SDK should be sending null when you specify null for the characters, I will create a bug to track this issue for the Java SDK. [BUG] Storage blob/datalake query APIs do not send null character when specified by customer #20294
  2. The service should be accepting the null character but is currently throwing on the service. I have an internal thread going on with my team to figure out why this is happening.

azure-sdk pushed a commit to azure-sdk/azure-sdk-for-java that referenced this issue Jun 28, 2022
Add "Pending" to LabProperties status (Azure#19612)

Add "Pending" to LabProperties status, an extensible enum
@alzimmermsft
Copy link
Member

Closing as #21322 added openInputStream to DataLakeFileClient.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

5 participants