Skip to content

Bug in UTF-8 decoding while reading only part of a (text) blob #34065

@Indy2222

Description

@Indy2222
  • Package Name: azure-storage-blob
  • Package Version: 12.19.0
  • Operating System: Linux
  • Python Version: 3.10

Describe the bug

When downloading only part of a UTF-8 decoded text blob, the SDK sometimes raises UnicodeDecodeError with error message 'utf-8' codec can't decode byte 0xe2 in position ...: unexpected end of data. See the code example below.

downloader = (
    client
    .get_blob_client(some_txt_blob)
    .download_blob(timeout=30, encoding='utf-8-sig')
)
# The line below sometimes raises
downloader.read(10 * 1024 * 1024)

I think that this line

assumes that the accumulated data (variable data) always starts and ends on a UTF-8 character boundary. Depending on the blob content (e.g. presence of non-ASCII characters) and the value passed to the read() method, the assumption may be untrue.

Expected behavior

Ideally, the API should accept the number of characters to download, in the case of text decoding, instead of the number of bytes. Alternatively, it could fetch a few extra bytes if needed. At a minimum, reading a specific number of bytes (or characters) combined with text decoding should not be allowed; for example, an explicit exception should be raised, and this behavior should be documented, or the API should not even allow it.

Additional context

We need to gradually download and parse a large CSV file/blob. Loading the whole file into memory and parsing it all at once is not feasible.

Metadata

Metadata

Labels

ClientThis issue points to a problem in the data-plane of the library.Service AttentionWorkflow: This issue is responsible by Azure service team.StorageStorage Service (Queues, Blobs, Files)bugThis issue requires a change to an existing behavior in the product in order to be resolved.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions