Bug in UTF-8 decoding while reading only part of a (text) blob

- **Package Name**: azure-storage-blob
- **Package Version**: 12.19.0
- **Operating System**: Linux
- **Python Version**: 3.10

**Describe the bug**

When downloading only part of a UTF-8 decoded text blob, the SDK sometimes raises `UnicodeDecodeError` with error message `'utf-8' codec can't decode byte 0xe2 in position ...: unexpected end of data`. See the code example below.

```python
downloader = (
    client
    .get_blob_client(some_txt_blob)
    .download_blob(timeout=30, encoding='utf-8-sig')
)
# The line below sometimes raises
downloader.read(10 * 1024 * 1024)
```

I think that this line https://github.com/Azure/azure-sdk-for-python/blob/5ee1d7add8963c82a5a8d5bf80636e21dabda681/sdk/storage/azure-storage-blob/azure/storage/blob/_download.py#L643 assumes that the accumulated data (variable `data`) always starts and ends on a UTF-8 character boundary. Depending on the blob content (e.g. presence of non-ASCII characters) and the value passed to the `read()` method, the assumption may be untrue.

**Expected behavior**

Ideally, the API should accept the number of characters to download, in the case of text decoding, instead of the number of bytes. Alternatively, it could fetch a few extra bytes if needed. At a minimum, reading a specific number of bytes (or characters) combined with text decoding should not be allowed; for example, an explicit exception should be raised, and this behavior should be documented, or the API should not even allow it.

**Additional context**

We need to gradually download and parse a large CSV file/blob. Loading the whole file into memory and parsing it all at once is not feasible.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in UTF-8 decoding while reading only part of a (text) blob #34065

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug in UTF-8 decoding while reading only part of a (text) blob #34065

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions