Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azure-storage-blob] Blob Upload never finishes if a generator is passed #17418

Closed
mth-cbc opened this issue Mar 18, 2021 · 3 comments · Fixed by #17551
Closed

[azure-storage-blob] Blob Upload never finishes if a generator is passed #17418

mth-cbc opened this issue Mar 18, 2021 · 3 comments · Fixed by #17551
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Service Attention This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)

Comments

@mth-cbc
Copy link

mth-cbc commented Mar 18, 2021

  • Package Name: azure-storage-blob
  • Package Version: 12.6.0
  • Operating System: Ubuntu 18.04 in WSL on Windows 10 64-Bit
  • Python Version: 3.8.0

Describe the bug
I want to pass a generator of the form:

def data_generator():
    # do something
    yield data

to the function upload_blob() to avoid the need to hold all of my (potentially huge) data in memory.

The upload, however, never finishes even after the generator is exhausted.

To Reproduce
Steps to reproduce the behavior:

  1. Get a BlobClient
  2. Implement a generator which in total produces more data than your chunk size
  3. Call blob_client.upload_blob(data=<generator>)
  4. Observe that the operation never finishes

Here is a minimal script with which I could tirgger the behaviour every time:

from azure.storage.blob import BlobClient


# Size of raw_string combined with "range" in `test_generator` needs
# to be bigger than the chunk size (otherwise everything is put in one
# go, instead of a "stream" case)
raw_string = "a" * 1024 * 1024 * 3


def test_generator():
    for it in range(0, 2):
        print(it)
        yield raw_string
    print("Exhausted generator")


test: BlobClient = BlobClient.from_connection_string(
    conn_str="<conn_str>",
    container_name="ttp",
    blob_name="generator_test.txt",
)

test.upload_blob(data=test_generator(), overwrite=True)
print("Finished")

The corresponding output is this:

0
1
Exhausted generator

Even after waiting several minutes the program does not finish.

Expected behavior
The upload should succeed normally and the blob should contain all the data produced by the generator.

Additional context
I wasn't completly sure if the upload_blob function was supposed to work as I expected. So I investigated a bit further and I might have found the underlying problem. This IterStreamer does not behave as I would expect. Here some script which visualises my observations:

import six


# Copy of the IterStreamer Class from
# https://github.com/Azure/azure-sdk-for-python/blob/af066bb6990279e5868407918d9cd82ad159e7e7/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads.py#L503
class IterStreamer(object):
    """
    File-like streaming iterator.
    """

    def __init__(self, generator, encoding="UTF-8"):
        self.generator = generator
        self.iterator = iter(generator)
        self.leftover = b""
        self.encoding = encoding

    def __len__(self):
        return self.generator.__len__()

    def __iter__(self):
        return self.iterator

    def seekable(self):
        return False

    def __next__(self):
        return next(self.iterator)

    next = __next__  # Python 2 compatibility.

    def tell(self, *args, **kwargs):
        raise NotImplementedError("Data generator does not support tell.")

    def seek(self, *args, **kwargs):
        raise NotImplementedError("Data generator is unseekable.")

    def read(self, size):
        data = self.leftover
        # self.leftover = b""  # <-- My "bugfix"
        count = len(self.leftover)
        try:
            while count < size:
                chunk = self.__next__()
                if isinstance(chunk, six.text_type):
                    chunk = chunk.encode(self.encoding)
                data += chunk
                count += len(chunk)
        except StopIteration:
            pass

        if count > size:
            self.leftover = data[size:]

        return data[:size]


test_data = ["some mock data, content does not matter"] * 100
test_stream = IterStreamer(test_data)

read_size = 1024
loop_counter = 0
# Logic closely resampling the code in the function `get_chunk_streams`
# in the class `_ChunkUploader`:
# https://github.com/Azure/azure-sdk-for-python/blob/af066bb6990279e5868407918d9cd82ad159e7e7/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads.py#L159
while True:
    data = b""

    temp = test_stream.read(read_size)
    if not isinstance(temp, six.binary_type):
        raise TypeError("Blob data should be of type bytes.")
    data += temp or b""

    # Debug Print
    print(data)

    # We have read an empty string and so are at the end
    # of the buffer or we have read a full chunk.
    if temp == b"":  # or len(data) == self.chunk_size:
        break

    # Stop an unending loop
    loop_counter += 1
    if loop_counter > 30:
        print("Unending Loop")
        break

The output of the "unfixed" version:

b'some mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock '
b'data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, cont'
b'ent does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does n'
b'ot mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
Unending Loop

If you comment in the "bugfix line" the output changes to:

b'some mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock '
b'data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, cont'
b'ent does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does n'
b'ot mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not matter'
b''
@ghost ghost added needs-triage This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 18, 2021
@mth-cbc mth-cbc changed the title Blob Upload never finishes if a generator is passed [azure-storage-blob] Blob Upload never finishes if a generator is passed Mar 18, 2021
@catalinaperalta catalinaperalta added bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. Storage Storage Service (Queues, Blobs, Files) and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 18, 2021
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Mar 18, 2021
@catalinaperalta
Copy link
Member

Thanks for reaching out @mth-cbc! We'll investigate ASAP.
Adding @xiafu-msft who can comment more on this issue.

@catalinaperalta catalinaperalta added the Service Attention This issue is responsible by Azure service team. label Mar 18, 2021
@ghost
Copy link

ghost commented Mar 18, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details
  • Package Name: azure-storage-blob
  • Package Version: 12.6.0
  • Operating System: Ubuntu 18.04 in WSL on Windows 10 64-Bit
  • Python Version: 3.8.0

Describe the bug
I want to pass a generator of the form:

def data_generator():
    # do something
    yield data

to the function upload_blob() to avoid the need to hold all of my (potentially huge) data in memory.

The upload, however, never finishes even after the generator is exhausted.

To Reproduce
Steps to reproduce the behavior:

  1. Get a BlobClient
  2. Implement a generator which in total produces more data than your chunk size
  3. Call blob_client.upload_blob(data=<generator>)
  4. Observe that the operation never finishes

Here is a minimal script with which I could tirgger the behaviour every time:

from azure.storage.blob import BlobClient


# Size of raw_string combined with "range" in `test_generator` needs
# to be bigger than the chunk size (otherwise everything is put in one
# go, instead of a "stream" case)
raw_string = "a" * 1024 * 1024 * 3


def test_generator():
    for it in range(0, 2):
        print(it)
        yield raw_string
    print("Exhausted generator")


test: BlobClient = BlobClient.from_connection_string(
    conn_str="<conn_str>",
    container_name="ttp",
    blob_name="generator_test.txt",
)

test.upload_blob(data=test_generator(), overwrite=True)
print("Finished")

The corresponding output is this:

0
1
Exhausted generator

Even after waiting several minutes the program does not finish.

Expected behavior
The upload should succeed normally and the blob should contain all the data produced by the generator.

Additional context
I wasn't completly sure if the upload_blob function was supposed to work as I expected. So I investigated a bit further and I might have found the underlying problem. This IterStreamer does not behave as I would expect. Here some script which visualises my observations:

import six


# Copy of the IterStreamer Class from
# https://github.com/Azure/azure-sdk-for-python/blob/af066bb6990279e5868407918d9cd82ad159e7e7/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads.py#L503
class IterStreamer(object):
    """
    File-like streaming iterator.
    """

    def __init__(self, generator, encoding="UTF-8"):
        self.generator = generator
        self.iterator = iter(generator)
        self.leftover = b""
        self.encoding = encoding

    def __len__(self):
        return self.generator.__len__()

    def __iter__(self):
        return self.iterator

    def seekable(self):
        return False

    def __next__(self):
        return next(self.iterator)

    next = __next__  # Python 2 compatibility.

    def tell(self, *args, **kwargs):
        raise NotImplementedError("Data generator does not support tell.")

    def seek(self, *args, **kwargs):
        raise NotImplementedError("Data generator is unseekable.")

    def read(self, size):
        data = self.leftover
        # self.leftover = b""  # <-- My "bugfix"
        count = len(self.leftover)
        try:
            while count < size:
                chunk = self.__next__()
                if isinstance(chunk, six.text_type):
                    chunk = chunk.encode(self.encoding)
                data += chunk
                count += len(chunk)
        except StopIteration:
            pass

        if count > size:
            self.leftover = data[size:]

        return data[:size]


test_data = ["some mock data, content does not matter"] * 100
test_stream = IterStreamer(test_data)

read_size = 1024
loop_counter = 0
# Logic closely resampling the code in the function `get_chunk_streams`
# in the class `_ChunkUploader`:
# https://github.com/Azure/azure-sdk-for-python/blob/af066bb6990279e5868407918d9cd82ad159e7e7/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads.py#L159
while True:
    data = b""

    temp = test_stream.read(read_size)
    if not isinstance(temp, six.binary_type):
        raise TypeError("Blob data should be of type bytes.")
    data += temp or b""

    # Debug Print
    print(data)

    # We have read an empty string and so are at the end
    # of the buffer or we have read a full chunk.
    if temp == b"":  # or len(data) == self.chunk_size:
        break

    # Stop an unending loop
    loop_counter += 1
    if loop_counter > 30:
        print("Unending Loop")
        break

The output of the "unfixed" version:

b'some mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock '
b'data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, cont'
b'ent does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does n'
b'ot mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
b'ot matter'
Unending Loop

If you comment in the "bugfix line" the output changes to:

b'some mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock '
b'data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, cont'
b'ent does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does n'
b'ot mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not mattersome mock data, content does not matter'
b''
Author: mth-cbc
Assignees: xiafu-msft
Labels:

Client, Service Attention, Storage, bug, customer-reported

Milestone: -

@tasherif-msft tasherif-msft self-assigned this Mar 19, 2021
@tasherif-msft
Copy link
Contributor

Hi @mth-cbc great catch!! and great work on your workaround there :)
I will adjust the code slightly so that we account for all edge cases and put out a pr!

Sorry for the inconvenience

openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-python that referenced this issue Feb 15, 2022
[2021-11-15-preview] Adding restorable apis to list restorables databases,collection and resources for tables and graphs. Also, added Latest restorable timestamp support for Graphs and Tables. (Azure#17418)

* Adds base for updating Microsoft.DocumentDB from version preview/2021-10-15-preview to version 2021-11-15-preview

* Updates readme

* Updates API version in new specs and examples

* 1. Adding restorable apis to list restorables databases,collection and resources for tables and graphs.
2. Added Latest restorable timestamp support for Graphs and Tables.

* Add SqlDataTransferDataSourceSink

* Fix AzureBlobStorage

* Update TableResourceList

Co-authored-by: Nitesh Vijay <niteshvijay1995@gmail.com>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Service Attention This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants