Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GCS blob interface #729

Closed
wants to merge 7 commits into from
Closed

Conversation

cadnce
Copy link
Contributor

@cadnce cadnce commented Oct 1, 2022

Fixes #599 - Swap to using GCS native blob open under the hood.

This should reduce the amount of custom code to maintain, I have tried to keep the interfaces identical so there is no API breaking changes. Though this does mean there is still lots of code that can be trimmed down.

I think it might be worth re-thinking the test coverage and if the test suites like FakeAuthorizedSessionTest are still valid/useful.

What do you think? @petedannemann

Swap to using GCS native blob open under the hood.
Reduces code maintenence overhead.
@petedannemann
Copy link
Contributor

petedannemann commented Oct 1, 2022

First off, @cadnce this is great and thank you for your efforts.

Something I am wondering is if we need these Reader and Writer classes at all now. Do you think there is a solution that just uses the Blob.open function and nothing more? It seems like almost all of our Reader and Writer methods just defer to the the raw reader returned by Blob.open. If this change breaks the smart-open API in some small way like removing the infrequently used terminate method I think that is fine. We can just do a major version bump.

I agree most of these tests are probably useless now, especially if we are able to accomplish what I stated above.

@petedannemann
Copy link
Contributor

petedannemann commented Oct 1, 2022

Could we do something like this? We would need to do a bit of refactoring for handling kwargs but this should eliminate a lot more of our code

def open(
        bucket_id,
        blob_id,
        mode,
        client=None,  # type: google.cloud.storage.Client
        blob_properties=None,
        blob_open_kwargs=None,
        ):
    if client is None:
        client = google.cloud.storage.Client()

    bucket = client.bucket(bucket_id)

    if mode in (constants.READ_BINARY, 'r', 'rt'):
        blob = client.bucket.get_blob(blob_id)
        if blob is None:
            raise google.cloud.exceptions.NotFound('blob %s not found in %s' % (blob_id, bucket_id))
    elif mode in (constants.WRITE_BINARY, 'w', 'wt'):
        blob = client.bucket.blob(blob_id)
        if blob_properties:
            for k, v in blob_properties.items():
                setattr(blob, k, v)
    else:
        raise NotImplementedError('GCS support for mode %r not implemented' % mode)

    return blob.open(mode, **blob_open_kwargs)

@cadnce
Copy link
Contributor Author

cadnce commented Oct 1, 2022

Cool!

If you’re not opposed to the idea of losing compatibility I’ll swap it to something much simpler like you suggest.

Ironically that’s where I started out before I spent ages ensuring it all matched - I guess I should have just asked

@petedannemann
Copy link
Contributor

Considering those changes would reduce the maintenance of the GCS module drastically I think it more than warrants a breaking change / major version bump. We could just redirect most limitations to the underlying python-storage repo. I don't think any meaningful breaking changes will really occur regardless though since the blob.open is fairly similar to our implementation here

@cadnce
Copy link
Contributor Author

cadnce commented Oct 2, 2022

👍

I’ll fix this up/replace it next week.

@cadnce cadnce marked this pull request as draft October 2, 2022 00:30
Breaking changes:
* Removed gcs.Reader/gcs.Writer classes
* No Reader/Writer.terminate()
* The buffer size can no-longer be controlled independently of chunk_size
* calling close twice on a gcs file object will now throw an exception
@cadnce
Copy link
Contributor Author

cadnce commented Oct 10, 2022

Hmm, so this has caused me a little more thinking than I was expecting and would be keen to get input on a few things

  1. Testing!
    GCS doesnt have a moto esque library, hence you've written your own. Cool! But since we're now deferring lots to the google-storage library i'm not sure many of the tests actually test much except the Mock/Fakes. I'd be keen for your opinion, but In my head the main things to test are:
  • That the smart_open interface works with gcs ( The existing OpenTest)
  • Using a real reader/writer works (the existing integration tests I suppose)
  • That we set the correct properties on the blobs (TODO)
  • That we passthrough the correct open arguments (TODO)
  • Error handling for blob or bucket not existing ( test_nonexisting_bucket and test_read_nonexisting_key )
  1. The google-storage blob module handles the case close() being called twice differently to the way that the previous reader did. i.e. in google storage it raises an exception if you close it twice https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/fileio.py#L426 wheras most other file type objects ignore this. I've raised a question on the upstream, but I'm not entirely familiar the underlying memory handling that is referred to in the failing test and if it is still relevant in python 3.x? - https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/tests/test_gcs.py#L831-L832

  2. Should i be trying to handle the kwarg arguments that are just getting passed through to the blob open or blob properties? It looks like it would be possible to validate against the valid upload/download kwargs from the google library but that seems like i'm just adding bloat for the sake of it as thats the first thing the google library checks.

@cadnce
Copy link
Contributor Author

cadnce commented Oct 10, 2022

If you need to maintain the functionality in 2, we could do something super disgusting and monkeypatch the google BlobWriter object so that it allows close to be called multiple times. But that seems like a recipe for disaster in the future!

smart_open/tests/test_gcs.py Outdated Show resolved Hide resolved
@petedannemann
Copy link
Contributor

petedannemann commented Oct 10, 2022

Hmm, so this has caused me a little more thinking than I was expecting and would be keen to get input on a few things

  1. Testing!
    GCS doesnt have a moto esque library, hence you've written your own. Cool! But since we're now deferring lots to the google-storage library i'm not sure many of the tests actually test much except the Mock/Fakes. I'd be keen for your opinion, but In my head the main things to test are:
  • That the smart_open interface works with gcs ( The existing OpenTest)
  • Using a real reader/writer works (the existing integration tests I suppose)
  • That we set the correct properties on the blobs (TODO)
  • That we passthrough the correct open arguments (TODO)
  • Error handling for blob or bucket not existing ( test_nonexisting_bucket and test_read_nonexisting_key )
  1. The google-storage blob module handles the case close() being called twice differently to the way that the previous reader did. i.e. in google storage it raises an exception if you close it twice https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/fileio.py#L426 wheras most other file type objects ignore this. I've raised a question on the upstream, but I'm not entirely familiar the underlying memory handling that is referred to in the failing test and if it is still relevant in python 3.x? - https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/tests/test_gcs.py#L831-L832
  2. Should i be trying to handle the kwarg arguments that are just getting passed through to the blob open or blob properties? It looks like it would be possible to validate against the valid upload/download kwargs from the google library but that seems like i'm just adding bloat for the sake of it as thats the first thing the google library checks.
  1. That all sounds right to me. The majority of those unit tests no longer serve their purpose
  2. Oh, that is unfortunate about the double close. I'm not really sure if that requires a monkeypatch, I would probably be ok with the new behavior as long as we document these changes and possibly provide advice on how to migrate to the new version of smart-open. I'm pretty sure I copied that failing test from another test and I wouldn't worry about that. We can delete it.
  3. Let's defer to validity of those to the google library

@petedannemann
Copy link
Contributor

@mpenkov @piskvorky what is smart-open's stance on breaking API changes within the library?

The opinion I have been expressing so far is that they are ok if we perform a major version bump and the benefits are worth it. In this case, we defer almost all maintenance of the gcs integration to the google-cloud-storage library so the benefits seem worthwhile.

I think so far the breaking changes we have identified are:

  1. Removal of the terminate method, which is likely very infrequently used
  2. close can no longer be called more than once

@cadnce
Copy link
Contributor Author

cadnce commented Oct 10, 2022

On the breaking changes front also

  • the parameters to gcs.open have changed (no more control of buffer_size)
  • no-longer has a gcs.Reader or gcs.Writer class

If we’re reasonably happy with the concept of it being a breaking change, I’ll have a look at ensuring those test cases are covered off (and probably strip out a few that are no-longer testing the right things)

@cadnce
Copy link
Contributor Author

cadnce commented Oct 10, 2022

Btw thanks for your input @petedannemann always hard knowing the right way to contribute

smart_open/gcs.py Outdated Show resolved Hide resolved
@cadnce
Copy link
Contributor Author

cadnce commented Oct 13, 2022

Ok, so of the changes that were breakign originally in this PR:

  1. teminate() previously this method removed an incomplete upload. However now the google-python-storage library under the hood uses ResumableUploads the behaviour with these is differnt. "Only a completed resumable upload appears in your bucket and, if applicable, replaces an existing object with the same name". Though i've added a terminate() feature for backwards compatibility which is no-op.

  2. The double close - this was something I wasnt happy with so I raised a ticket on the google library and that will hopefully be included soon - BlobWriter close() behaviour googleapis/python-storage#884

  3. Removal of buffer_size (I added this back in and now warn about using it) 👍

  4. gcs.Reader gcs.Writer I added them back with the same interfaces they now just act as a passthrough 👍

Oh, and I updated the test coverage to cover the things I thought we were missing.

Which means its just a case of waiting for the next release of the google-cloud-storage library and the interface will remain identical.

@cadnce
Copy link
Contributor Author

cadnce commented Oct 13, 2022

Its definitely not scientific, but I ran the integration/performance tests

use-gcs-open

-------------------------------------------------------------------------------------- benchmark: 7 tests -------------------------------------------------------------------------------------
Name (time in s)                        Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_gcs_performance_small_reads     2.6681 (1.0)      4.1263 (1.0)      3.2295 (1.0)      0.7510 (7.96)     2.7039 (1.0)      1.3380 (9.16)          1;0  0.3096 (1.0)           5           1
test_gcs_readwrite_text              4.9159 (1.84)     5.8216 (1.41)     5.2937 (1.64)     0.3342 (3.54)     5.2788 (1.95)     0.3598 (2.46)          2;0  0.1889 (0.61)          5           1
test_gcs_readwrite_binary_gzip       5.0102 (1.88)     5.2816 (1.28)     5.1368 (1.59)     0.1090 (1.16)     5.0950 (1.88)     0.1656 (1.13)          2;0  0.1947 (0.63)          5           1
test_gcs_readwrite_binary            5.0349 (1.89)     5.4729 (1.33)     5.1708 (1.60)     0.1791 (1.90)     5.1131 (1.89)     0.2120 (1.45)          1;0  0.1934 (0.62)          5           1
test_gcs_performance_gz              5.0807 (1.90)     6.0795 (1.47)     5.4019 (1.67)     0.4065 (4.31)     5.3494 (1.98)     0.4864 (3.33)          1;0  0.1851 (0.60)          5           1
test_gcs_readwrite_text_gzip         5.0951 (1.91)     5.3273 (1.29)     5.2043 (1.61)     0.0943 (1.0)      5.2325 (1.94)     0.1460 (1.0)           2;0  0.1921 (0.62)          5           1
test_gcs_performance                 5.6067 (2.10)     6.4920 (1.57)     5.9715 (1.85)     0.3289 (3.49)     5.9022 (2.18)     0.3725 (2.55)          2;0  0.1675 (0.54)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

develop

-------------------------------------------------------------------------------------- benchmark: 7 tests -------------------------------------------------------------------------------------
Name (time in s)                        Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_gcs_performance_small_reads     4.8196 (1.0)      4.9168 (1.0)      4.8616 (1.0)      0.0484 (1.0)      4.8348 (1.0)      0.0893 (1.0)           2;0  0.2057 (1.0)           5           1
test_gcs_readwrite_binary            4.8830 (1.01)     5.7310 (1.17)     5.1998 (1.07)     0.3588 (7.41)     5.0778 (1.05)     0.5689 (6.37)          1;0  0.1923 (0.93)          5           1
test_gcs_readwrite_text_gzip         4.9546 (1.03)     5.6261 (1.14)     5.1468 (1.06)     0.2742 (5.66)     5.0464 (1.04)     0.2550 (2.85)          1;1  0.1943 (0.94)          5           1
test_gcs_performance_gz              5.0127 (1.04)     5.2060 (1.06)     5.1066 (1.05)     0.0762 (1.57)     5.1100 (1.06)     0.1191 (1.33)          2;0  0.1958 (0.95)          5           1
test_gcs_readwrite_binary_gzip       5.3487 (1.11)     5.8019 (1.18)     5.5354 (1.14)     0.1659 (3.43)     5.5166 (1.14)     0.1612 (1.80)          2;0  0.1807 (0.88)          5           1
test_gcs_readwrite_text              5.4528 (1.13)     6.0964 (1.24)     5.8176 (1.20)     0.2517 (5.20)     5.7647 (1.19)     0.3452 (3.86)          2;0  0.1719 (0.84)          5           1
test_gcs_performance                 5.7934 (1.20)     7.7706 (1.58)     6.4189 (1.32)     0.7946 (16.41)    6.3065 (1.30)     0.8402 (9.41)          1;0  0.1558 (0.76)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The results pretty much being too similar to compare on my connection

@cadnce cadnce marked this pull request as ready for review October 13, 2022 05:47
@cadnce cadnce changed the title [WIP] Fixes #599 - use gcs blob interface. Fixes #599 - use gcs blob interface. Oct 13, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Oct 16, 2022

Do we need to author any documentation that helps people adjust to the breaking changes?

@cadnce
Copy link
Contributor Author

cadnce commented Oct 16, 2022

If we wait for the google-python-storage library version 2.6.0 googleapis/python-storage#838

I think there will be no breaking changes 😁

Though, I’ll add an update to the documentation about the new parameter passthrough.

the slightly unfortunate thing is it will break all the rest of the open PR’s which modify the gcs module, but they will all be easier to add in the future.

@ddelange
Copy link
Contributor

ddelange commented Oct 16, 2022

fyi another PR waiting for gcs 2.6.0: #727

would be awesome if that PR can be closed in favour of this one (just that acl cant be specified using setattr)

edit: yes this PR would cover it (source):

blob_open_kwargs={'predefined_acl': 'publicRead'}

if it's not too much to ask: could you copy-paste the corresponding test from my PR? its tiny :)

@cadnce
Copy link
Contributor Author

cadnce commented Oct 16, 2022

@ddelange Yup, it should be covered by the blob_open_kwargs (why I made this pr instead of adding all the small gcs functionality additions)

I’ll add that as a use case in the docs update to explain how to use it?

I was trying to trim functionality out of the fakes (and removed the test you updated) because I was deferring a lot of that to the google library. But could add that in if you felt it was adding a lot?

@ddelange
Copy link
Contributor

awesome, docs sounds good! 💥

yeah, all the fakes. still a mystery to me why there's no testing framework for gcs like moto for boto, or a local dev server for testing like azurite for azure blob...

@ddelange
Copy link
Contributor

nvm, spoke too soon https://github.com/fsouza/fake-gcs-server

@ddelange
Copy link
Contributor

here an example of switching to a dockerized azurite server for tests, was a quite satisfying refactor as you just point to fake server and run real assertions on the server's responses

ddelange added a commit to ddelange/pypicloud that referenced this pull request Oct 17, 2022
@ddelange
Copy link
Contributor

nvm, spoke too soon https://github.com/fsouza/fake-gcs-server

here an example of switching to a dockerized azurite server for tests, was a quite satisfying refactor as you just point to fake server and run real assertions on the server's responses

ah, that felt good again ♻️ ddelange/pypicloud#4

thanks a lot for the efforts here, looking very forward to smart_open 7.0.0! 🚀

@ddelange
Copy link
Contributor

is this PR ready for review?

@cadnce
Copy link
Contributor Author

cadnce commented Oct 31, 2022

Sorry about the ambiguity.

We’re waiting on the release of the google cloud storage python module. That said, the code for this will remain the same except for the dependency version will be bumped. So in theory you can review now and just assume the double close issue is not an issue?

Copy link
Contributor

@ddelange ddelange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome stuff!

try:
setattr(g_blob, k, v)
except AttributeError:
logger.warn(f'Unable to set property {k} on blob')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my side there's no need for a try-except: if the user passes unsupported properties, it's OK to fail hard

if hasattr(_blob, 'terminate'):
raise RuntimeWarning(
'Unexpected incompatibility between dependency and google-cloud-storage dependency.'
'Things may not work as expected'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'may not' implies the user can try anyway (warning vs error)

could this error instead state what the user did wrong, and how to fix it?

is it related to minimum gcs lib versions? would it be an option to limit the gcs dependency in setup.py to only allow the versions that work with this new implementation? cc @mpenkov

except AttributeError:
logger.warn(f'Unable to set property {k} on blob')

_blob = g_blob.open('wb', **blob_open_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was mode 'w' supported before? if yes, maybe a release note? if no, plz ignore :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I think it's all good. from the PR diff: elif mode in (constants.WRITE_BINARY, 'w', 'wt')

feel free to resolve :)

@ddelange
Copy link
Contributor

ddelange commented Nov 8, 2022

gcs 2.6.0 was released 💥

@cadnce
Copy link
Contributor Author

cadnce commented Nov 8, 2022

Awesome! I’ll make the changes requested and bump the version and test this evening.

@piskvorky piskvorky changed the title Fixes #599 - use gcs blob interface. Use GCS blob interface Nov 19, 2022
@piskvorky
Copy link
Owner

@cadnce could we squeeze this into the next release, v6.3.0? We're planning to release soon (~in a week or so).

@piskvorky piskvorky added this to the 6.3.0 milestone Nov 19, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Dec 5, 2022

@cadnce Do you think you could finish this in the next couple of days? We're thinking of releasing 6.3.0 soon.

If not, then no rush, we can include this in the next release.

Please let me know.

@ddelange
Copy link
Contributor

ddelange commented Dec 5, 2022

Hi 👋 Looking at @cadnce's contribution graph, I think he hasn't been active since. I took the liberty to add a commit which I think addresses the outstanding comments, see #744.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 9, 2022

Closing in favor of #744 due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Google Cloud Storage to use blob.open
5 participants