New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add url signing as an alternative to forbid direct access #524
base: master
Are you sure you want to change the base?
Conversation
Definitely an interesting idea. It becomes somewhat tricky in having to explicitly consider which storage backend each dataset is on - maybe fsspec should have something general for this kind of thing. |
The other handy part (for S3) is that the signed urls act as the role of the signer, not the caller. |
@martindurant ok, what do you see as the path forward here? Do you want me to open a PR into fsspec? |
Yes, I think a "sign" method in AbstractFileSystem and/or a "sign" function with takes any URL/kwargs and instantiates the backend would be a good addition to fsspec. It would return NotImplemented in many cases (perhaps all cases except gcs and s3, which are of course in separate repos). I notice that another assumption in the code here, is that the URL is in arg "urlpath", which is common but probably not universal. Furthermore, kwargs (storage_options) are not passed to the signing client, and the process would combine poorly with (fsspec) caching. Food for thought. |
Hmm, I'm not certain if the caching is fixable without a bunch of work, since we'd need to destroy the cache or expire the url once the auth changes (or any other external event that we want as a trigger). |
@martindurant one thing I don't grok yet is: from my reading of the source the |
Right, in the previous case of allowing the client to open the data directly, it's enough to pass all of the arguments to the client. Since we do have a loose convention to use "urlpath" and "storage_options", we could instantiate the filesystem directly on the server, for signing purposes.
I'm not sure either. For the simple case of no expiration or time-based expiration, it's not too bad, but the signed URL won't be usable as a way to know if the target has been updated. Similarly, it might be complicated to know how to mitigate expired signed links (I guess this is always an error and we can just surface it). |
I've added sign to the base class and GCP/AWS providers. xref: fsspec/filesystem_spec#376 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts
response['args'] = entry._entry._create_open_args(user_parameters)[1] | ||
response_args = response['args'] | ||
fs, *_ = get_fs_token_paths(response_args['urlpath'], | ||
storage_options=response_args.get('storage_options', {})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source storage options should be rewritten to whatever it takes to be able to read the HTTP URL, perhaps just {}
fs, *_ = get_fs_token_paths(response_args['urlpath'], | ||
storage_options=response_args.get('storage_options', {})) | ||
try: | ||
response_args['urlpath'] = fs.sign(response_args['urlpath'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use the output paths from above, since this might have expanded globs into multiple concrete files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., this can be a list of signed URLs
The idea here is that instead of having forbid direct access load the data on the server and send it over to the client, only for the client need to load it again we send over a signed URL.
These URLs have a limited access time, meaning that we don't leak permissions to the data in case of a credentialed server.
This PR isn't complete but I wanted to open this up so we could discuss how best to place the moving pieces into intake.