Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add url signing as an alternative to forbid direct access #524

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

CJ-Wright
Copy link
Contributor

The idea here is that instead of having forbid direct access load the data on the server and send it over to the client, only for the client need to load it again we send over a signed URL.
These URLs have a limited access time, meaning that we don't leak permissions to the data in case of a credentialed server.

This PR isn't complete but I wanted to open this up so we could discuss how best to place the moving pieces into intake.

@CJ-Wright
Copy link
Contributor Author

ping @martindurant @mariusvniekerk

@martindurant
Copy link
Member

Definitely an interesting idea. It becomes somewhat tricky in having to explicitly consider which storage backend each dataset is on - maybe fsspec should have something general for this kind of thing.
Generally, the intake server has been somewhat neglected, glad to see some interest here.

@mariusvniekerk
Copy link

The other handy part (for S3) is that the signed urls act as the role of the signer, not the caller.

@CJ-Wright
Copy link
Contributor Author

@martindurant ok, what do you see as the path forward here? Do you want me to open a PR into fsspec?

@martindurant
Copy link
Member

Yes, I think a "sign" method in AbstractFileSystem and/or a "sign" function with takes any URL/kwargs and instantiates the backend would be a good addition to fsspec. It would return NotImplemented in many cases (perhaps all cases except gcs and s3, which are of course in separate repos).

I notice that another assumption in the code here, is that the URL is in arg "urlpath", which is common but probably not universal. Furthermore, kwargs (storage_options) are not passed to the signing client, and the process would combine poorly with (fsspec) caching. Food for thought.

@CJ-Wright
Copy link
Contributor Author

Hmm, I'm not certain if the caching is fixable without a bunch of work, since we'd need to destroy the cache or expire the url once the auth changes (or any other external event that we want as a trigger).

@CJ-Wright
Copy link
Contributor Author

@martindurant one thing I don't grok yet is: from my reading of the source the action == 'open' clause doesn't actually init the filesystem. Even _create_open_args doesn't know anything about the filesystem that I can tell. So even if I made the changes to fsspec, how do they get propagated up to the server?

@martindurant
Copy link
Member

martindurant commented Aug 14, 2020

Right, in the previous case of allowing the client to open the data directly, it's enough to pass all of the arguments to the client. Since we do have a loose convention to use "urlpath" and "storage_options", we could instantiate the filesystem directly on the server, for signing purposes.

I'm not certain if the caching is fixable without a bunch of work

I'm not sure either. For the simple case of no expiration or time-based expiration, it's not too bad, but the signed URL won't be usable as a way to know if the target has been updated. Similarly, it might be complicated to know how to mitigate expired signed links (I guess this is always an error and we can just surface it).

@CJ-Wright
Copy link
Contributor Author

CJ-Wright commented Aug 14, 2020

I've added sign to the base class and GCP/AWS providers.

xref: fsspec/filesystem_spec#376
fsspec/gcsfs#277
xref: fsspec/s3fs#345

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts

response['args'] = entry._entry._create_open_args(user_parameters)[1]
response_args = response['args']
fs, *_ = get_fs_token_paths(response_args['urlpath'],
storage_options=response_args.get('storage_options', {}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source storage options should be rewritten to whatever it takes to be able to read the HTTP URL, perhaps just {}

fs, *_ = get_fs_token_paths(response_args['urlpath'],
storage_options=response_args.get('storage_options', {}))
try:
response_args['urlpath'] = fs.sign(response_args['urlpath'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the output paths from above, since this might have expanded globs into multiple concrete files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., this can be a list of signed URLs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants