Accessing Google storage by Spark from outside Google cloud #48

amarouni · 2017-03-29T17:12:55Z

Hi guys,

We're trying to access the Google storage from with a Spark on Yarn job (writing to gs://...) on a cluster the resides outside Google Cloud.

We have setup the correct service account and credentials but still facing some issues :

The spark.hadoop.google.cloud.auth.service.account.keyfile points to the credentials file on the Spark driver but the Spark code (workers running on different servers) still try to access the same file path (which doesn't exist). We got to work correctly by having the credentials file on the exact same location on both the driver and the workers, but this is not practical and was a temporary workaround.

Is there any delegation token mechanism by which the driver authenticates with the Google cloud and sends the it to the workers so they don't need to have the same credential key at the exact same path ?

We tried also to upload the credential file (p12 or json) to the workers and set :
spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS
or
spark.executor.extraJavaOptions

to the file path (different from the driver file path) but we're getting :

java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:87)
	at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:68)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1319)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:549)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:512)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
Caused by: java.net.UnknownHostException: metadata
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
	at com.google.api.client.googleapis.compute.ComputeCredential.executeRefreshToken(ComputeCredential.java:87)
	at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
	at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:85)
	... 14 more

Is there any documentation for this use case that we missed ?

Thanks,

The text was updated successfully, but these errors were encountered:

dennishuo · 2017-05-06T00:57:06Z

Unfortunately not; the GCS connector integrates strictly through Hadoop FileSystem interfaces, which doesn't have any clear notion of masters/workers or any way to broadcast metadata to be used by all workers. Anything that could be implemented would end up fairly specific to a particular stack, e.g. relying on YARN or on Spark or HDFS or Zookeeper to do some kind of keyfile distribution.

Was it impractical because you need to specify different credentials per job or something like that?

One approach to make it easier if it's difficult to sync keyfile directories across your workers continuously would be to use an NFS mount shared across all your nodes to hold the keyfiles.

amarouni · 2017-05-09T08:06:53Z

@dennishuo It was really impractical when submitting Spark jobs from a Windows client to a Linux cluster, the Spark driver was running on Windows while the Spark cluster was hosted on Linux machines. So it was impossible to use the same credentials path both on Windows and Linux.

edwardcapriolo · 2017-11-22T23:26:57Z

through Hadoop FileSystem interfaces, which doesn't have any clear notion of masters/workers or any way to broadcast metadata to be used by all workers.

Hadoop offers a distributed cache https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/filecache/DistributedCache.html

chemikadze · 2017-12-04T04:11:50Z

@dennishuo We have very similar problem and are solving it in suggested manner -- making them available as local files on workers. However, this approach has two problems as for now:

making it work for all types of jobs is quite fragile, as hadoop-core, beeline, pig and spark require different approaches to inject those properties
it introduces another level of indirection in multi-user environment, when key file used does not correspond to principal
keys may be stolen by unauthorised principals if DefaultContainerExecutor is used

While 3) seems to be resolvable by using LinuxContainerExecutor, 1) and 2) does not seem to have easy solution at this time. I'm thinking about a mechanism that would provide different sets of auth properties depending on principal name. Something like that:

fs.gs.principals.names=myuser,otheruser
fs.gs.principals.props.myuser.google.cloud.auth.service.account.enable=true
fs.gs.principals.props.myuser.google.cloud.auth.service.account.json.keyfile=/var/run/keys/myuser.json
fs.gs.principals.props.otheruser.google.cloud.auth.service.account.enable=true
fs.gs.principals.props.otheruser.fs.gs.project.id=other-google-project-name
fs.gs.principals.props.otheruser.google.cloud.auth.service.account.json.keyfile=/var/run/keys/otheruser.json

Do you believe this type of functionality could become part of upstream driver?

krishnabigdata · 2018-10-29T07:23:20Z

@dennishuo We have a very similar problem, I wanted to setup a dataproc cluster for multi user. Since the compute engine uses a default service or custom service account credentials to connect to storage bucket which doesn't have any relation with user principals who submits the jobs or I couldn't find an option to control it, which makes the dataproc cluster unsecure and creates a problem mentioned by @chemikadze it introduces another level of indirection in multi-user environment, when key file used does not correspond to principal.

Is there any workaround or solution available ?

chemikadze · 2018-10-29T07:33:03Z

@krishnabigdata In my case, we've solved that indirection by implementing wrapper around GCS Hadoop driver, which is mapping users to keys according to configured mapping. Users are mapped to groups, and groups are mapped to particular "group" service account.

krishnabigdata · 2018-10-29T07:40:08Z

@chemikadze Thanks for your reply, in my case we are submitting the job using gcloud dataproc jobs submit hadoop because my thought is to controls access to dataproc cluster using IAM roles but during the job submission the user principals are not getting forward to the hadoop cluster and as well as gcloud doesn't perform any access validation on storage buckets at client side , the job always executed as root user. May I know how to map the users to their service account do you have any solution for this case?

krishnabigdata · 2018-10-30T15:36:07Z

@dennishuo @chemikadze @medb

All we need is the Hadoop Map Reduce submitted by users using gcloud dataproc jobs submit hadoop should be able to use only the storage buckets or folder which user has access to it.

Current:
gcloud dataproc jobs (IAM - user principal) -> Dataproc Cluster (IAM - user principal) -> (SA Default/custom) -> Storage Bucket

If the user has access to submit jobs to Dataproc cluster can use any storage-buckets which the service account has access to it.

Required:
gcloud dataproc jobs (IAM - user principal) -> Dataproc Cluster (IAM - user principal) -> (IAM - user principal) -> Storage Bucket

The user has access to submit jobs to Dataproc cluster can only use the storage-buckets which the user account has access to it.

So far I couldn't find a way to do it. Can you please help me on it

Is there any workaround or solution available to this problem?

medb · 2019-08-21T01:34:46Z

@krishnabigdata you can use GCP Token Broker in conjunction with Kerberos to secure Dataproc cluster for multi-user use case with per-user GCS authentication.

Sudip-Pandit · 2022-04-22T03:56:49Z

Hi Medb, could you have some ideas regarding GCP bucket to PySpark on-prem connection, so that I can get the bucket data to on-prem?

ryan-williams mentioned this issue May 5, 2017

Accessing GCS from Spark/Hadoop outside Google Cloud #52

Closed

medb closed this as completed Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing Google storage by Spark from outside Google cloud #48

Accessing Google storage by Spark from outside Google cloud #48

amarouni commented Mar 29, 2017

dennishuo commented May 6, 2017

amarouni commented May 9, 2017

edwardcapriolo commented Nov 22, 2017

chemikadze commented Dec 4, 2017

krishnabigdata commented Oct 29, 2018

chemikadze commented Oct 29, 2018

krishnabigdata commented Oct 29, 2018

krishnabigdata commented Oct 30, 2018 •

edited

Loading

medb commented Aug 21, 2019

Sudip-Pandit commented Apr 22, 2022

Accessing Google storage by Spark from outside Google cloud #48

Accessing Google storage by Spark from outside Google cloud #48

Comments

amarouni commented Mar 29, 2017

dennishuo commented May 6, 2017

amarouni commented May 9, 2017

edwardcapriolo commented Nov 22, 2017

chemikadze commented Dec 4, 2017

krishnabigdata commented Oct 29, 2018

chemikadze commented Oct 29, 2018

krishnabigdata commented Oct 29, 2018

krishnabigdata commented Oct 30, 2018 • edited Loading

medb commented Aug 21, 2019

Sudip-Pandit commented Apr 22, 2022

krishnabigdata commented Oct 30, 2018 •

edited

Loading