Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use access token for the Authorisation #146

Merged
merged 5 commits into from
Apr 10, 2020

Conversation

mayurdb
Copy link
Contributor

@mayurdb mayurdb commented Mar 24, 2020

With this change, the user can pass the access token to the spark with conf spark.gcs.user.accessToken and the same will be used for the authorization of the big-query resources.

This change still does not support refreshing the access token on expiry which would become an issue for the long-running spark applications.

I am currently thinking of this approach to solve this -

  1. User passes the refresh token, client ID and client secret to Spark along with the access token if she wants to auto-refresh the access token.
  2. When the BigQueryRelation gets created, on the driver, generate the access token if not already available.
  3. Broadcast the generated access token so that each executor has access to it.
  4. Run a daemon thread on the driver which asynchronously refreshes the access token before the expiry so that each executor always has the valid access token. ( I will have to research a bit on this part to update the value of the already broadcasted conf).

Let me know if this sounds workable or if there is some simpler approach to do this.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@mayurdb
Copy link
Contributor Author

mayurdb commented Mar 24, 2020 via email

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@davidrabinowitz davidrabinowitz self-requested a review March 26, 2020 17:55
@davidrabinowitz
Copy link
Member

/gcbrun

@@ -77,6 +79,8 @@ object SparkBigQueryOptions {
val DefaultFormat: FormatOptions = FormatOptions.parquet()
private val PermittedIntermediateFormats = Set(FormatOptions.orc(), FormatOptions.parquet())

val GcsAccessTokenConfig = "spark.gcs.user.accessToken"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the name to gcpAccessToken

@mayurdb
Copy link
Contributor Author

mayurdb commented Mar 28, 2020

@davidrabinowitz can you please also reply to the comment above?

@davidrabinowitz
Copy link
Member

I was hoping to get the initial AccessToken support ready as we plan to release the next version of the connector next week (March 30/31).

In the long run, something along the points you have outlined sounds like a good plan. Some of the work I've started for implemented DataSource v2 will probably make it easier. I suggest that we merge this PR as is (just changing the option name), and let's open another issue where we can continue the discussion.

@davidrabinowitz
Copy link
Member

/gcbrun

@apostaremczak
Copy link

Hey, are you planning on merging this feature and including in the the next release? I'd like to use it as well :)

@mayurdb
Copy link
Contributor Author

mayurdb commented Apr 10, 2020

@davidrabinowitz I have addressed the comments and tested the changes on my end -
`
//set access token for the user who has read permission to the dataset

spark.sql("SET spark.gcs.user.accessToken=<access_token>").collect

// Verify the user has the access

val df = spark.read.format("bigquery").option("table","mayur.customTable").load()
df.collect()
res6: Array[org.apache.spark.sql.Row] = Array([75,75,key75,dataPoint75], [76,76,key76,dataPoint76], [77,77,key77,dataPoint77], ...

//set access token for the user who does not have read permission to the dataset

spark.sql("SET spark.gcs.user.accessToken=<no_access_user_access_token>").collect

// Verify that this user does not have the access

val df = spark.read.format("bigquery").option("table","mayur.customTable").load()
df.collect()
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table spark-cutomer-project-1:mayur.partitionedTable: User does not have bigquery.tables.get permission for table spark-cutomer-project-1:mayur.partitionedTable.

// Give this user access via the GCP console and verify that the user is able to read the dataset

val df = spark.read.format("bigquery").option("table","mayur.customTable").load()
df.collect()
res27: Array[org.apache.spark.sql.Row] = Array([75,75,key75,dataPoint75], [76,76,key76,dataPoint76], [77,77,key77,dataPoint77], [78,78,key78,dataPoint78],....
`

I have verified this for the writes as well. Anything else pending from my side?

@davidrabinowitz
Copy link
Member

Hi @mayurdb

Can you please do the follwoing:

  • Replace the parameter name from spark.gcs.user.accessToken to gcpAccessToken in order to keep name consistency? Also the token is for all GCP authentication.
  • Rebase the PR to master? Currently I cannot merge it due to conflicts.

I plan to release a new version next week, and I'd be happy to include this PR in it.

@davidrabinowitz
Copy link
Member

/gcbrun

@mayurdb
Copy link
Contributor Author

mayurdb commented Apr 10, 2020

@davidrabinowitz I have changed the conf name to spark.gcpAccessToken which is the spark convention of naming confs. Let me know if you had like me to change this to just gcpAccessToken :)

I have also rebased the branch to master!

@davidrabinowitz
Copy link
Member

@mayurdb Yes, I'd appreciate it if you can change it to just gcpAccessToken in order to keep consistency with the rest of the parameters. I know that the spark convention is to stark with spark., but at the moment backwards compatibility is important and I don't want to break existing applications relying on the connector.

@mayurdb
Copy link
Contributor Author

mayurdb commented Apr 10, 2020

@davidrabinowitz done

@davidrabinowitz
Copy link
Member

/gcbrun

@davidrabinowitz davidrabinowitz merged commit 2101239 into GoogleCloudDataproc:master Apr 10, 2020
@mayurdb
Copy link
Contributor Author

mayurdb commented Apr 15, 2020

@davidrabinowitz just realized, spark ignores the confs which are not prefixed with spark.* So the users might not be able to pass just gcpAccessToken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants