Use access token for the Authorisation #146

mayurdb · 2020-03-24T13:15:23Z

With this change, the user can pass the access token to the spark with conf spark.gcs.user.accessToken and the same will be used for the authorization of the big-query resources.

This change still does not support refreshing the access token on expiry which would become an issue for the long-running spark applications.

I am currently thinking of this approach to solve this -

User passes the refresh token, client ID and client secret to Spark along with the access token if she wants to auto-refresh the access token.
When the BigQueryRelation gets created, on the driver, generate the access token if not already available.
Broadcast the generated access token so that each executor has access to it.
Run a daemon thread on the driver which asynchronously refreshes the access token before the expiry so that each executor always has the valid access token. ( I will have to research a bit on this part to update the value of the already broadcasted conf).

Let me know if this sounds workable or if there is some simpler approach to do this.

googlebot · 2020-03-24T13:15:27Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

mayurdb · 2020-03-24T13:33:21Z

@googlebot I signed it!

…

On Tue, Mar 24, 2020 at 6:45 PM googlebot ***@***.***> wrote: Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 *Please visit https://cla.developers.google.com/ <https://cla.developers.google.com/> to sign.* Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it. ------------------------------ What to do if you already signed the CLA Individual signers - It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data <https://cla.developers.google.com/clas> and verify that your email is set on your git commits <https://help.github.com/articles/setting-your-email-in-git/>. Corporate signers - Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version <https://opensource.google/docs/cla/#troubleshoot>). - The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data <https://cla.developers.google.com/clas> and verify that your email is set on your git commits <https://help.github.com/articles/setting-your-email-in-git/>. - The email used to register you as an authorized contributor must also be attached to your GitHub account <https://github.com/settings/emails>. ℹ️ *Googlers: Go here <https://goto.google.com/prinfo/https%3A%2F%2Fgithub.com%2FGoogleCloudDataproc%2Fspark-bigquery-connector%2Fpull%2F146> for more info*. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#146 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBQ6XXZJ6XDQF6D4K574ULRJCW75ANCNFSM4LSUMOHA> .

googlebot · 2020-03-24T13:33:25Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

davidrabinowitz · 2020-03-26T20:00:05Z

/gcbrun

connector/src/main/scala/com/google/cloud/spark/bigquery/SparkBigQueryOptions.scala

davidrabinowitz · 2020-03-26T20:10:31Z

connector/src/main/scala/com/google/cloud/spark/bigquery/SparkBigQueryOptions.scala

@@ -77,6 +79,8 @@ object SparkBigQueryOptions {
  val DefaultFormat: FormatOptions = FormatOptions.parquet()
  private val PermittedIntermediateFormats = Set(FormatOptions.orc(), FormatOptions.parquet())

+  val GcsAccessTokenConfig = "spark.gcs.user.accessToken"


Please change the name to gcpAccessToken

mayurdb · 2020-03-28T01:36:02Z

@davidrabinowitz can you please also reply to the comment above?

davidrabinowitz · 2020-03-28T20:45:25Z

I was hoping to get the initial AccessToken support ready as we plan to release the next version of the connector next week (March 30/31).

In the long run, something along the points you have outlined sounds like a good plan. Some of the work I've started for implemented DataSource v2 will probably make it easier. I suggest that we merge this PR as is (just changing the option name), and let's open another issue where we can continue the discussion.

davidrabinowitz · 2020-03-30T18:27:40Z

/gcbrun

apostaremczak · 2020-04-10T08:31:30Z

Hey, are you planning on merging this feature and including in the the next release? I'd like to use it as well :)

mayurdb · 2020-04-10T14:14:36Z

@davidrabinowitz I have addressed the comments and tested the changes on my end -
`
//set access token for the user who has read permission to the dataset

spark.sql("SET spark.gcs.user.accessToken=<access_token>").collect

// Verify the user has the access

val df = spark.read.format("bigquery").option("table","mayur.customTable").load()
df.collect()
res6: Array[org.apache.spark.sql.Row] = Array([75,75,key75,dataPoint75], [76,76,key76,dataPoint76], [77,77,key77,dataPoint77], ...

//set access token for the user who does not have read permission to the dataset

spark.sql("SET spark.gcs.user.accessToken=<no_access_user_access_token>").collect

// Verify that this user does not have the access

val df = spark.read.format("bigquery").option("table","mayur.customTable").load()
df.collect()
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table spark-cutomer-project-1:mayur.partitionedTable: User does not have bigquery.tables.get permission for table spark-cutomer-project-1:mayur.partitionedTable.

// Give this user access via the GCP console and verify that the user is able to read the dataset

val df = spark.read.format("bigquery").option("table","mayur.customTable").load()
df.collect()
res27: Array[org.apache.spark.sql.Row] = Array([75,75,key75,dataPoint75], [76,76,key76,dataPoint76], [77,77,key77,dataPoint77], [78,78,key78,dataPoint78],....
`

I have verified this for the writes as well. Anything else pending from my side?

davidrabinowitz · 2020-04-10T17:17:41Z

Hi @mayurdb

Can you please do the follwoing:

Replace the parameter name from spark.gcs.user.accessToken to gcpAccessToken in order to keep name consistency? Also the token is for all GCP authentication.
Rebase the PR to master? Currently I cannot merge it due to conflicts.

I plan to release a new version next week, and I'd be happy to include this PR in it.

davidrabinowitz · 2020-04-10T17:33:08Z

/gcbrun

mayurdb · 2020-04-10T17:43:35Z

@davidrabinowitz I have changed the conf name to spark.gcpAccessToken which is the spark convention of naming confs. Let me know if you had like me to change this to just gcpAccessToken :)

I have also rebased the branch to master!

davidrabinowitz · 2020-04-10T17:50:23Z

@mayurdb Yes, I'd appreciate it if you can change it to just gcpAccessToken in order to keep consistency with the rest of the parameters. I know that the spark convention is to stark with spark., but at the moment backwards compatibility is important and I don't want to break existing applications relying on the connector.

mayurdb · 2020-04-10T17:56:43Z

@davidrabinowitz done

davidrabinowitz · 2020-04-10T19:19:50Z

/gcbrun

mayurdb · 2020-04-15T11:34:37Z

@davidrabinowitz just realized, spark ignores the confs which are not prefixed with spark.* So the users might not be able to pass just gcpAccessToken

mayurdb mentioned this pull request Mar 24, 2020

Access BigQuery resources using Access token #145

Closed

Use access token for the Authorization

1b93429

mayurdb force-pushed the AccTokenAuth branch from 63c08d4 to 1b93429 Compare March 24, 2020 13:29

davidrabinowitz self-requested a review March 26, 2020 17:55

davidrabinowitz reviewed Mar 26, 2020

View reviewed changes

connector/src/main/scala/com/google/cloud/spark/bigquery/SparkBigQueryOptions.scala Outdated Show resolved Hide resolved

connector/src/main/scala/com/google/cloud/spark/bigquery/SparkBigQueryOptions.scala Outdated Show resolved Hide resolved

davidrabinowitz requested changes Mar 26, 2020

View reviewed changes

Addressed comments

dd1056e

davidrabinowitz approved these changes Apr 10, 2020

View reviewed changes

mayurdb added 2 commits April 10, 2020 22:58

Merge branch 'master' into AccTokenAuth

b94dc6e

Changed the conf name

e1bb4de

Changed the conf name-v2

0f5fdd2

davidrabinowitz merged commit 2101239 into GoogleCloudDataproc:master Apr 10, 2020

mayurdb mentioned this pull request Jun 23, 2020

Auto refreshing Access token #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use access token for the Authorisation #146

Use access token for the Authorisation #146

mayurdb commented Mar 24, 2020

googlebot commented Mar 24, 2020

mayurdb commented Mar 24, 2020 via email

googlebot commented Mar 24, 2020

davidrabinowitz commented Mar 26, 2020

davidrabinowitz Mar 26, 2020

mayurdb commented Mar 28, 2020

davidrabinowitz commented Mar 28, 2020

davidrabinowitz commented Mar 30, 2020

apostaremczak commented Apr 10, 2020

mayurdb commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

mayurdb commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

mayurdb commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

mayurdb commented Apr 15, 2020

Use access token for the Authorisation #146

Use access token for the Authorisation #146

Conversation

mayurdb commented Mar 24, 2020

googlebot commented Mar 24, 2020

What to do if you already signed the CLA

Individual signers

Corporate signers

mayurdb commented Mar 24, 2020 via email

googlebot commented Mar 24, 2020

davidrabinowitz commented Mar 26, 2020

davidrabinowitz Mar 26, 2020

Choose a reason for hiding this comment

mayurdb commented Mar 28, 2020

davidrabinowitz commented Mar 28, 2020

davidrabinowitz commented Mar 30, 2020

apostaremczak commented Apr 10, 2020

mayurdb commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

mayurdb commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

mayurdb commented Apr 10, 2020

davidrabinowitz commented Apr 10, 2020

mayurdb commented Apr 15, 2020