Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we have endpoint_override as an attribute if not yet? #169

Open
aoyh opened this issue Apr 18, 2022 · 23 comments
Open

Can we have endpoint_override as an attribute if not yet? #169

aoyh opened this issue Apr 18, 2022 · 23 comments
Assignees
Labels
enhancement New feature or request

Comments

@aoyh
Copy link

aoyh commented Apr 18, 2022

Create an attribute named endpoint_override in dbConnect

Example:

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena",
               work_group = 'mygroup',
               endpoint_override = 'url.aws.com')
@DyfanJones DyfanJones added the enhancement New feature or request label Apr 18, 2022
@DyfanJones
Copy link
Owner

Hi @aoyh, sounds like a good enhancement :) i will start working on an implementation tomorrow :)

@aoyh
Copy link
Author

aoyh commented Apr 19, 2022

Thanks a lot for your prompt reply! With that enhancement, I don't have to switch to library of other language.

Another 2 good reasons:

  • EndpointOverride is usually a must for security reason in some projects.
  • EndpointOverride is supported in AthenaJDBCConnector.

@DyfanJones
Copy link
Owner

Hi @aoyh,

I have done some initial work on this feature. Please have a go and let me know if it meets the requirements 😄

remotes::install_github("dyfanjones/rathena", ref="sdk-extra-parameters")

RAthena should be able to support these extra parameters for boto3.Session and client.

Session

  • botocore_session

client

  • config
  • api_version
  • use_ssl
  • verify
  • endpoint_url

Note: endpoint_url is used instead of endpoint_override as boto3 uses that parameter name instead (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client).

So you should be able to do the following:

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena",
               work_group = 'mygroup',
               endpoint_url = 'url.aws.com')

@DyfanJones DyfanJones self-assigned this Apr 19, 2022
@aoyh
Copy link
Author

aoyh commented Apr 19, 2022

@DyfanJones So quick! Thanks. Will check and let you know.

@DyfanJones
Copy link
Owner

Hmmm I think my initial implementation won't work as expected 🤔

I forgot each AWS service uses it's own endpoint. Currently RAthena uses 3 services: Athena, Glue (for catalog) and S3 (for getting the data). So would we need to override each endpoint? 🤔

It looks like JDBC driver just overrides the Athena endpoint:

Default endpoint:
EndpointOverride=athena. [Region].amazonaws.com:443;

VPC endpoint:
EndpointOverride=[vpce-specific- url].athena.[Region].vpce.amazonaws.com:443;

This is a similar issue when using pythena pandas cursor:

from pyathena import connect
from pyathena.pandas.cursor import PandasCursor

cursor = connect(
    profile_name = "default",
    s3_staging_dir="s3://made-up",
    endpoint_url = "https://athena.eu-west-1.amazonaws.com",
    cursor_class=PandasCursor).cursor()

df = cursor.execute("select * from sampledb.elb_logs limit 10").as_pandas()
OperationalError: An error occurred () when calling the GetObject operation: 

To resolve this we could give options for to user in which AWS service do they want to override for example

endpoints = list(Athena = "my.amazing.endpoint")

Possibly allowing strings that only affect Athena's endpoint 🤔

@DyfanJones
Copy link
Owner

DyfanJones commented Apr 20, 2022

@aoyh Second attempt :P In this implementation you can override each aws service endpoint. To do so you will need to provide a named list of the services' endpoint you want to override. Also if you provide your endpoint only (as a character) then aws athena endpoint will be overridden. I think this gives alot of flexiblity to users when wanting to use custom endpoints :)

Please have a go and let me know if this meets requirement.

P.s. does the new documentation make sense? If not please let me know 😄

library(DBI)

con1 = dbConnect(
  RAthena::athena(), 
  endpoint_override = "https://athena.eu-west-1.amazonaws.com"
)

dbGetInfo(con1)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#> 
#> 
#> $region_name
#> [1] "eu-west-1"
#> 
#> $keyboard_interrupt
#> [1] TRUE
#> 
#> $timezone
#> [1] "UTC"
#> 
#> $expiration
#> NULL
#> 
#> $kms_key
#> NULL
#> 
#> $encryption_option
#> NULL
#> 
#> $poll_interval
#> NULL
#> 
#> $work_group
#> [1] "primary"
#> 
#> $dbms.name
#> [1] "default"
#> 
#> $s3_staging
#> [1] "s3://dummy"
#> 
#> $profile_name
#> NULL
#> 
#> $boto3
#> [1] "1.21.35"
#> 
#> $RAthena
#> [1] "2.5.1.9000"

con2 = dbConnect(
  RAthena::athena(), 
  endpoint_override = list(athena = "https://athena.eu-west-1.amazonaws.com")
)

dbGetInfo(con2)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#> 
#> 
#> $region_name
#> [1] "eu-west-1"
#> 
#> $keyboard_interrupt
#> [1] TRUE
#> 
#> $timezone
#> [1] "UTC"
#> 
#> $expiration
#> NULL
#> 
#> $kms_key
#> NULL
#> 
#> $encryption_option
#> NULL
#> 
#> $poll_interval
#> NULL
#> 
#> $work_group
#> [1] "primary"
#> 
#> $dbms.name
#> [1] "default"
#> 
#> $s3_staging
#> [1] "s3://dummy"
#> 
#> $profile_name
#> NULL
#> 
#> $boto3
#> [1] "1.21.35"
#> 
#> $RAthena
#> [1] "2.5.1.9000"


con3 = dbConnect(
  RAthena::athena(), 
  endpoint_override = list(
    Athena = "https://athena.eu-west-1.amazonaws.com",
    s3 = "https://s3.eu-west-1.amazonaws.com"
  )
)

dbGetInfo(con3)
#> $endpoint_override
#> $endpoint_override$athena
#> [1] "https://athena.eu-west-1.amazonaws.com"
#> 
#> $endpoint_override$s3
#> [1] "https://s3.eu-west-1.amazonaws.com"
#> 
#> 
#> $region_name
#> [1] "eu-west-1"
#> 
#> $keyboard_interrupt
#> [1] TRUE
#> 
#> $timezone
#> [1] "UTC"
#> 
#> $expiration
#> NULL
#> 
#> $kms_key
#> NULL
#> 
#> $encryption_option
#> NULL
#> 
#> $poll_interval
#> NULL
#> 
#> $work_group
#> [1] "primary"
#> 
#> $dbms.name
#> [1] "default"
#> 
#> $s3_staging
#> [1] "s3://dummy"
#> 
#> $profile_name
#> NULL
#> 
#> $boto3
#> [1] "1.21.35"
#> 
#> $RAthena
#> [1] "2.5.1.9000"

Created on 2022-04-20 by the reprex package (v2.0.1)

@aoyh
Copy link
Author

aoyh commented Apr 21, 2022

@DyfanJones Thanks for the detailed explanation!

Since my endpoint looks like VPC endpoint: EndpointOverride=[vpce-specific- url].athena.[Region].vpce.amazonaws.com that you mentioned, then I tried the method below:

con1 = dbConnect(
  RAthena::athena(), 
  endpoint_override = "https://athena.eu-west-1.amazonaws.com"
)

dbListTables(con1) works fine, but `dbGetQuery(con1, '.SELECT ..') encountered some error:

An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:iam:: .... is not authorized to perform: athena:StartQueryExecution on resource:

Any clue?

@DyfanJones
Copy link
Owner

DyfanJones commented Apr 22, 2022

dbListTables calls AWS Glue service to get the Catalog quicker than AWS Athena :) as you haven't overwritten AWS Glue endpoint that should work as normal :)

It looks like your IAM role doesn't have permission to StartQueryExecution on AWS Athena using that endpoint. Can you double check your IAM roles to ensure you are able to :)

My guess is you don't have permission to run AWS Athena in the eu-west-1 region. If you change the region to the one you use then I should expect it to work :) However on the off chance it doesn't, here the AWS Athena command called by RAthena:

The S3 and Glue services should be ok as you haven't overridden their endpoints :)

@aoyh
Copy link
Author

aoyh commented Apr 22, 2022

Additional info:
When using the same set of credentials, S3 location, endpoint_override url in pyathenajdbc library in Python, the query command works fine.
Is there something that I missed when using RAthena?

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               aws_access_key_id='...', 
              aws_secret_access_key='...',
              s3_staging_dir='s3://...',
              region_name='...',
              work_group = 'mygroup',
              endpoint_override = 'url.aws.com')

dbGetQuery(con, '.SELECT ..') 

I use R a lot more, thus don't mind spending more time figuring it out how to make RAthena work here.

Many thanks!

@DyfanJones
Copy link
Owner

Sorry I might of missed somehting. Are you getting the error in R when using this endpoint "https://athena.eu-west-1.amazonaws.com"? But it works fine in pyathenajdbc? Or are you using your vpc endpoint instead? :)

@aoyh
Copy link
Author

aoyh commented Apr 26, 2022

Yes, I get the error in R when using this endpoint "https://athena.eu-west-1.amazonaws.com/". But it works fine in pyathenajdbc.
Good chance that we can make it work~ Thanks!

@DyfanJones
Copy link
Owner

Have you tried using pyathena? pyathena and RAthena both use boto3 to make the connection to AWS Athena so it should be more comparable. Let me know your results :)

@DyfanJones
Copy link
Owner

@aoyh did it work for pyathena? or did you get a similar error to RAthena?

@DyfanJones
Copy link
Owner

@aoyh any update on this?

@aoyh
Copy link
Author

aoyh commented May 11, 2022

Hi @DyfanJones Sorry for my late reply. I spent a few days trying to get pyathena in the same environment. Till now I was not able to do it.
Error:

pyathena.error.DatabaseError: An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:iam::...  is not authorized to perform: athena:StartQueryExecution on resource:

However, another python package pyathenajdbc seems to be working there.

from pyathenajdbc import connect
conn = connect(S3OutputLocation=self.S3OutputLocation,
                       Schema=self.Schema,
                       EndpointOverride=self.EndpointOverride,
                       AwsRegion=self.AwsRegion,
                       Catalog=self.Catalog
                       )

Note that the EndpointOverride parameter from pyathenajdbc works fine.

Thanks!

@DyfanJones
Copy link
Owner

Thanks for doing that investigation @aoyh. From the looks of it pyathena and RAthena suffer the same issue regarding IAM role not having enough permissions for you, which is good in one way. RAthena is passing the endpoint url correctly to boto3.

As pyathenajdbc is using the jdbc driver. That is managed by the simba jdbc driver. Without going into the jdbc driver and finding out what is different I am not a 100% sure why boto3 isn't giving you the same results 🤔 It suggesting that your IAM role isn't suited which is confusing.

@aoyh
Copy link
Author

aoyh commented May 11, 2022

No worries @DyfanJones . I will use a workaround by starting with pyathenajdbc python package via reticulate R package and then do the rest of data wrangling in R.

Thank you all the same!

@DyfanJones
Copy link
Owner

@aoyh there is an R package that uses the jdbc driver AWR.Athena. that might be able to help you out, instead of having to use pyathenajdbc to do the same thing :)

@aoyh
Copy link
Author

aoyh commented May 17, 2022

Thanks @DyfanJones for timely tip. I was just pondering on the idea that a similar R package may exist. And you just lighten it!
Will try AWR.Athena out!

@DyfanJones
Copy link
Owner

I will do an initial release of the endpoint_override feature. In the meantime i will have to open up the athena jdbc to see what is the difference. From my understanding i am passing the endpoint correctly however I could be mistaken.

@DyfanJones
Copy link
Owner

@aoyh I have done some tweaking to the implementation of the endpoint_override, plus I have added some unit tests to check if the endpoint is getting correctly passed to boto3.client. Please have a go and see if I have cracked it :)

# Enable repository from dyfanjones
options(repos = c(
    ropensci = 'https://dyfanjones.r-universe.dev',
    CRAN = "https://cloud.r-project.org"))

# Download and install RAthena in R
install.packages('RAthena')

Many thanks for the testing you have done for me so far

@DyfanJones
Copy link
Owner

General notes for completeness ....

It looks like awswrangler passes endpoint_url to boto3.client, similar to the method RAthena and pyathena do. Which is reassuring that RAthena is doing it correctly.

Here is the method awswrangle uses when creating the boto3.clients.

@apply_configs
def client(
    service_name: str,
    session: Optional[boto3.Session] = None,
    botocore_config: Optional[botocore.config.Config] = None,
    verify: Optional[Union[str, bool]] = None,
) -> boto3.client:
    """Create a valid boto3.client."""
    endpoint_url: Optional[str] = _get_endpoint_url(service_name=service_name)
    return ensure_session(session=session).client(
        service_name=service_name,
        endpoint_url=endpoint_url,
        use_ssl=True,
        verify=verify,
        config=default_botocore_config() if botocore_config is None else botocore_config,
    )

https://github.com/awslabs/aws-data-wrangler/blob/ef0f83ac3a2e85b0279f88259c4443520f6f3a6f/awswrangler/_utils.py#L105-L120

Note: The endpoint needs to be in the samge region i.e. region: eu-west-1, and endpoint: https://athena.eu-west-1.amazonaws.com/ . If the region doesn't match the region the you are at risk of the following aws error: Credential should be scoped to a valid region.

@DyfanJones
Copy link
Owner

RAthena v-2.6.0 has been released on the cran. Let me know if your still having the endpoint_override issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants