Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbConnect() with RAthena takes longer than expected #168

Closed
juhoautio opened this issue Mar 25, 2022 · 16 comments
Closed

dbConnect() with RAthena takes longer than expected #168

juhoautio opened this issue Mar 25, 2022 · 16 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested

Comments

@juhoautio
Copy link

Issue Description

dbConnect() with RAthena takes longer than expected when connecting to my Athena.

Example:

library(RAthena)
library(DBI)

drv <- RAthena::athena()

tic("dbConnect")
conn <- dbConnect(drv,
                profile_name = profile,
                s3_staging_dir = s3_staging_dir)
toc()

# dbConnect: 428.593 sec elapsed

Compare that against RJDBC approach when connecting to the same Athena target:

library(rJava)
library(RJDBC)
library(DBI)

drv <- JDBC(driverClass="com.simba.athena.jdbc.Driver",
           "/Users/juhoautio/athena/AthenaJDBC42_2.0.9.jar",
           identifier.quote="'")

tic("dbConnect")
conn <- jdbcConnection <- dbConnect(
    drv,
    'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
    S3OutputLocation=s3_staging_dir,
    AwsRegion=region,
    AwsCredentialsProviderClass='com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider',
    AwsCredentialsProviderArguments=profile)
toc()

# dbConnect: 1.039 sec elapsed

I'm just getting started with R, but I tried to see with RStudio debugger where the RAthena dbConnect is spending its time. I'm suspecting that it could be related to calling glue$get_tables in a loop:

RAthena/R/utils.R

Lines 354 to 370 in abd3552

# get list of tables from glue catalog
get_table_list <- function(glue, schema){
token <- character(1)
table_list <- list()
while(!is.null(token)){
retry_api_call(response <- glue$get_tables(DatabaseName = schema, NextToken = token))
table_list <- c(
table_list,
lapply(response[["TableList"]],
function(x) {list(DatabaseName = x[["DatabaseName"]],
Name = x[["Name"]],
TableType = x[["TableType"]])})
)
token <- response[["NextToken"]]
}
return(table_list)
}

My glue catalog (that is used by Athena) contains a big number of schemas & tables. Does RAthena list all of them before returning the connection? Could that explain the slowness? How could I debug this to provide actual proof?

Once the connection has been obtained, performance is good – actually it seems that fetching query results (millions of rows) is faster with RAthena based connection than the RJDBC, so I would prefer to use RAthena, if I can just solve the problem of slow connection initialization.

@DyfanJones
Copy link
Owner

Hi @juhoautio,
That is correct, RAthena connects to RStudio connection tab and lists all all schemas and tables in the AWS Glue Catalog. That is the reason for it taking longer than a standard RJDBC connection. However you can toggle the RStudio connection tab off to speed up the initial connection.

library(DBI)

conn <- dbConnect(
    RAthena::athena(),
    profile_name = "my_athena_profile",
    rstudio_conn_tab  = FALSE
)

Documentation RAthena::dbConnect

I hope this helps :)

@DyfanJones
Copy link
Owner

Side note, if you are pulling back large amount of data back to R I would recommend looking at the unload method (https://dyfanjones.github.io/RAthena/articles/aws_athena_unload.html). It makes AWS Athena to return it's results in parquet format and then R picks it up using the arrow package. This gives some nice performance benefits 😄

@DyfanJones DyfanJones added the question Further information is requested label Mar 25, 2022
@juhoautio
Copy link
Author

juhoautio commented Mar 26, 2022

Wow, thanks a lot! That solved the slow dbConnect for me and is already enough to make RAthena the best alternative that I'm aware of!

rstudio_conn_tab doc & default value

Regarding the documentation for dbConnect, I even looked at it hoping to find an option like "skip listing schemas/tables", but I didn't realize that rstudio_conn_tab could be the culprit.

The doc says:

rstudio_conn_tab
Optional to get AWS Athena Schema and display it in RStudio's Connections Tab. Default set to TRUE.

I would tend to think FALSE is a better default. Isn't it? Or how is dbConnect generally used? Is there some other way to get querying with RAthena?

If the default won't be changed, would it be ok to include more info in the doc, basically to mention what you wrote: "RAthena connects to RStudio connection tab and lists all all schemas and tables in the AWS Glue Catalog" before the connection is returned. And that having this enabled adds to the time taken by dbConnect?

However you can toggle the RStudio connection tab off to speed up the initial connection.

Asynchronous connection tab population?

By the way, does RStudio connection tab support being populated asynchronously so that the connection could be obtained & used immediately, but so that connection tab would still be gradually updated as the glue catalog is being iterated? It could be natural to show first just the list of schemas and then the tables as they are returned from listing. This should assume that RStudio connection tab is able to display visual cues when something is still loading and not complete yet.

The unload optimization

This could deserve its own issue, but commenting here any way.

I would recommend looking at the unload method

That sounded very promising, but unfortunately I got empty results when trying to use it.

I tried:

library(tibble)
# it seems that arrow is required to use unload=T optimization with RAthena
library(arrow)

my_result <- as_tibble(dbGetQuery(conn, my_query, unload=T))

And it executed faster then without unload=T, but the result in my_result was empty.

In output I also saw this:

additional arguments ignored in warning()
Warning: AccessDenied (HTTP 403). Access Denied

Which is probably irrelevant because I see the same also without unload=T and in that case my_result is not empty – in my current case it's ~8M obs.

Could it be related to as_tibble somehow not being compatible with the unload option – I don't know. Again, please excuse my very limited knowledge of R at the moment.

Btw library(arrow) is not mentioned in the unload article, maybe it should be(?).

@DyfanJones
Copy link
Owner

DyfanJones commented Mar 26, 2022

The default for rstudio_conn_tab is set to TRUE to align with other DBI connections behaviour i.e. odbc and RProstgres, and the option was added to help with connection speed. Happy to update the documentation if it is confusing .

Currently I haven't figured out a method to do asynchronous connection tab population. But I am always open to PR to resolve this 😄

If you want tibbles returned instead of data.table you can use the following option in RAthena_options

RAthena_options("vroom", unload=T)

my_result <- dbGetQuery(conn, my_query)

vroom refers to the file parser package vroom and returns tibble data frame format.

library(arrow) shouldn't need to be required if it is installed onto the machine. However I should mention it is required on the machine within https://dyfanjones.github.io/RAthena/articles/aws_athena_unload.html .

Interesting you got a AccessDenied (HTTP 403). Access Denied. The unload method does a list_objects_v2 on the AWS S3 prefix to get all the parquet files. If the error persists then you might not have access to do list_objects_v2 to the AWS S3 bucket AWS Athena is outputting.

@DyfanJones DyfanJones added the documentation Improvements or additions to documentation label Mar 26, 2022
@juhoautio
Copy link
Author

The default for rstudio_conn_tab is set to TRUE to align with other DBI connections behaviour i.e. odbc and RProstgres, and the option was added to help with connection speed. Happy to update the documentation if it is confusing .

Right, I wouldn't say that the doc is confusing, but it can be helpful to extend it a bit. If it's possible to cover this info I would find it helpful:

  • If the value is TRUE, RAthena connects to RStudio connection tab and lists all all schemas and tables in the AWS Glue Catalog before dbConnect returns.
  • The list operation may take long if the AWS Glue Catalog has many objects.

Currently I haven't figured out a method to do asynchronous connection tab population.

👍 Thanks for answering though, indeed I was just wondering if RStudio might support that, but apparently not.

If you want tibbles returned instead of data.table you can use the following option in RAthena_options

I tried this, but as I suspected it didn't help with the 0 rows returned issue. I could open a new issue/question about it if you don't mind? But as said, I get Warning: AccessDenied (HTTP 403). Access Denied also when I use dbGetQuery without unload=T and in that case I get results after all. Is it possible to get more details (maybe some debug logging) to see which operation fails with 403?

Many thanks!

@DyfanJones
Copy link
Owner

DyfanJones commented Mar 28, 2022

I think I know what is causing the Warning: AccessDenied (HTTP 403). Access Denied. I believe your IAM role doesn't have permission to clear down the AWS Athena S3 bucket. To prevent RAthena attempting this you can use caching (https://dyfanjones.github.io/RAthena/articles/aws_athena_query_caching.html).

library(DBI)
library(RAthena)

RAthena_options(
    cache_size = 10
)

con <- dbConnet(athena())

df <- dbGetQuery(con, "select * from your_table")

Let me know if you keep getting Warning: AccessDenied (HTTP 403). Access Denied.

Sadly I am not sure how to get a more informative error message as it is coming from AWS S3 api https://aws.amazon.com/premiumsupport/knowledge-center/s3-troubleshoot-403/ . Probably need to raise it with Python Boto3 to get a more informative error message 🤔.

@juhoautio
Copy link
Author

I think I know what is causing the Warning: AccessDenied (HTTP 403). Access Denied. I believe your IAM role doesn't have permission to clear down the AWS Athena S3 bucket.

Makes sense: the role that I'm using for this doesn't have delete permission.

To prevent RAthena attempting this you can use caching

RAthena_options(
    cache_size = 10
)

This did the trick, it seems – no more warning 🥳

However it didn't work when I tried it first in my existing R session. When I started a new session from scratch, then it worked. Is that expected? Why wouldn't it apply within an existing R session?

Sadly I am not sure how to get a more informative error message as it is coming from AWS S3 api

Looks like you were able to deduce that the only possible line of code with that kind of warning was S3 deletion. That solves this case, but as a mere user I would not have been able to guess that.

When I asked about possibility to include more logging output, I meant including at least information on which line in RAthena code the error happened and what the context was. In this case it could have included the s3 bucket & path of the object that it was attempting to delete. Additionally I was wondering if it's a common pattern in R to allow user configure logging to control the level of verbosity.

Regarding deletion of query results from S3, I wouldn't have expected the library to do that. Do you know if the Athena JDBC driver does something like that as well? This makes me wonder, could RAthena even try to delete the target data if user executes a CTAS type query 🤔


As another side note, there seems to be a bug that sometimes "Data scanned" is printed wrongly. For example this happened:

library(DBI)
library(RAthena)
library(tictoc)

RAthena_options(
  cache_size = 10
)

con <- dbConnect(athena(),
                 profile_name = "my_profile",
                 rstudio_conn_tab  = FALSE,
                 s3_staging_dir = "s3://aws-athena-query-results-12345-us-east-1/")

tic("dbGetQuery 1")
df <- dbGetQuery(con, "SELECT COUNT(*) FROM my_table WHERE processdate > 20211122")
# INFO: (Data scanned: -43839744 Bytes)
toc()
# dbGetQuery 1: 39.114 sec elapsed

tic("dbGetQuery 2")
df <- dbGetQuery(con, "SELECT COUNT(*) FROM my_table WHERE processdate > 20211122")
# INFO: (Data scanned: -43839744 Bytes)
toc()
# dbGetQuery 2: 0.684 sec elapsed

When I don't enable caching, I can sometimes notice the same error in "Data scanned" printing. For example:

library(DBI)
library(RAthena)
library(tictoc)

con <- dbConnect(athena(),
                 profile_name = "my_profile",
                 rstudio_conn_tab  = FALSE,
                 s3_staging_dir = "s3://aws-athena-query-results-12345-us-east-1/")

tic("dbGetQuery 1")
df <- dbGetQuery(con, "SELECT COUNT(*) FROM my_table WHERE processdate > 20211122")
# INFO: (Data scanned: 96.57 MB)
toc()
# dbGetQuery 1: 23.786 sec elapsed

tic("dbGetQuery 2")
df <- dbGetQuery(con, "SELECT COUNT(*) FROM my_table WHERE processdate > 20211122")
# INFO: (Data scanned: -2732010 Bytes)
toc()
# dbGetQuery 2: 19.156 sec elapsed

# Warning messages:
# 1: An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied 
# 2: An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied 

But it doesn't happen most of the time.. I suppose it's not related to caching.

@juhoautio
Copy link
Author

One more thing, there's the Caching weakness that latest data is not always fetched.

So to force query execution I would need to always run RAthena_options(clear_cache = T) before my dbGetQuery, when I want to achieve both:

  • Skip s3 deletion (and avoid AccessDenied error)
  • Always run the query instead of caching

It would seem more natural if I would be able to set something like RAthena_options(skip_clear_s3 = T) instead of enabling caching to achieve the same as a side effect.

@DyfanJones
Copy link
Owner

How often is your backend data updating? The AccessDenied error is raised to R as a warning so that it doesn't affect the overall query, just abit annoying when user doesn't have delete access to the AWS Athena output S3 bucket.

On the other hand I am happy to add a skip_clear_s3 option into the RAthena_options, plus I some extra documentation is required to help/prevent any confusion :)

@DyfanJones DyfanJones added the enhancement New feature or request label Mar 28, 2022
@DyfanJones DyfanJones self-assigned this Mar 28, 2022
@juhoautio
Copy link
Author

How often is your backend data updating?

It depends. I think what makes this especially tough is that it's common to keep an R session running for a long time. For someone that is not so familiar with RAthena, it's not so obvious how to clear the cache – or to even realize that caching is happening.

The AccessDenied error is raised to R as a warning so that it doesn't affect the overall query, just abit annoying when user doesn't have delete access to the AWS Athena output S3 bucket.

Sure, that is good to know. It would be comforting though if the error message would be more informative, eg. if it would say that the query results could not be deleted because of missing permission. Then I could at least be confident that the query result itself can be trusted to be correct.

On the other hand I am happy to add a skip_clear_s3 option into the RAthena_options, plus I some extra documentation is required to help/prevent any confusion :)

Right, I think that would be ideal in my setup (and some people in the same organization with similar IAM permissions). But I'm not sure if it will be worth implementing yet. I think I'm going to experiment with the caching and see if it really ever is a problem that query is not run again every time. In fact, that can be more helpful than harm :)

But if it comes to that, I could even make a PR myself to add skip_clear_s3, if I end up realizing that it would be needed. But it's nice to hear that you're at least open to including it in RAthena.

If I could make a wish, I would choose improving the error logging before adding new conf options :)

This issue can be closed I think. Thanks a lot for all the help!

@DyfanJones
Copy link
Owner

Re-opening this ticket so that suggested documentations plus feature can be added.

@DyfanJones DyfanJones reopened this Apr 19, 2022
@DyfanJones
Copy link
Owner

Ah sorry @juhoautio I missed a question you asked. The issue you highlighted with information message "Data Scanned ..." is an issue with the reticulate package. It is returning the data class as and integer however the size of the integer is too large and it errors out and returns a weird output:

INFO: (Data scanned: -43839744 Bytes)

🤔 I wonder if i can change the returning output. I will have a little look at this as well :)

@DyfanJones
Copy link
Owner

I think I an initial implementation for resolving the Data scanned issue.

remotes::install_github("DyfanJones/RAthena", ref = "delay_convert")
library(DBI)

con = dbConnect(RAthena::athena(), rstudio_conn_tab = FALSE)

res = dbExecute(con,
    "select * from my_big_table"
)

Check on query Statistics

RAthena::dbStatistics(res)

#> $EngineExecutionTimeInMillis    678844
#> $DataScannedInBytes     3816675840
#> $TotalExecutionTimeInMillis    679019
#> $QueryQueueTimeInMillis    143
#> $QueryPlanningTimeInMillis    94
#> $ServiceProcessingTimeInMillis   32

Normally base R integer would fail (32 bit) as.integer(3816675840) however by switching the data type to numeric it seems fine.

RAthena:::data_scanned(res@info$Statistics$DataScannedInBytes)

'3.55 GB'

Just need to run the unit tests to make sure the package still behaves as expected :D

@DyfanJones
Copy link
Owner

Hi @juhoautio the latest dev version now supports skipping clearing AWS S3 resource. Plus it should fix the data scan message bug you mentioned earlier.

To install latest dev version:

# Enable universe(s) by dyfanjones
options(repos = c(
  dyfanjones = 'https://dyfanjones.r-universe.dev',
  CRAN = 'https://cloud.r-project.org')
)
# Install dev version packages
install.packages('RAthena')
library(DBI)
library(RAthena)

RAthena_options(
    clear_s3_resource = FALSE # prevent AWS S3 resource being cleared down in dbClearResults
)

con <- dbConnect(
    RAthena::athena(),
    rstudio_conn_tab = FALSE # Skip RAthena populating RStudio's connection tab
)

@juhoautio
Copy link
Author

Thank you. I'm going to take in into use it when the next release comes.

@DyfanJones
Copy link
Owner

RAthena v-2.6.0 has been released on the cran. If you have any future suggestions/features please feel free to raise them :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants