Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr's tbl() function queries Athena for the fields. Querying Glue would be more efficient. #64

Closed
OssiLehtinen opened this issue Feb 6, 2020 · 19 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@OssiLehtinen
Copy link
Contributor

Issue Description

When 'connecting' to a table with dplyr's tbl()-function, a query is sent to Athena with a 'WHERE 0 == 1' clause for getting the column names. This query is generated by db_query_fields.DBIconnection.

The thing is, querying Athena can be slow at times and a much faster response could be gotten from Glue.

This, however works only if a direct connection to a table is made, like tbl(con, in_schema("schema", "table)). If one has a subquery in tbl() (e.g., tbl(con, sql("select * from table where b=c)), Glue cannot help here.

To handle this, one could define a method, such as:

db_query_fields.AthenaConnection <- function(con, sql, ...) {

  # Test if we have a subquery or a direct table definition
  is_direct <- grepl('^"?[a-å]+"?\\.?"?[a-å]+"?$', trimws(tolower(sql)))
  
  if(is_direct) { # If a direct definiton, get the fields from Glue
  
    if (!dbIsValid(con)) {stop("Connection already closed.", call. = FALSE)}
    
    if (grepl("\\.", sql)) {
      dbms.name <- gsub("\\..*", "" , sql)
      Table <- gsub(".*\\.", "" , sql)
    } else {
      dbms.name <- conn@info$dbms.name
      Table <- sql}
    
    tryCatch(
      output <- conn@ptr$glue$get_table(DatabaseName = dbms.name,
                                        Name = Table)$Table$StorageDescriptor$Columns)
    sapply(output, function(y) y$Name)
  } else { # If a subquery, query Athena for the fields
    sql <- sql_select(con, sql("*"), sql_subquery(con, sql), where = sql("0 = 1"))
    qry <- dbSendQuery(con, sql)
    on.exit(dbClearResult(qry))
    
    res <- dbFetch(qry, 0)
    names(res)
  }
}

So basically test if we have a direct table definition or a subquery, and query Glue or Athena accordingly.

The weakest part of this would be the first regex for trying to see if we have a direct table def. The good thing is, that if the regex match returns FALSE, we revert to dplyr's default behaviour.

What do you think?

p.s. Have been trying noctua as opposed to RAthena for a few days and really seems to work as a drop in replacement. Really like the native 'all R' aspect of it!

@DyfanJones
Copy link
Owner

@OssiLehtinen This looks really good, I will review the code to double check, but I am keen to get it in. I am happy you like noctua as well, I am planning to keep both RAthena and noctua up to date and at the same level. So hopefully it will feel seamless between using one or the other.

@DyfanJones
Copy link
Owner

Speed test:

library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

system.time(tbl(con, "iris")) -> t1 # new method
system.time(tbl(con, sql("select * from iris"))) -> t2 # to replicate old method

# new method
user  system elapsed 
0.082   0.012   0.288 

# old method
user  system elapsed 
0.993   0.138   3.660 

The speed increase is really good and makes it alot more user interactive.

@DyfanJones
Copy link
Owner

Due to the speed increase I think the documentation will have to be updated to advise users to use the new method as much as possible if they can

@DyfanJones
Copy link
Owner

@OssiLehtinen coming across cran check:

checking R files for non-ASCII characters ... WARNING
Found the following file with non-ASCII characters:
dplyr_integration.R
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.
Use \uxxxx escapes for other characters.

I believe it relates to:
grepl('^"?[a-å]+"?\\.?"?[a-å]+"?$', trimws(tolower(sql)))

Do you have a possible solution for this?

@DyfanJones
Copy link
Owner

@OssiLehtinen don't worry I have modified the code not to look if it is a sub query but to check if the class is "ident" or not:

db_query_fields.AthenaConnection <- function(con, sql, ...) {
  
  # check if sql is dbplyr ident
  is_ident <- inherits(sql, "ident")
  
  if(is_ident) { # If ident, get the fields from Glue
    
    if (grepl("\\.", sql)) {
      dbms.name <- gsub("\\..*", "" , sql)
      Table <- gsub(".*\\.", "" , sql)
    } else {
      dbms.name <- con@info$dbms.name
      Table <- sql}
    
    tryCatch(
      output <- con@ptr$glue$get_table(DatabaseName = dbms.name,
                                       Name = Table)$Table$StorageDescriptor$Columns)
    
    sapply(output, function(y) y$Name)
  } else { # If a subquery, query Athena for the fields
    # return dplyr methods
    sql_select <- pkg_method("sql_select", "dplyr")
    sql_subquery <- pkg_method("sql_subquery", "dplyr")
    dplyr_sql <- pkg_method("sql", "dplyr")
    
    sql <- sql_select(con, dplyr_sql("*"), sql_subquery(con, sql), where = dplyr_sql("0 = 1"))
    qry <- dbSendQuery(con, sql)
    on.exit(dbClearResult(qry))
    
    res <- dbFetch(qry, 0)
    names(res)
  }
}

This will give the same results :) so all good

@OssiLehtinen
Copy link
Contributor Author

The speed up is great!

And yes, the 'å's are surely the culprit for the cran error. I included those in the range from purely Scandinavic reasoning, but now that I think of it, that regex is not optimal in any case:
One could limit the letter range to a-z, but the bigger problem are numbers, underscores etc. I'm on my mobile at the moment so can't test it right now, but [a-z1-9_] could do the trick? Regex is really not my forte, so have to test it still.

Sorry for the rushed version, will look into it more once at the keyboard tomorrow. Fortunately a failure to catch a valid table name results in using the dplyr default method, but of course it would be best to use the faster one when evever possible.

@OssiLehtinen
Copy link
Contributor Author

This statement seems to do the trick in my tests (the \p{L} should match any unicode character):

is_direct <- grepl('^"?[\\p{L}0-9_]+"?\\.?"?[\\p{L}0-9_]+"?$', trimws(tolower(sql)), perl = T)

https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html

@OssiLehtinen
Copy link
Contributor Author

Ah, you seem to have found a more elegant solution with inherits, right? That's great, as the regex solution is pretty kludgey.

@OssiLehtinen
Copy link
Contributor Author

OssiLehtinen commented Feb 7, 2020

Just to confirm, the inherits solution works well in my tests too.

@OssiLehtinen
Copy link
Contributor Author

Haha, I somehow managed to miss your latest message about the inherits solution earlier and noticed the solution only after looking at your pull request (and after posting about my regex-improvement etc.). Well, at least I got some practice on regex...

@OssiLehtinen
Copy link
Contributor Author

OssiLehtinen commented Feb 7, 2020

One more glitch came up!

With the current version, partition names will be missed from the list of names returned.

The following snippet will also fetch the partitionnames if they exist:

  if(is_ident) { # If a direct definiton, get the fields from Glue
    message("direct")
    if (!dbIsValid(con)) {stop("Connection already closed.", call. = FALSE)}
    
    if (grepl("\\.", sql)) {
      dbms.name <- gsub("\\..*", "" , sql)
      Table <- gsub(".*\\.", "" , sql)
    } else {
      dbms.name <- conn@info$dbms.name
      Table <- sql}
    
    tryCatch(
      table_definition <- con@ptr$glue$get_table(DatabaseName = dbms.name,
                                        Name = Table)$Table)
      
      columns <- sapply(table_definition$StorageDescriptor$Columns, function(y) y$Name)
      
      partitions <- NULL
      if(length(table_definition$PartitionKeys) > 0) partitions <- sapply(table_definition$PartitionKeys, function(y) y$Name)
    
      c(columns, partitions)
    
  }

@OssiLehtinen
Copy link
Contributor Author

Btw, should something similar be done in dbListFields also?

@DyfanJones
Copy link
Owner

The partition is for the db_query_fields?

@DyfanJones
Copy link
Owner

ah I see what you mean, good spot

@DyfanJones
Copy link
Owner

i am guessing you are using the if statement due to the nature of what is returned in sapply from the partition call. To over come this vapply can be used instead with expected values to be returned:

vapply(output$PartitionKeys,function(y) y$Name, FUN.VALUE = character(1))

This will make the following:

db_query_fields.AthenaConnection <- function(con, sql, ...) {
  
  # check if sql is dbplyr ident
  is_ident <- inherits(sql, "ident")
  
  if(is_ident) { # If ident, get the fields from Glue
    
    if (grepl("\\.", sql)) {
      dbms.name <- gsub("\\..*", "" , sql)
      Table <- gsub(".*\\.", "" , sql)
    } else {
      dbms.name <- con@info$dbms.name
      Table <- sql}
    
    tryCatch(
      output <- con@ptr$glue$get_table(DatabaseName = dbms.name,
                                       Name = Table)$Table)
    
    col_names = vapply(output$StorageDescriptor$Columns, function(y) y$Name, FUN.VALUE = character(1))
    partitions = vapply(output$PartitionKeys,function(y) y$Name, FUN.VALUE = character(1))
    
    c(col_names, partitions)
    
  } else { # If a subquery, query Athena for the fields
    # return dplyr methods
    sql_select <- pkg_method("sql_select", "dplyr")
    sql_subquery <- pkg_method("sql_subquery", "dplyr")
    dplyr_sql <- pkg_method("sql", "dplyr")
    
    sql <- sql_select(con, dplyr_sql("*"), sql_subquery(con, sql), where = dplyr_sql("0 = 1"))
    qry <- dbSendQuery(con, sql)
    on.exit(dbClearResult(qry))
    
    res <- dbFetch(qry, 0)
    names(res)
  }
}

Yes dbListFields should have this as currently it will be missing partition colnames. Which is a bug

@DyfanJones
Copy link
Owner

This issue persisted in the RStudio connection tab view. PR #65 fixes this issue.

@DyfanJones DyfanJones added the bug Something isn't working label Feb 7, 2020
DyfanJones pushed a commit that referenced this issue Feb 8, 2020
dplyr tbl improved performance speed #64
@DyfanJones
Copy link
Owner

PR #65 passed unit tests. If this issue persists please re-open or open another one

@geotheory
Copy link

Where does pkg_method come from?

@DyfanJones
Copy link
Owner

Where does pkg_method come from?

pkg_method function is created:

noctua/R/utils.R

Lines 168 to 175 in 1a4b000

# get parent pkg function and method
pkg_method <- function(fun, pkg) {
if (!requireNamespace(pkg, quietly = TRUE)) {
stop(fun,' requires the ', pkg,' package, please install it first and try again',
call. = F)}
fun_name <- utils::getFromNamespace(fun, pkg)
return(fun_name)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants