-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbplyr's tbl() function queries Athena for the fields. Querying Glue would be more efficient. #64
Comments
@OssiLehtinen This looks really good, I will review the code to double check, but I am keen to get it in. I am happy you like |
Speed test:
The speed increase is really good and makes it alot more user interactive. |
Due to the speed increase I think the documentation will have to be updated to advise users to use the new method as much as possible if they can |
@OssiLehtinen coming across cran check:
I believe it relates to: Do you have a possible solution for this? |
@OssiLehtinen don't worry I have modified the code not to look if it is a sub query but to check if the class is "ident" or not:
This will give the same results :) so all good |
The speed up is great! And yes, the 'å's are surely the culprit for the cran error. I included those in the range from purely Scandinavic reasoning, but now that I think of it, that regex is not optimal in any case: Sorry for the rushed version, will look into it more once at the keyboard tomorrow. Fortunately a failure to catch a valid table name results in using the dplyr default method, but of course it would be best to use the faster one when evever possible. |
This statement seems to do the trick in my tests (the \p{L} should match any unicode character):
https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html |
Ah, you seem to have found a more elegant solution with inherits, right? That's great, as the regex solution is pretty kludgey. |
Just to confirm, the |
Haha, I somehow managed to miss your latest message about the inherits solution earlier and noticed the solution only after looking at your pull request (and after posting about my regex-improvement etc.). Well, at least I got some practice on regex... |
One more glitch came up! With the current version, partition names will be missed from the list of names returned. The following snippet will also fetch the partitionnames if they exist: if(is_ident) { # If a direct definiton, get the fields from Glue
message("direct")
if (!dbIsValid(con)) {stop("Connection already closed.", call. = FALSE)}
if (grepl("\\.", sql)) {
dbms.name <- gsub("\\..*", "" , sql)
Table <- gsub(".*\\.", "" , sql)
} else {
dbms.name <- conn@info$dbms.name
Table <- sql}
tryCatch(
table_definition <- con@ptr$glue$get_table(DatabaseName = dbms.name,
Name = Table)$Table)
columns <- sapply(table_definition$StorageDescriptor$Columns, function(y) y$Name)
partitions <- NULL
if(length(table_definition$PartitionKeys) > 0) partitions <- sapply(table_definition$PartitionKeys, function(y) y$Name)
c(columns, partitions)
} |
Btw, should something similar be done in dbListFields also? |
The partition is for the |
ah I see what you mean, good spot |
i am guessing you are using the if statement due to the nature of what is returned in
This will make the following:
Yes |
This issue persisted in the RStudio connection tab view. PR #65 fixes this issue. |
dplyr tbl improved performance speed #64
PR #65 passed unit tests. If this issue persists please re-open or open another one |
Where does |
Lines 168 to 175 in 1a4b000
|
Issue Description
When 'connecting' to a table with dplyr's tbl()-function, a query is sent to Athena with a 'WHERE 0 == 1' clause for getting the column names. This query is generated by db_query_fields.DBIconnection.
The thing is, querying Athena can be slow at times and a much faster response could be gotten from Glue.
This, however works only if a direct connection to a table is made, like
tbl(con, in_schema("schema", "table))
. If one has a subquery in tbl() (e.g.,tbl(con, sql("select * from table where b=c))
, Glue cannot help here.To handle this, one could define a method, such as:
So basically test if we have a direct table definition or a subquery, and query Glue or Athena accordingly.
The weakest part of this would be the first regex for trying to see if we have a direct table def. The good thing is, that if the regex match returns FALSE, we revert to dplyr's default behaviour.
What do you think?
p.s. Have been trying noctua as opposed to RAthena for a few days and really seems to work as a drop in replacement. Really like the native 'all R' aspect of it!
The text was updated successfully, but these errors were encountered: