Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr summarise translation to SQL #153

Closed
ellmanj opened this issue Aug 2, 2021 · 13 comments
Closed

dbplyr summarise translation to SQL #153

ellmanj opened this issue Aug 2, 2021 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@ellmanj
Copy link

ellmanj commented Aug 2, 2021

Hello again! I've found what I think is another issue around translating to SQL.

When I try to use the summarise function to get the median value for a column, I get the following error.

> query <- dplyr::tbl(db_connection, "table_name") %>% summarise(median_value=median(column_name))
> query %>% collect()
Error: InvalidRequestException (HTTP 400). line 1:42: mismatched input '('. Expecting: 'BY'

Here is the SQL being executed:

> query %>% show_query()
<SQL>
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "column_name") AS "median_value"
FROM "table_name"

So I believe the issue is that PERCENTILE_CONT is not a valid Presto function. Perhaps we could use one of the approx_percentile functions here?

@DyfanJones DyfanJones added the bug Something isn't working label Aug 2, 2021
@DyfanJones
Copy link
Owner

Hi @ellmanj thanks for highlighting this issue. I will check this out :)

@DyfanJones DyfanJones self-assigned this Aug 2, 2021
@DyfanJones
Copy link
Owner

So percentile_cont is coming from dbplyr::base_win, dbplyr::base_agg and dbplyr::base_no_win. Will replace these going forward.

@DyfanJones
Copy link
Owner

Branch sql_tran_median seems to have fixed the issue:

remotes::install_github("DyfanJones/noctua", ref = "sql_tran_median")
library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

# Create iris data set in AWS Athena
dbWriteTable(con, "iris", iris)

tbl(con, "iris") %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 3]
# Database: Athena 0.1.11 [default@eu-west-1/default]
  lower_percentile median upper_percentile
             <dbl>  <dbl>            <dbl>
1             1.6    4.4              5.1

tbl(con, "iris") %>%
  group_by(species) %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 4]
# Database: Athena 0.1.11 [default@eu-west-1/default]
  species    lower_percentile median upper_percentile
  <chr>                 <dbl>  <dbl>            <dbl>
1 virginica               5.1    5.6              5.9
2 setosa                  1.4    1.5              1.6
3 versicolor              4      4.4              4.6

@ellmanj Please let me know if this fixes the issue.

@ellmanj
Copy link
Author

ellmanj commented Aug 3, 2021

thanks @DyfanJones! I tried it out. It works if I don't supply the na.rm parameter to the median function. But in my current project in some places we do supply that parameter.

Example:

> query <- dplyr::tbl(db_connection, "person") %>% summarise(median_value=median(age, na.rm = TRUE))
> query %>% collect()
Error in median(age, na.rm = TRUE) : unused argument (na.rm = TRUE)

Any thoughts?

@DyfanJones
Copy link
Owner

@ellmanj Ah thanks, I have that behaviour missing. I will add it so that it mimics dbplyr

@DyfanJones
Copy link
Owner

@ellmanj add to parameter na.rm to median sql_translate_env`. Please have a go.

library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

# Create iris data set in AWS Athena
dbWriteTable(con, "iris", iris)

tbl(con, "iris") %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 3]
# Database: Athena 0.1.11 [default@eu-west-1/default]
lower_percentile median upper_percentile
<dbl>  <dbl>            <dbl>
  1             1.6    4.4              5.1
# Warning message:
# Missing values are always removed in SQL.
# Use `median(x, na.rm = TRUE)` to silence this warning
# This warning is displayed only once per session. 

tbl(con, "iris") %>%
  group_by(species) %>%
  summarise(lower_percentile=quantile(petal_length,0.25),
            median = median(petal_length, na.rm = TRUE),
            upper_percentile = quantile(petal_length,0.75))

# Source:   lazy query [?? x 4]
# Database: Athena 0.1.11 [default@eu-west-1/default]
species    lower_percentile median upper_percentile
<chr>                 <dbl>  <dbl>            <dbl>
1 virginica               5.1    5.6              5.9
2 setosa                  1.4    1.5              1.6
3 versicolor              4      4.4              4.6

@ellmanj
Copy link
Author

ellmanj commented Aug 3, 2021

@DyfanJones that worked! Thanks for the quick turnaround on this.

@DyfanJones
Copy link
Owner

@ellmanj PR #154 has been merged to master. Will release to the cran roughly on the 27/08 due to cran policy of monthly releases.

@DyfanJones
Copy link
Owner

Will push these changes to cran tomorrow. Sorry around the delay I had a short holiday so these changes got delayed abit

@ellmanj
Copy link
Author

ellmanj commented Sep 9, 2021 via email

@DyfanJones
Copy link
Owner

Sorry the release was delayed by #156 and how to best implement it. I will update RAthena and continue with the release.

@ellmanj
Copy link
Author

ellmanj commented Sep 22, 2021 via email

@DyfanJones
Copy link
Owner

noctua v-2.2.0 has now been released to the cran, thanks for your wait :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants