Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginated glue responses not covered #137

Closed
OssiLehtinen opened this issue Jan 29, 2021 · 24 comments
Closed

Paginated glue responses not covered #137

OssiLehtinen opened this issue Jan 29, 2021 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@OssiLehtinen
Copy link
Contributor

Hi Dyfan, one more for the week:

when noctua queries glue for a list of databases with get_databases(), the list is truncated in case the result gets paginated. Same with get_tables().

Not sure what is the underlying reason for this, but in one of our environments the first call to glue returns only the first 23 databases. For one reason or another setting the MaxResults parameter doesn't help here.

One fix would be to implement a separate paginated reader function and replace the native paws calls with that, right? Perhaps some more elegant implementation could be come up with.

get_all_databases <- function(glue) {
  cresults <- glue$get_databases()
  cdbs <- sapply(cresults$DatabaseList,function(x) x$Name)
  
  citers <- 0
  while(length(cresults$NextToken) > 0 & citers < 1000) {
    
    cresults <- glue$get_databases(NextToken = cresults$NextToken)
    cdbs <- c(cdbs, sapply(cresults$DatabaseList,function(x) x$Name))
    citers <- citers + 1
  }
  
  cdbs
}

(I tend to get nervous with these while loops so I added the max iteration count there...)

@DyfanJones
Copy link
Owner

Sorry @OssiLehtinen I am not a 100% sure what is the issue. Can you get me an example of what is happening? Is the issue not all databases and tables aren't being returned. Or table names are getting truncated?

@OssiLehtinen
Copy link
Contributor Author

Ah sorry for being unclear. The issue is that all databases and tables are note returned. For example, the connection panel in Rstudio shows only the first 23 databases, even though many more exist in the catalog.

@DyfanJones
Copy link
Owner

Ah I see :) i will have a look into it. Strange it is only displaying that top 23 data bases, if i remember correctly the glue call should return a 1000 🤔.

But you are right it does need the Next token as it won't return everything.

Just thinking out loud (don't know if it is true so don't quote me 😆) is there a restriction on the rstudio connection tab? As in the number of databases and tables it will show

@OssiLehtinen
Copy link
Contributor Author

Yeah it is pretty strange, and I don't quite understand why we're getting just the first 23 databases (even if we explicitly set MaxResults = 1000L in the paws function call). 23 is a pretty strange number of results to return.

It shouldn't be an issue with rstudio as the issue is there also with a direct call to glue$get_databases() and in an R session in a terminal.

BTW, boto3 has the same issue, so it doesn't seem to be a specific bug with paws either. Well, fo of course it is in principle a bug with both as the MaxResults parameter isn't really respected. aws-cli returns all the databases, however, but maybe there is some built in pagination handling.

p.s. I guess someone might have 1000+ databases (not saying that anyone should :D) and then pagination would be needed even if things were working correctly...

@DyfanJones
Copy link
Owner

Interesting, i guess i need to make 24 temporary databases in my athena just to see this happening :D hahaha

@OssiLehtinen
Copy link
Contributor Author

Haha, maybe you would see it, maybe not. The strange thing is that this happens only in one of our environments/accounts. One specific feature there is that we have quite a lot of content shared via lake formation to that account, but the 'breaking point' of the database list doesn't seem to correlate with anything.

@DyfanJones
Copy link
Owner

does this happen in dbGetTables and dbListTables. I think i will need to update these methods accordingly

@OssiLehtinen
Copy link
Contributor Author

Yes, the same issue there. Must be everywhere where glue$get_databases() is used.

@DyfanJones
Copy link
Owner

Interesting, and does this only happen when you are working in that environment?

@OssiLehtinen
Copy link
Contributor Author

Yes, only in there.

@DyfanJones
Copy link
Owner

hmmm interesting. Are the permissions to aws glue the same in that environment compared to your others?

@DyfanJones
Copy link
Owner

Or is thats glue catalogue sync up correctly with your others?

@OssiLehtinen
Copy link
Contributor Author

Well there certainly could be something weird going on especially with the permissions granted through lake formation, but haven't spotted anything relevant so far. And when we do another get_databases() query with NextToken we get results for the rest of the tables, so no fundamental problem with permissions in my opinion.

I will let you know if we make some discoveries about what might cause this.

@DyfanJones
Copy link
Owner

Well next token needs to be added to these functions. Hopefully this will fix it going forward :) very interesting problem as i haven't come across it before :)

@DyfanJones DyfanJones self-assigned this Jan 29, 2021
@DyfanJones DyfanJones added the bug Something isn't working label Jan 29, 2021
@OssiLehtinen
Copy link
Contributor Author

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.

@DyfanJones
Copy link
Owner

@OssiLehtinen thanks :) I will have a look to see how this works in practise :D

@DyfanJones
Copy link
Owner

@OssiLehtinen are you available to test if branch aws-table-list fixes the issue you are coming across in your environment?

# Installation method for branch: aws-table-list
remotes::install_github("dyfanjones/noctua", ref="aws-table-list")

This works in my environment but I would like to make sure it is fixed in yours :)

@DyfanJones
Copy link
Owner

Thanks for this :) I have created 2 helper functions for the glue query as I wanted to format the returning response from AWS. This is to help with reducing the number of loops required to parse the named lists, and help with performance :)

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.

@OssiLehtinen
Copy link
Contributor Author

Everything seems to work correctly now. The connections tab shows all the databases and also dbListTables and dbGetTables show the full list now. So, great work again, thanks!

p.s. the rstudio_conn_tab = F option really comes handy, as reading the whole glue catalog is even slower now (obviously it takes more time to actually read in the complete catalog :) )

@DyfanJones
Copy link
Owner

You must have a fairly large glue catalogue for it to be fairly slow :D . From the R side of things it should be nice and fast when parsing the lists 😄 (Well as fast as I know how currently 😋)

🤔 How is dbGetTables or dbListTables? Do they run slow-ish?

@OssiLehtinen
Copy link
Contributor Author

Yeah, it's some 1200 tables or so at the moment. But I think using Lake Formation is the final nail in the coffin (quite a few of those tables are shared from other accounts on this specific account, where things are especially slow).

I totally agree the slowness shouldn't be an R side issue. I think I will raise an issue with AWS about it.

@OssiLehtinen
Copy link
Contributor Author

Oh, dbListTables and dbGetTables are equally slow. Also 'select * from information_schema.tables'.

@DyfanJones
Copy link
Owner

@OssiLehtinen Thanks for giving an update on this 😄

DyfanJones added a commit that referenced this issue Feb 2, 2021
…ion (#137)

Increased test coverage through the use of wrapper functions for raising errors.
@DyfanJones
Copy link
Owner

PR #139 fixes this issue 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants