Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying large data sets not working #54

Closed
nbahnson opened this issue May 27, 2020 · 12 comments
Closed

Querying large data sets not working #54

nbahnson opened this issue May 27, 2020 · 12 comments
Assignees
Labels
bug Unintended behavior that should be corrected
Milestone

Comments

@nbahnson
Copy link

nbahnson commented May 27, 2020

Hello,

Thanks for working on this package! It's really great to be able to analyze salesforce data in R and this makes that process a lot smoother. I currently am trying to use salesforcer to query ~1mil rows.

When I use sf_query with the rest API, any query above 2000 rows hangs indefinitely. I was able to update the sf_query_rest function and fix the hanging by changing this part of the code:

if (!is.null(next_records_url)) {
    next_records <- sf_query_rest(next_records_url = next_records_url, 
      control = control, verbose = verbose, ...)
    resultset <- bind_rows(resultset, next_records)
  }

to be

if (!response_parsed$done) {
    next_records <- sf_query_rest(next_records_url = next_records_url, 
                                  control = control, verbose = verbose, ...)
    resultset <- bind_rows(resultset, next_records)
  }

so it now works with over 2000 row queries. I was able to use this to pull my full data set correctly (though it certainly takes a while). Does this fix make sense, or might this cause unforeseen issues?

When I try to use the bulk api "Bulk 1.0", using sf_query_bulk, it only returns about 1/6 of the total data - not sure what's happening there. I also get errors if I try to use "Bulk 2.0" - it is mentioned in the documentation but does not seem to work when try to use it as a function argument as I get an error telling me it's not a valid argument. Apologies if I am simply using the function incorrectly.

Do you know why Bulk 2.0 isn't working, or why the normal bulk query is not returning the full data set?

@StevenMMortimer
Copy link
Owner

@nbahnson Thanks for the update and troubleshooting yourself a bit. It's interesting that you found things work well if you check !response_parsed$done. Have you checked what is the value of next_records_url in that case? I'm wondering what URL is being used if !is.null(next_records_url) does not work, but !response_parsed$done works. I would think that it implies the URL is NULL which should restart pulling query data from the original query and start returning duplicate rows.

salesforcer/R/query.R

Lines 130 to 139 in 2954c4b

if(!response_parsed$done){
next_records_url <- response_parsed$nextRecordsUrl
}
# check whether it has next record
if(!is.null(next_records_url)){
next_records <- sf_query_rest(next_records_url = next_records_url, control = control,
verbose = verbose, ...)
resultset <- bind_rows(resultset, next_records)
}

Regarding the Bulk API, I would highly recommend using it vs. the REST API for queries that return anything above 10K records or so. I'm surprised you're not getting things to work, so two things:

  1. I'd recommend using sf_query(....api_type="Bulk 1.0") instead of sf_query_bulk() directly just in case there are any less user-friendly elements or bugs with using that function directly which is more of a helper utility that gets called by sf_query().
  2. I'd recommend running your query from R and then monitoring it in Salesforce's Bulk API dashboard just to see what it's saying there as well. Your Bulk API dashboard is available at the URL: https://{{your.instance}}.salesforce.com/750. Feel free to let me know how it goes.

@nbahnson
Copy link
Author

nbahnson commented Jun 5, 2020

@StevenMMortimer thanks for the response!

As for the REST API:
When I change the code pack to the package code the next_records_url just keeps repeating the last url. For example, if I am querying 4500 rows with the rest api, these are the next_records_url values:
"/services/data/v46.0/query/%leavingblankincasethisisidentifying%-2000"
and then infinite repeats of
"/services/data/v46.0/query/%leavingblankincasethisisidentifying%-4000"
and it never exits the loop, thereby returning duplicate rows as you said.

It seems this is because when response_parsed$done = True, the next_records_url isn't reset with what should be the final NULL value, and so the loop cannot be closed. When I trace through the logic like this, it seems that this loop can never exit if over 2000 rows are queried with the REST API. Am I missing something? Can this be corrected in the package?

As for the Bulk API:

  1. I get the same results using sf_query(....api_type="Bulk 1.0"), which again only returns ~115k out of ~715k rows (which I get by looking at salesforce directly or using my modified REST API call).
  2. I looked at the Bulk API dashboard (thanks for that tip! didn't know about that). It says that my query processed ~715k records, and when I look at the results downloaded through the dashboard I see the ~715k results I was expecting. This seems to imply something going wrong with the R bulk API not downloading the full results. Do you have any idea why that might be?

Is there some other method in this package that I could use to correctly download a large query? I would try Bulk 2.0 but as I mentioned earlier, that is not available as an option when I actually call the sf_query function (I get an error saying it's not a valid option). I am pulling my data from another source for my purposes for now but would love to be able to query Salesforce directly with R.

@StevenMMortimer
Copy link
Owner

@nbahnson Thanks for the update! Sorry that hear that things aren't working :(

I was able to find that the potential issue could be related to the size of data returned by the SOQL which may or may not be measured by the number of records. The limits are noted HERE. Also, someone previously mentioned that the types of fields being pulled will affect the results as well: #35 (comment). I never really noticed or appreciated that, but maybe that is the issue with hanging at that specific number?

As far as the Bulk API, I'm trying to think through the best way to debug. Possibly running the individual internal functions and checking the output. Specifically, this type of routine and looking that every single batch is being tracked and only pulled if "Complete":

my_query <- "SELECT .... FROM ...."

job_info <- sf_create_job_bulk(operation = 'query', object = '..........')

query_info <- sf_submit_query_bulk(job_id = job_info$id, soql = my_query)

result <- sf_batch_details_bulk(job_id = query_info$jobId,
                                batch_id = query_info$id)

recordset <- sf_query_result_bulk(job_id = query_info$jobId,
                                  batch_id = query_info$id,
                                  result_id = result$result)

sf_close_job_bulk(job_info$id)

Really appreciate the help to debug!

@nbahnson
Copy link
Author

nbahnson commented Jun 5, 2020

@StevenMMortimer

In your first comment about the specific number, is that in regards to the REST API? Are you talking about the specific number of 2000? Apologies for my loose language, that number is mentioned because that is the number of rows returned per url - when you use the rest API for anything over 2000 you have to make subsequent calls and then stitch the results together, and that recursion is where the function is breaking.

That's why there's the recursive call in the first place, to get around the 2000 row limit with the REST api consistent with the documentation you linked (apologies if you already know this and I am misunderstanding your response). Again, I was able to fix the recursion issue by just changing the conditional logic so I think that's the only issue here. I have a local working version of the function which makes me believe that it's unrelated to any nuanced SOQL quirks. It's just a logic error that will lead to the function entering an infinite loop if it ever has to make a recursive call, which again only happens after >2000 rows are queried with the REST API.

The REST API issue seems pretty clear - I'll definitely work to debug the Bulk API. I appreciate the help! I'd love to get these fixed in the package so I don't have to have my own versions floating around my projects

@StevenMMortimer StevenMMortimer self-assigned this Jun 8, 2020
@StevenMMortimer StevenMMortimer added this to the 0.2.0 milestone Jun 8, 2020
@StevenMMortimer StevenMMortimer added the bug Unintended behavior that should be corrected label Jun 8, 2020
@nbahnson
Copy link
Author

nbahnson commented Jun 9, 2020

Hey @StevenMMortimer

I was able to debug the Bulk API as well!

The issue is in the function
sf_query_result_bulk
Any line # references will be for this function

The bulk jobs are completed fine, but that function is only returning part of the data. My data is returned as content type "text/csv" so it follows that logic fork at line 18. I deconstructed this function, and found that this line in the function:
response_text <- content(httr_response, as = "text", encoding = "UTF-8")
at line 14

is what only returns the 1/6 of the data like I described. I thought this was odd since the httr_response actually had the full data set in it, but the content function was only returning part of it. If I instead parse the httr_response using content(recordset, type = "text/csv") then I get the full data set back.

From my testing, it looks like httr_response$headers$content-type can be passed directly into the content function as follows:

content_type <- httr_response$headers$`content-type`  (line 14)
res <- content(httr_response, type = content_type)

This would simplify the code by making the conditional fork based on content_type not necessary (starting line 15).

@nbahnson
Copy link
Author

nbahnson commented Jun 9, 2020

So I now have a working version of the bulk api code as well. Let me know if any of my comments need clarifying! Thanks for taking a look at this

@StevenMMortimer
Copy link
Owner

@nbahnson Wow! That's awesome. Much appreciated. I'm going to release the next version on CRAN within a couple days, so this will be shipped with it! Thanks again!

@nbahnson
Copy link
Author

nbahnson commented Jun 9, 2020

@StevenMMortimer Glad I could help, and thank you for jumping on this so quickly!

I'm trying to get my coworkers to use R so I'm happy to help where I can :)

Let me know if I can help test anything!

@mjvtnm
Copy link

mjvtnm commented Jun 11, 2020

The changes made today helped a bit, at least the recursion gets stopped. The problem is that the sub call returns parsed data (after type.convert) which then is combined with the raw data. So if there's say a date variable the bind_rows fails. Quick fix seems to be to set guess_types=FALSE. Maybe this should be the default for the recursive calls (line 227 etc)? The first call would do the type conversions anyway eventually.

@StevenMMortimer
Copy link
Owner

StevenMMortimer commented Jun 11, 2020

Thanks, @mjvtnm! I noticed this when the build failed last night and figured I would tackle it in the morning. You can see the new flow here that casts as character:

salesforcer/R/query.R

Lines 123 to 131 in 00f4a5a

response_parsed <- content(httr_response, "text", encoding="UTF-8")
response_parsed <- fromJSON(response_parsed, flatten=TRUE)
if(length(response_parsed$records) > 0){
resultset <- response_parsed$records %>%
select(-matches("^attributes\\.")) %>%
select(-matches("\\.attributes\\.")) %>%
as_tibble() %>%
mutate_all(as.character)

In this case, I'm converting all columns to character before binding, then converting if requested by the user. This ensures that the bind operation will not fail because of incompatible types.

And in the last iteration, I am guessing the types if requested by the user:

salesforcer/R/query.R

Lines 147 to 150 in 00f4a5a

# cast the data in the final iteration if requested
if(is.null(next_records_url) & (nrow(resultset) > 0) & (guess_types)){
resultset <- resultset %>%
type_convert(col_types = cols(.default = col_guess()))


NOTE: Unfortunately, fromJSON is already doing some type conversion internally. I cannot think of a fix right now to actually ensure that the raw text returned from the API call is not formatted by fromJSON. This is only a problem when a user needs something like a really long integer that needs to be kept a string as noted in #40. But it seems like the current strategy has addressed that specific case for now, so I'll just continue to monitor the issues while I think about a better solution.

@StevenMMortimer StevenMMortimer modified the milestones: 0.2.0, 0.1.4 Jun 11, 2020
@StevenMMortimer
Copy link
Owner

@nbahnson @mjvtnm This issue should be fixed in the latest release of the package (0.1.4) that is now on CRAN. Please let me know if there are still problems and we can work through them. If there are no lingering issues or I don't hear back in a couple weeks, then I'll close this issue.

Thanks again for your assistance and your patience! 👍

@nbahnson
Copy link
Author

I've been able to use it a bit over the last week - it seems to work well! Thanks for jumping on this and getting a fix out so fast :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended behavior that should be corrected
Projects
None yet
Development

No branches or pull requests

3 participants