Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #166 and issue w/ row-binding results when some elements are empty #167

Closed
wants to merge 17 commits into from

Conversation

elipousson
Copy link
Contributor

@elipousson elipousson commented Mar 11, 2024

Checklist

  • update NEWS.md
  • documentation updated with devtools::document()
  • devtools::check() passes locally

Changes

I'll update this description with more detail later this evening.

  • Fix Improve handling of filter_geom inputs with "MULTIPOLYGON" geometry #166 (and add tests for prepare_spatial_filter()
  • Fix regression in arc_read() created by arc_select() now always returning OBJECTID (new issue sometime in last month?)
  • Fix handling of empty data frames in results data frame list by collect_layer()
  • Additional refactoring to improve consistency of input checks and add helper functions

Issues that this closes

See changes. I'd add more GH issues but my existing reprex is gnarly and I don't know if/when I'll have time to pull together a good one. The arc_read() regression is evident in the existing errored tests – it maybe should get fixed in {arcgisutils} if you can tell what the root source of the problem is.

Follow up tasks

Update README with correct information on handling of filter_geom. That change could easily be added to this PR along with the updates to NEWS.

`do.call(rbind, res)` failed when some but not all elements of `res` are empty (issue w/ combining valid CRS with NA values)

This refactor adds `rbind_results` and `rbind_res_list` helper functions for improved handling of results lists.

Refactoring also includes:

- Passing call and error_call to arcgisutils functions
- Add set_page_size helper function to consolidate related input checks
- Expose error_call arg in count_results
- Expose on_error and progress parameters for `httr2::req_perform_parallel` (including setup for arcgislayers.progress option)
R/arc-select.R Outdated
Comment on lines 532 to 540
rbind_res_list <- function(x, error_call = rlang::caller_env()) {
# TODO: enhance this with additional suggested packages
if (rlang::is_installed("vctrs")) {
x <- vctrs::vec_rbind(!!!x, error_call = error_call)
} else {
x <- do.call(rbind.data.frame, x)
}
x
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth extracting this into arcgisutils
https://github.com/R-ArcGIS/arcpbf/blob/main/R/post-process.R#L109-L121

squish_dfs <- function(x) {
  if (rlang::is_installed("collapse", version = "2.0.0")) {
    x <- collapse::rowbind(x)
  } else if (rlang::is_installed("data.table")) {
    x <- data.table::rbindlist(x)
    data.table::setDF(x)
  } else if (rlang::is_installed("dplyr")) {
    x <- dplyr::bind_rows(x)
  } else {
    x <- do.call(rbind.data.frame, x)
  }
  x
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the speed boost for including data.table or collapse worth the added depencies? I'm mostly agnostic but didn't want to add a bunch of new dependencies without some clear confirmation.

FWIW – I've also been avoiding dplyr::bind_rows() ever since purrr::map_dfr() was superseded (and cited the "confusing semantics with edge cases" for bind_rows the updated description).

if (is.null(page_size)) {
return(max_records)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure that the type is numeric. Because this might error or return NA which is not checked later

Suggested change
if (!is.numeric(page_size)) {
cli::cli_abort("{.arg page_size} must be a numeric scalar not {.obj_type_friendly {class(page_size)}}", call = error_call)
}

R/arc-select.R Outdated

if (page_size > max_records) {
cli::cli_abort(
"{.arg page_size} ({page_size}) cannot excede layer's {.field maxRecordCount} property ({max_records})",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm. I agree with this check. But the presence of a default value for max_records might mean that if we are I am lazy we might report the wrong number (25000) instead of the layers value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value should not be used as long as the layer has a maxRecordsCount value but I dropped it regardless.

Comment on lines +507 to +513
empty_res <- vapply(res, rlang::is_empty, FALSE)

if (all(empty_res)) {
cli::cli_alert_info("No features returned from query")
return(data.frame())
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a repro for this scenario? We should have a test for it. Because I feel like I fixed a related issue with parse_esri_json() not too long ago! R-ArcGIS/arcgisutils#21

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can get it to trigger. The existing code checked res after the row-binding was complete to see it it was NULL and then returned an empty data frame - this should catch the same situation. Ideally, it would have no features but the structure of the data frame with columns and CRS would be identical to the request if it had been successful.

@JosiahParry
Copy link
Collaborator

Thank you!! I've provided some initial thoughts. I'm not all here mentally right now as I'm at our Esri DevSummit to talk about this. :) I'll give it a more thorough review next week.

Regarding the MULTIPOLYGON situation, I might bother the lead dev for the ArcGIS API for Python while I'm here. There may be something else that we could be doing to support it more simply. One option I have in my head is that we can take the sfc and st_cast(filter_geom, "POLYGON") and then run the arc_select() function for each of the polygons and rbinding them. This would make the function recursive though 😨 . Alternatively, we can error and provide a very thorough set of examples in the @details doc that illustrate different techniques for handling it. Because if we can use multipolygons too well, finding a panacea might be a bit tough

Also:

- Correct typo in vctrs::vec_rbind call: error_call -> .error_call
- Add validate_n_feats helper to consolidate cli_abort calls
`arc_select()` is now returning OBJECTID column in every query (even when fields is supplied)
Partial match errored
- Work w/ NULL n_feats values
- Ensure numeric value for page_size and max_records
- Remove default value for max_records
@elipousson
Copy link
Contributor Author

@JosiahParry no rush on my account, I ran into both of these bugs with a project today and figured I'd just get it fixed and open the PR while I had it on my mind. Hope the conference goes well!

Curious to hear what your colleague recommends but I would strongly recommend against having the filter_geom error on MULTIPOLYGON inputs since so sf::st_read() promotes POLYGON geometry up to MULTIPOLYGON by default when reading from a file. The original solution actually worked OK if all the geometries had some overlap. So we could add that back in as a test and use the covex hull method only if it doesn't meet that condition.

- Add `match_fields` helper to improve checks for fields (check length 1 values)
- Move collapsing fields to validate_params
- Avoid returning Object ID field if not selected
- Restore sf class after dropped by vctrs::vec_rbind
- Simplify update_params internals
Also

- avoid duplicate call to validate_params w/in count_results
- disable call to rbind_res_list (issue w/ dropping sf class wasn't fixed by last)
Changes to arc_select drop object ID as expected
Also guard against attempting to drop names from empty data frame
@JosiahParry
Copy link
Collaborator

@elipousson I have time in the next few days to review this. Do you mind trying to write out what the objective of this PR is? It's a fairly large PR which makes it a bit tough to review!

@elipousson
Copy link
Contributor Author

It is largely still the same as in the initial description – I ran into a handful of small bugs and found more when I went to fix them. Here are the issues addressed by this PR:

  • collect_layer() errored due to the responses including sf objects and empty data frames
  • arc_select() handles filter_geom MULTIPOLYGON inputs in ways that did not work as I originally expected (not sure the improved way is perfect either)
  • arc_read() suffered a regression due to changes in arc_select() that always returned an Object ID column (even when it wasn't included in fields)
  • arc_read() had an existing issue where it was not possible to select a single column named alias

There is also some additional minor refactoring with no change in features that I addressed along the way including adding the rbind_results, rbind_res_list, and match_fields helper functions.

@JosiahParry
Copy link
Collaborator

JosiahParry commented Mar 21, 2024

I'm having a fair bit of trouble reviewing this PR because it's so large and broad. This PR is a bit too sweeping to merge at this moment. I think we should break this into smaller more bite-sized pieces that are a bit easier and clearer to tackle. Right now it's all a bit too entwined. I can try and piece apart what the objectives are and then we can start smaller?

Some of the things that this PR does that I think can be broken down into smaller PRs. I'll work on creating issues.

@elipousson
Copy link
Contributor Author

Thanks for creating all those issues! Totally understand the need to split it up. I need to get a better handle on branching so I can keep developing on my fork and throw things into branches as appropriate.

As is, these mega-PRs are just me fixing things on main as I work through an issue which can be impractical for collaboration. Did you want me to close this PR after I get the more focused ones opened? Or rename this one after pulling out anything beyond the initial scope?

@JosiahParry
Copy link
Collaborator

JosiahParry commented Mar 24, 2024

Let's keep this open for now or convert to draft. I think it would be nice to crib from this for each of the PRs.

I think first of all, creating a utility row-binding function would be most immediately useful (selfishly as I'm working on arcgeocode).

What I think would be easiest / most impactful in order:

  1. Row-binding utility (squish_df helper arcgisutils#38 (comment))
    • including handling empty results
  2. FeatureLayer's return fields when none provided #175
  3. Support use of field alias values as column names or label attributes by arc_read() #169
  4. Consistent type checking—i've become a fan of the standalone type checks
  5. Improve handling of filter_geom inputs with "MULTIPOLYGON" geometry #166

@elipousson
Copy link
Contributor Author

@JosiahParry I have not forgotten about this – just been busy with work. I was able to pull a new branch off from main so I can use that to implement the changes and open a new PR (one at a time unless the changes cover completely separate files) for each of those 5 items.

@JosiahParry
Copy link
Collaborator

@elipousson no worries! I'm a bit distracted as well. I've done #1 in https://github.com/R-ArcGIS/arcgisutils/pull/42/files

@elipousson
Copy link
Contributor Author

Closing this one out since all of the changes are either merged or included with #186

@elipousson elipousson closed this Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve handling of filter_geom inputs with "MULTIPOLYGON" geometry
2 participants