Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading errors when reading sparse data without a record_id column #471

Open
pbchase opened this issue Feb 21, 2023 · 13 comments
Open

Misleading errors when reading sparse data without a record_id column #471

pbchase opened this issue Feb 21, 2023 · 13 comments

Comments

@pbchase
Copy link
Contributor

pbchase commented Feb 21, 2023

I am seeing odd behavior from redcap_read when reading a single form of data. This is with REDCap 1.1.0, R 4.2.2, REDCap 12.4.2. If I read a sparsely-filled form--a form that does not have a row of data for every person described by the record_id column--I get this confusing error message:

> result <- REDCapR::redcap_read(
+   batch_size = 1000,
+   redcap_uri = source_credentials$redcap_uri,
+   token = source_credentials$token,
+   forms = c("next_visit_form")
+ )
The data dictionary describing 9,457 fields was read from REDCap in 1.3 seconds.  The http status code was 200.
3,515 records and 2 columns were read from REDCap in 0.8 seconds.  The http status code was 200.                                                                                                        
Starting to read 772 records  at 2023-02-21 10:51:53.
Reading batch 1 of 1, with subjects 110001 through A17-001 (ie, 772 unique subject records).
3,515 records and 5 columns were read from REDCap in 0.9 seconds.  The http status code was 200.                                                                                                        

── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  visitdate_ideal = col_date(format = ""),
  visitdate_earliest = col_date(format = ""),
  visitdate_latest = col_date(format = ""),
  visitdate_scheduled = col_date(format = ""),
  next_visit_form_complete = col_double()
)

Error in REDCapR::redcap_read(batch_size = 1000, redcap_uri = source_credentials$redcap_uri,  : 
  There are 772 subject(s) that are missing rows in the returned dataset. REDCap's PHP code is likely trying to process too much text in one bite.

Common solutions this problem are:
  - specifying only the records you need (w/ `records`)
  - specifying only the fields you need (w/ `fields`)
  - specifying only the forms you need (w/ `forms`)
  - specifying a subset w/ `filter_logic`
  - reduce `batch_size`

The missing ids are:
110001,110002,110003,110004,110005,110006,110007,110008,110009,110010,110011,110012,110013,110014,110015,110016,110017,110018,110019,110020,110021,110022,110023,110024,110025,110026,110027,110028,110029,110030,110031,110032,110033,110034,110036,110037,110038,110039,110040,110041,110042,110043,110044,110045,110046,110047,110048,110049,110050,110051,110052,110053,110054,110055,110056,110057,110058,110059,110060,110061,110062,110063,110064,110065,110066,110067,110068,110069,110070,110071,110072,110073,110074,110075,110076,110077,110078,110079,110080,110081,

All of the suggestions are misleading. The real fix is to add the record_id column (ptid in my case) e.g.,

> result <- REDCapR::redcap_read(
+   batch_size = 1000,
+   redcap_uri = source_credentials$redcap_uri,
+   token = source_credentials$token,
+   fields = c("ptid"),
+   forms = c("next_visit_form")
+ )
The data dictionary describing 9,457 fields was read from REDCap in 1.2 seconds.  The http status code was 200.
3,516 records and 2 columns were read from REDCap in 0.6 seconds.  The http status code was 200.                                                                                                        
Starting to read 773 records  at 2023-02-21 11:08:12.
Reading batch 1 of 1, with subjects 110001 through A17-001 (ie, 773 unique subject records).
3,516 records and 7 columns were read from REDCap in 0.8 seconds.  The http status code was 200.                                                                                                        

── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  ptid = col_character(),
  redcap_event_name = col_character(),
  visitdate_ideal = col_date(format = ""),
  visitdate_earliest = col_date(format = ""),
  visitdate_latest = col_date(format = ""),
  visitdate_scheduled = col_date(format = ""),
  next_visit_form_complete = col_double()
)

Whereas I say the error message is misleading, I'm not sure adjusting it is the fix. I feel like we used to always get the record_id column even if we never asked for it. That makes sense. Reading data without its primary key is silly unless you are anonymizing data. That hardly seems like a good default. Did that change? Was it supposed to?

@wibeasley
Copy link
Member

Did that change? Was it supposed to?

Yes, I think you're right, but I'm not seeing anything pop when I quickly skim the release notes since v8). I feel it happened about the same time that the record_id would be hashed if the user didn't have PHI rights.

There's always a chance that I'm wrong and REDCap is returning the field but REDCapR is not. I can look deeper if it matters.

Either way, I'm leaning in favor of letting the user decide to omit the variable (even if it's more frequently an accident than intentional).

Do you have a strong preference to always include it? Are you still in favor of just adding another bullet in the error message quoted above --it sounded like that was your first reaction.

@pbchase
Copy link
Contributor Author

pbchase commented Feb 22, 2023

I would honor what the API does. If it omits the record_id, when the caller does not name it, honor that.

As for what action to take, I think you should replace the hard stop with a warning. To the warning text, add a bullet that says something like, "...or perhaps the fields and forms you requested are sparse" and also change the hard stop to a warning. If REDCapR's assessment of the situation could be wrong, the hard stop causes problems with a non-obvious fix. I found the hard stop annoying more than anything. I knew very well the form I was querying was sparse and REDCapR wouldn't let me run my query. If I could have seen the result, I could assess that sparseness, decide if the error messages were accurate or not, note that the record_id column was missing from data, add it, and even address the "problem".

@wibeasley
Copy link
Member

wibeasley commented Apr 23, 2023

I think the real fix is to unambiguously determine the name of record_id if it's been renamed. I've posted this feature/question on Community: https://redcap.vanderbilt.edu/community/post.php?id=203859.

This would prevent the guessing of the record_id field, which in this scenario will always fail because it is not requested/returned.

REDCapR/R/redcap-read.R

Lines 448 to 449 in 18b9663

unique_ids_actual <- sort(unique(ds_stacked[[id_position]]))
ids_missing_rows <- setdiff(unique_ids, unique_ids_actual)

@simX
Copy link

simX commented Jul 5, 2023

FWIW, I just migrated to a new Mac where I did a fresh install of R, RStudio, and REDCapR, and I am seeing this exact issue, whereas it's not happening on my older Mac (presumably because I haven't updated the REDCapR package). I troubleshooted myself and came to the same conclusion, that you need to include the name of the record_id field and include it in the list of fields you request, otherwise you get the misleading error posted in the initial issue. I do also have one project where record_id has been renamed to something else, so that threw me for a loop for a bit.

I have no real preference on whether the record_id field should always be included, though I don't think it hurts to always include it since it's the de facto id field for records in that project. But the user of REDCapR should not be required to request this field to get any data at all, as I don't think that's clear from the documentation (which makes sense since this is a recent change in behavior).

@simX
Copy link

simX commented Jul 5, 2023

FYI I just checked on my other computers: this issue does not occur with REDCapR package version 1.0.0. It does occur with version 1.1.0. I used getNamespaceVersion("REDCapR") to figure out the currently loaded package version.

@pbchase
Copy link
Contributor Author

pbchase commented Jul 6, 2023

I think the real fix is to unambiguously determine the name of record_id if it's been renamed. I've posted this feature/question on Community: https://redcap.vanderbilt.edu/community/post.php?id=203859.

This would prevent the guessing of the record_id field, which in this scenario will always fail because it is not requested/returned.

@wibeasley, the record_id column is always the first row in the metadata. Don't you always query the metadata? If so, you would have the name of the record_id field name before you queried for data.

If that feels too hacky, you could write your own method to read the record_id column from the metadata, then reference that method to get the record_id column name. Then you have the easy option to revise the method when a better, REDCap-official way is released.

@simX
Copy link

simX commented Jul 6, 2023

Is it worth trying to figure out why version 1.0.0 of the REDCapR package still works in the previous fashion, without requiring you to include the record_id field name in the list of fields to query?

@wibeasley
Copy link
Member

I'll have time tomorrow to look at this closely. I remember starting on this issue at least. The id_position parameter was my solution until I REDCap supports something like https://redcap.vanderbilt.edu/community/post.php?id=203859. @pbchase, does this address what you're suggesting

https://ouhscbbmc.github.io/REDCapR/reference/redcap_read.html

image

@simX, in the meantime, can you try the version on GitHub and see if it does what you want? Install with remotes::install_github("OuhscBbmc/REDCapR").

@pbchase
Copy link
Contributor Author

pbchase commented Jul 6, 2023

@pbchase, does this address what you're suggesting

https://ouhscbbmc.github.io/REDCapR/reference/redcap_read.html

@wibeasley, yes, the id_position parameter will work fine.

@wibeasley
Copy link
Member

@simX, I think this has been fixed for a while with the main version on GitHub. Here's how the two read functions operate. Is this a problem for your scenario?

uri      <- "https://bbmc.ouhsc.edu/redcap/api/"
token    <- "22C3FF1C8B08899FB6F86D91D874A159" # pid 3181

REDCapR::redcap_read_oneshot(
  redcap_uri  = uri, 
  token       = token,
  forms       = "blood_pressure",
  verbose     = FALSE 
)$data
#> # A tibble: 6 × 3
#>     sbp   dbp blood_pressure_complete
#>   <dbl> <dbl>                   <dbl>
#> 1   1.1  11.1                       2
#> 2   1.2  11.2                       2
#> 3   1.3  11.3                       2
#> 4   2.1  22.1                       2
#> 5   2.2  22.2                       2
#> 6   2.3  22.3                       2

REDCapR::redcap_read(
  redcap_uri  = uri, 
  token       = token,
  forms       = "blood_pressure",
  verbose     = FALSE 
)$data
#> # A tibble: 8 × 6
#>   record_id redcap_repeat_instrument redcap_repeat_instance   sbp   dbp
#>       <dbl> <chr>                                     <dbl> <dbl> <dbl>
#> 1         1 <NA>                                         NA  NA    NA  
#> 2         1 blood_pressure                                1   1.1  11.1
#> 3         1 blood_pressure                                2   1.2  11.2
#> 4         1 blood_pressure                                3   1.3  11.3
#> 5         2 <NA>                                         NA  NA    NA  
#> 6         2 blood_pressure                                1   2.1  22.1
#> 7         2 blood_pressure                                2   2.2  22.2
#> 8         2 blood_pressure                                3   2.3  22.3
#> # ℹ 1 more variable: blood_pressure_complete <dbl>

Created on 2023-07-09 with reprex v2.0.2

@simX
Copy link

simX commented Jul 9, 2023

I’ll take a look in a couple days with the version from GitHub. It is possible it’s already fixed, I didn’t realize there were further changes beyond version 1.1.0.

@simX
Copy link

simX commented Jul 11, 2023

Looks like the GitHub version works properly; it doesn't require including the record_id field in the list of fields to retrieve anymore.

I installed the remotes package from CRAN using RStudio's [Tools --> Install Packages...] menu item, and then installed the GitHub version of REDCapR using remotes::install_github("OuhscBbmc/REDCapR"). I did have to quit and restart RStudio and not save the workspace data (presumably because I would need to unload version 1.1.0 and load the GitHub version), but all works fine now. Thanks for the assistance!

wibeasley added a commit that referenced this issue Jul 11, 2023
@wibeasley
Copy link
Member

wibeasley commented Jul 11, 2023

@simX, sweet. Glad to hear it's working. (In the future, I think you need to restart just R within RStudio, not restart all of RStudio.)

@pbchase, I think the modified behavior (since Feb) satisfies your first two posts in this issue. But I'd like to hear it from you (that I'm not overlooking something) before I close the issue.

In short, REDCapR::redcap_read_oneshot() doesn't add record_id but REDCapR::redcap_read() does add it (it's unavoidable, because there is so much stitching and so many meta-ish commands). But neither throw an error when it's not requested --as confirmed by the four test functions in the commit immediately above.

*crap, the message from commit ab4c58a should be "verify record_id IS returned by redcap_read".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants