Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling labels & checkboxes #51

Closed
wibeasley opened this issue Oct 2, 2014 · 9 comments
Closed

handling labels & checkboxes #51

wibeasley opened this issue Oct 2, 2014 · 9 comments
Assignees
Labels

Comments

@wibeasley
Copy link
Member

This bug is a corner case, and low impact. If the batch size is really tiny and you request 'label' instead of 'raw', there's a initially strange pattern of NA values. See the screenshot below.

It's because the batches with no values in the whole column are given NA in the empty cells (or possibly the columns aren't even outputed by the API -I'll have to check). However the batches with at least one value in the column are given a zero-width character in the empty cells.

   returned_object2 <- redcap_read(redcap_uri=uri, token=token, raw_or_label="label",
      export_data_access_groups="true", verbose=T, batch_size=2),

image

@wibeasley wibeasley added the bug label Oct 2, 2014
@wibeasley wibeasley self-assigned this Oct 2, 2014
@nutterb
Copy link
Contributor

nutterb commented Oct 2, 2014

This is definitely an issue of what REDCap returns. It's being exacerbated by the batching, and I disagree with your description of it as low impact. This could be a major headache if doing a large number of batches. I think the immediate remedy for it is to convert any NA's in checkbox fields to empty characters.

> x <- httr::POST(url=URL,  # export raw values
+                 body=list(token=TOKEN,
+                           content='record',
+                           format='csv'))
> (x <- read.csv(textConnection(as.character(x))))
  id color___r color___g color___b color___p thisform_complete
1  1         1         0         0         0                 0
2  2         0         0         1         0                 0
3  3         0         1         1         0                 0
4  4         0         0         0         0                 0
> 
> y <- httr::POST(url=URL, #export labeled values
+                 body=list(token=TOKEN,
+                           content='record',
+                           format='csv',
+                           rawOrLabel='label'))
> (y <- read.csv(textConnection(as.character(y))))
  id color___r color___g color___b color___p thisform_complete
1  1       Red                            NA        Incomplete
2  2                          Blue        NA        Incomplete
3  3               Green      Blue        NA        Incomplete
4  4                                      NA        Incomplete
> levels(y$color___r)
[1] ""    "Red"
>

Notice that when I export 'raw' I get 0's, but when I export 'label' I get NA's in the column for purple. The column for purple should be a series of empty character sets.

Now watch what happens when I take out record 1.

> y <- httr::POST(url=URL,
+                 body=list(token=TOKEN,
+                           content='record',
+                           format='csv',
+                           rawOrLabel='label',
+                           records='2,3,4'))
> (y <- read.csv(textConnection(as.character(y))))
  id color___r color___g color___b color___p thisform_complete
1  2        NA                Blue        NA        Incomplete
2  3        NA     Green      Blue        NA        Incomplete
3  4        NA                            NA        Incomplete

What version of REDCap are you using? This should probably be brought up on the Google Group. I think it's reasonable to ask if this is the expected or desired behavior.

I'm including my data dictionary and data set here if you want to reproduce my results.

"Variable / Field Name","Form Name","Section Header","Field Type","Field Label","Choices, Calculations, OR Slider Labels","Field Note","Text Validation Type OR Show Slider Number","Text Validation Min","Text Validation Max",Identifier?,"Branching Logic (Show field only if...)","Required Field?","Custom Alignment","Question Number (surveys only)","Matrix Group Name","Matrix Ranking?"
id,thisform,,text,"Subject ID",,,,,,,,,,,,
color,thisform,,checkbox,Color,"r, Red | g, Green | b, Blue | p, Purple",,,,,,,,,,,
id,redcap_data_access_group,color___r,color___g,color___b,color___p,thisform_complete
"1","",1,0,0,0,0
"2","",0,0,1,0,0
"3","",0,1,1,0,0
"4","",0,0,0,0,0

@wibeasley
Copy link
Member Author

Tell me what you think, @nutterb. I'm guessing the inconsistency is on R's part, not on REDCap's. I liked how your scenario isolated the problem well, so I set it up at https://github.com/OuhscBbmc/REDCapR/tree/dev/inst/test_data/project_color_boxes.

I made a slight modification to your code, but peeking at the csv before R converts it into a data.frame. It's tough to see with all the escaping, but I think REDCap is returned what it's supposed to.

Furthermore, looking at the classes in each data.frame, the first one correctly converts color___r to character, while, the empty one converts it to logical:

> sapply(ds_all, class)
         id   color___r   color___g   color___b   color___p 
  "integer" "character" "character" "character"   "logical" 
> sapply(ds_some, class)
         id   color___r   color___g   color___b   color___p 
  "integer"   "logical" "character" "character"   "logical" 

The reason I think it's a corner case is that I never use labels, and the batches will rarely be so small. But I'd still like to handle it correctly, and not rely on obscurity. I'm working on a larger strategy that uses the metadata to format the returned data.frame, and I'm hoping that approach will solve this problem. Any thoughts?

(I'll start a new issue later, but sometime I want to discuss creating some projects that we can both test against, to reduce some of the duplication of work. I'm happy to host them on our box.)

@nutterb
Copy link
Contributor

nutterb commented Oct 2, 2014

It turns out you're right. I didn't look at the raw API output. It appears that read.table's na.string argument interprets an complete vector of empty characters as blanks. In other words

"study_id", "color"
1, "" 

is the same things as

"study_id", "color"
1, 

Unless there is at least one value in color that is non-empty. And I don't think there's anyway around this. This is something you'll have to handle in the post processing.

But if you move toward using the metadata to format the data frame, you'll get around this easily. This is actually how Horner started out. I only ever download the raw data from the API. This has given me the flexibility to apply either the "Checked/Unchecked" formatting you see from the manual download, or the label download like what we're aiming for here.

Before you set up any mutual project (and I'd be very interested), check your use policies. I asked about making some test databases to use in the examples of redcapAPI, but was told no because they didn't want to be giving anyone that wasn't affiliated with the institution access into the servers. Apparently that was a confidentiality issue even if there is no access to patient data.

@nutterb
Copy link
Contributor

nutterb commented Oct 2, 2014

So I guess there are two ways you could post-process this. Instead of converting all of the batches to data frames and rbinding them, you could try stripping out all of the first lines of the strings from the batches (as long as you keep the first line in the first batch). Then you could paste all of the strings together and convert to a data frame in one go. This might actually be faster than playing with multiple data frames.

Or you can add a couple extra lines to what you already have.
After the data frame is constructed, something like (this is going to be a frustrating mix of code and pseudo-code):

if (raw_or_label=='label'){
  check <- grepl("___", names(d), value=TRUE)
  d[, check] <- lapply(d[, check, drop=FALSE],
                            function(x) {
                                 lbl <- levels(x)
                                 x <- as.character(x)
                                 x[is.na(x)] <- ''
                                 x <- factor(x, lbl)
                             })

It's probably easier to do this at the stage you're at than to do the text parsing. But it looks pretty ugly.

@wibeasley
Copy link
Member Author

Thanks for the suggesting different approaches. There are many little combinations of scenarios I'm worried about, such as (a) if some columns are dropped if the values are all missing, or (b) if a single column data.frame gets implicitly converted to a vector. I'll add your suggestions to the couple I was considering, and share my thoughts/troubles.

I was leaning towards declaring the classes at the read.csv() call. I've had success with the colClasses parameter when importing someone else's data into our databases (and hence the inconsistent naming conventions within the file).

# Read the CSV and set the column types
classes <- c(
  "AnswerID" = "integer"
  , "SiteID" = "integer"
  , "FamilyID" = "integer"
  , "PersonID" = "integer"
  , "Formdate" = "character"
  , "SSN" = "character" #Force as a character (not integer) so leading zeros aren't dropped.
  , "final_graduation_date" = "integer"
  , "Deleted" = "integer"
)

#colnames(ds) paste(colnames(ds), collapse="\",\"")
ds <- read.csv(uriExtract, stringsAsFactors=FALSE, na.strings="", colClasses=classes, comment.char="")

But the success I've had with this approach has been using huge (60k+ datasets), whereas this API situation needs to work with dinky little batches that potentially are sparsely populated.

@nutterb
Copy link
Contributor

nutterb commented Oct 2, 2014

*(a) if some columns are dropped if the values are all missing, *
I'm having a hard time imagining when this might happen. Is there a certain context you think it might occur?

or (b) if a single column data.frame gets implicitly converted to a vector
I'm guessing you mean in my lapply statement? I amended the code to prevent this. I've been bitten by this one a number of times before, and still never seem to remember the drop argument. Is there another context you're worried about?

Using the colClasses is a fairly brilliant idea. If that's the route you go, I would recommend doing the text processing before joining the batches. That way you minimize the risk of a column becoming all NA's. Even if they were all NA's, I'm not sure it would kill the call (though I'd have to play with it). This would also be blazing fast.

But, since I'm a pessimist at heart, I have to mention the one potential problem I see. What do you intend to do with text fields that are numeric but not validated as numbers? When researchers don't contact me when they do their database design, I get a lot of unvalidated fields (drives me nuts). Would you leave those for R to decide, or would you force those to character? Just something to think about.

I don't think the size of the database will be an issue. In small databases, I think you just won't notice the difference in speed.

@wibeasley
Copy link
Member Author

(a) I'm having a hard time imagining when this might happen. Is there a certain context you think it might occur?

Nothing specific on my mind. Just cautious about R or REDCap dropping empty things. Kinda like how we have to specify useNA = "always" when calling table(). Or even calling table() on a character variable instead of a factor, when not all levels are present.

(b) Is there another context you're worried about?

That's funny. Until you modified the code block by adding the drop parameter, it hadn't occurred to me that your lapply call could have suffered that problem. But yeah, that kind of thing is what's on my mind.

I don't think the size of the database will be an issue. In small databases, I think you just won't notice the difference in speed.

I wasn't concerned with performance. I brought up the small batches because they'll have less information for R to make the correct guesses about what the correct class/type of variable should be.

@wibeasley
Copy link
Member Author

What do you intend to do with text fields that are numeric but not validated as numbers?

I don't think there's anything we can do, short of implementing some sort of machine learning algorithm, do you? I need to look at what you and Horner have done, but I imagined relying only on the field_type and text_validation_type_or_show_slider_number metadata fields.


...but nuts, as we're discussing this, I realize that those validations aren't bullet proof, right? They're not like real database data types. The REDCap validation rules can added after some cells have been entered. The old values are grandfathered in, and don't have to comply with the new validation rules. Coercing to a variable type won't be so forgiving in R.


Tentatively I'm thinking of a two-pass approach. In the first pass (that uses colClasses), use field_type to distinguish categorize each variable as either boolean or string. In the second pass after the data.frame is already established, use the text_validation_type_or_show_slider_number to see if any of those string variables can safely be cast into an integer, numeric, or date class. (And maybe be more granular with the Date/DateTime/Posixct/whatever fields.)

But keep them as a string if that can't be cast without errors. And maybe provide a parameter so the user can specify if the routine should (a) throw an error, or (b) keep it as a string if the REDCap's validation class didn't work.

How does that sound? And could that logic be reused for your package, if it's something you even want to consider including?

@wibeasley
Copy link
Member Author

This was fixed a while ago (I think when we switching to readr with issue #127), and I neglected to close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants