-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling labels & checkboxes #51
Comments
This is definitely an issue of what REDCap returns. It's being exacerbated by the batching, and I disagree with your description of it as low impact. This could be a major headache if doing a large number of batches. I think the immediate remedy for it is to convert any NA's in checkbox fields to empty characters. > x <- httr::POST(url=URL, # export raw values
+ body=list(token=TOKEN,
+ content='record',
+ format='csv'))
> (x <- read.csv(textConnection(as.character(x))))
id color___r color___g color___b color___p thisform_complete
1 1 1 0 0 0 0
2 2 0 0 1 0 0
3 3 0 1 1 0 0
4 4 0 0 0 0 0
>
> y <- httr::POST(url=URL, #export labeled values
+ body=list(token=TOKEN,
+ content='record',
+ format='csv',
+ rawOrLabel='label'))
> (y <- read.csv(textConnection(as.character(y))))
id color___r color___g color___b color___p thisform_complete
1 1 Red NA Incomplete
2 2 Blue NA Incomplete
3 3 Green Blue NA Incomplete
4 4 NA Incomplete
> levels(y$color___r)
[1] "" "Red"
> Notice that when I export 'raw' I get 0's, but when I export 'label' I get NA's in the column for purple. The column for purple should be a series of empty character sets. Now watch what happens when I take out record 1. > y <- httr::POST(url=URL,
+ body=list(token=TOKEN,
+ content='record',
+ format='csv',
+ rawOrLabel='label',
+ records='2,3,4'))
> (y <- read.csv(textConnection(as.character(y))))
id color___r color___g color___b color___p thisform_complete
1 2 NA Blue NA Incomplete
2 3 NA Green Blue NA Incomplete
3 4 NA NA Incomplete What version of REDCap are you using? This should probably be brought up on the Google Group. I think it's reasonable to ask if this is the expected or desired behavior. I'm including my data dictionary and data set here if you want to reproduce my results. "Variable / Field Name","Form Name","Section Header","Field Type","Field Label","Choices, Calculations, OR Slider Labels","Field Note","Text Validation Type OR Show Slider Number","Text Validation Min","Text Validation Max",Identifier?,"Branching Logic (Show field only if...)","Required Field?","Custom Alignment","Question Number (surveys only)","Matrix Group Name","Matrix Ranking?"
id,thisform,,text,"Subject ID",,,,,,,,,,,,
color,thisform,,checkbox,Color,"r, Red | g, Green | b, Blue | p, Purple",,,,,,,,,,, id,redcap_data_access_group,color___r,color___g,color___b,color___p,thisform_complete
"1","",1,0,0,0,0
"2","",0,0,1,0,0
"3","",0,1,1,0,0
"4","",0,0,0,0,0 |
Tell me what you think, @nutterb. I'm guessing the inconsistency is on R's part, not on REDCap's. I liked how your scenario isolated the problem well, so I set it up at https://github.com/OuhscBbmc/REDCapR/tree/dev/inst/test_data/project_color_boxes. I made a slight modification to your code, but peeking at the csv before R converts it into a data.frame. It's tough to see with all the escaping, but I think REDCap is returned what it's supposed to. Furthermore, looking at the classes in each data.frame, the first one correctly converts > sapply(ds_all, class)
id color___r color___g color___b color___p
"integer" "character" "character" "character" "logical"
> sapply(ds_some, class)
id color___r color___g color___b color___p
"integer" "logical" "character" "character" "logical" The reason I think it's a corner case is that I never use labels, and the batches will rarely be so small. But I'd still like to handle it correctly, and not rely on obscurity. I'm working on a larger strategy that uses the metadata to format the returned data.frame, and I'm hoping that approach will solve this problem. Any thoughts? (I'll start a new issue later, but sometime I want to discuss creating some projects that we can both test against, to reduce some of the duplication of work. I'm happy to host them on our box.) |
It turns out you're right. I didn't look at the raw API output. It appears that "study_id", "color"
1, "" is the same things as "study_id", "color"
1, Unless there is at least one value in But if you move toward using the metadata to format the data frame, you'll get around this easily. This is actually how Horner started out. I only ever download the raw data from the API. This has given me the flexibility to apply either the "Checked/Unchecked" formatting you see from the manual download, or the label download like what we're aiming for here. Before you set up any mutual project (and I'd be very interested), check your use policies. I asked about making some test databases to use in the examples of |
So I guess there are two ways you could post-process this. Instead of converting all of the batches to data frames and Or you can add a couple extra lines to what you already have. if (raw_or_label=='label'){
check <- grepl("___", names(d), value=TRUE)
d[, check] <- lapply(d[, check, drop=FALSE],
function(x) {
lbl <- levels(x)
x <- as.character(x)
x[is.na(x)] <- ''
x <- factor(x, lbl)
}) It's probably easier to do this at the stage you're at than to do the text parsing. But it looks pretty ugly. |
Thanks for the suggesting different approaches. There are many little combinations of scenarios I'm worried about, such as (a) if some columns are dropped if the values are all missing, or (b) if a single column data.frame gets implicitly converted to a vector. I'll add your suggestions to the couple I was considering, and share my thoughts/troubles. I was leaning towards declaring the classes at the # Read the CSV and set the column types
classes <- c(
"AnswerID" = "integer"
, "SiteID" = "integer"
, "FamilyID" = "integer"
, "PersonID" = "integer"
, "Formdate" = "character"
, "SSN" = "character" #Force as a character (not integer) so leading zeros aren't dropped.
, "final_graduation_date" = "integer"
, "Deleted" = "integer"
)
#colnames(ds) paste(colnames(ds), collapse="\",\"")
ds <- read.csv(uriExtract, stringsAsFactors=FALSE, na.strings="", colClasses=classes, comment.char="") But the success I've had with this approach has been using huge (60k+ datasets), whereas this API situation needs to work with dinky little batches that potentially are sparsely populated. |
*(a) if some columns are dropped if the values are all missing, * or (b) if a single column data.frame gets implicitly converted to a vector Using the colClasses is a fairly brilliant idea. If that's the route you go, I would recommend doing the text processing before joining the batches. That way you minimize the risk of a column becoming all NA's. Even if they were all NA's, I'm not sure it would kill the call (though I'd have to play with it). This would also be blazing fast. But, since I'm a pessimist at heart, I have to mention the one potential problem I see. What do you intend to do with text fields that are numeric but not validated as numbers? When researchers don't contact me when they do their database design, I get a lot of unvalidated fields (drives me nuts). Would you leave those for R to decide, or would you force those to character? Just something to think about. I don't think the size of the database will be an issue. In small databases, I think you just won't notice the difference in speed. |
Nothing specific on my mind. Just cautious about R or REDCap dropping empty things. Kinda like how we have to specify
That's funny. Until you modified the code block by adding the
I wasn't concerned with performance. I brought up the small batches because they'll have less information for R to make the correct guesses about what the correct class/type of variable should be. |
I don't think there's anything we can do, short of implementing some sort of machine learning algorithm, do you? I need to look at what you and Horner have done, but I imagined relying only on the ...but nuts, as we're discussing this, I realize that those validations aren't bullet proof, right? They're not like real database data types. The REDCap validation rules can added after some cells have been entered. The old values are grandfathered in, and don't have to comply with the new validation rules. Coercing to a variable type won't be so forgiving in R. Tentatively I'm thinking of a two-pass approach. In the first pass (that uses But keep them as a string if that can't be cast without errors. And maybe provide a parameter so the user can specify if the routine should (a) throw an error, or (b) keep it as a string if the REDCap's validation class didn't work. How does that sound? And could that logic be reused for your package, if it's something you even want to consider including? |
This was fixed a while ago (I think when we switching to readr with issue #127), and I neglected to close it. |
This bug is a corner case, and low impact. If the batch size is really tiny and you request 'label' instead of 'raw', there's a initially strange pattern of
NA
values. See the screenshot below.It's because the batches with no values in the whole column are given
NA
in the empty cells (or possibly the columns aren't even outputed by the API -I'll have to check). However the batches with at least one value in the column are given a zero-width character in the empty cells.The text was updated successfully, but these errors were encountered: