Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_file directly into environment with user-specified file format #35

Closed
1 task done
kuriwaki opened this issue Dec 16, 2019 · 7 comments · Fixed by #66
Closed
1 task done

get_file directly into environment with user-specified file format #35

kuriwaki opened this issue Dec 16, 2019 · 7 comments · Fixed by #66
Labels
data-download Functions that are about downloading, not uploading, data

Comments

@kuriwaki
Copy link
Member

kuriwaki commented Dec 16, 2019

What the issue is about:

  • a suggested code or documentation change, improvement to the code, or feature request

Issue: I think most users who want to get data from the R dataverse package want to start working with the data in their R environment right away. However, get_file only returns raw binary output which is not usable on its own.

Proposal: The help page shows how to write the class raw object into a temp file and read it back in. The proposed feature is to add an optional argument in get_file or make a function that does this write-in / read-in-again process automatically. Users will enter a function that will be used to read in the tempfile. An example function that does this is below.

How does this sound?

# hide my key

library(dataverse)

# function ----

# @param file to be passed on to get_file
# @param dataset to be passed on to get_file
# @param read_function If supplied a function object, this will write the 
#   raw file to a tempfile and read it back in with the supplied function. This
#   is useful when you want to start working with the data right away in the R
#   environment
get_file_addon <- function(file,
                            dataset = NULL,
                            read_function = NULL,
                            ...) {
  
  raw_file <- get_file(file, dataset)
  
  # default of get_file
  if (is.null(read_function))
    return(raw_file)
  
  # save to temp and then read it in with supplied function
  if (!is.null(read_function)) {
    tmp <- tempfile(file, fileext = stringr::str_extract(file, "\\.[A-z]+$"))
    writeBin(raw_file, tmp)
    return(do.call(read_function, list(tmp)))
  }
}

# read in two non-tab ingested files ----
cces_dta <- get_file_addon(file = "cumulative_2006_2018.dta", 
                           dataset = "10.7910/DVN/II2DB6",
                           read_function = haven::read_dta)
cces_rds <- get_file_addon(file = "cumulative_2006_2018.Rds", 
                           dataset = "10.7910/DVN/II2DB6",
                           read_function = readr::read_rds)
class(cces_dta)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(cces_rds)
#> [1] "tbl_df"     "tbl"        "data.frame"
dim(cces_dta)
#> [1] 452755     73
dim(cces_rds)
#> [1] 452755     73

Created on 2019-12-16 by the reprex package (v0.3.0)

@wibeasley
Copy link
Contributor

@kuriwaki,

  1. I like this idea. I agree that it's a step that is reasonably automated and will remove a (small) barrier encountered in almost all use cases.

  2. I'm wondering if it's best to offer the data.frame conversion only for ingested datasets? I'm guessing that's the majority of what most users would consider converting to a data.frame. It also alleviates us from assuming the responsibility of guessing correctly for ambiguous files (like a csv with 'txt' extension, or a csv file that's actually separated with semicolons). I'd rather rely on Dataverse's own ingesting logic. They'll do a better job initially, and they're more likely to be better about maintaining that logic over time.

  3. But I'm happy to be convinced otherwise. If the package does assume this responsibility, maybe the mime can help with that decision logic.

  4. If only ingested datasets are returned as data.frames, I guessing it makes sense only to use the available rds. And not to convert the tab to rds. For three reasons.

    1. it's less for us to develop & maintain

    2. the rds is smaller, and therefore should travel the internet faster than the plain-text tab file.

    3. our csv-to-rds process may repair column names differently than the Dataverse ingestion process. For example, the tab file has a subject id variable. Some parsing procedures repair that name automatically (e.g., subject_id, subject id) and some don't. Therefore the user code and documentation might not use the same variable name --depending on how the csv was converted to an rds.

    @pdurbin, if we go this route, I might need help identifying the ingestion code that creates the rds. The R package should probably mimic the ingestion process as close as possible. My search isn't popping anything I recognize as this part.

What are your thoughts? Would this restriction (ie, only ingested datasets are returned as data.frames) be too limiting?

@pdurbin
Copy link
Member

pdurbin commented Dec 19, 2019

4. the ingestion code that creates the rds

Well, here's a lead: "Confirming what Phil said - if the original ingested file was Stata (.dta) or SPSS (.sav or *.por), we use R package "foreign" to directly convert that saved original file to an .RData dataframe. For all the other supported formats, the dataframe is generated by R from the tab-delimited file and the variable metadata in the database." -- https://groups.google.com/d/msg/dataverse-community/QDRnM6ztbt8/AYynuwocBAAJ

Let me dig a bit.


Update. I'm pretty sure this R code is called: https://github.com/IQSS/dataverse/blob/v4.18.1/src/main/java/edu/harvard/iq/dataverse/rserve/scripts/dataverse_r_functions.R

From this Java code: https://github.com/IQSS/dataverse/blob/v4.18.1/src/main/java/edu/harvard/iq/dataverse/rserve/RemoteDataFrameService.java#L125

@wibeasley
Copy link
Contributor

@pdurbin, that helped a lot

@kuriwaki, this shows how inexperienced I still am with Dataverse. I didn't realize they really meant "RData", instead of "Rds".

So unless Dataverse also offers Rds files soon, I totally support with your proposal.

In addition, what do you think about a function that always returns a data.frame for an ingested tab file? In that case, it never passes through the rds stage. Something like readr::read_delim() converts the plain-text to a tibble, and returns the tibble to the caller. Isn't this the most frequent use case? I really don't know --do/would many people use an R package to download a Stata/Spss/Whatever file?


For those who don't know, RData saves the equivalent of an environment/workspace --not necessarily a single rectangular data. When it's restored from all the variables used by the developer populate the client. The user is forced to (at least initially) use the old names. Besides the naming complication, multiple variables can use contained, which can lead to more confusion.

Excerpt from Efficient R programming

(RData) is the most widely used. It uses uses the save function which takes any number of R objects and writes them to a file, which must be specified by the file = argument. save is like save.image, which saves all the objects currently loaded in R.

The second method is slightly less used but we recommend it. Apart from being slightly more concise for saving single R objects, the readRDS function is more flexible: as shown in the subsequent line, the resulting object can be assigned to any name. In this case we called it df_co2_rds (which we show to be identical to df_co2, loaded with the load command) but we could have called it anything or simply printed it to the console.

Using saveRDS is good practice because it forces you to specify object names. If you use save without care, you could forget the names of the objects you saved and accidentally overwrite objects that already existed.

@kuriwaki
Copy link
Member Author

Thank you.

My intention with the read_function argument (with no default provided) is to leave it up to the user to discern what function could be used with the data. Sometimes, several commands should work fine (e.g. foreign::read.dta vs. haven::read_dta, or readr::read_delim vs. read.delim); more often, only certain funtions will work.

As for ingested datasets.. my sense is that get_file will always return the original, not the ingested format. For example constructionData.tab used as an example in the get_file help page is a Stata dta ingested into a tab, but get_file returns a raw file that can only be reasonably ingested with a read.dta/read_dta.

Re:

would many people use an R package to download a Stata/Spss/Whatever file?

  • If the replication file on dataset comes from Stata/SPSS and the one who is analyzing the replication is a R user, then the R user will have no choice but to read the Stata/SPSS file into R. Even if the ingestion works, sometimes important metadata (like variable and value labels) are stripped off in the ingest.

@pdurbin
Copy link
Member

pdurbin commented Feb 25, 2020

So unless Dataverse also offers Rds files soon

This just in. A request for RDS support in Dataverse from @reikoch at IQSS/dataverse#6678

@wibeasley @kuriwaki please feel free to comment on that issue! You both know way more about R than I do! 😄

@kuriwaki

This comment has been minimized.

@kuriwaki kuriwaki added the data-download Functions that are about downloading, not uploading, data label Dec 3, 2020
kuriwaki added a commit that referenced this issue Dec 27, 2020
@kuriwaki
Copy link
Member Author

kuriwaki commented Jan 2, 2021

This functionality is now called get_dataframe_* in #66.

I reread this conversation after implementing that PR. Re the above comment (#35 (comment)) by @wibeasley:

  • Re your bullet point 2, I think no, we want get_dataframe_* to be able to read in datafiles that are not ingested (e.g. nlsw88_rds-export.rds). Then it'll just be up to the user to specify the correct function
  • I don't quite understand "not to convert the tab to rds" in point 4. Either we use read_tsv to read the ingested version of the ingested file (original = FALSE), OR we ask the user to find the appropriate function to read the original version of the ingested or non-ingested file (original = TRUE). We never need to use Download Options > RData Format for any of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-download Functions that are about downloading, not uploading, data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants