Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate file description #32

Closed
1 of 3 tasks
adam3smith opened this issue Dec 5, 2019 · 3 comments · Fixed by #39
Closed
1 of 3 tasks

Duplicate file description #32

adam3smith opened this issue Dec 5, 2019 · 3 comments · Fixed by #39

Comments

@adam3smith
Copy link
Contributor

adam3smith commented Dec 5, 2019

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

The "description" for files is repeated, resulting in a duplicate data.frame column name which causes all sorts of issues. Not sure if this is a problem with the API or the R-package, but figured I'd start here. CC @pdurbin

## load package
library("dataverse")


## code goes here
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
colnames(obrien_files)

 [1] "description"         "label"               "restricted"         
 [4] "version"             "datasetVersionId"    "categories"         
 [7] "id"                  "persistentId"        "pidURL"             
[10] "filename"            "contentType"         "filesize"           
[13] "description"         "storageIdentifier"   "rootDataFileId"     
[16] "md5"                 "checksum"            "creationDate"       
[19] "originalFileFormat"  "originalFormatLabel" "originalFileSize"   
[22] "UNF"                 "tabularTags"

## session info for your system
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.0.1   dataverse_0.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        rstudioapi_0.10   xml2_1.2.0        magrittr_1.5     
 [5] tidyselect_0.2.5  R6_2.4.0          rlang_0.3.4       httr_1.4.1       
 [9] tools_3.4.3       pkgbuild_1.0.2    cli_1.1.0         withr_2.1.2      
[13] remotes_2.1.0     assertthat_0.2.1  rprojroot_1.3-2   tibble_2.1.1     
[17] crayon_1.3.4      processx_3.3.0    purrr_0.3.2       callr_3.1.1      
[21] ps_1.3.0          curl_3.3          glue_1.3.1        pillar_1.4.2     
[25] compiler_3.4.3    backports_1.1.4   prettyunits_1.0.2 jsonlite_1.6     
[29] pkgconfig_2.0.2  
@pdurbin
Copy link
Member

pdurbin commented Dec 5, 2019

If anything, it's probably a bug or at least a weirdness in the Dataverse API, which shows "description" twice. Here's a screenshot from https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/WOT075

Screen Shot 2019-12-04 at 10 49 56 PM

@adam3smith I'd encourage you to create an issue at https://github.com/IQSS/dataverse/issues but I'd be afraid that if we delete one of the "description" fields from the Dataverse API that an integration would break. It's probably better to think of this as a wart in the Dataverse API, something to fix in v2 or whatever. 😄

kuriwaki added a commit to kuriwaki/dataverse-client-r that referenced this issue Dec 16, 2019
@kuriwaki
Copy link
Member

The columns also get duplicated when binding here (both have the description column name).

out$files <- cbind(out$files, file_df)

In my fork (kuriwaki@49fd9e5), I've removed the duplicate and it works:

Sys.setenv("DATAVERSE_KEY" = "5b514e42-1260-4b78-b395-e27de83d3115")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

library(tibble)
library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")


# description about each dataset
obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
any(duplicated(colnames(obrien_files)))
#> [1] FALSE

# non-duplicated column names makes tibble possible
as_tibble(obrien_files)
#> # A tibble: 6 x 22
#>   label restricted version datasetVersionId categories     id persistentId
#>   <chr> <lgl>        <int>            <int> <list>      <int> <chr>       
#> 1 Geog… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 2 Land… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 3 Land… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 4 Prop… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 5 Road… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> 6 Road… FALSE            1           178559 <chr [1]>  3.64e6 doi:10.7910…
#> # … with 16 more variables: pidURL <chr>, filename <chr>, contentType <chr>,
#> #   filesize <int>, description <chr>, storageIdentifier <chr>,
#> #   rootDataFileId <int>, md5 <chr>, checksum$type <chr>, $value <chr>,
#> #   creationDate <chr>, originalFileFormat <chr>, originalFormatLabel <chr>,
#> #   originalFileSize <int>, UNF <chr>, tabularTags <list>

Created on 2019-12-16 by the reprex package (v0.3.0)

@kuriwaki kuriwaki linked a pull request Dec 28, 2020 that will close this issue
@kuriwaki
Copy link
Member

Duplicate column was manually removed after the fact in PR #39, in commit kuriwaki@49fd9e5

library("dataverse")

## code goes here
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
colnames(obrien_files)
#>  [1] "label"               "restricted"          "version"            
#>  [4] "datasetVersionId"    "categories"          "id"                 
#>  [7] "persistentId"        "pidURL"              "filename"           
#> [10] "contentType"         "filesize"            "description"        
#> [13] "storageIdentifier"   "rootDataFileId"      "md5"                
#> [16] "checksum"            "creationDate"        "originalFileFormat" 
#> [19] "originalFormatLabel" "originalFileSize"    "originalFileName"   
#> [22] "UNF"                 "tabularTags"

any(duplicated(colnames(obrien_files)))
#> [1] FALSE

Created on 2020-12-28 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants