Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

404 errors in vignette - get_file() #33

Closed
wibeasley opened this issue Dec 7, 2019 · 17 comments · Fixed by #39 or #66
Closed

404 errors in vignette - get_file() #33

wibeasley opened this issue Dec 7, 2019 · 17 comments · Fixed by #39 or #66
Labels
bug data-download Functions that are about downloading, not uploading, data

Comments

@wibeasley
Copy link
Contributor

wibeasley commented Dec 7, 2019

You guys might start regretting inviting me to be a maintainer. I'm having trouble reproducing the vignettes, even easy parts like retrieving plain-text R & CSVs.

Part 1: out of the box

remotes::install_github("iqss/dataverse-client-r")
#> Skipping install of 'dataverse' from a github remote, the SHA1 (bac89f46) has not changed since last install.
#>   Use `force = TRUE` to force installation
library("dataverse")
Sys.setenv("DATAVERSE_KEY" = "examplekey12345")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("doi:10.7910/DVN/ARKOTI")
# Dataset (75170): 
# Version: 1.0, RELEASED
# Release Date: 2015-07-07T02:57:02Z
# License: CC0
# 22 Files:
# label version      id                  contentType
# 1                  alpl2013.tab       2 2692294    text/tab-separated-values
# 2                   BPchap7.tab       2 2692295    text/tab-separated-values
# 3                   chapter01.R       2 2692202 text/plain; charset=US-ASCII
# ...
# 16             drugCoverage.csv       1 2692233 text/plain; charset=US-ASCII
# ...

# Retrieve files by ID
object.size(dataverse::get_file(2692294)) # tab works
#> 211040 bytes
object.size(dataverse::get_file(2692295)) # tab works
#> 61336 bytes
object.size(dataverse::get_file(2692210)) # R fails
#> Error in dataverse::get_file(2692210): Not Found (HTTP 404).
object.size(dataverse::get_file(2692233)) # csv fails
#> Error in dataverse::get_file(2692233): Not Found (HTTP 404).

# Retrieve files by name & doi
object.size(get_file("alpl2013.tab"     , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 211040 bytes
object.size(get_file("BPchap7.tab"      , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 61336 bytes
object.size(get_file("chapter01.R"      , "doi:10.7910/DVN/ARKOTI")) # R fails
#> Error in get_file("chapter01.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
object.size(get_file("drugCoverage.csv" , "doi:10.7910/DVN/ARKOTI")) # csv fails
#> Error in get_file("drugCoverage.csv", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).


# Taken straight from https://cran.r-project.org/web/packages/dataverse/vignettes/C-retrieval.html
code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
#> Error in get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).

Created on 2019-12-06 by the reprex package (v0.3.0)

Part 2: digging.

Using debug(dataverse::get_file), the error-throwing line is in get_file():

r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query, ...)

To make things a tad more direct, I called dataverse::get_file(2692233). The two relevant parameters to httr::GET() are

Browse[2]> query
$format
[1] "original"

Browse[2]> u
[1] "https://dataverse.harvard.edu/api/access/datafile/2692233"

The r value returned is

Response [https://dataverse.harvard.edu/api/access/datafile/2692233?format=original]
  Date: 2019-12-07 05:13
  Status: 404
  Content-Type: application/json
  Size: 201 B

That u value is fine when pasted into Chrome. I saw several Dataverse discussions about a trailing /. When I added that, the response appears good.

Browse[2]> u2 <- paste0("https://dataverse.harvard.edu/api/access/datafile/2692233", "/")
Browse[2]> httr::GET(u2, httr::add_headers(`X-Dataverse-key` = key), ... )
Response [https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/ARKOTI/14e66408488-c678717f7c4d?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27drugCoverage.csv&response-content-type=text%2Fplain%3B%20charset%3DUS-ASCII&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191207T051632Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20191207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=c1b13a7d3ea2a53c1c1e70c18a762ae0e4ae14eb41fae7d79c71fce26a9b354f]
  Date: 2019-12-07 05:16
  Status: 200
  Content-Type: text/plain; charset=US-ASCII
  Size: 4.06 kB

Part 3: Questions

  1. I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

  2. Why are csv & R files affected, but not tab files? As I step through a tab file (e.g., dataverse::get_file(2692294)), it appears the exact same lines are executed. And that u value doesn't have a trailing slash (https://dataverse.harvard.edu/api/access/datafile/2692294). I see two differences: (a) the content type and (b) this one doesn't go through AWS/S3.

    Response [https://dataverse.harvard.edu/api/access/datafile/2692294?format=original]
    Date: 2019-12-07 05:35
    Status: 200
    Content-Type: application/x-stata; name="alpl2013.dta"
    Size: 211 kB
    <BINARY BODY>

    This is probably related to @EdJeeOnGitHub's recent issue More verbose httr error message for get_file() #31. Notice he mentions problems with certain file formats.

  3. Is this related at all to Dataverse URL - Page Not Found (404 Error) w/ Trailing Forward Slash dataverse#3130, Dataverse URL - Validation Error for Bad URL with "." dataverse#2559, or Error message for wrong/missing dataverse is not clear dataverse#4196?
    You can see that my knowledge with the web side of this is limited; I don't understand them that well.

devtools::session_info()
  • Session info ---------------------------------------------------------------------------
    setting value
    version R version 3.6.1 Patched (2019-08-12 r76979)
    os Windows 10 x64
    system x86_64, mingw32
    ui RStudio
    language (EN)
    collate English_United States.1252
    ctype English_United States.1252
    tz America/Chicago
    date 2019-12-06

  • Packages -------------------------------------------------------------------------------
    package * version date lib source
    assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
    backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
    callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1)
    cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
    clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.1)
    crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
    curl 4.3 2019-12-02 [1] CRAN (R 3.6.1)
    dataverse * 0.2.1 2019-12-07 [1] Github (bac89f4)
    desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
    devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
    digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
    ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
    evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
    fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
    glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
    htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
    httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
    jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0)
    knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1)
    magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
    memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
    packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0)
    pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1)
    pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
    prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
    processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
    ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
    R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
    Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
    remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
    reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
    rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1)
    rmarkdown 1.18 2019-11-27 [1] CRAN (R 3.6.1)
    rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
    rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
    sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
    testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1)
    usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
    whisker 0.4 2019-08-28 [1] CRAN (R 3.6.1)
    withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
    xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1)
    xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)

screenshot of postman postman
@EdJeeOnGitHub
Copy link
Contributor

EdJeeOnGitHub commented Dec 7, 2019

Whilst I was looking around @wibeasley I noticed that line 104 in get_file.R never gets called because length(query) is always greater than 1. However, often the query = query argument in the if statement above was causing issues - the else{} block will often work but it's never ran.

The relevant code is copied below.

if (length(query)) {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), query = query, ...)
} else {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
}

@pdurbin
Copy link
Member

pdurbin commented Dec 7, 2019

You guys might start regretting inviting me to be a maintainer.

Never! Are you coming to the 2020 Dataverse Community Meeting? 😄

  1. I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

My money is on a change in Dataverse, not curl. 😄

I'll try to dig in more on this during the work week. Have a good weekend!

@pdurbin
Copy link
Member

pdurbin commented Dec 9, 2019

2. Why are csv & R files affected, but not tab files? As I step through a tab file

Is this related to the fact that passing "format=original" only works for tabular files? Please see IQSS/dataverse#6408

@wibeasley to be honest, I'm a little lost in this issue, probably because I'm not much of an R hacker. Please keep the questions coming. Please let me know how I can help. 😄

@kuriwaki
Copy link
Member

kuriwaki commented Dec 9, 2019

Appreciate if there is a fix/workaround for this. I currently cannot read non-ingested datasets as well as ingested Stata datasets that originate from Stata v14+ files. Here are three examples in the CCES, where the first one works but not the other two.

library(dataverse)
# hide my key



# tab files in CCES 2017 (Stata v12 dataset) WORKS
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3STEZY
cc17 <- get_file("Common Content Data.tab", "doi:10.7910/DVN/3STEZY")
writeBin(cc17,  "Common Content Data.dta")
cc17_dta <- foreign::read.dta("Common Content Data.dta")
cc17_dta <- haven::read_dta("Common Content Data.dta")

# tab files in CCES 2018 (Stata v14+dataset) DOES NOT WORK
# possibly because of Stata version issue
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZSBZ7K
cc18 <- get_file("cces18_common_vv.tab", "doi:10.7910/DVN/ZSBZ7K")
writeBin(cc18,  "cces18_common_vv.dta")
cc18_dta <- foreign::read.dta("cces18_common_vv.dta")
#> Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file
cc18_dta <- haven::read_dta("cces18_common_vv.dta")
#> Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): 
#> Failed to parse /private/var/folders/gy/sd6ddp895s7dyqbdh2432fwm0000gn/T/
#> RtmpoYWXRS/reprex14f97d9c5c8b/cces18_common_vv.dta:
#> This version of the file format is not supported.

# Cumualtive common content dta, not tabulated, DOES NOT WORK
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/II2DB6
ccc_d <- get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).
ccc_r <- get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).

Created on 2019-12-09 by the reprex package (v0.3.0)

@pdurbin
Copy link
Member

pdurbin commented Dec 9, 2019

Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file

It looks like @kuriwaki has opened #34 about this.

@pdurbin

This comment has been minimized.

@pdurbin

This comment has been minimized.

kuriwaki added a commit to kuriwaki/dataverse-client-r that referenced this issue Dec 11, 2019
@kuriwaki
Copy link
Member

Thanks. It's good to know the data is there. As was originally pointed out, when I remove the query argument in httr::GET everything goes through fine. (Testable at devtools::install_github("kuriwaki/dataverse-client-r")).

Since dput(query) gives only this length-1 objectlist(format = "original") at least in my case, is there any need to keep that argument at all?

kuriwaki added a commit to kuriwaki/dataverse-client-r that referenced this issue Dec 12, 2019
@kuriwaki
Copy link
Member

@wibeasley: the changes I've made in the fork seem to fix get_file to read any single-file object. However, I'm not familiar enough with the dataverse API or httr to assess if it is stable or to asses if what I'm doing is recommended. If I submitted a PR, would you (or @pdurbin) be able to review / discuss it?

for example here are results from @EdJeeOnGitHub 's #31

library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")

dv_files <- get_dataset("doi:10.7910/DVN/JGLOZF")$files

# for each file in data
for (f in 1:nrow(dv_files)) {
  data_bytes <- as.integer(object.size(dataverse::get_file(dv_files$id[f])))
  data_mb <- round(measurements::conv_unit(data_bytes, "byte", "MB"), 3)
  
  metadata_mb <- round(measurements::conv_unit(dv_files$filesize[f], "byte", "MB"), 3)
  
  print(glue::glue("{dv_files$filename[f]}, {metadata_mb} MB in metadata, {data_mb} MB when downloaded"))
}
#> finalusingindices_anon.tab, 11.251 MB in metadata, 11.263 MB when downloaded
#> ReadMe with Codebook.docx, 0.036 MB in metadata, 0.036 MB when downloaded
#> The Hunger Project Dataverse Files.zip, 1.98 MB in metadata, 1.98 MB when downloaded
#> THPawareness_HH_anon.tab, 0.585 MB in metadata, 0.586 MB when downloaded

Created on 2019-12-16 by the reprex package (v0.3.0)

@wibeasley
Copy link
Contributor Author

@kuriwaki the examples I posted initially are now working. Thanks so much for figuring it out and fixing. I made one small addition in the commit referenced above. Basically, it catches the case if the file is already specified as a number/id. Was that your intention, or am I misunderstanding something?

wibeasley added a commit that referenced this issue Jan 2, 2020
also correct `create_dataverse()`

ref #33
wibeasley added a commit that referenced this issue Jan 2, 2020
@kuriwaki

This comment has been minimized.

@mayeulk

This comment has been minimized.

@kuriwaki

This comment has been minimized.

@mayeulk

This comment has been minimized.

@mayeulk

This comment has been minimized.

@kuriwaki

This comment has been minimized.

@mayeulk

This comment has been minimized.

@kuriwaki kuriwaki added the data-download Functions that are about downloading, not uploading, data label Dec 3, 2020
@kuriwaki kuriwaki linked a pull request Dec 28, 2020 that will close this issue
@kuriwaki kuriwaki added the bug label Dec 28, 2020
@kuriwaki kuriwaki pinned this issue Dec 28, 2020
@kuriwaki kuriwaki unpinned this issue Dec 28, 2020
@kuriwaki kuriwaki changed the title 404 errors in vignette 404 errors in vignette - get_file() Dec 28, 2020
kuriwaki added a commit that referenced this issue Dec 28, 2020
…bular data in their original form. Otherwise, do not specify `format`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug data-download Functions that are about downloading, not uploading, data
Projects
None yet
5 participants