404 errors in vignette - get_file() #33

wibeasley · 2019-12-07T05:53:40Z

You guys might start regretting inviting me to be a maintainer. I'm having trouble reproducing the vignettes, even easy parts like retrieving plain-text R & CSVs.

Part 1: out of the box

remotes::install_github("iqss/dataverse-client-r")
#> Skipping install of 'dataverse' from a github remote, the SHA1 (bac89f46) has not changed since last install.
#>   Use `force = TRUE` to force installation
library("dataverse")
Sys.setenv("DATAVERSE_KEY" = "examplekey12345")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("doi:10.7910/DVN/ARKOTI")
# Dataset (75170): 
# Version: 1.0, RELEASED
# Release Date: 2015-07-07T02:57:02Z
# License: CC0
# 22 Files:
# label version      id                  contentType
# 1                  alpl2013.tab       2 2692294    text/tab-separated-values
# 2                   BPchap7.tab       2 2692295    text/tab-separated-values
# 3                   chapter01.R       2 2692202 text/plain; charset=US-ASCII
# ...
# 16             drugCoverage.csv       1 2692233 text/plain; charset=US-ASCII
# ...

# Retrieve files by ID
object.size(dataverse::get_file(2692294)) # tab works
#> 211040 bytes
object.size(dataverse::get_file(2692295)) # tab works
#> 61336 bytes
object.size(dataverse::get_file(2692210)) # R fails
#> Error in dataverse::get_file(2692210): Not Found (HTTP 404).
object.size(dataverse::get_file(2692233)) # csv fails
#> Error in dataverse::get_file(2692233): Not Found (HTTP 404).

# Retrieve files by name & doi
object.size(get_file("alpl2013.tab"     , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 211040 bytes
object.size(get_file("BPchap7.tab"      , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 61336 bytes
object.size(get_file("chapter01.R"      , "doi:10.7910/DVN/ARKOTI")) # R fails
#> Error in get_file("chapter01.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
object.size(get_file("drugCoverage.csv" , "doi:10.7910/DVN/ARKOTI")) # csv fails
#> Error in get_file("drugCoverage.csv", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).


# Taken straight from https://cran.r-project.org/web/packages/dataverse/vignettes/C-retrieval.html
code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
#> Error in get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).

^{Created on 2019-12-06 by the reprex package (v0.3.0)}

Part 2: digging.

Using debug(dataverse::get_file), the error-throwing line is in get_file():

r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query, ...)

To make things a tad more direct, I called dataverse::get_file(2692233). The two relevant parameters to httr::GET() are

Browse[2]> query
$format
[1] "original"

Browse[2]> u
[1] "https://dataverse.harvard.edu/api/access/datafile/2692233"

The r value returned is

Response [https://dataverse.harvard.edu/api/access/datafile/2692233?format=original]
  Date: 2019-12-07 05:13
  Status: 404
  Content-Type: application/json
  Size: 201 B

That u value is fine when pasted into Chrome. I saw several Dataverse discussions about a trailing /. When I added that, the response appears good.

Browse[2]> u2 <- paste0("https://dataverse.harvard.edu/api/access/datafile/2692233", "/")
Browse[2]> httr::GET(u2, httr::add_headers(`X-Dataverse-key` = key), ... )
Response [https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/ARKOTI/14e66408488-c678717f7c4d?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27drugCoverage.csv&response-content-type=text%2Fplain%3B%20charset%3DUS-ASCII&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191207T051632Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20191207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=c1b13a7d3ea2a53c1c1e70c18a762ae0e4ae14eb41fae7d79c71fce26a9b354f]
  Date: 2019-12-07 05:16
  Status: 200
  Content-Type: text/plain; charset=US-ASCII
  Size: 4.06 kB

Part 3: Questions

I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?
Why are csv & R files affected, but not tab files? As I step through a tab file (e.g., dataverse::get_file(2692294)), it appears the exact same lines are executed. And that u value doesn't have a trailing slash (https://dataverse.harvard.edu/api/access/datafile/2692294). I see two differences: (a) the content type and (b) this one doesn't go through AWS/S3.
```
Response [https://dataverse.harvard.edu/api/access/datafile/2692294?format=original]
Date: 2019-12-07 05:35
Status: 200
Content-Type: application/x-stata; name="alpl2013.dta"
Size: 211 kB
<BINARY BODY>
```
This is probably related to @EdJeeOnGitHub's recent issue More verbose httr error message for get_file() #31. Notice he mentions problems with certain file formats.
Is this related at all to Dataverse URL - Page Not Found (404 Error) w/ Trailing Forward Slash dataverse#3130, Dataverse URL - Validation Error for Bad URL with "." dataverse#2559, or Error message for wrong/missing dataverse is not clear dataverse#4196?
You can see that my knowledge with the web side of this is limited; I don't understand them that well.

devtools::session_info()

Session info ---------------------------------------------------------------------------
setting value
version R version 3.6.1 Patched (2019-08-12 r76979)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/Chicago
date 2019-12-06
Packages -------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.1)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
curl 4.3 2019-12-02 [1] CRAN (R 3.6.1)
dataverse * 0.2.1 2019-12-07 [1] Github (bac89f4)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0)
knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1)
rmarkdown 1.18 2019-11-27 [1] CRAN (R 3.6.1)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
whisker 0.4 2019-08-28 [1] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)

screenshot of postman

The text was updated successfully, but these errors were encountered:

EdJeeOnGitHub · 2019-12-07T14:32:27Z

Whilst I was looking around @wibeasley I noticed that line 104 in get_file.R never gets called because length(query) is always greater than 1. However, often the query = query argument in the if statement above was causing issues - the else{} block will often work but it's never ran.

The relevant code is copied below.

if (length(query)) {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), query = query, ...)
} else {
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
}

pdurbin · 2019-12-07T22:55:03Z

You guys might start regretting inviting me to be a maintainer.

Never! Are you coming to the 2020 Dataverse Community Meeting? 😄

I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

My money is on a change in Dataverse, not curl. 😄

I'll try to dig in more on this during the work week. Have a good weekend!

pdurbin · 2019-12-09T21:24:32Z

2. Why are csv & R files affected, but not tab files? As I step through a tab file

Is this related to the fact that passing "format=original" only works for tabular files? Please see IQSS/dataverse#6408

@wibeasley to be honest, I'm a little lost in this issue, probably because I'm not much of an R hacker. Please keep the questions coming. Please let me know how I can help. 😄

kuriwaki · 2019-12-09T23:22:04Z

Appreciate if there is a fix/workaround for this. I currently cannot read non-ingested datasets as well as ingested Stata datasets that originate from Stata v14+ files. Here are three examples in the CCES, where the first one works but not the other two.

library(dataverse)
# hide my key



# tab files in CCES 2017 (Stata v12 dataset) WORKS
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3STEZY
cc17 <- get_file("Common Content Data.tab", "doi:10.7910/DVN/3STEZY")
writeBin(cc17,  "Common Content Data.dta")
cc17_dta <- foreign::read.dta("Common Content Data.dta")
cc17_dta <- haven::read_dta("Common Content Data.dta")

# tab files in CCES 2018 (Stata v14+dataset) DOES NOT WORK
# possibly because of Stata version issue
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZSBZ7K
cc18 <- get_file("cces18_common_vv.tab", "doi:10.7910/DVN/ZSBZ7K")
writeBin(cc18,  "cces18_common_vv.dta")
cc18_dta <- foreign::read.dta("cces18_common_vv.dta")
#> Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file
cc18_dta <- haven::read_dta("cces18_common_vv.dta")
#> Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip, name_repair = .name_repair): 
#> Failed to parse /private/var/folders/gy/sd6ddp895s7dyqbdh2432fwm0000gn/T/
#> RtmpoYWXRS/reprex14f97d9c5c8b/cces18_common_vv.dta:
#> This version of the file format is not supported.

# Cumualtive common content dta, not tabulated, DOES NOT WORK
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/II2DB6
ccc_d <- get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.dta", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).
ccc_r <- get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6")
#> Error in get_file("cumulative_2006_2018.Rds", "doi:10.7910/DVN/II2DB6"): 
#> Not Found (HTTP 404).

^{Created on 2019-12-09 by the reprex package (v0.3.0)}

pdurbin · 2019-12-09T23:40:10Z

Error in foreign::read.dta("cces18_common_vv.dta"): not a Stata version 5-12 .dta file

It looks like @kuriwaki has opened #34 about this.

…nd for IQSS#33

kuriwaki · 2019-12-11T16:57:26Z

Thanks. It's good to know the data is there. As was originally pointed out, when I remove the query argument in httr::GET everything goes through fine. (Testable at devtools::install_github("kuriwaki/dataverse-client-r")).

Since dput(query) gives only this length-1 objectlist(format = "original") at least in my case, is there any need to keep that argument at all?

kuriwaki · 2019-12-16T19:12:07Z

@wibeasley: the changes I've made in the fork seem to fix get_file to read any single-file object. However, I'm not familiar enough with the dataverse API or httr to assess if it is stable or to asses if what I'm doing is recommended. If I submitted a PR, would you (or @pdurbin) be able to review / discuss it?

for example here are results from @EdJeeOnGitHub 's #31

library(dataverse) # devtools::install_github("kuriwaki/dataverse-client-r")

dv_files <- get_dataset("doi:10.7910/DVN/JGLOZF")$files

# for each file in data
for (f in 1:nrow(dv_files)) {
  data_bytes <- as.integer(object.size(dataverse::get_file(dv_files$id[f])))
  data_mb <- round(measurements::conv_unit(data_bytes, "byte", "MB"), 3)
  
  metadata_mb <- round(measurements::conv_unit(dv_files$filesize[f], "byte", "MB"), 3)
  
  print(glue::glue("{dv_files$filename[f]}, {metadata_mb} MB in metadata, {data_mb} MB when downloaded"))
}
#> finalusingindices_anon.tab, 11.251 MB in metadata, 11.263 MB when downloaded
#> ReadMe with Codebook.docx, 0.036 MB in metadata, 0.036 MB when downloaded
#> The Hunger Project Dataverse Files.zip, 1.98 MB in metadata, 1.98 MB when downloaded
#> THPawareness_HH_anon.tab, 0.585 MB in metadata, 0.586 MB when downloaded

^{Created on 2019-12-16 by the reprex package (v0.3.0)}

@kuriwaki

cc: @kuriwaki ref #33

wibeasley · 2020-01-01T23:54:04Z

@kuriwaki the examples I posted initially are now working. Thanks so much for figuring it out and fixing. I made one small addition in the commit referenced above. Basically, it catches the case if the file is already specified as a number/id. Was that your intention, or am I misunderstanding something?

also correct `create_dataverse()` ref #33

run in a new environment ref #33

…bular data in their original form. Otherwise, do not specify `format`

kuriwaki mentioned this issue Dec 10, 2019

Reading Stata suggestion - change foreign to haven? #34

Closed

This comment has been minimized.

Sign in to view

kuriwaki added a commit to kuriwaki/dataverse-client-r that referenced this issue Dec 11, 2019

remove query option for format != "bundle" & length == 1, as workarou…

27560a2

…nd for IQSS#33

kuriwaki added a commit to kuriwaki/dataverse-client-r that referenced this issue Dec 12, 2019

backtrack; use query if the target looks like a .tab file for IQSS#33

4d45330

This was referenced Jan 1, 2020

package scaffolding #38

Closed

addressing 404 errors in vignette #39

Merged

wibeasley added a commit that referenced this issue Jan 1, 2020

catch case if file is already specified as a number

add1b62

cc: @kuriwaki ref #33

wibeasley added a commit that referenced this issue Jan 2, 2020

Clearly state dvn isn't on CRAN any more

e8a9126

also correct `create_dataverse()` ref #33

wibeasley added a commit that referenced this issue Jan 2, 2020

more sympathetic if archived scripts don't run automatically

d69422d

run in a new environment ref #33

This comment has been minimized.

Sign in to view

andybega mentioned this issue Jan 6, 2020

404 error at download attempt andybega/icews#51

Open

This comment has been minimized.

Sign in to view

kuriwaki added the data-download Functions that are about downloading, not uploading, data label Dec 3, 2020

kuriwaki linked a pull request Dec 28, 2020 that will close this issue

addressing 404 errors in vignette #39

Merged

kuriwaki added the bug label Dec 28, 2020

kuriwaki pinned this issue Dec 28, 2020

kuriwaki unpinned this issue Dec 28, 2020

kuriwaki changed the title ~~404 errors in vignette~~ 404 errors in vignette - get_file() Dec 28, 2020

kuriwaki added a commit that referenced this issue Dec 28, 2020

Simpler condition for #33 - only use format for getting ingested ta…

22427e0

…bular data in their original form. Otherwise, do not specify `format`

kuriwaki mentioned this issue Dec 30, 2020

Reorganize get_file functions and add a get_dataframe functions #66

Merged

wibeasley closed this as completed in #66 Jan 17, 2021

pdurbin mentioned this issue Feb 10, 2023

404, not found when providing format=original in Data Access API IQSS/dataverse#6408

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

404 errors in vignette - get_file() #33

404 errors in vignette - get_file() #33

wibeasley commented Dec 7, 2019 •

edited

Loading

EdJeeOnGitHub commented Dec 7, 2019 •

edited by wibeasley

Loading

pdurbin commented Dec 7, 2019

pdurbin commented Dec 9, 2019 •

edited

Loading

kuriwaki commented Dec 9, 2019 •

edited

Loading

pdurbin commented Dec 9, 2019

This comment has been minimized.

This comment has been minimized.

kuriwaki commented Dec 11, 2019

kuriwaki commented Dec 16, 2019

wibeasley commented Jan 1, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

404 errors in vignette - get_file() #33

404 errors in vignette - get_file() #33

Comments

wibeasley commented Dec 7, 2019 • edited Loading

Part 1: out of the box

Part 2: digging.

Part 3: Questions

EdJeeOnGitHub commented Dec 7, 2019 • edited by wibeasley Loading

pdurbin commented Dec 7, 2019

pdurbin commented Dec 9, 2019 • edited Loading

kuriwaki commented Dec 9, 2019 • edited Loading

pdurbin commented Dec 9, 2019

This comment has been minimized.

This comment has been minimized.

kuriwaki commented Dec 11, 2019

kuriwaki commented Dec 16, 2019

wibeasley commented Jan 1, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

wibeasley commented Dec 7, 2019 •

edited

Loading

EdJeeOnGitHub commented Dec 7, 2019 •

edited by wibeasley

Loading

pdurbin commented Dec 9, 2019 •

edited

Loading

kuriwaki commented Dec 9, 2019 •

edited

Loading