Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read.socrata() fails on some datasets due to missing response headers from server #118

Closed
pschmied opened this issue Nov 16, 2016 · 31 comments
Assignees
Labels
Milestone

Comments

@pschmied
Copy link
Contributor

Per @chrismetcalf 's request, I'm posting this here and tagging him to take a look.

I've found that some datasets are returned with a set of response headers that cause read.socrata() to error out. I was led to believe the issue has to do with the type of backend (new vs. old) on the Socrata side.

See below for full details, but the issue seems to be that the response header x-soda2-types isn't getting returned. RSocrata tries to extract the column types from the missing response header, and fails.

library(RSocrata)

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin16.1.0 (64-bit)
## Running under: macOS Sierra 10.12.1

## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     

## other attached packages:
## [1] RSocrata_1.7.1-24

## loaded via a namespace (and not attached):
## [1] compiler_3.3.2 httr_1.2.1     plyr_1.8.4     R6_2.2.0       Rcpp_0.12.7   
## [6] jsonlite_1.1   mime_0.5

## Example earthquake dataset referenced in RSocrata docs
ex_url <- "https://soda.demo.socrata.com/resource/4334-bgaj.csv"

## This is an example dataset that I believe to be using Socrata's new back end.
nbe_url <- "https://odn.data.socrata.com/resource/pvug-y23y.csv"


## Loads as expected
ex_df <- read.socrata(ex_url)


## Errors out: "Error: Argument 'txt' must be a JSON string, URL or file."
nbe_df <- read.socrata(nbe_url)


traceback()
## 5: base::stop(..., call. = FALSE)
## 4: stop("Argument 'txt' must be a JSON string, URL or file.")
## 3: jsonlite::fromJSON(response$headers[["x-soda2-types"]])
## 2: getSodaTypes(response)
## 1: read.socrata(nbe_url)


## Get http responses for comparison
ex_resp <- httr::GET(ex_url)
nbe_resp <- httr::GET(nbe_url)


## Earthquake dataset response headers include `x-soda2-types`
names(ex_resp$headers)
##  [1] "server"                      "date"                       
##  [3] "content-type"                "transfer-encoding"          
##  [5] "connection"                  "x-socrata-requestid"        
##  [7] "access-control-allow-origin" "last-modified"              
##  [9] "x-soda2-warning"             "x-soda2-fields"             
## [11] "x-soda2-types"               "x-soda2-legacy-types"       
## [13] "age"                         "x-socrata-region"           
## [15] "content-encoding"           

## NBE-backed entities dataset does not
names(nbe_resp$headers)
##  [1] "server"                          "date"                           
##  [3] "content-type"                    "transfer-encoding"              
##  [5] "connection"                      "x-socrata-requestid"            
##  [7] "access-control-allow-origin"     "x-soda2-data-out-of-date"       
##  [9] "x-soda2-truth-last-modified"     "x-soda2-secondary-last-modified"
## [11] "last-modified"                   "age"                            
## [13] "x-socrata-region"                "content-encoding"     
@tomschenkjr
Copy link
Contributor

Thanks @pschmied, I can replicate locally and appears to be a valid bug.

@chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.

@tomschenkjr
Copy link
Contributor

@Chicago/rsocrata-devs - I've created a new branch issue118 that includes a unit test for this particular issue. The version is bumped to 1.7.2-1. On Monday we can discuss if this is a potential hot fix or is wrapped into our 1.8.0 release since this deals with new API. Our documentation may need to be more deliberate in discussing the APIs we've specifically tested in the past.

Looping-in @levyj to help navigate the NBE lingo.

@nicklucius - can you look at this code?

@pschmied
Copy link
Contributor Author

@tomschenkjr My hunch is that pvug-y23y WILL change. Maybe we could upload a classic, open, very unlikely to change dataset such as iris as a reference. I joined Socrata a few days ago, so I can talk to folks here to see if that would be appropriate.

Before modifying the RSocrata package, we might see if @chrismetcalf can verify that indeed the correct set of headers are being returned from the Socrata side. If NBE-backed datasets are supposed to be returned with type information, then that would suggest a fix on our end.

@tomschenkjr
Copy link
Contributor

@pschmied - first, congrats on the new job! 🍾

I agree on having some test data sets that we can use for unit testing and suggested it in #45. We've been using the Earthquake example--and it's been fine--but having one for different versions of SoDA would be helpful.

We'll hold off on work for now (unassigning @levyj and @nicklucius for now).

@nicklucius
Copy link
Contributor

@tomschenkjr - I just saw your last message, but was about to post this:

I agree this code below causes an error because the response header x-soda2-types is not getting returned, as @pschmied reported above.

nbe_url <- "https://odn.data.socrata.com/resource/pvug-y23y.csv"
nbe_df <- read.socrata(nbe_url)

read.socrata() uses the data types found in x-soda2-types to trigger posixify() to run on date/time columns and no_deniro() to run on money columns.

As a fix, for any URL where x-soda2-types turns out to be missing, I could redirect the getSodaTypes() function to find the data types by changing the URL from /resource/ to /views/. In this example, we can get the data types for https://odn.data.socrata.com/resource/pvug-y23y.csv as follows:

  • Change the URL to https://odn.data.socrata.com/views/pvug-y23y to get these results
{
  "id" : "pvug-y23y",
  "name" : "Entities",
  "averageRating" : 0,
  "createdAt" : 1473714769,
  "displayType" : "table",
  "downloadCount" : 11,
  "hideFromCatalog" : false,
  "hideFromDataJson" : false,
  "indexUpdatedAt" : 1474044590,
  "newBackend" : true,
  "numberOfComments" : 0,
  "oid" : 23753334,
  "provenance" : "official",
  "publicationAppendEnabled" : true,
  "publicationDate" : 1473714855,
  "publicationGroup" : 13035076,
  "publicationStage" : "published",
  "rowClass" : "",
  "rowIdentifierColumnId" : 269859293,
  "rowsUpdatedAt" : 1474481590,
  "rowsUpdatedBy" : "7krv-k7t3",
  "tableId" : 13035076,
  "totalTimesRated" : 0,
  "viewCount" : 50,
  "viewLastModified" : 1474044581,
  "viewType" : "tabular",
  "columns" : [ {
    "id" : 269859293,
    "name" : "id",
    "dataTypeName" : "text",
    "fieldName" : "id",
    "position" : 1,
    "renderTypeName" : "text",
    "tableColumnId" : 38329919,
    "width" : 186,
    "format" : { }
  }, {
    "id" : 269859296,
    "name" : "name",
    "dataTypeName" : "text",
    "fieldName" : "name",
    "position" : 2,
    "renderTypeName" : "text",
    "tableColumnId" : 38329922,
    "width" : 443,
    "format" : { }
  }, {
    "id" : 269859294,
    "name" : "type",
    "dataTypeName" : "text",
    "fieldName" : "type",
    "position" : 3,
    "renderTypeName" : "text",
    "tableColumnId" : 38329920,
    "width" : 183,
    "format" : { }
  }, {
    "id" : 269859295,
    "name" : "rank",
    "dataTypeName" : "number",
    "fieldName" : "rank",
    "position" : 4,
    "renderTypeName" : "number",
    "tableColumnId" : 38329921,
    "width" : 147,
    "format" : { }
  } ],
  • Use the dataTypeName in place of x-soda2-types

@pschmied
Copy link
Contributor Author

Thank you, @tomschenkjr ! I'm very excited--have been beneficiary of Socrata-backed open-data portals for some time now.

As for @nicklucius 's approach, that seems very reasonable. Sadly, I'm too new on the job to know what is the better long-term solution. That's where I'll rely on Chris, who'll have a much better sense of where the API is, and where it's likely to go ;-)

@chrismetcalf
Copy link

Catching up...

@chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.

I wouldn't do that, since the dataset is likely to change time. However, you could always create a unit test that passes a known $limit and set an assertion on that.

@tomschenkjr @levyj In general, would it be good to have a couple of datasets that I maintain in a "known state" for creating test suites against? I could easily set up a couple of datasets somewhere that you could use to test different conditions.

For soda-ruby, I've stubbed out the actual HTTP calls, but it might be good to have some integration tests too.

@tomschenkjr
Copy link
Contributor

@chrismetcalf - 10-4 on not relying row counts. Its be very helpful to have static data sets for testing with known conditions.

For the underlying issues, is the SoDA header in NBE returning the correct headers in this case?

_
Tom Schenk Jr.
Chief Data Officer
Department of Innovation and Technology
City of Chicago
(312) 744-2770
tom.schenk@cityofchicago.org | @ChicagoCDO
data.cityofchicago.org | opengrid.io | digital.cityofchicago.org | chicago.github.io |dev.cityofchicago.org

On Fri, Nov 18, 2016 at 6:34 PM -0600, "Chris Metcalf" <notifications@github.commailto:notifications@github.com> wrote:

Catching up...

@chrismetcalfhttps://github.com/chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.

I wouldn't do that, since the dataset is likely to change time. However, you could always create a unit test that passes a known $limit and set an assertion on that.

@tomschenkjrhttps://github.com/tomschenkjr @levyjhttps://github.com/levyj In general, would it be good to have a couple of datasets that I maintain in a "known state" for creating test suites against? I could easily set up a couple of datasets somewhere that you could use to test different conditions.

For soda-ruby, I've stubbed out the actual HTTP calls, but it might be good to have some integration tests too.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/118#issuecomment-261677483, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0TGYXiC9nPAle74poYxiHZIldBoGks5q_kPYgaJpZM4K0cmO.


This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

@chrismetcalf
Copy link

It does appear that the X-SODA2-Fields and X-SODA2-Types headers are omitted when you select the .csv output type. I'm not sure why they'd be gone in that case, so I'll create a ticket with our Engineering team to see what's up.

@tomschenkjr tomschenkjr self-assigned this Dec 6, 2016
@tomschenkjr
Copy link
Contributor

Hey @chrismetcalf - just wanted to follow-up on this to see if this was an error in the return headers or if we shouldn't be expecting those X-SODA2- entries anymore.

@chrismetcalf
Copy link

@tomschenkjr I just bumped the JIRA ticket with our engineering team to see if I can get an update.

Internal ticket (you won't be able to open that link): https://socrata.atlassian.net/browse/EN-12140

@tomschenkjr
Copy link
Contributor

tomschenkjr commented Dec 6, 2016 via email

@tomschenkjr
Copy link
Contributor

@chrismetcalf - just following-up on this.

@chrismetcalf
Copy link

@tomschenkjr I just checked in on the ticket and it's still in the queue. I bumped the priority, but if you file a ticket with Support and ask for it to be linked to EN-12140, that'll probably get it more attention.

@tomschenkjr
Copy link
Contributor

tomschenkjr commented Jan 10, 2017 via email

@chrismetcalf
Copy link

Thanks @tomschenkjr

@tomschenkjr
Copy link
Contributor

Received feedback from @socrata support team and appears the missing X-SODA- parameters was an error. Closing this issue since it's not related to RSocrata directly and appears the engineering team is fixing the API issue.

Thanks for reporting!

@chrismetcalf
Copy link

@tomschenkjr FYI, just got the notification that the header fix for those datasets just hit staging, so a fix for the bug on our end should be out soon! Great seeing you guys yesterday!

@tomschenkjr
Copy link
Contributor

This has been resolved by the engineering team. Just let me know if you see it again.

@hrecht
Copy link

hrecht commented Feb 17, 2017

Hi, I'm getting this same error on two datasets, https://data.healthcare.gov/dataset/2016-QHP-Landscape-Individual-Market-Medical/v7sn-c66v/data and https://data.healthcare.gov/dataset/2017-QHP-Landscape-Individual-Market-Medical/enpz-m4q6
I've tested a few other datasets from the same site and those read without errors.

> myurl <- "https://data.healthcare.gov/dataset/2017-QHP-Landscape-Individual-Market-Medical/enpz-m4q6"
> df <- read.socrata(myurl)
Error: Argument 'txt' must be a JSON string, URL or file.
> traceback()
5: base::stop(..., call. = FALSE)
4: stop("Argument 'txt' must be a JSON string, URL or file.")
3: jsonlite::fromJSON(response$headers[["x-soda2-types"]])
2: getSodaTypes(response)
1: read.socrata(myurl)

I also tried reading from the csv and json resource urls for these datasets with no change in error message.

@tomschenkjr
Copy link
Contributor

Thanks, @hrecht. This seems to be a @socrata server-side issue. I will ping them to let them know the issue still exists in other data sets--I am not sure of the underlying cause.

If it persists or this is being encountered at a high-rate, we may need to release a bug-fix of sorts that does not do this validation; at least, throws a warning and not a fatal error.

@hrecht
Copy link

hrecht commented Feb 19, 2017

Thanks @tomschenkjr! Understood that it's a @socrata issue rather than RSocrata issue, hopefully they'll be able to fix.

@chrismetcalf
Copy link

@tomschenkjr @hrecht I'm following up with the team internally and we'll get back to you once we know more.

@tomschenkjr How reliant are you on the X-SODA2-Fields header in RSocrata? In this case, it looks like we might be deliberately omitting the *-Fields header since that dataset includes so many columns that just the column names alone exceed's NGinx's 4K limit on total header size. See the X-SODA2-Warning: X-SODA2-Fields, X-SODA2-Types, and X-SODA2-Legacy-Types are deprecated and have been suppressed for being too large header. We might need a different mechanism.

@tomschenkjr
Copy link
Contributor

@chrismetcalf - we use it to retrieve the data types being used. The most important is using it to identify dates and then converting them into a proper data column in R. Less important, is we also use it to identify money fields so R handles it as a float/numeric field instead of a string.

If this was deprecated, we could rely on guessing the field types.

@tomschenkjr
Copy link
Contributor

Reopening for now since we are revisiting a workaround for this.

@tomschenkjr tomschenkjr reopened this Mar 7, 2017
@chrismetcalf
Copy link

Following up on my comment above, we are in fact conditionally including the X-SODA2-* headers when they'd be long enough to push the header over the 4K limit. So I wouldn't rely on them when building a general-purpose library. I'm going to clarify that in dev.socrata.com so hopefully nobody else gets hung up on it: https://github.com/socrata/dev.socrata.com/issues/677

Now we need another, more reliable way to get you the column types. There are of course unsupported ways of getting this, but I'd really rather have you on something real, or pursue the guessing option.

How often are people hitting this issue? It should only be happening for very wide datasets or datasets with very long column names? Another workaround would be to specify a $select to limit the number of columns returned.

@PriyaDoIT PriyaDoIT added this to the 1.7.2 milestone Mar 15, 2017
@nicklucius
Copy link
Contributor

Our development team sketched out how RSocrata will handle this: If X-SODA2-* headers are not present, we will make a change so that read.socrata() does not fail. Instead, it will not run posixify() or no_deniro() and will return date and currency fields as character with a warning. That way, users still get the data they are seeking.

I'll get started on the development for this fix.

nicklucius added a commit that referenced this issue Mar 16, 2017
@nicklucius
Copy link
Contributor

@tomschenkjr - Just noting this to help explain my changes: I rebased the issue118 branch since it was a few months behind dev. Then after the fix was completed in issue118, I merged it directly to dev according to the new Git Flow.

@tomschenkjr tomschenkjr removed their assignment Mar 16, 2017
@tomschenkjr
Copy link
Contributor

@nicklucius - sounds good. I'm handing the issue over to you for development. Good night, good luck 🖖

@geneorama
Copy link
Member

The test for this started failing recently.

The section of code:

test_that("Warn instead of fail if X-SODA2-* headers are missing", {
  expect_warning(dfCsv <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.csv?$limit=1000"),
                info="https://github.com/Chicago/RSocrata/issues/118")
  expect_warning(dfJson <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.json?$limit=1000"),
                info="https://github.com/Chicago/RSocrata/issues/118")
  expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.csv"))
  expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.json"))
  expect_equal("data.frame", class(dfCsv), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal("data.frame", class(dfJson), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal(140, ncol(dfJson), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
})

The actual failing test message:

>   expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
Error: columns not equal to ncol(dfCsv).
1/1 mismatches
[1] 150 - 146 == 4
https://github.com/Chicago/RSocrata/issues/118

I thought this might be useful, but it didn't help me:

> setdiff(colnames(dfJson), colnames(dfCsv))
[1] "url"   "url.1" "url.2" "url.3"
> setdiff(colnames(dfCsv), colnames(dfJson))
 [1] "network_url"                                          
 [2] "plan_brochure_url"                                    
 [3] "summary_of_benefits_url"                              
 [4] "drug_formulary_url"                                   
 [5] "adult_dental"                                         
 [6] "premium_scenarios"                                    
 [7] "standard_plan_cost_sharing"                           
 [8] "X_73_percent_actuarial_value_silver_plan_cost_sharing"
 [9] "X_87_percent_actuarial_value_silver_plan_cost_sharing"
[10] "X_94_percent_actuarial_value_silver_plan_cost_sharing"

It looks like the URL columns are different in name only, but the other six columns are missing in the JSON. Not sure if this is related to this issue, or if this is something else?

@nicklucius
Copy link
Contributor

@geneorama - changing the value to 146, which then lines up the test expectation with the incoming data, removes the error. If there was a problem related to this issue #118, there would still be an error thrown. Therefore, I think the problem is that the number of columns returned by Socrata has changed and that is what caused the issue to fail. As such, nothing I'm seeing shows that there's anything related to #118.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants