Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read.socrata hanging on JSON format #96

Closed
kevinsmgov opened this issue Aug 29, 2016 · 9 comments
Closed

read.socrata hanging on JSON format #96

kevinsmgov opened this issue Aug 29, 2016 · 9 comments
Assignees
Labels
Milestone

Comments

@kevinsmgov
Copy link

When I attempt to use read.socrata with a JSON format, the process hangs up.
(example url https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-21%27)

Debugging through the process appears to show the problem in the getContentAsDataFrame function. When testing the JSON response for the end of a paged sequence:

if(httr::content(response, as = 'text') == "[ ]") # empty json?
(line 196)

but the string value I'm seeing at that point is

"[]\n"

So it's not matching and looping forever. Perhaps Socrata is using a different JSON serializer now than when this logic was originally written. (or, I may be using the package incorrectly - let me know if this appears to be the case)

Some possible suggestions:

  1. update the string (assuming that this does represent a change in the Socrata system and not something that is occurring because of some unique aspect of my query)
  2. grepl the value to test for whitespace variations in the empty JSON string
  3. deserialize the content to an r variable and test for an empty deserialization result before converting to data.frame (this would probably be the most robust solution as it should already be insusceptible to whitespace variations).
@geneorama
Copy link
Member

@kevinsmgov Thanks for the nice example and suggestions. I was going to implement the third suggestion (I agree with your assessment). However, I'm unable to reproduce the exact error.
For me this code:

library(RSocrata)
read.socrata("https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-21%27")

results in this error:

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

I just updated all my packages to see if I was missing something that you might have. Here's my sessionInfo():

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSocrata_1.7.0-14

loaded via a namespace (and not attached):
[1] httr_1.2.1   R6_2.1.3     tools_3.3.1  curl_1.2     jsonlite_1.0 mime_0.5  

Is yours similar?

@kevinsmgov
Copy link
Author

Sorry, I mixed up some debugging information. The issue with the
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
error is from the last field (census_tract_2010_geoid). If you remove that, you'll get a working example.

I'm still trying to figure out why that field is generating that error.

@kevinsmgov
Copy link
Author

OK, the issue with that field (census_tract_2010_geoid) is that it is sometime null. When Socrata serializes JSON it leaves out null variables. When the rbind occurs at line 270, it errors out because some rows have a different number of values.

There's probably not an easy answer for you here. In our SODA.NET library, we require the user to provide a target model to query into (we don't try to build a model from just their query result).

So, for JSON users, we'll need to restrict our queries to make sure that no null values are returned (e.g. https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-27%27%20and%20census_tract_2010_geoid%20is%20not%20null)

Additionally, you might enhance your CSV version to accept SoQL in the URL. That format guarantees a full tabular output regardless of null values.

@geneorama
Copy link
Member

The "uneven row length in json downloads" problem is an old one, it's documented in #19 and came up in again in #33. That's a tough one to fix, which is why it's still outstanding. Part of the complication is that the dataset columns have different names depending on whether they're CSV or JSON. Also because of nesting the JSON columns don't map 1:1 to CSV columns so it's not easy to map using the meta data.
However, this newline / empty element issue is new. I hope to fix it later this week.

@tomschenkjr
Copy link
Contributor

Thanks, this was very helpful in narrowing-down the source of the bug. Marking as duplicate and closing this issue so the discussing can be consolidated into #19

@geneorama
Copy link
Member

@tomschenkjr sorry for the confusion, a lot of our dialogue was addressing an error that @kevinsmgov accidentally introduced at the last minute (the JSON uneven row error).

The problem of the infinite loop still occurs with his modified url:
dat <- read.socrata("https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-27%27%20and%20census_tract_2010_geoid%20is%20not%20null")

I fixed this and I'm creating a pull request. All tests pass. I don't know how to add a test for infinite loops (I'm sure there's a way, but I wanted to get something pushed before I head out for the night).

@geneorama
Copy link
Member

Added test for the url above, pull request is updated.
I also updated the DESCRIPTION and NEWS.md.

@geneorama geneorama reopened this Sep 1, 2016
tomschenkjr added a commit that referenced this issue Sep 8, 2016
changed test for empty json in getContentAsDataFrame(). Closes #96
@tomschenkjr
Copy link
Contributor

@geneorama - I've deleted the branch from GitHub since the fix was merged into dev

@nicklucius
Copy link
Contributor

This is the same underlying issue as #19 and closed by #102.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants