read.socrata hanging on JSON format #96

kevinsmgov · 2016-08-29T18:48:18Z

When I attempt to use read.socrata with a JSON format, the process hangs up.
(example url https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-21%27)

Debugging through the process appears to show the problem in the getContentAsDataFrame function. When testing the JSON response for the end of a paged sequence:

if(httr::content(response, as = 'text') == "[ ]") # empty json?
(line 196)

but the string value I'm seeing at that point is

"[]\n"

So it's not matching and looping forever. Perhaps Socrata is using a different JSON serializer now than when this logic was originally written. (or, I may be using the package incorrectly - let me know if this appears to be the case)

Some possible suggestions:

update the string (assuming that this does represent a change in the Socrata system and not something that is occurring because of some unique aspect of my query)
grepl the value to test for whitespace variations in the empty JSON string
deserialize the content to an r variable and test for an empty deserialization result before converting to data.frame (this would probably be the most robust solution as it should already be insusceptible to whitespace variations).

The text was updated successfully, but these errors were encountered:

geneorama · 2016-08-30T16:17:53Z

@kevinsmgov Thanks for the nice example and suggestions. I was going to implement the third suggestion (I agree with your assessment). However, I'm unable to reproduce the exact error.
For me this code:

library(RSocrata)
read.socrata("https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-21%27")

results in this error:

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

I just updated all my packages to see if I was missing something that you might have. Here's my sessionInfo():

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSocrata_1.7.0-14

loaded via a namespace (and not attached):
[1] httr_1.2.1   R6_2.1.3     tools_3.3.1  curl_1.2     jsonlite_1.0 mime_0.5

Is yours similar?

kevinsmgov · 2016-08-30T16:33:36Z

Sorry, I mixed up some debugging information. The issue with the
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
error is from the last field (census_tract_2010_geoid). If you remove that, you'll get a working example.

I'm still trying to figure out why that field is generating that error.

kevinsmgov · 2016-08-30T17:00:15Z

OK, the issue with that field (census_tract_2010_geoid) is that it is sometime null. When Socrata serializes JSON it leaves out null variables. When the rbind occurs at line 270, it errors out because some rows have a different number of values.

There's probably not an easy answer for you here. In our SODA.NET library, we require the user to provide a target model to query into (we don't try to build a model from just their query result).

So, for JSON users, we'll need to restrict our queries to make sure that no null values are returned (e.g. https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-27%27%20and%20census_tract_2010_geoid%20is%20not%20null)

Additionally, you might enhance your CSV version to accept SoQL in the URL. That format guarantees a full tabular output regardless of null values.

geneorama · 2016-08-31T16:29:19Z

The "uneven row length in json downloads" problem is an old one, it's documented in #19 and came up in again in #33. That's a tough one to fix, which is why it's still outstanding. Part of the complication is that the dataset columns have different names depending on whether they're CSV or JSON. Also because of nesting the JSON columns don't map 1:1 to CSV columns so it's not easy to map using the meta data.
However, this newline / empty element issue is new. I hope to fix it later this week.

tomschenkjr · 2016-09-01T00:23:46Z

Thanks, this was very helpful in narrowing-down the source of the bug. Marking as duplicate and closing this issue so the discussing can be consolidated into #19

geneorama · 2016-09-01T00:57:50Z

@tomschenkjr sorry for the confusion, a lot of our dialogue was addressing an error that @kevinsmgov accidentally introduced at the last minute (the JSON uneven row error).

The problem of the infinite loop still occurs with his modified url:
dat <- read.socrata("https://data.smgov.net/resource/xx64-wi4x.json?$select=incident_number,incident_date,call_type,received_time,cleared_time,census_tract_2010_geoid&$where=incident_date=%272016-08-27%27%20and%20census_tract_2010_geoid%20is%20not%20null")

I fixed this and I'm creating a pull request. All tests pass. I don't know how to add a test for infinite loops (I'm sure there's a way, but I wanted to get something pushed before I head out for the night).

geneorama · 2016-09-01T17:52:15Z

Added test for the url above, pull request is updated.
I also updated the DESCRIPTION and NEWS.md.

changed test for empty json in getContentAsDataFrame(). Closes #96

tomschenkjr · 2016-09-10T20:05:21Z

@geneorama - I've deleted the branch from GitHub since the fix was merged into dev

nicklucius · 2016-10-06T19:31:04Z

This is the same underlying issue as #19 and closed by #102.

tomschenkjr added the bug label Aug 29, 2016

tomschenkjr added the duplicate label Sep 1, 2016

tomschenkjr closed this as completed Sep 1, 2016

geneorama removed the duplicate label Sep 1, 2016

geneorama reopened this Sep 1, 2016

tomschenkjr added a commit that referenced this issue Sep 8, 2016

Merge pull request #97 from Chicago/iss96

c299b5c

changed test for empty json in getContentAsDataFrame(). Closes #96

tomschenkjr added the on dev branch label Sep 10, 2016

tomschenkjr modified the milestone: v1.7.1 Oct 4, 2016

PriyaDoIT assigned nicklucius Oct 4, 2016

tomschenkjr closed this as completed Oct 11, 2016

tomschenkjr removed the on dev branch label Oct 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read.socrata hanging on JSON format #96

read.socrata hanging on JSON format #96

kevinsmgov commented Aug 29, 2016

geneorama commented Aug 30, 2016

kevinsmgov commented Aug 30, 2016

kevinsmgov commented Aug 30, 2016

geneorama commented Aug 31, 2016

tomschenkjr commented Sep 1, 2016

geneorama commented Sep 1, 2016

geneorama commented Sep 1, 2016

tomschenkjr commented Sep 10, 2016

nicklucius commented Oct 6, 2016

read.socrata hanging on JSON format #96

read.socrata hanging on JSON format #96

Comments

kevinsmgov commented Aug 29, 2016

geneorama commented Aug 30, 2016

kevinsmgov commented Aug 30, 2016

kevinsmgov commented Aug 30, 2016

geneorama commented Aug 31, 2016

tomschenkjr commented Sep 1, 2016

geneorama commented Sep 1, 2016

geneorama commented Sep 1, 2016

tomschenkjr commented Sep 10, 2016

nicklucius commented Oct 6, 2016