Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to recursively export all Socrata data to local directory #126

Open
tomschenkjr opened this issue Mar 5, 2017 · 25 comments
Open
Assignees
Labels
Milestone

Comments

@tomschenkjr
Copy link
Collaborator

@tomschenkjr tomschenkjr commented Mar 5, 2017

The ls.socrata() function supports the listing of all data on a data portal while read.socrata(). Thus, the two can be combined under export.socrata() to download all of the files a neatly compress and place them in a single directory.

This sort of functionality can be used by people wishing to archive data portals or help in the migration of one platform to another one. The function should be focused on saving the data to a local or cloud-based storage (e.g., S3) and should avoid loading all of the data into memory.

Solving #124 will allow an easier integration between ls.socrata() and read.socrata(), so RSocrata v1.7.2-7 or above will be required.

  • Output compress CSVs into a directory named after the domain itself (e.g., data.cityofchicago.org/)
  • Timestamp each individual file (files should be individually timestamped since an entire portal export may take awhile)
  • Ensure that GeoJSON files are exported, compressed (I don't think #43 needs to be completed first)
  • Ensure Shapefiles, KML/KMZ, and other "attach"-based files are also exported, compressed
  • Allow users to supply a token in export.socrata() to avoid throttling
  • Allow users to choose between CSV or JSON outputs. (CSVs are data.frame friendly, but JSON downloads faster from Socrata)
  • Allow users to choose between compression or uncompressed output
  • Support output to cloud services such as S3
  • Write unit test

An initial alpha is on this gist. This is now on the issue126 branch. Feedback is encouraged as I've done limited testing at this point.

@tomschenkjr tomschenkjr self-assigned this Mar 5, 2017
@joshmwolff
Copy link

@joshmwolff joshmwolff commented Mar 7, 2017

Really glad to see this enhancement request. I've used this function combination before and it works well, so I'm glad to see that you're thinking of combining them. An interesting extension of this enhancement --and one perhaps not vulnerable to throttling -- would be to enable RSocrata users to export a list of column names associated with each dataset (rather than the datasets themselves). That would allow users to investigate which datasets share which fields.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Mar 7, 2017

Thanks, @joshmwolff. It's an interesting idea and can see that being generally useful. But, as I think about it, would that be most useful to retain all of the columns in memory instead of writing them to disk?

@joshmwolff
Copy link

@joshmwolff joshmwolff commented Mar 7, 2017

@tomschenkjr: think you're right: it would be inefficient to use the export.socrata() function to get a list of all column names for all datasets within a domain if that means downloading all datasets into memory instead of writing them to disk. That said, I'm not sure I'd necessarily want to download all datasets to disk if I were only interested in keeping column names and ultimately combining those names into a single data table. I see, however, that Socrata's Discovery API will in fact return dataset column names for datasets in a particular domain. Can RSocrata hit the Discovery API? That might enable the package to capture the column names without having to also download the column data itself.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Mar 7, 2017

@joshmwolff - right now we're not using the discovery API, but something we're planning for an upcoming release.

In either case, I think read.socrata() could be modified to only extract the column names and not the data. Whether that's through the same SoDA API calls or through the Discovery/Catalog API is probably more a technical/optimization question.

Would you mind opening a new issue on this? I think it's a worthwhile conversation to track as a separate feature than this one.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Mar 15, 2017

Outlining some thoughts on unit tests for this function:

  • Count the number of files written to the directory and compare it to the count of data sets listed in ls.socrata().
  • Compare the file names written to disk to the names expected to be written (may need to be careful of certain OSes having filename length limitations).
  • Ensure the directory created aligns to the domain name being exported.
@tomschenkjr tomschenkjr modified the milestone: 1.8.0 Mar 15, 2017
@joshmwolff
Copy link

@joshmwolff joshmwolff commented Mar 27, 2017

FWIW, I've been playing with Socrata's "Discovery API" and it works well as a means to for creating a small dataframe of dataset names and other metadata. The following worked for me:

library(jsonlite)
API <- 'http://api.us.socrata.com/api/catalog/v1?only=datasets&domains=data.cambridgema.gov'
RawMetadata <- fromJSON(txt=API)
Metadata <- RawMetadata[['results']]
Metadata <- Metadata$resource

This avoids having to write any files to disk, as you're just storing a single small dataframe in memory (in Cambridge's case, 88 rows by 17 columns).

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Mar 28, 2017

Looks interesting and definitely interested in incorporating the Discovery API (i.e., #114).

Would this fit in line with #128?

@joshmwolff
Copy link

@joshmwolff joshmwolff commented Mar 30, 2017

Whoops. You're right: my comment belongs in #128 rather than here. Feel free to ignore.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented May 5, 2017

I've pushed the branch to the repo: issue126.

/cc @nicklucius @geneorama

nicklucius added a commit that referenced this issue May 6, 2017
@nicklucius
Copy link
Contributor

@nicklucius nicklucius commented May 6, 2017

I've pushed my first stab at downloading non-tabular data files. Here's how it works: if the first download URL available is not a CSV, the httr::GET() contents are written straight to a file and read.socrata() is not called. The original filename and proper file extension is appended to the filename.

For geographical data with multiple download choices, it looks like KML is first so that is what is being saved.

Still to do:

  • compress the file - I had issues with this
@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented May 6, 2017

@nicklucius - that's great. I'll check-off export for Shapefile/KML/KMZ for now. Looks like this will also work for geojson, but will play with that for a bit to see how it works.

The compression is surprisingly tricky.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented May 6, 2017

Scratch the geojson remark, that's certainly taken care of. I'll check that off, too.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented May 6, 2017

Getting an error when testing with the bulk files. Appears to be an error with the file name structure. Looking into it.

> export.socrata("https://data.cityofchicago.org/")
 Hide Traceback
 
 Rerun with Debug
 Error in file(con, "wb") : cannot open the connection 
3. file(con, "wb") 
2. writeBin(response$content, filename) 
1. export.socrata("https://data.cityofchicago.org/") 
In addition: Warning message:
In file(con, "wb") :
  cannot open file 'data.cityofchicago.org/qixn-wjxu_2017-05-06_133501."Street Sweeping - 2017 -
Map.kml"': Invalid argument
@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented May 6, 2017

Ok - I've fixed the above error and also added the app_token parameter.

Later today, can move this over to dev branch to be part of the nightly build.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented May 6, 2017

As for unit testing, one option is to setup a valid, but fake data.json file that is just a subset of data and placed at a file location we can control. Those files can point to actual data from the Chicago data portal.

This is a bit of work so don't necessarily like it, but is one option.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented May 7, 2017

It's now on dev branch.

@chrismetcalf
Copy link

@chrismetcalf chrismetcalf commented May 8, 2017

Hey guys, been on paternity leave (👶🍼💩) so I haven't had my head in GitHub enough, but @tomschenkjr was kind enough to alert me via email to what is going on.

Big question that'll help this be performant on Socrata portals - for the actual data dumping, are you using the export links (/api/views/$id/rows.$format) or the SODA2 APIs (/resource/$id.$format)?

The former will allow you to download the dataset export in one big file, and takes better advantage of caching where available. It should be faster for you.

@nicklucius
Copy link
Contributor

@nicklucius nicklucius commented May 10, 2017

@chrismetcalf - Congrats! I can understand how GitHub might not be the first thing on your mind right now.

The export.socrata() function uses the export links that look like /api/views/$id/rows.$format to download data. Is this the faster method? If so, I wonder if that is something to keep in mind for read.socrata() development.

@chrismetcalf
Copy link

@chrismetcalf chrismetcalf commented May 12, 2017

Yes, @nicklucius, it looks like you're all good! The only other recommendation I would make would be to watch for eTag headers if you've got local caching, but if you're using the export links you're already taking advantage of our bulk export and server-side caching.

@geneorama
Copy link
Member

@geneorama geneorama commented May 15, 2017

Documenting two ideas from the analytics meeting today:

  1. We need to save off the ls.socrata results so that we have the information about the data sets in the future.
  2. I think it would be good to separate out the two functions into their own functions for easier unit testing
@James-SR
Copy link

@James-SR James-SR commented Mar 21, 2018

Not sure if this error message should be a concern?

export.socrata("https://opendata.cheshireeast.gov.uk/")
Warning messages:
1: In no_deniro(result[[columnName]]) : NAs introduced by coercion
2: In no_deniro(result[[columnName]]) : NAs introduced by coercion
3: In no_deniro(result[[columnName]]) : NAs introduced by coercion

It appears to have extracted some files, but I'm not sure it is the complete set of public files.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Mar 21, 2018

@James-SR - thanks. It's still a beta feature so good to see some use cases that's producing warnings. When we develop it more, we will look at this warnings and handle them more elegantly.

The easiest way to check is if the number of exported documents is the same as the number of entries at https://opendata.cheshireeast.gov.uk/data.json

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Oct 28, 2018

@James-SR - The latest build of export.socrata has been pushed to the issue126 branch. I tested Cheshire East portal and did not receive the error. This could be because we fixed it or the offending data set or element was changed on the portal itself.

Feel free to test this again and hope it is still useful.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Oct 30, 2018

There is quirk with some data listed in Socrata that I wanted to document. Socrata supports several HTML-based "non-data" formats in the data.json file--which we use to list all available data through ls.socrata(). First, Socrata has "Stories" which let users create HTML sites (example). In my recent commit, I've opted to simply skip these files.

Second, Socrata also supports "external data". Sometimes, these external data are links to HTML webpages while other times they link to actual data. For instance, this dataset is linked to a web page not hosted by Socrata.

Our function simply does not handle this well because HTML websites do not have content disposition. I've made a change to ignore links that do not have a content disposition to ignore these kind of sites. Sometimes external data is actually data (e.g., CSV), so those should still be downloaded. Other content, e.g., HTML, will be skipped.

The downside of these approaches is it will ignore some information and could cause confusion. The former scenario is easy to determine because the data.json does not list a distributionUrl for Stories. However, the latter case is more complicated because external data always displays a distributionUrl, even if it's to HTML-based sites.

This also makes more difficult to write unit tests. For instance, this approach means the data.json file is not a one-for-one correspondence to data actually downloaded. There isn't an indicator in data.json for what is ultimately downloaded.

I will think of ways to resolve these issues.

@tomschenkjr
Copy link
Collaborator Author

@tomschenkjr tomschenkjr commented Jan 5, 2020

I've resolved the issue by having the function follow these rules:

  • Download if it is a CSV (mediaType is text/csv)
  • Ignore if there is no downloadUrl (usually because there simply isn't data)
  • Ignore if it's an HTML file (mediaType is text/html)
  • Otherwise, download everything through a normal GET command.
tomschenkjr pushed a commit to tomschenkjr/RSocrata that referenced this issue Jan 5, 2020
  * Ignores HTML files (e.g., Socrata Pages)
  * Ignores on occassions there isn't any data
  * Will download (uncompressed) PDFs, Word, Excel, PowerPoint, plain text attachments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.