New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to recursively export all Socrata data to local directory #126

Open
tomschenkjr opened this Issue Mar 5, 2017 · 22 comments

Comments

Projects
None yet
6 participants
@tomschenkjr
Member

tomschenkjr commented Mar 5, 2017

The ls.socrata() function supports the listing of all data on a data portal while read.socrata(). Thus, the two can be combined under export.socrata() to download all of the files a neatly compress and place them in a single directory.

This sort of functionality can be used by people wishing to archive data portals or help in the migration of one platform to another one. The function should be focused on saving the data to a local or cloud-based storage (e.g., S3) and should avoid loading all of the data into memory.

Solving #124 will allow an easier integration between ls.socrata() and read.socrata(), so RSocrata v1.7.2-7 or above will be required.

  • Output compress CSVs into a directory named after the domain itself (e.g., data.cityofchicago.org/)
  • Timestamp each individual file (files should be individually timestamped since an entire portal export may take awhile)
  • Ensure that GeoJSON files are exported, compressed (I don't think #43 needs to be completed first)
  • Ensure Shapefiles, KML/KMZ, and other "attach"-based files are also exported, compressed
  • Allow users to supply a token in export.socrata() to avoid throttling
  • Allow users to choose between CSV or JSON outputs. (CSVs are data.frame friendly, but JSON downloads faster from Socrata)
  • Allow users to choose between compression or uncompressed output
  • Support output to cloud services such as S3
  • Write unit test

An initial alpha is on this gist. This is now on the issue126 branch. Feedback is encouraged as I've done limited testing at this point.

@tomschenkjr tomschenkjr self-assigned this Mar 5, 2017

@joshmwolff

This comment has been minimized.

Show comment
Hide comment
@joshmwolff

joshmwolff Mar 7, 2017

Really glad to see this enhancement request. I've used this function combination before and it works well, so I'm glad to see that you're thinking of combining them. An interesting extension of this enhancement --and one perhaps not vulnerable to throttling -- would be to enable RSocrata users to export a list of column names associated with each dataset (rather than the datasets themselves). That would allow users to investigate which datasets share which fields.

joshmwolff commented Mar 7, 2017

Really glad to see this enhancement request. I've used this function combination before and it works well, so I'm glad to see that you're thinking of combining them. An interesting extension of this enhancement --and one perhaps not vulnerable to throttling -- would be to enable RSocrata users to export a list of column names associated with each dataset (rather than the datasets themselves). That would allow users to investigate which datasets share which fields.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr Mar 7, 2017

Member

Thanks, @joshmwolff. It's an interesting idea and can see that being generally useful. But, as I think about it, would that be most useful to retain all of the columns in memory instead of writing them to disk?

Member

tomschenkjr commented Mar 7, 2017

Thanks, @joshmwolff. It's an interesting idea and can see that being generally useful. But, as I think about it, would that be most useful to retain all of the columns in memory instead of writing them to disk?

@joshmwolff

This comment has been minimized.

Show comment
Hide comment
@joshmwolff

joshmwolff Mar 7, 2017

@tomschenkjr: think you're right: it would be inefficient to use the export.socrata() function to get a list of all column names for all datasets within a domain if that means downloading all datasets into memory instead of writing them to disk. That said, I'm not sure I'd necessarily want to download all datasets to disk if I were only interested in keeping column names and ultimately combining those names into a single data table. I see, however, that Socrata's Discovery API will in fact return dataset column names for datasets in a particular domain. Can RSocrata hit the Discovery API? That might enable the package to capture the column names without having to also download the column data itself.

joshmwolff commented Mar 7, 2017

@tomschenkjr: think you're right: it would be inefficient to use the export.socrata() function to get a list of all column names for all datasets within a domain if that means downloading all datasets into memory instead of writing them to disk. That said, I'm not sure I'd necessarily want to download all datasets to disk if I were only interested in keeping column names and ultimately combining those names into a single data table. I see, however, that Socrata's Discovery API will in fact return dataset column names for datasets in a particular domain. Can RSocrata hit the Discovery API? That might enable the package to capture the column names without having to also download the column data itself.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr Mar 7, 2017

Member

@joshmwolff - right now we're not using the discovery API, but something we're planning for an upcoming release.

In either case, I think read.socrata() could be modified to only extract the column names and not the data. Whether that's through the same SoDA API calls or through the Discovery/Catalog API is probably more a technical/optimization question.

Would you mind opening a new issue on this? I think it's a worthwhile conversation to track as a separate feature than this one.

Member

tomschenkjr commented Mar 7, 2017

@joshmwolff - right now we're not using the discovery API, but something we're planning for an upcoming release.

In either case, I think read.socrata() could be modified to only extract the column names and not the data. Whether that's through the same SoDA API calls or through the Discovery/Catalog API is probably more a technical/optimization question.

Would you mind opening a new issue on this? I think it's a worthwhile conversation to track as a separate feature than this one.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr Mar 15, 2017

Member

Outlining some thoughts on unit tests for this function:

  • Count the number of files written to the directory and compare it to the count of data sets listed in ls.socrata().
  • Compare the file names written to disk to the names expected to be written (may need to be careful of certain OSes having filename length limitations).
  • Ensure the directory created aligns to the domain name being exported.
Member

tomschenkjr commented Mar 15, 2017

Outlining some thoughts on unit tests for this function:

  • Count the number of files written to the directory and compare it to the count of data sets listed in ls.socrata().
  • Compare the file names written to disk to the names expected to be written (may need to be careful of certain OSes having filename length limitations).
  • Ensure the directory created aligns to the domain name being exported.

@tomschenkjr tomschenkjr modified the milestone: 1.8.0 Mar 15, 2017

@joshmwolff

This comment has been minimized.

Show comment
Hide comment
@joshmwolff

joshmwolff Mar 27, 2017

FWIW, I've been playing with Socrata's "Discovery API" and it works well as a means to for creating a small dataframe of dataset names and other metadata. The following worked for me:

library(jsonlite)
API <- 'http://api.us.socrata.com/api/catalog/v1?only=datasets&domains=data.cambridgema.gov'
RawMetadata <- fromJSON(txt=API)
Metadata <- RawMetadata[['results']]
Metadata <- Metadata$resource

This avoids having to write any files to disk, as you're just storing a single small dataframe in memory (in Cambridge's case, 88 rows by 17 columns).

joshmwolff commented Mar 27, 2017

FWIW, I've been playing with Socrata's "Discovery API" and it works well as a means to for creating a small dataframe of dataset names and other metadata. The following worked for me:

library(jsonlite)
API <- 'http://api.us.socrata.com/api/catalog/v1?only=datasets&domains=data.cambridgema.gov'
RawMetadata <- fromJSON(txt=API)
Metadata <- RawMetadata[['results']]
Metadata <- Metadata$resource

This avoids having to write any files to disk, as you're just storing a single small dataframe in memory (in Cambridge's case, 88 rows by 17 columns).

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr Mar 28, 2017

Member

Looks interesting and definitely interested in incorporating the Discovery API (i.e., #114).

Would this fit in line with #128?

Member

tomschenkjr commented Mar 28, 2017

Looks interesting and definitely interested in incorporating the Discovery API (i.e., #114).

Would this fit in line with #128?

@joshmwolff

This comment has been minimized.

Show comment
Hide comment
@joshmwolff

joshmwolff Mar 30, 2017

Whoops. You're right: my comment belongs in #128 rather than here. Feel free to ignore.

joshmwolff commented Mar 30, 2017

Whoops. You're right: my comment belongs in #128 rather than here. Feel free to ignore.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr May 5, 2017

Member

I've pushed the branch to the repo: issue126.

/cc @nicklucius @geneorama

Member

tomschenkjr commented May 5, 2017

I've pushed the branch to the repo: issue126.

/cc @nicklucius @geneorama

nicklucius added a commit that referenced this issue May 6, 2017

@nicklucius

This comment has been minimized.

Show comment
Hide comment
@nicklucius

nicklucius May 6, 2017

Contributor

I've pushed my first stab at downloading non-tabular data files. Here's how it works: if the first download URL available is not a CSV, the httr::GET() contents are written straight to a file and read.socrata() is not called. The original filename and proper file extension is appended to the filename.

For geographical data with multiple download choices, it looks like KML is first so that is what is being saved.

Still to do:

  • compress the file - I had issues with this
Contributor

nicklucius commented May 6, 2017

I've pushed my first stab at downloading non-tabular data files. Here's how it works: if the first download URL available is not a CSV, the httr::GET() contents are written straight to a file and read.socrata() is not called. The original filename and proper file extension is appended to the filename.

For geographical data with multiple download choices, it looks like KML is first so that is what is being saved.

Still to do:

  • compress the file - I had issues with this
@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr May 6, 2017

Member

@nicklucius - that's great. I'll check-off export for Shapefile/KML/KMZ for now. Looks like this will also work for geojson, but will play with that for a bit to see how it works.

The compression is surprisingly tricky.

Member

tomschenkjr commented May 6, 2017

@nicklucius - that's great. I'll check-off export for Shapefile/KML/KMZ for now. Looks like this will also work for geojson, but will play with that for a bit to see how it works.

The compression is surprisingly tricky.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr May 6, 2017

Member

Scratch the geojson remark, that's certainly taken care of. I'll check that off, too.

Member

tomschenkjr commented May 6, 2017

Scratch the geojson remark, that's certainly taken care of. I'll check that off, too.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr May 6, 2017

Member

Getting an error when testing with the bulk files. Appears to be an error with the file name structure. Looking into it.

> export.socrata("https://data.cityofchicago.org/")
 Hide Traceback
 
 Rerun with Debug
 Error in file(con, "wb") : cannot open the connection 
3. file(con, "wb") 
2. writeBin(response$content, filename) 
1. export.socrata("https://data.cityofchicago.org/") 
In addition: Warning message:
In file(con, "wb") :
  cannot open file 'data.cityofchicago.org/qixn-wjxu_2017-05-06_133501."Street Sweeping - 2017 -
Map.kml"': Invalid argument
Member

tomschenkjr commented May 6, 2017

Getting an error when testing with the bulk files. Appears to be an error with the file name structure. Looking into it.

> export.socrata("https://data.cityofchicago.org/")
 Hide Traceback
 
 Rerun with Debug
 Error in file(con, "wb") : cannot open the connection 
3. file(con, "wb") 
2. writeBin(response$content, filename) 
1. export.socrata("https://data.cityofchicago.org/") 
In addition: Warning message:
In file(con, "wb") :
  cannot open file 'data.cityofchicago.org/qixn-wjxu_2017-05-06_133501."Street Sweeping - 2017 -
Map.kml"': Invalid argument
@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr May 6, 2017

Member

Ok - I've fixed the above error and also added the app_token parameter.

Later today, can move this over to dev branch to be part of the nightly build.

Member

tomschenkjr commented May 6, 2017

Ok - I've fixed the above error and also added the app_token parameter.

Later today, can move this over to dev branch to be part of the nightly build.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr May 6, 2017

Member

As for unit testing, one option is to setup a valid, but fake data.json file that is just a subset of data and placed at a file location we can control. Those files can point to actual data from the Chicago data portal.

This is a bit of work so don't necessarily like it, but is one option.

Member

tomschenkjr commented May 6, 2017

As for unit testing, one option is to setup a valid, but fake data.json file that is just a subset of data and placed at a file location we can control. Those files can point to actual data from the Chicago data portal.

This is a bit of work so don't necessarily like it, but is one option.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr May 7, 2017

Member

It's now on dev branch.

Member

tomschenkjr commented May 7, 2017

It's now on dev branch.

@chrismetcalf

This comment has been minimized.

Show comment
Hide comment
@chrismetcalf

chrismetcalf May 8, 2017

Hey guys, been on paternity leave (👶🍼💩) so I haven't had my head in GitHub enough, but @tomschenkjr was kind enough to alert me via email to what is going on.

Big question that'll help this be performant on Socrata portals - for the actual data dumping, are you using the export links (/api/views/$id/rows.$format) or the SODA2 APIs (/resource/$id.$format)?

The former will allow you to download the dataset export in one big file, and takes better advantage of caching where available. It should be faster for you.

chrismetcalf commented May 8, 2017

Hey guys, been on paternity leave (👶🍼💩) so I haven't had my head in GitHub enough, but @tomschenkjr was kind enough to alert me via email to what is going on.

Big question that'll help this be performant on Socrata portals - for the actual data dumping, are you using the export links (/api/views/$id/rows.$format) or the SODA2 APIs (/resource/$id.$format)?

The former will allow you to download the dataset export in one big file, and takes better advantage of caching where available. It should be faster for you.

@nicklucius

This comment has been minimized.

Show comment
Hide comment
@nicklucius

nicklucius May 10, 2017

Contributor

@chrismetcalf - Congrats! I can understand how GitHub might not be the first thing on your mind right now.

The export.socrata() function uses the export links that look like /api/views/$id/rows.$format to download data. Is this the faster method? If so, I wonder if that is something to keep in mind for read.socrata() development.

Contributor

nicklucius commented May 10, 2017

@chrismetcalf - Congrats! I can understand how GitHub might not be the first thing on your mind right now.

The export.socrata() function uses the export links that look like /api/views/$id/rows.$format to download data. Is this the faster method? If so, I wonder if that is something to keep in mind for read.socrata() development.

@chrismetcalf

This comment has been minimized.

Show comment
Hide comment
@chrismetcalf

chrismetcalf May 12, 2017

Yes, @nicklucius, it looks like you're all good! The only other recommendation I would make would be to watch for eTag headers if you've got local caching, but if you're using the export links you're already taking advantage of our bulk export and server-side caching.

chrismetcalf commented May 12, 2017

Yes, @nicklucius, it looks like you're all good! The only other recommendation I would make would be to watch for eTag headers if you've got local caching, but if you're using the export links you're already taking advantage of our bulk export and server-side caching.

@geneorama

This comment has been minimized.

Show comment
Hide comment
@geneorama

geneorama May 15, 2017

Member

Documenting two ideas from the analytics meeting today:

  1. We need to save off the ls.socrata results so that we have the information about the data sets in the future.
  2. I think it would be good to separate out the two functions into their own functions for easier unit testing
Member

geneorama commented May 15, 2017

Documenting two ideas from the analytics meeting today:

  1. We need to save off the ls.socrata results so that we have the information about the data sets in the future.
  2. I think it would be good to separate out the two functions into their own functions for easier unit testing
@James-SR

This comment has been minimized.

Show comment
Hide comment
@James-SR

James-SR Mar 21, 2018

Not sure if this error message should be a concern?

export.socrata("https://opendata.cheshireeast.gov.uk/")
Warning messages:
1: In no_deniro(result[[columnName]]) : NAs introduced by coercion
2: In no_deniro(result[[columnName]]) : NAs introduced by coercion
3: In no_deniro(result[[columnName]]) : NAs introduced by coercion

It appears to have extracted some files, but I'm not sure it is the complete set of public files.

James-SR commented Mar 21, 2018

Not sure if this error message should be a concern?

export.socrata("https://opendata.cheshireeast.gov.uk/")
Warning messages:
1: In no_deniro(result[[columnName]]) : NAs introduced by coercion
2: In no_deniro(result[[columnName]]) : NAs introduced by coercion
3: In no_deniro(result[[columnName]]) : NAs introduced by coercion

It appears to have extracted some files, but I'm not sure it is the complete set of public files.

@tomschenkjr

This comment has been minimized.

Show comment
Hide comment
@tomschenkjr

tomschenkjr Mar 21, 2018

Member

@James-SR - thanks. It's still a beta feature so good to see some use cases that's producing warnings. When we develop it more, we will look at this warnings and handle them more elegantly.

The easiest way to check is if the number of exported documents is the same as the number of entries at https://opendata.cheshireeast.gov.uk/data.json

Member

tomschenkjr commented Mar 21, 2018

@James-SR - thanks. It's still a beta feature so good to see some use cases that's producing warnings. When we develop it more, we will look at this warnings and handle them more elegantly.

The easiest way to check is if the number of exported documents is the same as the number of entries at https://opendata.cheshireeast.gov.uk/data.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment