First some boilerplate for extracting the metadata we start with, taken from the previous notebook.

In [1]:
import json

with open("../auth/nyc-open-data.json", "r") as f:
    nyc_auth = json.load(f)
    
import pysocrata
nyc_datasets = pysocrata.get_datasets(**nyc_auth)

nyc_datasets = [d for d in nyc_datasets if d['resource']['type'] != 'story']

nyc_types = [d['resource']['type'] for d in nyc_datasets]
volcab_map = {'dataset': 'table', 'href': 'link', 'map': 'geospatial dataset', 'file': 'blob'}
nyc_types = list(map(lambda d: volcab_map[d], nyc_types))

nyc_endpoints = [d['resource']['id'] for d in nyc_datasets]

Here are the data formats that we're working with:

In [2]:
import pandas as pd

pd.Series(nyc_types).value_counts()

table                 1132
link                   182
geospatial dataset     165
blob                    98
dtype: int64

Tables are easiest; we have access to a nice client, `sodapy`, which wraps Socrata API features designed around accessing their contents. So let's start by playing around with `sodapy` and validating that we can get what want.

In [3]:
import numpy as np
table_indices = np.nonzero([t == 'table' for t in nyc_types])
table_endpoints = np.array(nyc_endpoints)[table_indices]

In [5]:
nyc_datasets[0]

{'classification': {'categories': [],
  'domain_category': 'Business',
  'domain_metadata': [{'key': 'Update_Automation', 'value': 'Yes'},
   {'key': 'Update_Update-Frequency', 'value': 'As needed'},
   {'key': 'Dataset-Information_Agency',
    'value': 'Department of Information Technology & Telecommunications (DoITT)'}],
  'domain_tags': ['webpage',
   'registration',
   '.nyc',
   'domain',
   'internet',
   'web',
   'site',
   'website',
   'page'],
  'tags': []},
 'link': 'https://data.cityofnewyork.us/Business/-nyc-Domain-Registrations/9cw8-7heb',
 'metadata': {'domain': 'data.cityofnewyork.us'},
 'permalink': 'https://data.cityofnewyork.us/d/9cw8-7heb',
 'resource': {'attribution': 'Department of Information Technology & Telecommunications (DoITT)',
  'columns_description': ['', '', ''],
  'columns_field_name': ['nexus_category',
   'domain_registration_date',
   'domain_name'],
  'columns_name': ['Nexus Category',
   'Domain Registration Date ',
   'Domain Name '],
  'createdA

The metadata, shown above, does not show the size of the dataset as it would be when downloaded.

The most straightforward way of getting this information would be to send a `HEAD` request and read the `Content-Length` returned in the header. But the server doesn't accept `HEAD` requests...

In [8]:
ex_table_endpoint = table_endpoints[0]

In [9]:
import requests

In [15]:
ex_head = requests.head('https://data.cityofnewyork.us/api/views/szkz-syh6/rows.csv?accessType=DOWNLOAD')

In [16]:
ex_head.headers

{'Cache-Control': 'private, no-cache, must-revalidate', 'X-Socrata-Region': 'aws-us-east-1-fedramp-prod', 'Age': '0', 'X-Error-Message': 'HEAD is not supported', 'Connection': 'keep-alive', 'Access-Control-Allow-Origin': '*', 'X-Socrata-RequestId': '12dk7s9gwfgvp9pir1dyt0dab', 'Server': 'nginx', 'X-Error-Code': 'invalid_request', 'Date': 'Sat, 21 Jan 2017 23:12:38 GMT'}

...and downloads are returned in a compressed (`gzip`) format which disallows reading length off of `GET` requests, either.

In [13]:
ex_head_2 = requests.get('https://data.cityofnewyork.us/api/views/szkz-syh6/rows.csv?accessType=DOWNLOAD')

In [14]:
ex_head_2.headers

{'Cache-Control': 'public, must-revalidate, max-age=21600', 'Transfer-Encoding': 'chunked', 'X-Socrata-RequestId': 'c0p415wydc7skze85vz8py1tt', 'Last-Modified': 'Sat, 21 Jan 2017 14:33:47 UTC', 'X-Socrata-Region': 'aws-us-east-1-fedramp-prod', 'Content-Type': 'text/csv; charset=utf-8', 'Content-disposition': 'attachment; filename=Prequalified_Firms.csv', 'Content-Encoding': 'gzip', 'Age': '0', 'Connection': 'keep-alive', 'Access-Control-Allow-Origin': '*', 'Server': 'nginx', 'Date': 'Sat, 21 Jan 2017 23:12:12 GMT'}

Is this also true of the three other data types?

In [18]:
# Geospatial
# https://nycopendata.socrata.com/Transportation/Subway-Entrances/drex-xx56
requests.head('https://nycopendata.socrata.com/api/geospatial/drex-xx56?method=export&format=Shapefile').headers

{'Cache-Control': 'private, no-cache, must-revalidate', 'X-Socrata-Region': 'aws-us-east-1-fedramp-prod', 'Age': '0', 'X-Error-Message': 'HEAD not allowed', 'Connection': 'keep-alive', 'Access-Control-Allow-Origin': '*', 'X-Socrata-RequestId': '59apfqsvykxwgqn7gaws6dsue', 'Server': 'nginx', 'X-Error-Code': 'method_not_allowed', 'Date': 'Sat, 21 Jan 2017 23:20:37 GMT'}

In [19]:
requests.get('https://nycopendata.socrata.com/api/geospatial/drex-xx56?method=export&format=Shapefile').headers

{'Transfer-Encoding': 'chunked', 'X-Socrata-Region': 'aws-us-east-1-fedramp-prod', 'Age': '0', 'Connection': 'keep-alive', 'Access-Control-Allow-Origin': '*', 'X-Socrata-RequestId': '4dp1j7urabazn7h7xn63u53xj', 'Server': 'nginx', 'Content-Disposition': 'attachment; filename="Subway Entrances.zip"', 'Content-Type': 'application/zip', 'Date': 'Sat, 21 Jan 2017 23:20:44 GMT'}

In [20]:
# Blob
# https://data.cityofnewyork.us/dataset/Broadband-Data-Dig-Datasets/ft4n-yqee
requests.head('https://data.cityofnewyork.us/api/file_data/3d0f7600-f88a-4a11-8ad9-707c785caa08?filename=Broadband%20Data%20Dig%20-%20Datasets.zip').headers

{'Cache-Control': 'private, no-cache, must-revalidate', 'X-Socrata-Region': 'aws-us-east-1-fedramp-prod', 'Age': '0', 'X-Error-Message': 'HEAD is not supported', 'Connection': 'keep-alive', 'Access-Control-Allow-Origin': '*', 'X-Socrata-RequestId': '4bzuzuuuyvwkvlf0i3o8fy5ak', 'Server': 'nginx', 'X-Error-Code': 'invalid_request', 'Date': 'Sat, 21 Jan 2017 23:23:46 GMT'}

In [22]:
# requests.get('https://data.cityofnewyork.us/api/file_data/3d0f7600-f88a-4a11-8ad9-707c785caa08?filename=Broadband%20Data%20Dig%20-%20Datasets.zip').headers

Ok, so it's pretty obvious that there's no way to get the size of a file being downloaded until it is completely downloaded, because the connection is keep-alive and chunked.

We can get the # rows and # columns for tables by using the API. However, we can't even do that for datasets of the other types...OK, but geospatial should be relatively straightforward to read in. That leaves links and blobs. Let's start with links.

In [23]:
link_indices = np.nonzero([t == 'link' for t in nyc_types])
link_endpoints = np.array(nyc_endpoints)[link_indices]

In [24]:
link_endpoints

array(['79me-a7rs', 'fx7a-24mf', 'psde-rqze', 'dte3-kvx7', '9p9k-tusd',
       'hfa3-euj3', 'pnij-y7y6', 'yupw-u2ax', '9dux-uz3w', 'd9fg-z42k',
       'c5dk-m6ea', 'qk6i-zcht', 'tnru-abg2', 'ware-id4f', 'v7f4-yzyg',
       'dnjp-mkjx', 'j8nm-zs7q', 'ezds-sqp6', 'qiwj-i2jk', 'xswq-wnv9',
       'wha9-m3tq', 'qf28-yqqv', 'spax-mybh', '8d5p-rji6', 'pnru-8qsf',
       'bqbs-iwyn', 'f7ta-5e24', '824w-7c8u', '2p3a-y7d4', 'krxp-x4za',
       '64vf-hxyb', 'rn6h-i66u', 'umu5-zyd3', 'pf9y-ef2p', 'hxay-3qcw',
       '9p99-55bh', 'pr5n-ucgi', 'ivb7-t7a7', 'pqb7-6q2k', 'n7nh-rhic',
       's65f-sqe8', '8k4a-z83b', 'nx9f-wn3a', 'eweh-h793', 'm7f5-x3k4',
       's8jv-f44n', 'ie6s-t87j', 'vdkk-sqws', 'aumr-wgtk', 'egch-abu9',
       '5crx-5ivw', 'ud5r-z5ws', 'tpe4-3w5y', '4v4n-gnh2', 'p84r-8kqf',
       '3gx8-vrcy', 'hz79-96hi', 'sngu-yqq8', 'xi5z-cgq7', 't22b-cmty',
       'p94q-8hxh', 'mpmk-b5ed', 'vsnr-94wk', 'quix-kfbk', '9jqw-r2a4',
       'sah3-jw2y', 'epfh-qbp5', 'hc9t-g6wa', 'vghm-gmwr', 'fbqm

In [29]:
nyc_datasets[link_indices[0][0]]

{'classification': {'categories': [],
  'domain_category': 'Recreation',
  'domain_metadata': [{'key': 'Update_Automation', 'value': 'No'},
   {'key': 'Update_Update-Frequency', 'value': 'As needed'},
   {'key': 'Dataset-Information_Agency',
    'value': 'Department of Parks and Recreation (DPR)'}],
  'domain_tags': ['parks and recreation', 'recreation', 'parks', 'dpr'],
  'tags': []},
 'link': 'https://data.cityofnewyork.us/Recreation/Directory-of-Parks/79me-a7rs',
 'metadata': {'domain': 'data.cityofnewyork.us'},
 'permalink': 'https://data.cityofnewyork.us/d/79me-a7rs',
 'resource': {'attribution': 'Department of Parks and Recreation (DPR)',
  'columns_description': [],
  'columns_field_name': [],
  'columns_name': [],
  'createdAt': '2016-10-06T01:18:57.000Z',
  'description': '',
  'download_count': 0,
  'id': '79me-a7rs',
  'name': 'Directory of Parks',
  'page_views': {'page_views_last_month': 2,
   'page_views_last_month_log': 1.5849625007211563,
   'page_views_last_week': 1,
 

TODO: Figure out how to work links and blobs.