Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaging multiple networks #51

Closed
duncandewhurst opened this issue Aug 18, 2022 · 9 comments
Closed

Packaging multiple networks #51

duncandewhurst opened this issue Aug 18, 2022 · 9 comments
Assignees
Labels
CSV format This issue relates to the CSV publication format GeoJSON format This issue relates to the GeoJSON publication format Schema
Milestone

Comments

@duncandewhurst
Copy link
Collaborator

From the data stewardship, publication formats and access methods consultation document:

The standard should provide a standardised bulk download format for packaging multiple networks.

When designing the format, we'll need to consider streaming. See open-contracting/standard#1084 for a related discussion. We'll also need to consider packaging for the CSV and GeoJSON formats.

@duncandewhurst duncandewhurst added Schema CSV format This issue relates to the CSV publication format GeoJSON format This issue relates to the GeoJSON publication format labels Aug 18, 2022
@duncandewhurst duncandewhurst added this to the Alpha milestone Aug 18, 2022
@duncandewhurst
Copy link
Collaborator Author

#13 has some examples of networks with large numbers of links which might require streaming

@duncandewhurst
Copy link
Collaborator Author

For GeoJSON, I don't think we need to worry about packaging. As long as the network identifier is included in each feature's properties, a single GeoJSON file could contain nodes or links from multiple networks.

@duncandewhurst
Copy link
Collaborator Author

duncandewhurst commented Aug 30, 2022

On reflection, I think a bigger issue is API design and streaming support for links and nodes, since each dataset is likely to contain a small number of potential large networks, rather than a large number of small networks.

Edit: See #75 for a proposal on streaming/paginating individual networks.

@duncandewhurst
Copy link
Collaborator Author

@kindly, @lgs85 it would be great to get your thoughts on the proposal below.

On reflection, I think a bigger issue is API design and streaming support for links and nodes, since each dataset is likely to contain a small number of potential large networks, rather than a large number of small networks.

Whilst this is true, the standard still needs to specify how to package multiple networks, to avoid a situation in which publishers mint a variety of packaging formats, which would make authoring tools that consume OFDS data difficult.

Proposal

Based on the discussion in open-contracting/standard#1084, offer two packaging formats each for the JSON and GeoJSON publication formats:

  • A small file and API response format for files that are small enough to fit into memory or are published via API.
  • A bulk download format for files that are too large to fit into memory.

The approach to packaging multiple networks in CSV format will depend on the tool chosen in #14.

JSON

Small files and API responses

A top-level JSON object with an array of Network objects in .networks and, for data published via API, a pages object based on the pagination approach from OCDS. Note that it is named pages to avoid a clash with links.

{
  "networks": [
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}

The preferred approach is to publish embedded nodes and links. For networks that are too large to return in a single API response, .relatedResources should be used to provide links to separate endpoints for nodes and links, which must return a top-level JSON object with a nodes or a links array, respectively:

{
  "nodes": [
    {...},
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}
{
  "links": [
    {...},
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}

Bulk downloads

A JSON Lines file with one network per line:

{...}
{...}
{...}

The preferred approach is to publish embedded nodes and links. If an individual network is too large to load into memory, .relatedResources should be used to provide links to separate bulk downloads for nodes and links, which must be formatted as JSON Lines files with one node or link per line, respectively.

GeoJSON

Small files and API responses

Publish separate files/endpoints for nodes and links, each structured as a top-level FeatureCollection object according to the GeoJSON transformation specification. Each file may contain features from multiple networks. The network each feature relates to is identified by its .properties.network.id.

For data published via API, add a top-level pages object based on the pagination approach from OCDS:

{
  "type": "FeatureCollection",
  "features": [
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}

Bulk downloads

Separate Newline-delimted GeoJSON files for nodes and links, with one feature per line structured according to the GeoJSON transformation specification:

{...}
{...}
{...}

Other approaches considered

JSON

Small files and API responses

Do not support packaging multiple networks. Instead, publish networks one at a time, i.e. publish a JSON file for each network containing a top-level Network object. For data published via API, use .relatedResources (see #75) to provide links to the next and previous networks in the series:

{
    "relatedResources": [
    {
      "href": "",
      "rel": "next"
    },
    {
      "href": "",
      "rel": "prev"
    }
  ]
}

As in the proposal, the preferred approach is to publish embedded nodes and links. For networks that are too large to return in a single API response, .relatedResources should be used to provide links to separate endpoints for nodes and links, which must return a top-level JSON object with a nodes or a links array, respectively.

Pros:

  • Simpler for publishers that only need to publish one network

Cons:

  • Greater number of API calls required to get all the data
  • Inconsistency between the format of the data returned by endpoints for networks and endpoints for nodes or links.

Bulk downloads

A ZIP or GZIP file containing a JSON file for each network.

As in the proposal, the preferred approach is to publish embedded nodes and links. For networks that are too large to load into memory, .relatedResources should be used to provide links to separate bulk downloads for nodes and links, which must be formatted as JSON Lines files with one node or link per line, respectively. The reason for choosing JSON Lines over ZIP/GZIP for bulk downloads of nodes and links is that networks can contain upwards of 100,000 links so the ZIP/GZIP file could expand into upwards of 100,000 files.

Cons:

  • Inconsistency between the format of the bulk download for networks and the format of the bulk downloads for links and nodes.

@duncandewhurst duncandewhurst changed the title Packaging format Packaging multiple networks Sep 9, 2022
@kindly
Copy link

kindly commented Sep 9, 2022

This looks fine to me. The pages approach in GEOJSON format looks odd and may confuse geo users expecting to have all the data in one go and if they do not they may not be able to traverse through the links. However, I see no real harm in it.

@duncandewhurst
Copy link
Collaborator Author

Thanks, @kindly. The pages key in the GeoJSON format is only for data that needs paginating. If the data is small enough, it can be served whole. We can make that clear in the documentation and guidance.

In case it's of interest, ArcGIS uses pagination to serve GeoJSON data (example). If the data is greater than one page, no link to the next page is provided. Instead properties.exceededTransferLimit is set to True and the user needs to construct the next link using the resultOffset URL parameter and (presumably) some knowledge of what the transfer limit is / how many pages are returned per query.

@lgs85
Copy link
Contributor

lgs85 commented Sep 12, 2022

I think that this approach looks fine. As discussed, it'll be important to make very clear in the guidance that this is unlikely to be used in the majority of cases

@duncandewhurst
Copy link
Collaborator Author

The reference documentation has been updated to reflect the proposal in this issue: https://open-fibre-data-standard.readthedocs.io/en/latest/reference/publication_formats.html

This issue will remain open against the beta milestone to gather feedback from the alpha consultation.

@duncandewhurst duncandewhurst modified the milestones: Alpha, Beta Sep 14, 2022
@duncandewhurst
Copy link
Collaborator Author

We've not heard any further feedback on this issue so I'm going to close it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CSV format This issue relates to the CSV publication format GeoJSON format This issue relates to the GeoJSON publication format Schema
Projects
None yet
Development

No branches or pull requests

3 participants