Skip to content

Service architecture

Stephen Balogh edited this page Apr 14, 2016 · 1 revision

Architecture Overview

The NYU Spatial Data Repository is comprised of several interlinking web services:

  • An instance of GeoBlacklight (the codebase that this repo specifically tracks), which serves as the discovery interface
  • An instance of Apache Solr, the search engine that powers GeoBlacklight
  • A SQL database for storing GeoBlacklight user, search, and bookmark data
  • A separate PostGIS enabled SQL database for storing vector geometry (one of the two main sources of GIS data in the SDR)
  • Two instances of GeoServer (one for public GIS data, the other for restricted), which power the WMS and WFS web-services used in GeoBlacklight
  • A multi-purpose metadata and record management server (though this is not a critical part of the architecture), which runs GeoCombine.
  • An institutional repository –– we use NYU's Faculty Digital Archive –– to store "archival" copies of all data layers, and also to provide UUIDs that we can use to refer to layers across all of the different services
  • A source of metadata records –– we are using OpenGeoMetadata as a place to grab records from other institutions, and as a place to store authoritative copies of our own

In addition to these, we also rely on NYU's Faculty Digital Archive to provide Handle.net UUIDs for records, and store preservation copies of datasets, as well as accompanying codebooks or documentation packets.

A blog series that expands upon NYU's implementation of GeoBlacklight, and which touches on some of these same points, can be found here.


In-depth profile of services and their current implementations

GeoBlacklight

The SDR currently relies on GeoBlacklight, with some local modifications. The rails application is running on an Ubuntu 14.04 server, hosted on Amazon Web Services EC2.

Phusion Passenger, connected to Apache, is the Rails web server. HTTPS is forced (this is a requirement of the OmniAuth integration). Here is a helpful set of instructions on how to implement Passenger with Apache on Ubuntu 14.04 (albeit from a different cloud provider).

Rails environment variables are managed using Figs. Make sure to add gem 'figs' to your Gemfile.

The application performs some simple dataset caching from GeoServer. The cached copies of these datasets are stored in a directory specified in config/settings.yml; if GeoServer's WFS service has trouble generating a WGS84 Shapefile/KMZ/GeoJSON document, you can supply those to the cache dir and bypass the GeoServer download generation.

Solr

Since Solr has no user-access restriction measures by default, access to it is highly restricted in our deployment. GeoBlacklight communicates with it directly via the RSolr client. At no time is a user connecting directly to a Solr core.

Solr is deployed on EC2, and is fire-walled such that it is only able to communicate with the Rails server, and a server that has been designated for handling metadata ingest.

We are maintaining both production and development Solr cores, so that we can preview new records before going live with them.

SQL Database (for Rails)

We have chosen to use MySQL as the backend db for GeoBlacklight when the application is running in production or development mode.

Our instance of MySQL runs on Amazon RDS, and contains three databases (one for production, one for development, one for staging). SQLite can be used for the test environment.

PostGIS SQL Database

This also runs on RDS, however it uses PostgreSQL. PostGIS extensions can be added to PostgreSQL by following these instructions.

Read-only user accounts are provided to both instances of GeoServer so that they can pull geometries from the database. I have also experimented with directly connecting to PostGIS from a desktop GIS client (like QGIS) –– also with a read-only user account –– but I seem to get better results from connecting via GeoServer WFS.

We are using our PostGIS database to store only vector geometries, even though recent versions of PostGIS do support raster layers. One database contains all vector layers, regardless of whether they are Public or Restricted (that distinction is only actualized at the GeoServer level). Accordingly, we are using Amazon's security groups to heavily restrict traffic to and from this database. Direct connections from outside the Virtual Private Cloud may be limited to NYU IP ranges if there is a desire to allow direct, read-only connections –– but otherwise, access should be completely restricted to the two GeoServer instances (and a management server).

GeoServer

We have two instances of GeoServer up, at:

Both connect directly to the PostGIS database, though the layers enabled by each (and therefore being served) are mutually exclusive, and dependent on the Rights status in the metadata for the records. Layers are enabled/disabled by a Ruby script that runs through GeoBlacklight-schema JSON records, and then connects to the GeoServer REST APIs. More on that in the following section.

GeoServer provides two crucial services for us: the WMS and WFS endpoints that GeoBlacklight needs for layer previews and generated layer downloads (respectively).

We have separate instances of GeoServer for Public and Restricted data so that we can limit access to the Restricted endpoint to NYU IP address ranges. For users trying to access Restricted data from off-campus, EZproxy will be used to route traffic to them after they are logged into NYU SSO.

Metadata / Record Management Server (optional)

For our current workflow, we use another AWS server (at http://metadata.geo.nyu.edu) to handle many of our remaining needs, including:

  1. Intake of geospatial metadata, and export of GeoBlacklight-schema compatible JSON records (for the moment, we are using this plugin to Omeka to do this; in the future, this functionality should be more closely integrated into the SDR stack)
  2. Management of GeoBlacklight records, including the creation of individual records from batch outputs; rearrangement of records into directories that match our UUID scheme; publication of those records to our repo on OpenGeoMetadata
  3. Cleanup of records (with scripts like this)
  4. Transfer of records to Solr, using the GeoCombine toolkit
  5. Management, reproduction, and format conversion of vector GIS layers; transfer of SQL version of datasets to PostGIS database
  6. Enabling layers from PostGIS on the correct instance of GeoServer, via GeoServer's REST API

In addition to all of this, we also plan on using this server to host web-accessible versions of all of our GIS metadata records (in particular, records with ISO or FGDC metadata) in order to populate GeoBlacklight's "Metadata" view.

Institutional Repository (NYU Faculty Digital Archive)

Though also not strictly necessary, the role of the institutional repository is foundational in terms of our data model. All records that exist in GeoBlacklight will have corresponding records in the Faculty Digital Archive, and links exist on GeoBlacklight to connect these two records. Keeping "originals" on the FDA provides us with several things.

The FDA provides us with a Handle.net URI per object. By uploading datasets at the layer level, we get URIs that can be used across all components of the service to refer to the same layer. For example, nyu_2451_34366 might be a UUID of a layer, and therefore it would be the layer slug in GeoBlacklight, the name of the vector table in PostGIS, the feature name in GeoServer; additionally, the folder structure to the JSON record, on our OpenGeoMetadata repo, would be /handle/2451/34366/geoblacklight.json, and the link to the archival copy on the FDA would be http://hdl.handle.net/2451/34366.

The FDA also allows us to provide "direct download" links to particular bitstreams from within the GeoBlacklight interface. We can use this to reduce load on GeoServer, and have the download come directly from the FDA instead of being generated (even though, it would be more likely that GeoBlacklight would be offering a locally cached version of a dataset download, so in practice this is less of a concern).

We can also store any additional documentation, or accompanying codebooks, on the FDA.

OpenGeoMetadata

We store the authority copies of our metadata records on the edu.nyu repo within the OpenGeoMetadata GitHub group; records stored in our repo should represent the most current version of our metadata. Our workflow for updating Solr involves using GeoCombine to pull changes from our git repo, and then index them into the search core.

This is helpful for many reasons. For one, it allows institutions using GeoBlacklight to share geospatial metadata. At the moment, we have indexed all of Stanford's Public records into the SDR. Doing this is quite easy, given the fact that all of their records are published to OGM.

It also makes it very simple to quickly go about reindexing Solr, particularly in settings in which we need a backup or a load-balanced Solr server; a machine can clone from OGM and index into the designated Solr core within a mere two GeoCombine rake tasks.