Skip to content

mhaffner/shp2nosql

Repository files navigation

shp2nosql

This package, shp2nosql, is a set of command line tools for working with geospatial data and NoSQL systems such as Elasticsearch and MongoDB. It is designed to be fairly standardized with reasonable defaults so that the intricacies of inserting/indexing geospatial data (mainly shapefiles) need not be mastered for each individual system. It does, however, require some working knowledge the system you wish to use.

The process of inserting/indexing records basically involves converting the shapefile to a GeoJSON, formatting the GeoJSON appropriately, and preparing the system for spatial queries. This process can be tedious, especially with Elasticsearch. Without an appropriate spatial index, you cannot take advantage of Elasticsearch’s geo queries. While the data type of most fields is detected automatically, this is not the case for geographic features. Among other things, this set of tools is designed to remedy such issues.

Further, this set of tools has options to get shapefiles directly from U.S. Census TIGER files using wget. This could be useful for testing the features of each system or for querying social media data against the Census’s (e.g. count how many tweets are within each census tract in Oklahoma)

Required dependencies

  • GDAL (> 2.1)
  • Wget
  • curl
  • which
  • Python 3+
  • NoSQL system (one of the following)
    • Elasticsearch 5+
    • MongoDB 3+

Optional dependencies

Installation

Method 1

  • Use git clone https://github.com/mhaffner/shp2nosql
  • Add the shp2nosql/bin directory to your path (e.g. execute export PATH=$PATH:~/path/to/shp2nosql/bin in the terminal)

Method 2

  • Download and unzip the latest release from here.
  • Add the shp2nosql/bin directory to your path (e.g. execute export PATH=$PATH:~/path/to/shp2nosql/bin in the terminal)

How to use

Elasticsearch (shp2es)

Minimum arguments

-f : FEATURES to get from U.S. census (tract, county, or state)
     arugment: path to file in quotes (if local) or "tract", "county", or "state" 
     example: "state"          
OR 

-l : is LOCAL
     argument: path to shapefile (relative or absolute)
     example: "/home/user/cities.shp" or just "cities.shp"

OR

-m : use MULTIPLE local shapefiles
     argument: full path to directory in quotes; all shapfiles must be in this directory 
     example: "/home/user/shapefiles"

AND

-i : INDEX name
     argument: name of the index (user's choice)
     example: "states"
-t : document TYPE
     argument: name of the document type (user's choice)
     example: "state"

MongoDB (shp2mongo)

Minimum arguments

-f : FEATURES to get from U.S. census (tract, county, or state)
     arugment: path to file in quotes (if local) or "tract", "county", or "state" 
     example: "state"          

OR 

-l : is LOCAL
     argument: path to shapefile (relative or absolute)
     example: "/home/user/cities.shp" or just "cities.shp"

OR

-m : use MULTIPLE local shapefiles
     argument: full path to directory in quotes; all shapfiles must be in this directory 
     example: "/home/user/shapefiles"

AND

-d : DATABASE name
     argument: name of the database (user's choice)
     example: "states"               
-c : COLLECTION name
     argument: name of the collection (user's choice)

Examples

Several examples are included in shp2nosql/examples. You should be able to run the scripts from this directory with bash example-1.sh, for example. All data necessary are included.

Elasticsearch

A detailed Elasticsearch example:

# Elasticsearch
shp2es -r -f state -i us_states -t state 

# an equivalent, more readable version with comments
shp2es \
    -r `# remove the index if it exists` \
    -f state `# file to get from US Census TIGER files` \
    -i us_states `# index name` \
    -t state `# document type`

In the example above, the tool first deletes the named index if it already exists. The tool uses wget to retrieve a shapefile of all U.S. States (plus Washington, D.C., Puerto Rico, etc.) from U.S. Census TIGER files. This shapefile is stored in shp2nosql/data/shapefiles after downloading. The tool converts the shapefile to GeoJSON, formats the GeoJSON for Elasticsearch, indexes records into the index us_states with document type state. To see if the records indexed correctly, try this from the terminal:

curl localhost:9200/us_states/_count

This command counts the number of documents in our index. It should return something like this:

{"count":56,"_shards":{"total":5,"successful":5,"failed":0}} 

MongoDB

A detailed MongoDB example:

# MongoDB
shp2mongo -r -f state -d us_states -c state 

# an equivalent, more readable version with comments
shp2mongo \
    -r `# remove the database if it exists` \
    -f state `# file to get from US Census TIGER files` \
    -d us_states `# database name` \
    -c state `# collection` \

If you tried the previous Elasticsearch example, you’ll notice that the tool does not have to download the shapefile from the U.S. Census TIGER files again. It simply uses the same file. To see if records inserted correctly, try this from a terminal:

mongo us_states

Then, from the mongo shell try:

db.state.count()

It should return:

56

Full documentation

shp2es

##### shp2es help ##### 

-h : HELP (show this documentation;
     arugment: no argument used
-l : is LOCAL
     argument: path to shapefile (relative or absolute)
     example: "/home/user/cities.shp" or just "cities.shp"
-f : FEATURES to get from U.S. census (tract, county, or state)
     arugment: path to file in quotes (if local) or "tract", "county", or "state" 
     example: "state"          
-m : use MULTIPLE local shapefiles
     argument: full path to directory in quotes; all shapfiles must be in this directory 
     example: "/home/user/shapefiles"
-s : two digit STATE fips code (required when using -f tract)
     argument: two digit state fips code
     example: "40" (state fips code of Oklahoma)               
-i : INDEX name
     argument: name of the index (user's choice)
     example: "states"
-t : document TYPE
     argument: name of the document type (user's choice)
     example: "state"
-H : HOST (default is localhost)
     argument: if none supplied, "localhost" is used; otherwise, host name
     example: "127.0.0.01"
-p : PORT
     argument: if non supplied, "9200" used for elasticsearch, 27017 for mongodb
     example: "9200"
-r : REMOVE database or index before inserting records
     argument: no argument used
-e : use ESBULK utility
     argument: no argument used

shp2mongo

##### shp2mongo help ##### 

-h : HELP (show this documentation;
     arugment: no argument used
-l : is LOCAL
     argument: path to shapefile (relative or absolute)
     example: "/home/user/cities.shp" or just "cities.shp"
-f : FEATURES to get from U.S. census (tract, county, or state)
     arugment: path to file in quotes (if local) or "tract", "county", or "state" 
     example: "state"          
-m : use MULTIPLE local shapefiles
     argument: full path to directory in quotes; all shapfiles must be in this directory 
     example: "/home/user/shapefiles"
-s : two digit STATE fips code (required when using -f tract)
     argument: two digit state fips code
     example: "40" (state fips code of Oklahoma)               
-d : DATABASE name
     argument: name of the database (user's choice)
     example: "states"               
-c : COLLECTION name
     argument: name of the collection (user's choice)
     example: "state"               
-H : HOST (default is localhost)
     argument: if none supplied, "localhost" is used; otherwise, host name
     example: "127.0.0.01"
-p : PORT
     argument: if non supplied, 27017
     example: "27017"
-r : REMOVE database or index before inserting records
     argument: no argument used

FAQ and common problems

Q: I’m recieving a 413 error while attempting to index documents into Elasticsearch. What’s going on?

A: Sometimes this is more of a warning in that records often index successfully even after seeing this message. If not, be sure your machine has enough available memory to carry out a bulk index. Also, consider adjusting http.maxRequestLength in /etc/elasticsearch/elasticsearch.yml if necessary. Alternatively, use the [[github.com/miku/esbulk][esbulk]] utility (must be installed and found in your path) with the -e flag

Q: My shapefile has n features, so why does my database/index have n - x features (i.e. not all features were indexed/inserted)?

A: This could be due to a topology error. Visit the directory shp2nosql/data/geojson and view the features with a text editor (warning: the file could be large). Consider validating the geojson with a tool like geojsonlint.

Q (Elasticsearch): Why did my script complete successfully without indexing any documents?

A (Elasticsearch): The index may have already existed. If you did not intend to add documents without deleting previous documents, consider running the tool with the -r option (which removes the index before indexing) or deleting the index manually using

curl -XDELETE host:port/index

Q (MongoDB): Why is the number of documents in my database more (or double) what I expected?

A (MongoDB): It’s possible that the database and collection existed previously and you simply added to records that were already present. Consider running the tool with the -r option (which removes the database before indexing).

Q: Why did the tool not use the coordinate system/projection of my shapefile? It appears as though everything is GeoJSON is using EPSG:4326.

A: The support for alternative CRS’s for GeoJSON was removed in 2008 (see here). This standard states everything must use EPSG:4326. Other coordinate systems could reasonably work (although the standard would be violated), but this feature is not currently available. If this is a problem, create an issue.

Q: I received an error with the esbulk utility, but the output was not informative. What’s going on?

A: Try going without the utility with a small data set and see if the issue persists. If geometry is malformed, esbulk may not return an informative error.

Q: I installed Elasticsearch/MongoDB, but I get an error asking if the system is running. How do I check this?

A: To check if Elasticsearch is running, use

curl host:port # e.g. curl localhost:9200

If it is running, it should output some meaningful information about your cluster in .json format. To check if MongoDB is running, simply use the command

mongo

If MongoDB is running, it should drop you into the Mongo shell (you may need to install mongodb-tools to use the Mongo shell on Arch Linux).

If services are not running, you can start them with

systemctl start elasticsearch

systemctl start mongodb

if your system has systemd (this should be the default on Ubuntu > 16.04 and Arch Linux). You may need to enable the service first though.

Q: Why are arguments different for Elasticsearch and MongoDB?

This seemingly inconsistent notation is used so that arguments are consistent with the terminology of each system. For example, Elasticsearch requires arguments for options -i (index) and -t (document type), while MongoDB requires arguments for options -d (database name) and -c (collection name).

Q: The script starts but hangs on

Resolving ftp2.census.gov... 148.129.75.35, 2610:20:2010:a09:1000:0:9481:4b23
Connecting to ftp2.census.gov|148.129.75.35|:21... connected.

A: This is an issue with the ftp service of the U.S. Census. It goes down periodically. Usually killing the script with Ctrl-c and trying again a few minutes later solves the problem.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages