This package, shp2nosql, is a set of command line tools for working with geospatial data and NoSQL systems such as Elasticsearch and MongoDB. It is designed to be fairly standardized with reasonable defaults so that the intricacies of inserting/indexing geospatial data (mainly shapefiles) need not be mastered for each individual system. It does, however, require some working knowledge the system you wish to use.
The process of inserting/indexing records basically involves converting the shapefile to a GeoJSON, formatting the GeoJSON appropriately, and preparing the system for spatial queries. This process can be tedious, especially with Elasticsearch. Without an appropriate spatial index, you cannot take advantage of Elasticsearch’s geo queries. While the data type of most fields is detected automatically, this is not the case for geographic features. Among other things, this set of tools is designed to remedy such issues.
Further, this set of tools has options to get shapefiles directly from
U.S. Census TIGER files using wget
. This could be useful for testing
the features of each system or for querying social media data against
the Census’s (e.g. count how many tweets are within each census tract
in Oklahoma)
- GDAL (> 2.1)
- Wget
- curl
- which
- Python 3+
- NoSQL system (one of the following)
- Elasticsearch 5+
- MongoDB 3+
- Use
git clone https://github.com/mhaffner/shp2nosql
- Add the
shp2nosql/bin
directory to your path (e.g. executeexport PATH=$PATH:~/path/to/shp2nosql/bin
in the terminal)
- Download and unzip the latest release from here.
- Add the
shp2nosql/bin
directory to your path (e.g. executeexport PATH=$PATH:~/path/to/shp2nosql/bin
in the terminal)
-f : FEATURES to get from U.S. census (tract, county, or state)
arugment: path to file in quotes (if local) or "tract", "county", or "state"
example: "state"
OR
-l : is LOCAL
argument: path to shapefile (relative or absolute)
example: "/home/user/cities.shp" or just "cities.shp"
OR
-m : use MULTIPLE local shapefiles
argument: full path to directory in quotes; all shapfiles must be in this directory
example: "/home/user/shapefiles"
AND
-i : INDEX name
argument: name of the index (user's choice)
example: "states"
-t : document TYPE
argument: name of the document type (user's choice)
example: "state"
-f : FEATURES to get from U.S. census (tract, county, or state)
arugment: path to file in quotes (if local) or "tract", "county", or "state"
example: "state"
OR
-l : is LOCAL
argument: path to shapefile (relative or absolute)
example: "/home/user/cities.shp" or just "cities.shp"
OR
-m : use MULTIPLE local shapefiles
argument: full path to directory in quotes; all shapfiles must be in this directory
example: "/home/user/shapefiles"
AND
-d : DATABASE name
argument: name of the database (user's choice)
example: "states"
-c : COLLECTION name
argument: name of the collection (user's choice)
Several examples are included in
shp2nosql/examples.
You should be able to run the scripts from this directory with bash
example-1.sh
, for example. All data necessary are included.
A detailed Elasticsearch example:
# Elasticsearch
shp2es -r -f state -i us_states -t state
# an equivalent, more readable version with comments
shp2es \
-r `# remove the index if it exists` \
-f state `# file to get from US Census TIGER files` \
-i us_states `# index name` \
-t state `# document type`
In the example above, the tool first deletes the named index if it already
exists. The tool uses wget
to retrieve a shapefile of all U.S. States (plus
Washington, D.C., Puerto Rico, etc.) from U.S. Census TIGER files. This
shapefile is stored in shp2nosql/data/shapefiles after downloading. The tool
converts the shapefile to GeoJSON, formats the GeoJSON for Elasticsearch,
indexes records into the index us_states
with document type state
. To see
if the records indexed correctly, try this from the terminal:
curl localhost:9200/us_states/_count
This command counts the number of documents in our index. It should return something like this:
{"count":56,"_shards":{"total":5,"successful":5,"failed":0}}
A detailed MongoDB example:
# MongoDB
shp2mongo -r -f state -d us_states -c state
# an equivalent, more readable version with comments
shp2mongo \
-r `# remove the database if it exists` \
-f state `# file to get from US Census TIGER files` \
-d us_states `# database name` \
-c state `# collection` \
If you tried the previous Elasticsearch example, you’ll notice that the tool does not have to download the shapefile from the U.S. Census TIGER files again. It simply uses the same file. To see if records inserted correctly, try this from a terminal:
mongo us_states
Then, from the mongo shell try:
db.state.count()
It should return:
56
##### shp2es help ##### -h : HELP (show this documentation; arugment: no argument used -l : is LOCAL argument: path to shapefile (relative or absolute) example: "/home/user/cities.shp" or just "cities.shp" -f : FEATURES to get from U.S. census (tract, county, or state) arugment: path to file in quotes (if local) or "tract", "county", or "state" example: "state" -m : use MULTIPLE local shapefiles argument: full path to directory in quotes; all shapfiles must be in this directory example: "/home/user/shapefiles" -s : two digit STATE fips code (required when using -f tract) argument: two digit state fips code example: "40" (state fips code of Oklahoma) -i : INDEX name argument: name of the index (user's choice) example: "states" -t : document TYPE argument: name of the document type (user's choice) example: "state" -H : HOST (default is localhost) argument: if none supplied, "localhost" is used; otherwise, host name example: "127.0.0.01" -p : PORT argument: if non supplied, "9200" used for elasticsearch, 27017 for mongodb example: "9200" -r : REMOVE database or index before inserting records argument: no argument used -e : use ESBULK utility argument: no argument used
##### shp2mongo help ##### -h : HELP (show this documentation; arugment: no argument used -l : is LOCAL argument: path to shapefile (relative or absolute) example: "/home/user/cities.shp" or just "cities.shp" -f : FEATURES to get from U.S. census (tract, county, or state) arugment: path to file in quotes (if local) or "tract", "county", or "state" example: "state" -m : use MULTIPLE local shapefiles argument: full path to directory in quotes; all shapfiles must be in this directory example: "/home/user/shapefiles" -s : two digit STATE fips code (required when using -f tract) argument: two digit state fips code example: "40" (state fips code of Oklahoma) -d : DATABASE name argument: name of the database (user's choice) example: "states" -c : COLLECTION name argument: name of the collection (user's choice) example: "state" -H : HOST (default is localhost) argument: if none supplied, "localhost" is used; otherwise, host name example: "127.0.0.01" -p : PORT argument: if non supplied, 27017 example: "27017" -r : REMOVE database or index before inserting records argument: no argument used
Q: I’m recieving a 413 error while attempting to index documents into Elasticsearch. What’s going on?
A: Sometimes this is more of a warning in that records often index
successfully even after seeing this message. If not, be sure your
machine has enough available memory to carry out a bulk index. Also,
consider adjusting http.maxRequestLength in
/etc/elasticsearch/elasticsearch.yml
if necessary. Alternatively, use
the [[github.com/miku/esbulk][esbulk]]
utility (must be installed and found in your path) with the
-e flag
Q: My shapefile has n features, so why does my database/index have n - x features (i.e. not all features were indexed/inserted)?
A: This could be due to a topology error. Visit the directory
shp2nosql/data/geojson
and view the features with a text editor
(warning: the file could be large). Consider validating the geojson
with a tool like geojsonlint.
Q (Elasticsearch): Why did my script complete successfully without indexing any documents?
A (Elasticsearch): The index may have already existed. If you did not intend
to add documents without deleting previous documents, consider running the tool
with the -r
option (which removes the index before indexing) or deleting the index
manually using
curl -XDELETE host:port/index
Q (MongoDB): Why is the number of documents in my database more (or double) what I expected?
A (MongoDB): It’s possible that the database and collection existed previously
and you simply added to records that were already present. Consider running the
tool with the -r
option (which removes the database before indexing).
Q: Why did the tool not use the coordinate system/projection of my shapefile?
It appears as though everything is GeoJSON is using EPSG:4326
.
A: The support for alternative CRS’s for GeoJSON was removed in 2008
(see here). This standard states everything must use EPSG:4326
.
Other coordinate systems could reasonably work (although the standard
would be violated), but this feature is not currently available. If
this is a problem, create an issue.
Q: I received an error with the esbulk
utility, but the output was not
informative. What’s going on?
A: Try going without the utility with a small data set and see if the issue
persists. If geometry is malformed, esbulk
may not return an informative
error.
Q: I installed Elasticsearch/MongoDB, but I get an error asking if the system is running. How do I check this?
A: To check if Elasticsearch is running, use
curl host:port # e.g. curl localhost:9200
If it is running, it should output some meaningful information about your cluster in .json format. To check if MongoDB is running, simply use the command
mongo
If MongoDB is running, it should drop you into the Mongo shell (you may need to
install mongodb-tools
to use the Mongo shell on Arch Linux).
If services are not running, you can start them with
systemctl start elasticsearch
systemctl start mongodb
if your system has systemd
(this should be the default on Ubuntu >
16.04 and Arch Linux). You may need to enable the service first
though.
Q: Why are arguments different for Elasticsearch and MongoDB?
This seemingly inconsistent notation is used so that arguments are
consistent with the terminology of each system. For example,
Elasticsearch requires arguments for options -i
(index) and -t
(document type), while MongoDB requires arguments for options -d
(database name) and -c
(collection name).
Q: The script starts but hangs on
Resolving ftp2.census.gov... 148.129.75.35, 2610:20:2010:a09:1000:0:9481:4b23 Connecting to ftp2.census.gov|148.129.75.35|:21... connected.
A: This is an issue with the ftp service of the U.S. Census. It goes down
periodically. Usually killing the script with Ctrl-c
and trying again a few
minutes later solves the problem.