README


#+BEGIN_COMMENT
(org-org-export-to-org)
#+END_COMMENT

* shp2nosql
This package, shp2nosql, is a set of command line tools for working
with geospatial data and NoSQL systems such as Elasticsearch and
MongoDB. It is designed to be fairly standardized with reasonable
defaults so that the intricacies of inserting/indexing geospatial data
(mainly shapefiles) need not be mastered for each individual system.
It does, however, require /some/ working knowledge the system you
wish to use. 

The process of inserting/indexing records basically involves
converting the shapefile to a GeoJSON, formatting the GeoJSON
appropriately, and preparing the system for spatial queries. This
process can be tedious, especially with Elasticsearch. Without an
appropriate spatial index, you cannot take advantage of
Elasticsearch's geo queries. While the data type of most fields is
detected automatically, this is not the case for geographic features.
Among other things, this set of tools is designed to remedy such
issues.

Further, this set of tools has options to get shapefiles directly from
U.S. Census TIGER files using =wget=. This could be useful for testing
the features of each system or for querying social media data against
the Census's (e.g. count how many tweets are within each census tract
in Oklahoma)

** Required dependencies
- GDAL (> 2.1)
- Wget 
- curl
- which
- Python 3+
- NoSQL system (one of the following)
  - Elasticsearch 5+
  - MongoDB 3+
** Optional dependencies
- [[https://github.com/miku/esbulk][esbulk]] 
** Installation
*** Method 1
- Use =git clone https://github.com/mhaffner/shp2nosql=
- Add the =shp2nosql/bin= directory to your path (e.g. execute =export
  PATH=$PATH:~/path/to/shp2nosql/bin= in the terminal)
*** Method 2
- Download and unzip the latest release from [[https://github.com/mhaffner/shp2nosql/releases][here]].
- Add the =shp2nosql/bin= directory to your path (e.g. execute =export
  PATH=$PATH:~/path/to/shp2nosql/bin= in the terminal)

* How to use 
** Elasticsearch (shp2es)
*** Minimum arguments
#+BEGIN_SRC text
-f : FEATURES to get from U.S. census (tract, county, or state)
     arugment: path to file in quotes (if local) or "tract", "county", or "state" 
     example: "state"          
OR 

-l : is LOCAL
     argument: path to shapefile (relative or absolute)
     example: "/home/user/cities.shp" or just "cities.shp"

OR

-m : use MULTIPLE local shapefiles
     argument: full path to directory in quotes; all shapfiles must be in this directory 
     example: "/home/user/shapefiles"

AND

-i : INDEX name
     argument: name of the index (user's choice)
     example: "states"
-t : document TYPE
     argument: name of the document type (user's choice)
     example: "state"
#+END_SRC 
** MongoDB (shp2mongo)
*** Minimum arguments
#+BEGIN_SRC text
-f : FEATURES to get from U.S. census (tract, county, or state)
     arugment: path to file in quotes (if local) or "tract", "county", or "state" 
     example: "state"          

OR 

-l : is LOCAL
     argument: path to shapefile (relative or absolute)
     example: "/home/user/cities.shp" or just "cities.shp"

OR

-m : use MULTIPLE local shapefiles
     argument: full path to directory in quotes; all shapfiles must be in this directory 
     example: "/home/user/shapefiles"

AND

-d : DATABASE name
     argument: name of the database (user's choice)
     example: "states"               
-c : COLLECTION name
     argument: name of the collection (user's choice)
#+END_SRC 
** Examples
Several examples are included in
[[https://github.com/mhaffner/shp2nosql/tree/master/examples/][shp2nosql/examples]].
You should be able to run the scripts from this directory with =bash
example-1.sh=, for example. All data necessary are included.
*** Elasticsearch
A detailed Elasticsearch example:

#+BEGIN_SRC shell 
# Elasticsearch
shp2es -r -f state -i us_states -t state 

# an equivalent, more readable version with comments
shp2es \
    -r `# remove the index if it exists` \
    -f state `# file to get from US Census TIGER files` \
    -i us_states `# index name` \
    -t state `# document type`
#+END_SRC

In the example above, the tool first deletes the named index if it already
exists. The tool uses =wget= to retrieve a shapefile of all U.S. States (plus
Washington, D.C., Puerto Rico, etc.) from U.S. Census TIGER files. This
shapefile is stored in [[https://github.com/mhaffner/shp2nosql/data/shapefiles][shp2nosql/data/shapefiles]] after downloading. The tool
converts the shapefile to GeoJSON, formats the GeoJSON for Elasticsearch,
indexes records into the index =us_states= with document type =state=. To see
if the records indexed correctly, try this from the terminal:

#+BEGIN_SRC shell 
curl localhost:9200/us_states/_count
#+END_SRC

This command counts the number of documents in our index. It should return
something like this:

#+BEGIN_SRC 
{"count":56,"_shards":{"total":5,"successful":5,"failed":0}} 
#+END_SRC
*** MongoDB
A detailed MongoDB example:

#+BEGIN_SRC shell 
# MongoDB
shp2mongo -r -f state -d us_states -c state 

# an equivalent, more readable version with comments
shp2mongo \
    -r `# remove the database if it exists` \
    -f state `# file to get from US Census TIGER files` \
    -d us_states `# database name` \
    -c state `# collection` \
#+END_SRC

If you tried the previous Elasticsearch example, you'll notice that
the tool does not have to download the shapefile from the U.S. Census
TIGER files again. It simply uses the same file. To see if records
inserted correctly, try this from a terminal:

#+BEGIN_SRC shell 
mongo us_states
#+END_SRC

Then, from the mongo shell try:

#+BEGIN_SRC
db.state.count()
#+END_SRC

It should return:

#+BEGIN_SRC
56
#+END_SRC
** Full documentation
*** shp2es
#+INCLUDE: "./elasticsearch-help.txt" src
*** shp2mongo
#+INCLUDE: "./mongodb-help.txt" src
* FAQ and common problems
*Q*: I'm recieving a 413 error while attempting to index documents into
Elasticsearch. What's going on?

*A*: Sometimes this is more of a warning in that records often index
successfully even after seeing this message. If not, be sure your
machine has enough available memory to carry out a bulk index. Also,
consider adjusting http.maxRequestLength in
=/etc/elasticsearch/elasticsearch.yml= if necessary. Alternatively, use
the =[[github.com/miku/esbulk][esbulk]]= utility (must be installed and found in your path) with the
-e flag

*Q*: My shapefile has /n/ features, so why does my database/index have
/n - x/ features (i.e. not all features were indexed/inserted)?

*A*: This could be due to a topology error. Visit the directory
=shp2nosql/data/geojson= and view the features with a text editor
(warning: the file could be large). Consider validating the geojson
with a tool like [[geojsonlint.com][geojsonlint]].

*Q (Elasticsearch)*: Why did my script complete successfully without
indexing any documents?

*A (Elasticsearch)*: The index may have already existed. If you did not intend
to add documents without deleting previous documents, consider running the tool
with the =-r= option (which removes the index before indexing) or deleting the index
manually using

#+BEGIN_SRC shell
curl -XDELETE host:port/index
#+END_SRC

*Q (MongoDB)*: Why is the number of documents in my database more (or double)
what I expected?

*A (MongoDB)*: It's possible that the database and collection existed previously
and you simply added to records that were already present. Consider running the
tool with the =-r= option (which removes the database before indexing).

*Q*: Why did the tool not use the coordinate system/projection of my shapefile?
It appears as though everything is GeoJSON is using =EPSG:4326=. 

*A*: The support for alternative CRS's for GeoJSON was removed in 2008
(see [[https://tools.ietf.org/html/rfc7946#section-4][here]]). This standard states everything must use =EPSG:4326=.
Other coordinate systems could reasonably work (although the standard
would be violated), but this feature is not currently available. If
this is a problem, create an [[https://github.com/mhaffner/shp2nosql/issues][issue]].

*Q*: I received an error with the =esbulk= utility, but the output was not
informative. What's going on?

*A*: Try going without the utility with a small data set and see if the issue
persists. If geometry is malformed, =esbulk= may not return an informative
error.

*Q*: I installed Elasticsearch/MongoDB, but I get an error asking if
the system is running. How do I check this?

*A*: To check if Elasticsearch is running, use

#+BEGIN_SRC shell
curl host:port # e.g. curl localhost:9200
#+END_SRC

If it is running, it should output some meaningful information about your
cluster in .json format. To check if MongoDB is running, simply use the command 

#+BEGIN_SRC shell
mongo
#+END_SRC

If MongoDB is running, it should drop you into the Mongo shell (you may need to
install =mongodb-tools= to use the Mongo shell on Arch Linux). 

If services are not running, you can start them with 

#+BEGIN_SRC shell
systemctl start elasticsearch

systemctl start mongodb
#+END_SRC

if your system has =systemd= (this should be the default on Ubuntu >
16.04 and Arch Linux). You may need to enable the service first
though. 

*Q*: Why are arguments different for Elasticsearch and MongoDB?

This seemingly inconsistent notation is used so that arguments are
consistent with the terminology of each system. For example,
Elasticsearch requires arguments for options =-i= (index) and =-t=
(document type), while MongoDB requires arguments for options =-d=
(database name) and =-c= (collection name).

*Q*: The script starts but hangs on
#+BEGIN_SRC
Resolving ftp2.census.gov... 148.129.75.35, 2610:20:2010:a09:1000:0:9481:4b23
Connecting to ftp2.census.gov|148.129.75.35|:21... connected.
#+END_SRC

*A*: This is an issue with the ftp service of the U.S. Census. It goes down
 periodically. Usually killing the script with =Ctrl-c= and trying again a few
 minutes later solves the problem.