/
README
289 lines (230 loc) · 9.74 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
#+BEGIN_COMMENT
(org-org-export-to-org)
#+END_COMMENT
* shp2nosql
This package, shp2nosql, is a set of command line tools for working
with geospatial data and NoSQL systems such as Elasticsearch and
MongoDB. It is designed to be fairly standardized with reasonable
defaults so that the intricacies of inserting/indexing geospatial data
(mainly shapefiles) need not be mastered for each individual system.
It does, however, require /some/ working knowledge the system you
wish to use.
The process of inserting/indexing records basically involves
converting the shapefile to a GeoJSON, formatting the GeoJSON
appropriately, and preparing the system for spatial queries. This
process can be tedious, especially with Elasticsearch. Without an
appropriate spatial index, you cannot take advantage of
Elasticsearch's geo queries. While the data type of most fields is
detected automatically, this is not the case for geographic features.
Among other things, this set of tools is designed to remedy such
issues.
Further, this set of tools has options to get shapefiles directly from
U.S. Census TIGER files using =wget=. This could be useful for testing
the features of each system or for querying social media data against
the Census's (e.g. count how many tweets are within each census tract
in Oklahoma)
** Required dependencies
- GDAL (> 2.1)
- Wget
- curl
- which
- Python 3+
- NoSQL system (one of the following)
- Elasticsearch 5+
- MongoDB 3+
** Optional dependencies
- [[https://github.com/miku/esbulk][esbulk]]
** Installation
*** Method 1
- Use =git clone https://github.com/mhaffner/shp2nosql=
- Add the =shp2nosql/bin= directory to your path (e.g. execute =export
PATH=$PATH:~/path/to/shp2nosql/bin= in the terminal)
*** Method 2
- Download and unzip the latest release from [[https://github.com/mhaffner/shp2nosql/releases][here]].
- Add the =shp2nosql/bin= directory to your path (e.g. execute =export
PATH=$PATH:~/path/to/shp2nosql/bin= in the terminal)
* How to use
** Elasticsearch (shp2es)
*** Minimum arguments
#+BEGIN_SRC text
-f : FEATURES to get from U.S. census (tract, county, or state)
arugment: path to file in quotes (if local) or "tract", "county", or "state"
example: "state"
OR
-l : is LOCAL
argument: path to shapefile (relative or absolute)
example: "/home/user/cities.shp" or just "cities.shp"
OR
-m : use MULTIPLE local shapefiles
argument: full path to directory in quotes; all shapfiles must be in this directory
example: "/home/user/shapefiles"
AND
-i : INDEX name
argument: name of the index (user's choice)
example: "states"
-t : document TYPE
argument: name of the document type (user's choice)
example: "state"
#+END_SRC
** MongoDB (shp2mongo)
*** Minimum arguments
#+BEGIN_SRC text
-f : FEATURES to get from U.S. census (tract, county, or state)
arugment: path to file in quotes (if local) or "tract", "county", or "state"
example: "state"
OR
-l : is LOCAL
argument: path to shapefile (relative or absolute)
example: "/home/user/cities.shp" or just "cities.shp"
OR
-m : use MULTIPLE local shapefiles
argument: full path to directory in quotes; all shapfiles must be in this directory
example: "/home/user/shapefiles"
AND
-d : DATABASE name
argument: name of the database (user's choice)
example: "states"
-c : COLLECTION name
argument: name of the collection (user's choice)
#+END_SRC
** Examples
Several examples are included in
[[https://github.com/mhaffner/shp2nosql/tree/master/examples/][shp2nosql/examples]].
You should be able to run the scripts from this directory with =bash
example-1.sh=, for example. All data necessary are included.
*** Elasticsearch
A detailed Elasticsearch example:
#+BEGIN_SRC shell
# Elasticsearch
shp2es -r -f state -i us_states -t state
# an equivalent, more readable version with comments
shp2es \
-r `# remove the index if it exists` \
-f state `# file to get from US Census TIGER files` \
-i us_states `# index name` \
-t state `# document type`
#+END_SRC
In the example above, the tool first deletes the named index if it already
exists. The tool uses =wget= to retrieve a shapefile of all U.S. States (plus
Washington, D.C., Puerto Rico, etc.) from U.S. Census TIGER files. This
shapefile is stored in [[https://github.com/mhaffner/shp2nosql/data/shapefiles][shp2nosql/data/shapefiles]] after downloading. The tool
converts the shapefile to GeoJSON, formats the GeoJSON for Elasticsearch,
indexes records into the index =us_states= with document type =state=. To see
if the records indexed correctly, try this from the terminal:
#+BEGIN_SRC shell
curl localhost:9200/us_states/_count
#+END_SRC
This command counts the number of documents in our index. It should return
something like this:
#+BEGIN_SRC
{"count":56,"_shards":{"total":5,"successful":5,"failed":0}}
#+END_SRC
*** MongoDB
A detailed MongoDB example:
#+BEGIN_SRC shell
# MongoDB
shp2mongo -r -f state -d us_states -c state
# an equivalent, more readable version with comments
shp2mongo \
-r `# remove the database if it exists` \
-f state `# file to get from US Census TIGER files` \
-d us_states `# database name` \
-c state `# collection` \
#+END_SRC
If you tried the previous Elasticsearch example, you'll notice that
the tool does not have to download the shapefile from the U.S. Census
TIGER files again. It simply uses the same file. To see if records
inserted correctly, try this from a terminal:
#+BEGIN_SRC shell
mongo us_states
#+END_SRC
Then, from the mongo shell try:
#+BEGIN_SRC
db.state.count()
#+END_SRC
It should return:
#+BEGIN_SRC
56
#+END_SRC
** Full documentation
*** shp2es
#+INCLUDE: "./elasticsearch-help.txt" src
*** shp2mongo
#+INCLUDE: "./mongodb-help.txt" src
* FAQ and common problems
*Q*: I'm recieving a 413 error while attempting to index documents into
Elasticsearch. What's going on?
*A*: Sometimes this is more of a warning in that records often index
successfully even after seeing this message. If not, be sure your
machine has enough available memory to carry out a bulk index. Also,
consider adjusting http.maxRequestLength in
=/etc/elasticsearch/elasticsearch.yml= if necessary. Alternatively, use
the =[[github.com/miku/esbulk][esbulk]]= utility (must be installed and found in your path) with the
-e flag
*Q*: My shapefile has /n/ features, so why does my database/index have
/n - x/ features (i.e. not all features were indexed/inserted)?
*A*: This could be due to a topology error. Visit the directory
=shp2nosql/data/geojson= and view the features with a text editor
(warning: the file could be large). Consider validating the geojson
with a tool like [[geojsonlint.com][geojsonlint]].
*Q (Elasticsearch)*: Why did my script complete successfully without
indexing any documents?
*A (Elasticsearch)*: The index may have already existed. If you did not intend
to add documents without deleting previous documents, consider running the tool
with the =-r= option (which removes the index before indexing) or deleting the index
manually using
#+BEGIN_SRC shell
curl -XDELETE host:port/index
#+END_SRC
*Q (MongoDB)*: Why is the number of documents in my database more (or double)
what I expected?
*A (MongoDB)*: It's possible that the database and collection existed previously
and you simply added to records that were already present. Consider running the
tool with the =-r= option (which removes the database before indexing).
*Q*: Why did the tool not use the coordinate system/projection of my shapefile?
It appears as though everything is GeoJSON is using =EPSG:4326=.
*A*: The support for alternative CRS's for GeoJSON was removed in 2008
(see [[https://tools.ietf.org/html/rfc7946#section-4][here]]). This standard states everything must use =EPSG:4326=.
Other coordinate systems could reasonably work (although the standard
would be violated), but this feature is not currently available. If
this is a problem, create an [[https://github.com/mhaffner/shp2nosql/issues][issue]].
*Q*: I received an error with the =esbulk= utility, but the output was not
informative. What's going on?
*A*: Try going without the utility with a small data set and see if the issue
persists. If geometry is malformed, =esbulk= may not return an informative
error.
*Q*: I installed Elasticsearch/MongoDB, but I get an error asking if
the system is running. How do I check this?
*A*: To check if Elasticsearch is running, use
#+BEGIN_SRC shell
curl host:port # e.g. curl localhost:9200
#+END_SRC
If it is running, it should output some meaningful information about your
cluster in .json format. To check if MongoDB is running, simply use the command
#+BEGIN_SRC shell
mongo
#+END_SRC
If MongoDB is running, it should drop you into the Mongo shell (you may need to
install =mongodb-tools= to use the Mongo shell on Arch Linux).
If services are not running, you can start them with
#+BEGIN_SRC shell
systemctl start elasticsearch
systemctl start mongodb
#+END_SRC
if your system has =systemd= (this should be the default on Ubuntu >
16.04 and Arch Linux). You may need to enable the service first
though.
*Q*: Why are arguments different for Elasticsearch and MongoDB?
This seemingly inconsistent notation is used so that arguments are
consistent with the terminology of each system. For example,
Elasticsearch requires arguments for options =-i= (index) and =-t=
(document type), while MongoDB requires arguments for options =-d=
(database name) and =-c= (collection name).
*Q*: The script starts but hangs on
#+BEGIN_SRC
Resolving ftp2.census.gov... 148.129.75.35, 2610:20:2010:a09:1000:0:9481:4b23
Connecting to ftp2.census.gov|148.129.75.35|:21... connected.
#+END_SRC
*A*: This is an issue with the ftp service of the U.S. Census. It goes down
periodically. Usually killing the script with =Ctrl-c= and trying again a few
minutes later solves the problem.