## Use spatial indexing to query spatial data

This notebook shows you how to work with the spatio-temporal library that is pre-installed on all Spark environments in Watson Studio. The spatio-temporal library supports spatial indexing functions which you will learn to use in this notebook.

The spatial indexing functions enable efficient access to simple geometric objects such as points, lines and polygons in spatial databases. These functions greatly enhance time-critical search of spatial data.

Much of the spatial indexing functionality in spatial databases like Db2 Spatial, Oracle Spatial, or Microsoft SQL Server Spatial is not accessible. However, with the spatial indexing support in the spatio-temporal library you can now index and query arbitrary geometries. The spatial indices support various queries, which you will learn to use in this notebook, including: 
- `containing`
- `within_distance`
- `intersects`

The time taken to search for geometries in a database can decrease significantly depending on the size and complexity of the geometries. For example, you can query which ZIP code a polygon belongs to by entering a single point. A sequential search across the polygons would be very slow and expensive; however searching a spatial index of all ZIP code polygons for a country answers this query significantly faster.

This notebook runs on Python 3.6 with Spark.


## Table of Contents


1.	[Create a spatial context](#createContext)
2.	[Read sample data](#readData)
3.	[Create a spatial index](#createIndex)
4.	[Query the spatial index](#queryIndex)
5.	[Summary](#summary)


<a id="createContext"></a>
## Create a spatial context

Before you can start using the spatial indexing functions in the spatio-temporal library in your notebook, you must register `STContext` to access the `st` functions.

To register `STContext`:

In [1]:
from pyst import STContext
# Register STContext, which is the main entry point
stc = STContext(spark.sparkContext._gateway)

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200115163408-0000
KERNEL_ID = 97fb7239-7b44-417a-bb69-e27ccad0614a


<a id="readData"></a>
## Read sample data

In this notebook, you will use sample data listing US county boundaries.  

You will use the  `geojson_reader` function to read a GeoJSON file directly from a publically accessible URL into a `Pandas` dataframe.

In [2]:
county_df = stc.geojson_reader().read('https://api.dataplatform.cloud.ibm.com/v2/gallery-assets/entries/1ec43d48a694c6c1d052ddca4d68bdc4/data?accessKey=1ec43d48a694c6c1d052ddca4d69054c')
county_df.head(3)

Unnamed: 0,AFFGEOID,ALAND,AWATER,COUNTYFP,COUNTYNS,GEOID,LSAD,NAME,STATEFP,geometry
0,0500000US37017,2265887723,33010866,17,1026336,37017,6,Bladen,37,MultiPolygon(Polygon: Boundary: Ring(LineSegme...
1,0500000US37167,1023370459,25242751,167,1025844,37167,6,Stanly,37,MultiPolygon(Polygon: Boundary: Ring(LineSegme...
2,0500000US39153,1069181981,18958267,153,1074088,39153,6,Summit,39,MultiPolygon(Polygon: Boundary: Ring(LineSegme...


<a id="createIndex"></a>
## Create a spatial index

There are several options available to choose from for spatial indexing, which include **grid_index**, **r_star_tree_index** and **tessellation_index**. Click the following links for a quick introduction about each of these indexes:
1. [Grid index](https://en.wikipedia.org/wiki/Grid_(spatial_index))
2. [R* tree index](https://en.wikipedia.org/wiki/R-tree)
3. [Tessellation index](https://www.tandfonline.com/doi/full/10.1080/10095020.2016.1146440) 

**Note**: this tessellation index is not a  standard tessellation index. Instead, it is similar to a grid index, only in this case the grid is uniform with respect to the _size_ (in meters) of the grid, as opposed to a typical grid index that is uniform with respect to the number of latitude/longitude divisions.

In this notebook, you will use the tessellation index. To create a tessellation spatial index, you need to set the following two parameters:

- **Bounding box**: Defines the boundary of the spatial index. If you know exactly where your geometries are and can  define a boundary that contains all these geometries, you should provide his bounding box information to the method because it will reduce the amount of "tiles" that need to be created and thus increase performance. However, if you don't know much about the geometries or you want to play safe and not exclude any geometry, that might potentially fall out of the given bounding box (both are very common situations), you can use the whole earth as the bounding box by simply leave the `bbox parameter` parameter value set  to `None`, which is the default value.
- **Tile size**: Tile size is the size of the tile in the tessellation index. The value is given by the length of the tile in the unit of `meters`. You should provide a tile size that is close to the size of your geometries for better performance. For example,  if your geometries are $ 100km^2 (i.e. 10^8 m^2)$ then $ 10^4m$ could be a good value for tile size.

After deciding the bounding box and tile size, you can create the spatial index and import your geometries into the spatial index. For this you will use the `from_df` method which puts the geometries in a pandas DataFrame into the spatial index. You only need to specify the name of the geometry ID column and the name of the geometry column. The extra parameter called verbosity, which controls how many processing logs to print out, can be set to `error` which allows only summary and failure entries to display.

In [3]:
tile_size = 100000 #in meters
si = stc.tessellation_index(tile_size=tile_size) # we leave bbox as None to use full earth as boundingbox
si.from_df(county_df, 'NAME', 'geometry', verbosity='error') #Populate the spatial index

3220 entries processed, 3220 entries successfully added


<a id="queryIndex"></a>
##  Query the spatial index

You can use the following APIs to query the spatial index:
- **contained_in**
- **contained_in_with_info**
- **containing**
- **containing_with_info**
- **intersects**
- **intersects_with_info**
- **within_distance** 
- **within_distance_with_info**
- **nearest_neighbors**
- **nearest_neighbors_with_info** 

Here a a few questions that you can get answers for. 

1. Which county does White Plains Hospital belong to? In other words, which county polygon contains the point location for White Plains Hospital?

In [4]:
white_plains_hospital = stc.point(41.026132, -73.769585)
si.containing(white_plains_hospital)

['Westchester']

2. Which county does the city White Plains belong to? (i.e. which county polygon contains the polygon of White Plains?)

In [5]:
white_plains_WKT = 'POLYGON((-73.792 41.024,-73.794 41.031,-73.779 41.046,-73.78 41.049,-73.779 41.052,-73.776 41.054,-73.775 41.057,-73.767 41.058,-73.769 41.062,-73.768 41.067,-73.762 41.073,-73.759 41.074,-73.748 41.069,-73.746 41.056,-73.742 41.056,-73.74 41.053,-73.74 41.049,-73.749 41.04,-73.748 41.035,-73.739 41.034,-73.729 41.029,-73.725 41.025,-73.72 41.016,-73.717 41.015,-73.716 41.006,-73.718 41.002,-73.732 40.988,-73.732 40.985,-73.739 40.979,-73.745 40.978,-73.749 40.981,-73.749 40.986,-73.751 40.986,-73.756 40.991,-73.759 40.991,-73.76 40.993,-73.765 40.994,-73.769 40.997,-73.774 41.002,-73.775 41.006,-73.788 41.018,-73.792 41.024))'
wkt_reader = stc.wkt_reader()
white_plains = wkt_reader.read(white_plains_WKT)
si.containing(white_plains)

['Westchester']

3. Which are the 3 nearest counties to White Plains Hospital? 

In [6]:
counties = si.nearest_neighbors(white_plains_hospital, 3)
counties

['Westchester', 'Fairfield', 'Rockland']

4. Which are the counties within 20 km of White Plains Hospital? 

In [7]:
counties = si.within_distance(white_plains_hospital, 20000)
counties

['Fairfield', 'Nassau', 'Bergen', 'Westchester', 'Rockland', 'Bronx']

<a id="summary"></a>
## Summary

This notebook showed you a simple spatial indexing example. You learnt how to create a spatial context, read some sample geometric data, create an index, and then query the index. 

### Author

**Linsong Chu**, Research Engineer at IBM Research

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>