## Introduction to Spatial Analysis

Learning objective 1: acquire a basic understaning of
1. Spatial data - eg, what constitutes spatial data? What are examples and types?
2. Spatial data projections / coordinate reference systems
2. Using spatial data - simple approaches to seeing what your data includes (tie back into overall day’s goals of [a] understanding our data and [b] getting more comfortable with Python)
3. Spatial computation  
  * Spatial join (point in polygon)
  * Spatial aggregation (polygon -> polygon, + regionalization? maybe a note, revisit on Privacy day?)
3. Spatial analysis  
  * Spatial Autocorrelation
  * ...

## 10 min framing of spatial thinking
- posing a question
- working through analyses, re-visting spatial analyses throughout
- ESDA + spatial modeling


* from Geographic Information Systems (GIS) to GIScience

## Table of Contents
1. Overview of [spatial data types](#Spatial-data-types)
2. [Datasets]() in this Notebook
2. [Choropleth maps](#Choropleths)
2. [Coordinate Reference Systems (aka projections)](#Coordinate-Reference-Systems)
2. [Exploratory Spatial Data Analysis](#Exploratory-spatial-data-analysis)

### Spatial data types

- Back to [Table of Contents](#Table-of-Contents)

There are two generic spatial data types:
1. **Vector** - discrete data (usually), represented by points, lines, and polygons
2. **Raster** - continuous data (usually), generally represented as square pixels where each pixel (or "grid cell") has some value. Examples of raster data - link to "big data"
  * Imagery data (satellite, Google SteetView, traffic cameras, Placemeter)
  * Surface data (collected at monitoring stations then interpolated to a 'surface' - eg Array of Things, weather data)
  
However, raster data is commonly used in few social science contexts, so the below image (courtesy of [Data Science for Social Good](https://github.com/geebioso/postgis-workshop/blob/master/tutorial.org)) is probably sufficiet discussion about rasters for now:
![raster](../../data/sample_data/raster_example.png)

> Notice the pesky _"usually"_ next to both vector and raster datatypes? Technically any data **_could_** be represented as either vector or raster, but it would be computationally inefficient to create a raster layer of rivers or roads across Illinois because 
1. All the non-road and non-river locations would still have some value and 
2. You would have to pick a cell size which may not well represent the actual course of a given river (as opposed to a vector - line - that follows a path and could have some value for width of the river or road)





### Datasets

- Back to [Table of Contents](#Table-of-Contents)

Datasets used in this Notebook
1. [Illinois State Prisons](https://www.google.com/maps/d/u/0/viewer?mid=12vPv_cWo8H-exJs_zD5E4HCnyEA&hl=en_US&ll=39.65011658688028%2C-89.16519449999998&z=7) - point dataset of state prisons
2. [US Counties](https://www.census.gov/geo/maps-data/data/tiger-line.html) - polygons of counties from US Census's TIGER\Lines product

+ Data collection - at what spatial scale were data collected
+ Data have already been aggregated, considerations
  * aggregating to different spatial units could give different results

In [None]:
# location of data
data_dir_1 = '../../data/sample_data/'

In [None]:
# list files (with details) in data_dir_1:
!ls -lh {data_dir_1}

quick description of shapefile
1. required files
2. additional / optional files

Other common format types:
1. GeoJSON
2. KML
3. 

In [None]:
## data manipulation libraries ##
# Pandas for generic manipulation
import pandas as pd
# GeoPandas for spatial data manipulation
import geopandas as gpd
# PySAL for spatial statistics
import pysal as ps
# shapely for specific spatial data tasks (GeoPandas uses Shapely objects)
from shapely.geometry import Point, LineString, Polygon

# SQLAlchemy to get some data from the database
from sqlalchemy import create_engine

# improve control of visualizations
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# read in IL prison locations
il_prisons = gpd.read_file('{}IL_prisons.shp'.format(data_dir_1))

In [None]:
# what info is contained in this file?
il_prisons.info()

In [None]:
il_prisons.head()

In [None]:
# what does a simple map of those locations look like? 
# Note that you can pass matplotlib keywords to (geo)pandas, like figsize
il_prisons.plot(figsize=(6,8));

> There's some dots. Not super useful, but we can see by the longitude (x axis) and latitude (y axis) they're where we'd expect for Illinois. Let's add IL counites too to give some context

In [None]:
# we'll revisit the code below in depth when we talk about databases so don't worry about it for now

# create DB connection
engine = create_engine("postgresql:///df_spatial")
# create SQL query - limit to just IL by using the state FIPS code of 17
sql = "SELECT * FROM tl_2016_us_county WHERE statefp = '17';"
# get data from DB
il_counties = gpd.read_postgis(sql, engine, geom_col='geom', index_col='gid')

# see info of new geodataframe
il_counties.info()

In [None]:
# plot counties - all defaults, adds colors essentially randomly based on order of shapes in the dataframe
il_counties.plot()

In [None]:
# now let's try putting the prison locations on top of counties, and let's make counties grey
# note we assign counties to the object "ax" so we can overlay the prisons on the same "matplotlib axis"

# create a map of IL counties colored grey
ax = il_counties.plot(color='grey', figsize=(6,8))

# use the same "ax" object to plot the prisons on top of the county map, 
# plus resize the markers and remove their outlines
il_prisons.plot(ax=ax, markersize=10, markeredgewidth=0); 
# pro tip: adding this semi-colon at the end stops Jupyter from printing out the "<matplotlib.axes....>" line

### Choropleths

- Back to [Table of Contents](#Table-of-Contents)

Choropleths are super useful because they can quickly convey how values compare across the study area. Let's start wtih a simple example of the land area of each county. (Note much of the code below comes from Sergio and Dani's [Geovisualization](http://darribas.org/gds_scipy16/ipynb_md/02_geovisualization.html) notebook)

+ update the below to map #ex-offenders by zipcode, by % of population

In [None]:
# we'll create our matplotlib figure and axis first so we have more control over it later
f, ax = plt.subplots(figsize=(6,8))

# we'll pass geopandas the column, scheme (calculation method), number of groups to calculate (k)
# colormap to use, linewidge (to make the edges less noticeable), and the axis object created above
il_counties.plot(column='aland', scheme='QUANTILES', k=10, cmap='OrRd', linewidth=0.1, ax=ax)

# and this time we'll turn off the
ax.set_axis_off();

> as you can see, Geopandas only allows using the "quantiles" (or any other [scheme supported by PySAL](http://pysal.readthedocs.io/en/latest/library/esda/mapclassify.html)) to use between 2 and 9 and if you try soemthing different, it resets to 5

So here is how you can use more categories for your choropleths: create a new column with the appropriate PySAL function and map that, as follows.

In [None]:
# let's try the 'Fisher_Jenks' scheme:
fj10 = ps.Fisher_Jenks(il_counties.aland,k=10)

# the ps.<scheme> function returns two things, the bins used for the cutoffs:
print('bins:')
print(fj10.bins)
# and the assigned bin number to use:
print('\nbin number:')
print(fj10.yb)

In [None]:
# now we can use the new categories to color the choropleth of land area into 10 buckets
# notice the couple new keywords we include

# again we'll create the matplotlib figure and axis objects first
f, ax = plt.subplots(figsize=(6, 8))

# then create our choropleth, the "assign" function adds our Fisher Jenks buckets as a new column to map
# the 'categorical'
il_counties.assign(cl=fj10.yb).plot(column='cl', categorical=True, \
        k=10, cmap='OrRd', linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
# turn off the latitude/longitude axes
ax.set_axis_off();

**Placeholder for choropleths** of ex-offenders. Choropleths will be
1. Total ex-offenders by zipcode for a certain year -> would require aggregating by zipcode
2. Rate of ex-offenders as compared to the population 
  * and compared to _working_ population -> question: "normalize rates with small denominators"?
3. Change in ex-offenders by county from one year to the next (or 5 year chunk?)

### Coordinate Reference Systems

- Back to [Table of Contents](#Table-of-Contents)

Coordinate Reference Systems (aka projections) are basically math that (1) describes how information in a given dataset relates to the rest of the world and (2) usually creates a 'flat' surface on which data can be analyzed using more common algorithms (eg Euclidean geometry). 

>Why do we care?
1. Distance / area measurements
2. Spatial join - won't work with different CRS


As an example of point 2, consider the distance between two points: Euclidean distance (aka pythagorean theorem) provides an easy way to calculate distance so long as we know the difference in **_x_** (longitude) and **_y_** (latitude) between two points:
$$Distance   = \sqrt(({x}_1-{x}_2)^2 + ({y}_1-{y}_2)^2)$$

This works fine on **_correctly projected_** data, but **_does not work_** on unprojected data. For one thing the result would be in degrees and degrees are a different distance apart (in terms of meters or miles) at different points on the Earth's surface.

All this is to say: if you do a calculation with geographic data and the numbers don't make sense, check the projection. Let's do an example with the IL county areas.

In [None]:
# print out the CRS of IL counties:
print(il_counties.crs)

so first, it needs to be set (I'd guess GeoPandas will appropriately set from a database in the future). If we look it up in the database we'll see that it's WGS84 (World Geodesic Survey 1984), which has the [EPSG](www.epsg.org) (European Petroleum Survey Group) code of 4326.

In [None]:
# set the counties crs to 4326
il_counties.crs = {'init': u'epsg:4326'}

# print it out
print(il_counties.crs)

In [None]:
# let's check out the area calculated using Pandas with WGS84
il_counties['area_wgs84'] = il_counties.geom.area

In [None]:
# view the first 5 records' aland and calculated area with WGS84:
il_counties.loc[:,('aland', 'area_wgs84')].head()

Clearly not the same. We can look for other projections at a number of websites, but a good one is [epsg.io](www.epsg.io). let's use the US National Atlas Equal Area projection (epsg=2163), which is a meters based equal area projection.

In [None]:
# transform aka re-project the data (use the "inplace=True" flag to perform the operation on this Geodataframe)
il_counties.to_crs(epsg=2163, inplace=True)

# print out the CRS to see it worked
print(il_counties.crs)

In [None]:
# and let's calculate the area with the new CRS
il_counties['area_2163'] = il_counties.geom.area

# and again check the head() of the data, with all 3 area columns:
il_counties.loc[:,('aland', 'area_wgs84', 'area_2163')].head()

In [None]:
# let's check if those small differences are just because we're only looking at land area, not full county area:
il_counties['total_area'] = il_counties.aland + il_counties.awater

# and recheck areas against total:
il_counties.loc[:,('total_area', 'area_wgs84', 'area_2163')].head()

> There are still some differences between our newly calculated area ('area_2163') and the total area that came in the data ('aland' + 'awater'), however we can see it's much closer than the wgs84 version. These small differences most likely mean that the area from Census was calculated using a different Coordinate Reference System.

### Exploratory spatial data analysis

- Back to [Table of Contents](#Table-of-Contents)

The below code is sourced mostly from Segio and Dani's notebook on [Spatial Exploratory Data Analysis](http://darribas.org/gds_scipy16/ipynb_md/04_esda.html)

We will consider both the global and local spatial autocorrelation of where ex-offenders locate when leaving IL prisons

First some explanations.

** Global spatial autocorrelation ** (maybe skip this b/c limited time)


** Local spatial autocorrelation **
- getis-ord G* -> include difference between visualizing rate and cluster results, hotspots

Is there correlation of variables across space - ex-offenders, housing, wages 