<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Spatial Data Lab

_Authors: Matt Brems (DC)_

## NYC Data Component
You should consult the [Geopandas Practice Notbook](geopandas-practice.ipynb) before diving into this lab.

In that notebook, you're introduced to the `GeoDataFrame` object from `geopandas`. A `GeoDataFrame` is just like a `DataFrame`, except it contains a `geometry` column that identifies each row as an object in space. A row can either represent a point in space (in which case the `geometry` column contains `Points`) or an area (in which case the `geometry` column contains `Polygons`). A `GeoDataFrame` can contain more than one column which contains spatial information, but only one column at a time can identify the unique geometry of an observation.

Here, we'll practice some of the same functionality and concepts.

In [1]:
# basic stuff
import os
import pandas as pd
import numpy as np
from datetime import datetime
from urllib.request import urlretrieve
from zipfile import ZipFile
import pysal

# geo stuff
import geopandas as gpd
from shapely.geometry import Point
# from ipyleaflet import (Map,
#     Marker,
#     TileLayer, ImageOverlay,
#     Polyline, Polygon, Rectangle, Circle, CircleMarker,
#     GeoJSON,
#     DrawControl
# )

# plotting stuff
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('bmh')
plt.rcParams['figure.figsize'] = (10.0, 10.0)

# widget stuff
from ipywidgets import interact, HTML, FloatSlider
from IPython.display import clear_output, display

# progress stuff
from tqdm import tqdm_notebook, tqdm_pandas

# turn warnings off
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'geopandas'

In [None]:
# from the Geopandas practice notebook:

def get_nyc_shape_file(url, filename):

    # download file
    zipped = filename + '.zip'
    urlretrieve('https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile', zipped)
    zipped = os.getcwd() + '/' + zipped

    # unzip file
    to_unzip = ZipFile(zipped, 'r')
    unzipped = os.getcwd() + '/' + filename + '_unzipped'
    to_unzip.extractall(unzipped)
    to_unzip.close()

    # get shape file
    for file in os.listdir(unzipped):
        if file.endswith(".shp"):
            shape_file = unzipped + '/' + file

    # return full file path
    return shape_file

# get shape file path
shape_file_url = 'https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile'
shape_file_dir = 'nyc_boroughs'
file_path = get_nyc_shape_file(shape_file_url,shape_file_dir)

# read and view GeoDataFrame
gdf = gpd.GeoDataFrame.from_file(file_path)
gdf.head()

#### To begin, return a `Series` containing the area of each NYC borough.

Does it match the area we are given? What units do you think these columns are in?

You will want to consult [the Geopandas docs](http://geopandas.org/reference.html) to familiarize yourself with the special attributes and methods of `GeoSeries` and `GeoDataFrame` objects.

In [2]:
area = gdf.area

NameError: name 'gdf' is not defined

#### Add a new column to the dataset containing the centroid of each borough.

What type of object is this? What type of object does it contain?
Can we make this the `geometry` column for this dataset?

#### Now, plot the NYC boroughs, the convex hull for each borough, and the envelope for each borough.

Hint: You can call `.plot` on a `GeoDataFrame`.

#### Bonus: Plot the centroid of each borough on the map of each borough

#### Generate 10,000 samples uniformly across the NYC map. 

Note, you're generating both a random X and a random Y in order to get a location on the NYC map, much like how you might estimate $\pi$ using Monte Carlo simulations.

Plot these points over the map of NYC.

#### Place points within boroughs
A common geosptial task is to check whether a given point lies inside or outside of a certain area. In order to ease that calculation, convex hulls and envelopes are often used as approximations of the true shape of geographical areas.

In this part, we'll check which (if any) each borough our simulated points fall into:

- Whether or not each sample falls in the true geographic boroughs.
- Whether or not each sample falls in the convex hulls of the boroughs.
- Whether or not each sample falls in the envelopes of the boroughs.

We'll need to use the `Point` object that we imported from `shapely` and the `.contains` method from Geopandas.

At each step, use the `%%timeit` [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html) to measure how long this process takes.

Report these numbers, as well as how much more efficient (percentage-wise) envelopes and convex hulls are relative to the true geographies.

#### Generate metrics.  Summarize findings.

Obviously there's a trade-off here. Check how many samples lie in the actual geographies, the convex hulls, and the envelopes.

Report the following:

- A confusion matrix comparing convex hulls and actual geographies. (i.e. actual geographies are the true counts; convex hulls are predicted counts)
- A confusion matrix comparing envelopes and actual geographies.
- The accuracy and sensitivity from each of the confusion matrices above. You should report a sensitivity value for each borough.
- A paragraph summarizing your findings.

#### Perform a spatial join using your simulated data

You should consider the [Geopandas docs](http://geopandas.readthedocs.io/en/latest/reference/geopandas.sjoin.html).

Hint: You must use two `GeoDataFrame`s
Hint: Use `crs= {'init' :'epsg:4326'}`

##### First, use `sjoin` to label each simulated point according to its corresponding borough
This should give the same results as above, when you used `.contains` to check and see which borough each point belonged to.

##### Bonus: Use `sjoin` to count the number of points in each borough.

#### Generate a map of NYC with each borough shaded based on the number of pick-ups that occur in each borough.

In [None]:
import pandas as pd

In [None]:
## This will take awhile! Check out the data dictionary in the meantime: 
## http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

taxi = pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-09.csv")

In [None]:
taxi.head()

#### Suppose we want to forecast the number of pick-ups by borough. Would this process be described as areal, geostatistical, or point pattern?

#### Bonus: Build a widget that will put dots on the map for the location of each pick-up by date.
Using the exact latitude and longitude will cause multiple dots to overlap; people often use a [random jitter](https://www.dataplusscience.com/TableauJitter.html) to help with this. While not required, consider random jitter as an extra bonus!

#### In order to predict the precise location of pick-ups, would this process be described as areal, geostatistical, or point pattern?