<img src='../../img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Bokeh Exercise: NYC Subway Stations

You've found an online resource which provides NYC Subway Station locations as `GeoJSON`.  Since NYC is expanding and adding new stops every day, you want your notebook to dynamically pull in the station information via HTTP instead of loading it from disk.

The objective of the notebook is to find the most dangerous subway stations by intersecting station locations with known crime occurrences. 

This exercise will challenge your skills with Python, Pandas, Bokeh, PyProj and Scipy.

## Table of Contents
* [Bokeh Exercise: NYC Subway Stations](#Bokeh-Exercise:-NYC-Subway-Stations)
	* [Set-Up](#Set-Up)
* [Solutions](#Solutions)
	* [1. Load Data](#1.-Load-Data)
	* [2. Create a Data Source](#2.-Create-a-Data-Source)
	* [3. Create a Figure](#3.-Create-a-Figure)
	* [4.  Create a DataFrame](#4.--Create-a-DataFrame)
		* [Bonus: Save the Data to local JSON file](#Bonus:-Save-the-Data-to-local-JSON-file)
	* [5. Clean the Data](#5.-Clean-the-Data)
	* [6. Load More Data](#6.-Load-More-Data)
	* [7. More Data Cleanup](#7.-More-Data-Cleanup)
	* [8. Reproject Coordinates](#8.-Reproject-Coordinates)
	* [9. Create a KDTree](#9.-Create-a-KDTree)
	* [10. Crime Near Stations](#10.-Crime-Near-Stations)
	* [11. Classify Stations](#11.-Classify-Stations)
	* [12. Prep for Plotting](#12.-Prep-for-Plotting)
	* [13. Create a Figure](#13.-Create-a-Figure)
	* [14. Reflection](#14.-Reflection)


## Set-Up

In [1]:
import pandas as pd
from pandas.io.json import json_normalize

from bokeh.models import Range1d, ColumnDataSource, HoverTool
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource
from bokeh.palettes import Spectral5

from pyproj import transform, Proj

import requests

from scipy.spatial import cKDTree as KDTree

output_notebook()

# Exercises

## 1. Load Data

Load Subway Station information via HTTP

- http://s3.amazonaws.com/bokeh_data/subway_stations.geojson
- `requests` is a great python library for fetching resources, but is not part of the python standard library. Use the `requests.get()` method.

## 2. Create a Data Source

Create a `GeoJSONDataSource` which will wrap the geojson text and be used as a source for fig.circle glyph

## 3. Create a Figure

Create a Bokeh figure which displays the geojson content as circle glyphs.

- set figure `grid.grid_line_alpha` to `0`
- set circle `color` to `cyan`
- set circle `alpha` to `.85`
- set circle `line_color` to `darkblue`

## 4.  Create a DataFrame

Create a Pandas Dataframe from the Subway Stations GeoJSON

- since we are dealing with GeoJSON (type of JSON), experiment with `pandas.io.normalize_json`
- hint: if you used the `requests` library, resp.json() will return a python dictionary for you
- hint: use the `features` property of the station GeoJSON

### Bonus: Save the Data to local JSON file

## 5. Clean the Data

Separate out the `geometry.coordinates` column into separate `lat` and `lon` columns

- since we are dealing with GeoJSON (type of JSON), experiment with `pandas.io.normalize_json`

## 6. Load More Data

Load all rows from the file `../../data/Datashader/nyc_crime.csv` into a `Pandas.DataFrame` with the variable name `df`. Use the Pandas `read_csv()` method.
- pandas is great for loading CSV data
- `usecols` can help in only loading data which you need.  In this case, load the `Offense`, `Location 1`, and `Occurrence Year`.

## 7. More Data Cleanup

- Clean up the `Location 1` field and Create two new columns named `lat` (latitude) and `lon` (longitude)
- define categoricals columns as type category
- filter df to only contain data for the last 10 years
- decide on the appropriate order for these steps
- clean columns names to remove `properties.` prefix from all station column names

## 8. Reproject Coordinates

Reproject Both Stations and Crimes into `New York State Plane Long Island` (EPSG Code is 2263)

- This is a good projection for nyc
- units are in feet

## 9. Create a KDTree

Create a KDTree using the crime dataframe

- try to avoid max recursion errors by setting appropriately large `leafsize`
- experiment with how leafsize affects the tree build time (%%time)

## 10. Crime Near Stations

Calculate how many crimes have happened within a radius of 100 feet from a given subway station

- use the KDTree.query_ball_point method
- add the number of crimes to the subway stations dataframe
- experiment with how leafsize of tree affect query time (%%time)

## 11. Classify Stations

Classify stations into quintiles based on number of crime incidents

- `pd.qcut` would be a good choice
- name the quintiles: `very low`, `low`, `moderate`, `high`, and `very high` respectively

## 12. Prep for Plotting

Add a `color` and `point_size` columns to the stations dataframe based on  `crime_level`

## 13. Create a Figure

Create a figure which to visualize subway stations symbolized by crime level

- set figure background_fill_color to `black`
- set figure `plot_height` and `plot_width` to 700
- set figure axis.visible to `False`
- set station circle `alpha` to `.65`
- set station circle `line_color` to `white`
- set station circle `line_alpha` to `.85`
- use the `color` and `point_size` fields you calculated
- add a hover tool which displays station name and incident count

## 14. Reflection

List at least 2 problems with this analysis?

- Subway Stations incident counts are not normalized by the number of people which go through the station.
- Quintiles group data into 5 bins of equal count, but not equal bin-width.  Maybe if we change classification we would get different results.

Did you have a different answer? List it here.

---
*Copyright Continuum 2012-2016 All Rights Reserved.*