<img src='../../img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Bokeh Exercise: NYC Subway Stations

You've found an online resource which provides NYC Subway Station locations as `GeoJSON`.  Since NYC is expanding and adding new stops every day, you want your notebook to dynamically pull in the station information via HTTP instead of loading it from disk.

The objective of the notebook is to find the most dangerous subway stations by intersecting station locations with known crime occurrences. 

This exercise will challenge your skills with Python, Pandas, Bokeh, PyProj and Scipy.

## Table of Contents
* [Bokeh Exercise: NYC Subway Stations](#Bokeh-Exercise:-NYC-Subway-Stations)
	* [Set-Up](#Set-Up)
* [Solutions](#Solutions)
	* [1. Load Data](#1.-Load-Data)
	* [2. Create a Data Source](#2.-Create-a-Data-Source)
	* [3. Create a Figure](#3.-Create-a-Figure)
	* [4.  Create a DataFrame](#4.--Create-a-DataFrame)
		* [Bonus: Save the Data to local JSON file](#Bonus:-Save-the-Data-to-local-JSON-file)
	* [5. Clean the Data](#5.-Clean-the-Data)
	* [6. Load More Data](#6.-Load-More-Data)
	* [7. More Data Cleanup](#7.-More-Data-Cleanup)
	* [8. Reproject Coordinates](#8.-Reproject-Coordinates)
	* [9. Create a KDTree](#9.-Create-a-KDTree)
	* [10. Crime Near Stations](#10.-Crime-Near-Stations)
	* [11. Classify Stations](#11.-Classify-Stations)
	* [12. Prep for Plotting](#12.-Prep-for-Plotting)
	* [13. Create a Figure](#13.-Create-a-Figure)
	* [14. Reflection](#14.-Reflection)


## Set-Up

In [1]:
import pandas as pd
from pandas.io.json import json_normalize

from bokeh.models import Range1d, ColumnDataSource, HoverTool
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource
from bokeh.palettes import Spectral5

from pyproj import transform, Proj

import requests

from scipy.spatial import cKDTree as KDTree

output_notebook()

# Solutions

## 1. Load Data

Load Subway Station information via HTTP

- http://s3.amazonaws.com/bokeh_data/subway_stations.geojson
- requests is a great library for fetching resources, but is not part of the python standard library

In [2]:
resp = requests.get('http://s3.amazonaws.com/bokeh_data/subway_stations.geojson')

## 2. Create a Data Source

Create a `GeoJSONDataSource` which will wrap the geojson text and be used as a source for fig.circle glyph

In [3]:
geojson_data_source = GeoJSONDataSource(geojson=resp.text)

In [4]:
type( geojson_data_source )

bokeh.models.sources.GeoJSONDataSource

## 3. Create a Figure

Create a Bokeh figure which displays the geojson content as circle glyphs.

- set figure `grid.grid_line_alpha` to `0`
- set circle `color` to `cyan`
- set circle `alpha` to `.85`
- set circle `line_color` to `darkblue`

In [5]:
fig = figure(plot_width=700,
                 plot_height=700,
                 background_fill_color='black')
fig.grid.grid_line_alpha = 0
fig.circle(x='x',
           y='y',
           source=geojson_data_source,
           alpha=.85,
           color='cyan',
           line_color='darkblue',
           size=8)
show(fig)

## 4.  Create a DataFrame

Create a Pandas Dataframe from the Subway Stations GeoJSON

- since we are dealing with GeoJSON (type of JSON), experiment with `pandas.io.normalize_json`
- hint: if you used the `requests` library, resp.json() will return a python dictionary for you
- hint: use the `features` property of the station GeoJSON

In [6]:
stations_df = json_normalize(resp.json()['features'])
stations_df.head()

Unnamed: 0,geometry.coordinates,geometry.type,properties.line,properties.name,properties.url,type
0,"[-74.00030814706824, 40.73225482650675]",Point,B-D-F-M,W 4th St - Washington Sq (Lower),http://www.mta.info/nyct/subway/index.html,Feature
1,"[-73.83256899924748, 40.846810332614844]",Point,6-6 Express,Buhre Ave,http://www.mta.info/nyct/subway/index.html,Feature
2,"[-73.97192000013308, 40.757107333148234]",Point,4-6-6 Express,51st St,http://www.mta.info/nyct/subway/index.html,Feature
3,"[-73.97621799811347, 40.78864433404891]",Point,1-2,86th St,http://www.mta.info/nyct/subway/index.html,Feature
4,"[-74.00413100111697, 40.713065332984044]",Point,4-5-6-6 Express,Brooklyn Bridge - City Hall,http://www.mta.info/nyct/subway/index.html,Feature


### Bonus: Save the Data to local JSON file

In [7]:
stations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 470 entries, 0 to 469
Data columns (total 6 columns):
geometry.coordinates    470 non-null object
geometry.type           470 non-null object
properties.line         470 non-null object
properties.name         470 non-null object
properties.url          470 non-null object
type                    470 non-null object
dtypes: object(6)
memory usage: 22.1+ KB


In [10]:
stations_df.to_json('./tmp/subway_stations.geojson')

FileNotFoundError: [Errno 2] No such file or directory: './tmp/subway_stations.geojson'

In [11]:
# Linux, macOS
!ls ./tmp/subway_stations.geojson
!head ./tmp/subway_stations.geojson

ls: cannot access ./tmp/subway_stations.geojson: No such file or directory
head: cannot open './tmp/subway_stations.geojson' for reading: No such file or directory


In [12]:
# Windows
!dir ./tmp/subway_stations.geojson

Parameter format not correct - "mp".


## 5. Clean the Data

Separate out the `geometry.coordinates` column into separate `lat` and `lon` columns

- since we are dealing with GeoJSON (type of JSON), experiment with `pandas.io.normalize_json`

In [13]:
stations_df['lon'], stations_df['lat'] = zip(*stations_df['geometry.coordinates'])

In [14]:
stations_df.head()

Unnamed: 0,geometry.coordinates,geometry.type,properties.line,properties.name,properties.url,type,lon,lat
0,"[-74.00030814706824, 40.73225482650675]",Point,B-D-F-M,W 4th St - Washington Sq (Lower),http://www.mta.info/nyct/subway/index.html,Feature,-74.000308,40.732255
1,"[-73.83256899924748, 40.846810332614844]",Point,6-6 Express,Buhre Ave,http://www.mta.info/nyct/subway/index.html,Feature,-73.832569,40.84681
2,"[-73.97192000013308, 40.757107333148234]",Point,4-6-6 Express,51st St,http://www.mta.info/nyct/subway/index.html,Feature,-73.97192,40.757107
3,"[-73.97621799811347, 40.78864433404891]",Point,1-2,86th St,http://www.mta.info/nyct/subway/index.html,Feature,-73.976218,40.788644
4,"[-74.00413100111697, 40.713065332984044]",Point,4-5-6-6 Express,Brooklyn Bridge - City Hall,http://www.mta.info/nyct/subway/index.html,Feature,-74.004131,40.713065


## 6. Load More Data

Load all rows from the file `../../data/Datashader/nyc_crime.csv` into a `Pandas.DataFrame` with the variable name `df`. Use the Pandas `read_csv()` method.
- pandas is great for loading CSV data
- `usecols` can help in only loading data which you need.  In this case, load the `Offense`, `Location 1`, and `Occurrence Year`.

In [15]:
nyc_crime_file = '../../data/Datashader/nyc_crime.csv'
df = pd.read_csv( nyc_crime_file, usecols=['Offense', 'Occurrence Year', 'Location 1'])
df.head()

OSError: File b'../../data/Datashader/nyc_crime.csv' does not exist

## 7. More Data Cleanup

- Clean up the `Location 1` field and Create two new columns named `lat` (latitude) and `lon` (longitude)
- define categoricals columns as type category
- filter df to only contain data for the last 10 years
- decide on the appropriate order for these steps
- clean columns names to remove `properties.` prefix from all station column names

In [16]:
df = df[df['Occurrence Year'] > 2005]
df.head()

NameError: name 'df' is not defined

In [17]:
df['lat'], df['lon'] = zip(*df['Location 1'].str.replace('[()]', '').str.split(','))
df.head()

NameError: name 'df' is not defined

In [None]:
df['Offense'] = df['Offense'].astype('category')
df.head()

In [None]:
stations_df.columns = stations_df.columns.str.replace('properties.', '')
stations_df.head()

## 8. Reproject Coordinates

Reproject Both Stations and Crimes into `New York State Plane Long Island` (EPSG Code is 2263)

- This is a good projection for nyc
- units are in feet

In [None]:
input_proj  = Proj(init="EPSG:4326")
output_proj = Proj(init="EPSG:2263")

In [None]:
stations_df['x'], stations_df['y'] = transform(input_proj, output_proj, stations_df.lon.values, stations_df.lat.values)
df['x'], df['y'] = transform(input_proj, output_proj, df.lon.values, df.lat.values)
df.head()

## 9. Create a KDTree

Create a KDTree using the crime dataframe

- try to avoid max recursion errors by setting appropriately large `leafsize`
- experiment with how leafsize affects the tree build time (%%time)

In [None]:
tree = KDTree(df[['x','y']].values, leafsize=5000)
type(tree)

## 10. Crime Near Stations

Calculate how many crimes have happened within a radius of 100 feet from a given subway station

- use the KDTree.query_ball_point method
- add the number of crimes to the subway stations dataframe
- experiment with how leafsize of tree affect query time (%%time)

In [None]:
%%time
radius = 100 # feet
search_func = lambda r: len(tree.query_ball_point([r.x, r.y], radius))
stations_df['num_incidents'] = stations_df.apply(search_func, axis=1)

In [None]:
stations_df.head()

## 11. Classify Stations

Classify stations into quintiles based on number of crime incidents

- `pd.qcut` would be a good choice
- name the quintiles: `very low`, `low`, `moderate`, `high`, and `very high` respectively

In [None]:
quintile_labels = ['very low', 'low', 'moderate', 'high', 'very high']
stations_df['crime_level'] = pd.qcut(stations_df.num_incidents, 5, quintile_labels)
stations_df.head()

## 12. Prep for Plotting

Add a `color` and `point_size` columns to the stations dataframe based on  `crime_level`

In [None]:
from bokeh.palettes import Spectral5

stations_df['color'] = stations_df['crime_level'].copy()
stations_df.color.cat.categories = Spectral5

stations_df['point_size'] = stations_df['crime_level'].copy()
stations_df.point_size.cat.categories = [4, 6, 8, 10, 12]

stations_df.head()

## 13. Create a Figure

Create a figure which to visualize subway stations symbolized by crime level

- set figure background_fill_color to `black`
- set figure `plot_height` and `plot_width` to 700
- set figure axis.visible to `False`
- set station circle `alpha` to `.65`
- set station circle `line_color` to `white`
- set station circle `line_alpha` to `.85`
- use the `color` and `point_size` fields you calculated
- add a hover tool which displays station name and incident count

In [None]:
fig = figure(plot_width=700,
                 plot_height=700,
                 background_fill_color='black')
fig.grid.grid_line_alpha = 0
fig.axis.visible = False
fig.circle(x='x',
           y='y',
           color='color',
           source=ColumnDataSource(stations_df),
           alpha=.65,
           line_color='white',
           line_alpha=.85,
           size='point_size')

fig.add_tools(HoverTool(tooltips=[
    ("Name", "@name"),
    ("Crime Count", "@num_incidents"),
]))

show(fig)

## 14. Reflection

List at least 2 problems with this analysis?

- Subway Stations incident counts are not normalized by the number of people which go through the station.
- Quintiles group data into 5 bins of equal count, but not equal bin-width.  Maybe if we change classification we would get different results.

Did you have a different answer? List it here.

---
*Copyright Continuum 2012-2016 All Rights Reserved.*