In [1]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import math
from scipy import stats
import numpy as np

trips = Table.read_table('trip.csv')

The new table `stations` contains geographical information about each bike station, including latitude, longitude, and a "landmark" which is the name of the city where the station is located.

In [16]:
stations = Table.read_table('station.csv')
stations.show(3)

station_id,name,lat,long,dockcount,landmark,installation
2,San Jose Diridon Caltrain Station,37.3297,-121.902,27,San Jose,8/6/2013
3,San Jose Civic Center,37.3307,-121.889,15,San Jose,8/5/2013
4,Santa Clara at Almaden,37.334,-121.895,11,San Jose,8/6/2013


We can draw a map of where the stations are located, using `Marker.map_table`. The function operates on a table, whose columns are (in order) latitude, longitude, and an optional identifier for each point.

In [3]:
Marker.map_table(stations.select('lat', 'long', 'name'))

The map is created using [OpenStreetMap](http://www.openstreetmap.org/#map=5/51.500/-0.100), which is an open online mapping system that you can use just as you would use Google Maps or any other online map. Zoom in to San Francisco to see how the stations are distributed. Click on a marker to see which station it is.

You can also represent points on a map by colored circles. Here is such a map of the San Francisco bike stations.

In [4]:
sf = stations.where('landmark', are.equal_to('San Francisco'))
sf_map_data = sf.select('lat', 'long', 'name')
Circle.map_table(sf_map_data, color='green', radius=200)

### More Informative Maps
The bike stations are located in five different cities in the Bay Area. As a simple starting example, let us distinguish the points by using a different color for each city.  We will again use `apply` to assign each city a color by looking up its color in a table.

In [5]:
cities = stations.group('landmark').relabeled('landmark', 'city')
cities

city,count
Mountain View,7
Palo Alto,5
Redwood City,7
San Francisco,35
San Jose,16


In [6]:
colors = cities.with_column('color', make_array('blue', 'red', 'green', 'orange', 'purple'))
colors

city,count,color
Mountain View,7,blue
Palo Alto,5,red
Redwood City,7,green
San Francisco,35,orange
San Jose,16,purple


Now we can write a function to look up the color of a station by its city, using the `colors` table.

In [15]:
def find_color(city_name):
    return colors.where("city", are.equal_to(city_name)).column("color").item(0)

with_colors = stations.with_column("color", stations.apply(find_color, "landmark"))
with_colors.show(3)

station_id,name,lat,long,dockcount,landmark,installation,color
2,San Jose Diridon Caltrain Station,37.3297,-121.902,27,San Jose,8/6/2013,purple
3,San Jose Civic Center,37.3307,-121.889,15,San Jose,8/5/2013,purple
4,Santa Clara at Almaden,37.334,-121.895,11,San Jose,8/6/2013,purple


In [8]:
Marker.map_table(with_colors.select('lat', 'long', 'name', 'color'))

Now the markers have five different colors for the five different cities.

### Where do most of the rentals originate?
To see where most of the bike rentals originate, let's identify the start stations:

In [17]:
starts = trips.group('Start Station').sort('count', descending=True)
starts.show(3)

Start Station,count
San Francisco Caltrain (Townsend at 4th),26304
San Francisco Caltrain 2 (330 Townsend),21758
Harry Bridges Plaza (Ferry Building),17255


We can include this information in `stations`, again by using `apply`.  We previously defined the function `find_trip_count`, which is reproduced below:

In [10]:
starts = trips.group("Start Station")

def find_trip_count(station_name):
    return starts.where("Start Station", are.equal_to(station_name)).column("count").item(0)

Some of the stations in the `stations` dataset are not present in the `trips` dataset.  We must filter them out before applying `find_trip_count` to the remainder.

In [11]:
stations_with_trip_data = stations.where("name", are.contained_in(starts.column("Start Station")))
count_by_station = stations_with_trip_data.with_column(
    "Number of trips",
    stations_with_trip_data.apply(find_trip_count, "name"))
count_by_station

station_id,name,lat,long,dockcount,landmark,installation,Number of trips
2,San Jose Diridon Caltrain Station,37.3297,-121.902,27,San Jose,8/6/2013,4968
3,San Jose Civic Center,37.3307,-121.889,15,San Jose,8/5/2013,774
4,Santa Clara at Almaden,37.334,-121.895,11,San Jose,8/6/2013,1958
5,Adobe on Almaden,37.3314,-121.893,19,San Jose,8/5/2013,562
6,San Pedro Square,37.3367,-121.894,15,San Jose,8/7/2013,1418
7,Paseo de San Antonio,37.3338,-121.887,15,San Jose,8/7/2013,856
8,San Salvador at 1st,37.3302,-121.886,15,San Jose,8/5/2013,495
9,Japantown,37.3487,-121.895,15,San Jose,8/5/2013,885
10,San Jose City Hall,37.3374,-121.887,15,San Jose,8/6/2013,832
11,MLK Library,37.3359,-121.886,19,San Jose,8/6/2013,1099


Now we extract just the data needed for drawing our map, adding a color and an area to each station. The area is 1000 times the count of the number of rentals starting at each station, where the constant 1000 was chosen so that the circles would appear at an appropriate scale on the map.

In [12]:
starts_map_data = count_by_station.select('lat', 'long', 'name').with_columns(
    'color', 'blue',
    'area', count_by_station.column('Number of trips') * 1000
)
starts_map_data.show(3)
Circle.map_table(starts_map_data)

lat,long,name,color,area
37.3297,-121.902,San Jose Diridon Caltrain Station,blue,4968000
37.3307,-121.889,San Jose Civic Center,blue,774000
37.334,-121.895,Santa Clara at Almaden,blue,1958000


That huge blob in San Francisco shows that the eastern section of the city is the unrivaled capital of bike rentals in the Bay Area.

### Trip duration
Recall the first part of our hypothesis: "There is a negative association between urban density and duration of bike trips."  The map, plus a bit of background knowledge about the Bay Area, gives us a rough picture of urban density.  San Francisco to the north and San Jose to the south are large, relatively dense cities.  In between, the South Bay and Peninsula are relatively less dense.  If we can display average trip duration on this map, we can use this knowledge about density to check our hypothesis.

We will start by adding the average trip duration to our `count_by_station` table, just as we did above.

In [18]:
durations = trips.group("Start Station", np.mean)

def find_average_duration(station_name):
    return durations.where("Start Station", are.equal_to(station_name)).column("Duration mean").item(0)

with_duration = count_by_station.with_column(
    "Average trip duration",
    count_by_station.apply(find_average_duration, "name"))
with_duration.show(3)

station_id,name,lat,long,dockcount,landmark,installation,Number of trips,Average trip duration
2,San Jose Diridon Caltrain Station,37.3297,-121.902,27,San Jose,8/6/2013,4968,884.375
3,San Jose Civic Center,37.3307,-121.889,15,San Jose,8/5/2013,774,5458.04
4,Santa Clara at Almaden,37.334,-121.895,11,San Jose,8/6/2013,1958,850.924


Now we will create a column of colors.  Bright red will show locations with long trips, and dark red or black will show locations with shorter trips.

Unfortunately, the `map_table` function requires us to specify colors in a particular format, and converting to that format involves some rather technical details about color encodings.  We have written a function called `duration_to_color` to convert average trip duration numbers to `map_table`'s color format.  Don't worry about the implementation (the body of the function); the docstring describes what the function does.  We simply `apply` the `duration_to_color` function to our "Average trip duration" column to produce colors.

In [19]:
def duration_to_color(average_duration):
    """Converts an average trip duration to a string describing a color.
    
    Longer durations will be closer to bright red, and shorter durations
    will be closer to black.
    
    Args:
      average_duration (float): The average trip duration for one
        station.
    
    Returns:
      (string): A string describing a color based on the given average
        trip duration.  The string is in 6-digit hexidecimal format,
        which is a common way to describe colors."""
    max_duration_color = 255
    color_bits = 8
    rescaled_duration = min(max_duration_color, int(256 * average_duration / 5000))
    red_amount = 2**(2*color_bits) * rescaled_duration
    color = '#{:06X}'.format(red_amount)
    return color

duration_map_data = with_duration.select('lat', 'long', 'name').with_columns(
    'color', with_duration.apply(duration_to_color, 'Average trip duration'),
    'area', with_duration.column('Number of trips') * 4000,
)
duration_map_data.show(3)
Circle.map_table(duration_map_data, fill_opacity=1)

lat,long,name,color,area
37.3297,-121.902,San Jose Diridon Caltrain Station,#2D0000,19872000
37.3307,-121.889,San Jose Civic Center,#FF0000,3096000
37.334,-121.895,Santa Clara at Almaden,#2B0000,7832000


### Conclusions
It seems that the locations with long trip durations are mostly in Palo Alto and Redwood City, with one exception in San Jose.  These are the least urban bike stations on the map.  The data are therefore compatible with our hypothesis.

Until now, we have not proposed a causal mechanism for the association.  Here are a few that are plausible:

* Palo Alto and Redwood City are close to long bike routes in the hills to the southwest.  Perhaps people take long recreational biking trips through the hills.
* Perhaps Stanford students rent bicycles to get around campus for days at a time.
* Perhaps some people who live or work in the long suburban peninsula between San Francisco and San Jose commute for long distances by bicycle.

**Question for thought:** The `trips` dataset includes the date and time of day for the start and end of each trip.  How might we use this information to test some of the proposals above?