Rush Hour Dynamics: Studying the London Underground using Python

# Outline

* Introduction and motivating problem
* Data collection and thinking about data structures
* Brief introduction to the basics of using graph-tool



## Introduction
My name is Camilla. I recently (2013) graduated from Bryn Mawr College (Philadelphia) and then spent a year at Edinburgh.I currently work as a software engineer in test in London. 

In [1]:
%load_ext autoreload


In [2]:
%autoreload 2
from src import simulation_utils
from src.bokeh_visualizations import bokeh_map 

#create column datasource
geographical_data="/home/winterflower/programming_projects/python-londontube/src/data/london_stations.csv"
network_data="/home/winterflower/programming_projects/python-londontube/src/data/londontubes.txt"
column_datasource = bokeh_map.parse_input_file(geographical_data)

#plot data and output to current notebook
bokeh_map.bokeh_zone_colour_map(column_datasource, notebook=True)

ImportError: No module named src

## Motivating Problem

[add an iPython widget here]
My first flat in London was located in Marylebone. I got lucky in many ways, one of them was that for my daily commute to Liverpool Street Station, I could choose between 4 different tube lines: Circle , Hammersmith & City, Bakerloo and Metropolitan. I quickly became baffled by how the number of people in the trains on different stations varied different day. Some days the Edware road station would be packed, while other days it was empty and the Bakerloo station would be packed at exactly the same time. 
I noticed that these fluctuations were somehow associated with whether there were any delays on other underground lines and whether some stations were suspended. 

I became interested in studying the London Underground as a system. 
First, I had some questions about the static nature of the map.

1. What kind of graph properties does the London Underground have?
2. What is the average shortest path between any two stations?
3. Is there a path from every station to another?


## Technology 
To answer these questions, I chose two Python libraries: graph-tool and bokeh, graph-tool for the analytical side of things and bokeh for generating visualizations from. 

## The London Underground as a Graph

1. Data collection
Since there were no available sources online that had the data in the format that I wanted, I had two options: devise some optical character recognition magic to automatically parse the map or do it the old school way and write my own data file by hand. At this point, I got a bit unlucky and caught the flu and was confined to my bed for a week. Since there was very little else to do, I set out to collect the London Undergroun data by hand and I came up with the file below


```
#Station #Neighbour(line)
Acton Town	        Chiswick Park (District), South Ealing (Picadilly), Turnham Green (Picadilly)
Aldgate		        Tower Hill (Circle; District), Liverpool Street (Metropolitan; Circle; District)
Aldgate East	    Tower Hill (District), Liverpool Street (HammersmithCity; Metropolitan)
Alperton	        Sudbury Town (Picadilly), Park Royal (Picadilly)
```


2. In addition to the data specifying the network properties of the graph, we have data regarding the geographical locations and travel zones 

```
Acton,51.516886963398,-0.26767554257793,"3",W3 0BP
Acton Central,51.508757812012,-0.26341579231596,"2",W3 6BD
Acton Town,51.503071476856,-0.2802882961706,"3",W3 8HN
```

## Properties of the London Graph
At first, I wanted to establish some properties of the graph without mixing in simulations and commuters and things like that. 



## Centrality
The aim of centrality is to provide measures of the most 'important' vertices in a graph. This is interesting when applied to the problem of the London Underground, because it may indicate which stations are the most critical ones for passengers and must remain operational

## Betweenness Centrality
Betweenness centrality measures how many of the shortest paths from one vertex to another travel through a vertex. 

In [None]:
#calculate the betweenness centrality
#create the map_object

from src import simulation_utils
from src.graph_analytics import graph_analysis
import pandas as pd
betweenness_centrality_series_object=graph_analysis.calculate_betweenness(network_data)
betweenness_centrality_series_object.sort(ascending=False)
print betweenness_centrality_series_object[:10]




In [None]:
## calculate the length of the shortest path from any two stations

from src.graph_analytics import graph_analysis
shortest_paths=graph_analysis.calculate_all_shortest_paths(network_data)
#calculate the mean shortest path
mean_shortest_path=shortest_paths.mean(axis=0)



In [None]:
#find out stations with smallest mean shortest paths
mean_shortest_path.order(ascending=True, inplace=True)
#find out the top 5 stations
mean_shortest_path[:5]



In [None]:
#find out the bottom 5 stations
mean_shortest_path[len(mean_shortest_path.index)-5:]

In [None]:
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
figsize(15, 5)
#what is the overall distribution of mean shortest paths 
pd.Series.hist(mean_shortest_path)


In [None]:
#calculate the average shortest path
graph_analysis.calculate_average_shortest_path(shortest_paths)