<h1>Citibike Network Assignment</h1>
<li>The file, 201809-citibike-tripdata.csv, contains citibike trip data from September 2018 (a reasonable sized file!)
<li>The data:<br>
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
<li>Each record in the data is a trip 
<li>The data is described at https://www.citibikenyc.com/system-data

<h1>STEP 1: Read the data into a dataframe</h1>
<li>Convert station ids to str if necessary

In [1]:
import pandas as pd
import numpy as np
datafile = "201801-citibike-tripdata.csv"
df = pd.read_csv(datafile)


<h1>STEP 2: Basic cleaning</h1>
<li>Remove data that have any nans in any row (none in this file but others do have nans)
<li>and convert stationids to str 

In [2]:
df.isnull().sum()

tripduration               0
starttime                  0
stoptime                   0
start station id           0
start station name         0
start station latitude     0
start station longitude    0
end station id             0
end station name           0
end station latitude       0
end station longitude      0
bikeid                     0
usertype                   0
birth year                 0
gender                     0
dtype: int64

In [3]:
df['start station id'] = df['start station id'].astype(str)
df['end station id'] = df['end station id'].astype(str)

<h1>STEP 3: Write a function that returns a graph given a citibike data frame</h1> 
<li>Your function should return two things:
<ol>
<li>a graph
<li>a dictionary with station ids as the key and station name as the value
</ol>
<li>The graph should contain 
<ol>
<li>nodes (station ids)
<li>edges (station id, station id)
<li>edge data 
<ol>
<li>count: number of trips on the edge
<li>time: average duration - pickup to dropoff - on that edge
</ol>
</ol>
<li><b>Note:</b> the edge (x1,y1) is the same as (y1,x1) even though the start station ids and end station ids are flipped in the dataframe

In [4]:
def get_citibike_graph(df):
    import networkx as nx
    G = nx.Graph()
    node_names = dict()
    
    graph_list = []
    for i in range(len(df)):
        start = min(df.iloc[i]['start station id'], df.iloc[i]['end station id'])
        end = max(df.iloc[i]['start station id'], df.iloc[i]['end station id'])
        time = df.iloc[i]['tripduration']
        graph_list.append((start, end, time))
    
    edge = dict()
        
    for i in range(len(graph_list)):
        start = graph_list[i][0]
        end = graph_list[i][1]
        if (start, end) not in edge:
            edge[(start, end)] = {}
            edge[(start, end)]['count'] = 1
            edge[(start, end)]['sumtime'] = graph_list[i][2]
        else:
            edge[(start, end)]['count'] += 1
            edge[(start, end)]['sumtime'] += graph_list[i][2]
    
    for i in edge.keys():
        start = i[0]
        end = i[1]
        count = edge[i]['count']
        meantime = edge[i]['sumtime']/count
        G.add_edge(start, end, trips = count, time = meantime) 
    
    node_names = dict(zip(df['end station id'], df['end station name']))
 
    return G,node_names

<h1>STEP 4: Create the following graphs using the function above</h1>
<li>G: A graph of all the data in the dataframe
<li>m_G: A graph containing only data from male riders
<li>f_G: A graph containing only data from female riders
<li>Note: for m_G and f_G you will need to extract data from the dataframe

In [None]:
G,nodes=get_citibike_graph(df)

In [None]:
m_G,nodes = get_citibike_graph(df.loc[df['gender'] == 1])

In [None]:
f_G,nodes = get_citibike_graph(df.loc[df['gender'] == 2])

<h1>STEP 5: Answer the following questions for each of the graphs</h1>
<ol>
<li>Which station (name) is the best connected (max degree)?
<li>Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs. Report both the two stations as well as the time in minutes
<li>Which edge is associated with the most number of trips?
<li>Which station is the most central?
<li>Which node is a bottleneck node?

Which station (name) has the greatest number of connections (max degree)?

In [None]:
import networkx as nx

d=nx.degree(m_G)
l=list(d)
max(l,key=lambda x: x[1])[0]
print("Busiest male station: ", nodes[max(l,key=lambda x: x[1])[0]])

d=nx.degree(f_G)
l=list(d)
max(l,key=lambda x: x[1])[0]
print("Busiest female station: ", nodes[max(l,key=lambda x: x[1])[0]])

d=nx.degree(G)
l=list(d)
max(l,key=lambda x: x[1])[0]

print("Busiest station: ", nodes[max(l,key=lambda x: x[1])[0]])


Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs

In [27]:
start = sorted(m_G.edges(data=True), key = lambda x: x[2]['time'], reverse = True)[0][0]
end = sorted(m_G.edges(data=True), key = lambda x: x[2]['time'], reverse = True)[0][1]
print('Longest average distance males: ', nodes[start], 'to', nodes[end])

start = sorted(f_G.edges(data=True), key = lambda x: x[2]['time'], reverse = True)[0][0]
end = sorted(f_G.edges(data=True), key = lambda x: x[2]['time'], reverse = True)[0][1]
print('Longest average distance females: ', nodes[start], 'to', nodes[end])

start = sorted(G.edges(data=True), key = lambda x: x[2]['time'], reverse = True)[0][0]
end = sorted(G.edges(data=True), key = lambda x: x[2]['time'], reverse = True)[0][1]
print('Longest average distance all: ', nodes[start], 'to', nodes[end])

Longest average distance males:  Nassau St & Navy St to Hope St & Union Ave
Longest average distance females:  Adelphi St & Myrtle Ave to NYCBS Depot - 3AV
Longest average distance all:  Nassau St & Navy St to Hope St & Union Ave


Which edge is associated with the most number of trips?

In [28]:
start = sorted(m_G.edges(data=True), key = lambda x: x[2]['trips'], reverse = True)[0][0]
end = sorted(m_G.edges(data=True), key = lambda x: x[2]['trips'], reverse = True)[0][1]
print('Longest average distance males: ', nodes[start], 'to', nodes[end])

start = sorted(f_G.edges(data=True), key = lambda x: x[2]['trips'], reverse = True)[0][0]
end = sorted(f_G.edges(data=True), key = lambda x: x[2]['trips'], reverse = True)[0][1]
print('Longest average distance females: ', nodes[start], 'to', nodes[end])

start = sorted(G.edges(data=True), key = lambda x: x[2]['trips'], reverse = True)[0][0]
end = sorted(G.edges(data=True), key = lambda x: x[2]['trips'], reverse = True)[0][1]
print('Longest average distance all: ', nodes[start], 'to', nodes[end])

Longest average distance males:  Cooper Square & Astor Pl to E 7 St & Avenue A
Longest average distance females:  E 7 St & Avenue A to Cooper Square & Astor Pl
Longest average distance all:  Cooper Square & Astor Pl to E 7 St & Avenue A


<h2>Centrality</h2>
One of the concerns that the citibike system has to deal with is ensuring that no station has empty slots (a bike should always be available) and that no station should have no empty slots (you should be able to return a bike). To do this, it needs to monitor the movement of bikes through the system, ideally using a directed graph. Though our graph is not directed, we can look at some network characteristics that will help us answer these questions. Note that the "trips" feature in edge data captures flows.
<li>Which node is a possible bottleneck node in terms of bike flows?
<li>Which node is the "nearest" to all other nodes (irrespective of flows)
<li>Which node is the "nearest" to all other nodes (in terms of distance = time)


In [None]:
from collections import OrderedDict

print("Results for Graph")
print()
c_c = nx.closeness_centrality(G)
cc = OrderedDict(sorted(c_c.items(), key = lambda x: x[1], reverse = True))
items = list(cc.items())
node = items[0][0]
    
print('Most central in connectivity: ', nodes[node])

c_c = nx.closeness_centrality(G, distance = 'time')
cc = OrderedDict(sorted(c_c.items(), key = lambda x: x[1], reverse = True))
items = list(cc.items())
node = items[0][0]

print('Most central in connectivity using time as distance: ', nodes[node])

b_c = nx.betweenness_centrality(G, weight = 'count')
bc = OrderedDict(sorted(b_c.items(), key = lambda x: x[1], reverse = True))
items = list(bc.items())
node = items[0][0]

print('Bottleneck node: ', nodes[node])
print()

# Male Graph

print("Results for Male Graph")
print()
c_c = nx.closeness_centrality(m_G)
cc = OrderedDict(sorted(c_c.items(), key = lambda x: x[1], reverse = True))
items = list(cc.items())
node = items[0][0]
    
print('Most central in connectivity: ', nodes[node])

c_c = nx.closeness_centrality(m_G, distance = 'time')
cc = OrderedDict(sorted(c_c.items(), key = lambda x: x[1], reverse = True))
items = list(cc.items())
node = items[0][0]

print('Most central in connectivity using time as distance: ', nodes[node])

b_c = nx.betweenness_centrality(m_G, weight = 'count')
bc = OrderedDict(sorted(b_c.items(), key = lambda x: x[1], reverse = True))
items = list(bc.items())
node = items[0][0]
print('Bottleneck node: ', nodes[node])
print()

# Female Graph

print("Results for Female Graph")
print()
c_c = nx.closeness_centrality(f_G)
cc = OrderedDict(sorted(c_c.items(), key = lambda x: x[1], reverse = True))
items = list(cc.items())
node = items[0][0]
    
print('Most central in connectivity: ', nodes[node])

c_c = nx.closeness_centrality(f_G, distance = 'time')
cc = OrderedDict(sorted(c_c.items(), key = lambda x: x[1], reverse = True))
items = list(cc.items())
node = items[0][0]

print('Most central in connectivity using time as distance: ', nodes[node])

b_c = nx.betweenness_centrality(f_G, weight = 'count')
bc = OrderedDict(sorted(b_c.items(), key = lambda x: x[1], reverse = True))
items = list(bc.items())
node = items[0][0]
print('Bottleneck node: ', nodes[node])


Results for Graph

Most central in connectivity:  Pershing Square North
Most central in connectivity using time as distance:  E 4 St & 2 Ave
Bottleneck node:  1 Ave & E 62 St

Results for Male Graph

Most central in connectivity:  Pershing Square North
Most central in connectivity using time as distance:  E 2 St & 2 Ave
Bottleneck node:  Queens Plaza North & Crescent St

Results for Female Graph

Most central in connectivity:  Pershing Square North
Most central in connectivity using time as distance:  Stanton St & Chrystie St
