<h1>Citibike Network Assignment</h1>
<li>The file, 201809-citibike-tripdata.csv, contains citibike trip data from September 2018 (a reasonable sized file!)
<li>The data:<br>
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
<li>Each record in the data is a trip 
<li>The data is described at https://www.citibikenyc.com/system-data

<h1>STEP 1: Read the data into a dataframe</h1>
<li>Convert station ids to str if necessary

In [2]:
import pandas as pd
import numpy as np
import networkx as nx
datafile = "201801-citibike-tripdata.csv"
df = pd.read_csv(datafile)


<h1>STEP 2: Basic cleaning</h1>
<li>Remove data that have any nans in any row (none in this file but others do have nans)
<li>and convert stationids to str 

In [3]:
# no rows with nan values 
# convert stationid to str
df.dropna(inplace=True)
df['start station id'] = df['start station id'].astype(str)
df['end station id'] = df['end station id'].astype(str)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 718994 entries, 0 to 718993
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             718994 non-null  int64  
 1   starttime                718994 non-null  object 
 2   stoptime                 718994 non-null  object 
 3   start station id         718994 non-null  object 
 4   start station name       718994 non-null  object 
 5   start station latitude   718994 non-null  float64
 6   start station longitude  718994 non-null  float64
 7   end station id           718994 non-null  object 
 8   end station name         718994 non-null  object 
 9   end station latitude     718994 non-null  float64
 10  end station longitude    718994 non-null  float64
 11  bikeid                   718994 non-null  int64  
 12  usertype                 718994 non-null  object 
 13  birth year               718994 non-null  int64  
 14  gend

<h1>STEP 3: Write a function that returns a graph given a citibike data frame</h1> 
<li>Your function should return two things:
<ol>
<li>a graph
<li>a dictionary with station ids as the key and station name as the value
</ol>
<li>The graph should contain 
<ol>
<li>nodes (station ids)
<li>edges (station id, station id)
<li>edge data 
<ol>
<li>count: number of trips on the edge
<li>time: average duration - pickup to dropoff - on that edge
</ol>
</ol>
<li><b>Note:</b> the edge (x1,y1) is the same as (y1,x1) even though the start station ids and end station ids are flipped in the dataframe

In [4]:
def get_citibike_graph(df1):
    import networkx as nx
    G = nx.Graph()
    node_names = dict()

    #YOUR CODE GOES HERE
    df2 = df1.copy()
    start_names = df2.set_index('start station id').to_dict()['start station name']
    end_names = df2.set_index('end station id').to_dict()['end station name']
    node_names = {**start_names, **end_names}
    
    
    df2[['start station id', 'end station id']] = pd.DataFrame(np.sort(df2[['start station id', 'end station id']].values, axis = 1))
    edge_data = df2.groupby(['start station id', 'end station id'])['tripduration'].agg(['mean', 'size']).reset_index()
    
    nodes = list(node_names.keys())
    G.add_nodes_from(nodes)

    edges = list(zip(edge_data['start station id'], edge_data['end station id'], edge_data['size'], edge_data['mean']))
    for e in edges:
        G.add_edge(e[0], e[1], count=e[2], time=e[3])
    
    return G,node_names
    

<h1>STEP 4: Create the following graphs using the function above</h1>
<li>G: A graph of all the data in the dataframe
<li>m_G: A graph containing only data from male riders
<li>f_G: A graph containing only data from female riders
<li>Note: for m_G and f_G you will need to extract data from the dataframe

In [5]:
G,nodes=get_citibike_graph(df)

In [6]:
m_df = df[df['gender']==1].reset_index(drop=True)
m_G = get_citibike_graph(m_df)[0]

In [7]:
f_df = df[df['gender']==2].reset_index(drop=True)
f_G = get_citibike_graph(f_df)[0]

<h1>STEP 5: Answer the following questions for each of the graphs</h1>
<ol>
<li>Which station (name) is the best connected (max degree)?
<li>Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs. Report both the two stations as well as the time in minutes
<li>Which edge is associated with the most number of trips?
<li>Which station is the most central?
<li>Which node is a bottleneck node?

Which station (name) has the greatest number of connections (max degree)?

In [8]:
def max_station(network):
    mst = max(list(network.degree()),key=lambda x : x[1])[0]
    return nodes[mst]

print("Busiest station for all: " + max_station(G))
print("Busiest station for males: " + max_station(m_G))
print("Busiest station for females: " + max_station(f_G))


Busiest station for all: Pershing Square North
Busiest station for males: Pershing Square North
Busiest station for females: Pershing Square North


Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs

In [9]:
def max_time(network):
    time_li = []
    for edge in network.edges():
        time_li.append((edge, network.get_edge_data(*edge)['time']))
    return max(time_li,key=lambda x : x[1])

In [10]:
print("All: Longest avg duration is between " + nodes[max_time(G)[0][0]] + " and " + nodes[max_time(G)[0][1]] + ", " + "{:.2f}".format(max_time(G)[1]/60) + " minutes")
print("Male: Longest avg duration is between " + nodes[max_time(m_G)[0][0]] + " and " + nodes[max_time(m_G)[0][1]] + ", " + "{:.2f}".format(max_time(m_G)[1]/60) + " minutes")
print("Female: Longest avg duration is between " + nodes[max_time(f_G)[0][0]] + " and " + nodes[max_time(f_G)[0][1]] + ", " + "{:.2f}".format(max_time(f_G)[1]/60) + " minutes")

All: Longest avg duration is between Nassau St & Navy St and Hope St & Union Ave, 325167.48 minutes
Male: Longest avg duration is between Nassau St & Navy St and Hope St & Union Ave, 325167.48 minutes
Female: Longest avg duration is between Adelphi St & Myrtle Ave and NYCBS Depot - 3AV, 73698.82 minutes


In [11]:
#Note: I've printed the max edges but you don't need to print them

Which edge is associated with the most number of trips?

In [12]:
def max_count(network):
    count_li = []
    for edge in network.edges():
        count_li.append((edge, network.get_edge_data(*edge)['count']))
    return max(count_li,key=lambda x : x[1])

In [13]:
print("All: Most number of trips is between " + nodes[max_count(G)[0][0]] + " and " + nodes[max_count(G)[0][1]] + ", " + str(max_count(G)[1]) + "trips")
print("Male: Most number of trips is between " + nodes[max_count(m_G)[0][0]] + " and " + nodes[max_count(m_G)[0][1]] + ", " + str(max_count(m_G)[1]) + "trips")
print("Female: Most number of trips is between " + nodes[max_count(f_G)[0][0]] + " and " + nodes[max_count(f_G)[0][1]] + ", " + str(max_count(f_G)[1]) + "trips")

All: Most number of trips is between E 7 St & Avenue A and Cooper Square & Astor Pl, 700trips
Male: Most number of trips is between E 7 St & Avenue A and Cooper Square & Astor Pl, 533trips
Female: Most number of trips is between E 7 St & Avenue A and Cooper Square & Astor Pl, 161trips


<h2>Centrality</h2>
One of the concerns that the citibike system has to deal with is ensuring that no station has empty slots (a bike should always be available) and that no station should have no empty slots (you should be able to return a bike). To do this, it needs to monitor the movement of bikes through the system, ideally using a directed graph. Though our graph is not directed, we can look at some network characteristics that will help us answer these questions. Note that the "trips" feature in edge data captures flows.
<li>Which node is a possible bottleneck node in terms of bike flows?
<li>Which node is the "nearest" to all other nodes (irrespective of flows)
<li>Which node is the "nearest" to all other nodes (in terms of distance = time)


In [14]:
#Which node is a possible bottleneck node in terms of bike flows?
def bottleneck(network):
    bottle = nx.betweenness_centrality(network,weight='count')
    neck = max(bottle, key=bottle.get)
    return neck

print("All: Bottleneck node is "+ nodes[bottleneck(G)])
print("Male: Bottleneck node is "+ nodes[bottleneck(m_G)])
print("Female: Bottleneck node is "+ nodes[bottleneck(f_G)])


All: Bottleneck node is Wythe Ave & Metropolitan Ave
Male: Bottleneck node is Wythe Ave & Metropolitan Ave
Female: Bottleneck node is Kent Ave & N 7 St


In [15]:
#Which node is the "nearest" to all other nodes (irrespective of flows)
def center(network):
    cc = nx.closeness_centrality(network)
    c = max(cc, key=cc.get)
    return c

print("All: " + nodes[center(G)] + " is nearest to all other nodes")
print("Male: " + nodes[center(m_G)] + " is nearest to all other nodes")
print("Female: " + nodes[center(f_G)] + " is nearest to all other nodes")

All: Pershing Square North is nearest to all other nodes
Male: Pershing Square North is nearest to all other nodes
Female: Pershing Square North is nearest to all other nodes


In [16]:
#Which node is the "nearest" to all other nodes (in terms of distance = time)
def center_time(network):
    cc = nx.closeness_centrality(network, distance = 'time')
    c = max(cc, key=cc.get)
    return c

print("All: " + nodes[center_time(G)] + " is nearest to all other nodes(distance=time)")
print("Male: " + nodes[center_time(m_G)] + " is nearest to all other nodes(distance=time)")
print("Female: " + nodes[center_time(f_G)] + " is nearest to all other nodes(distance=time)")

All: E 4 St & 2 Ave is nearest to all other nodes(distance=time)
Male: E 2 St & 2 Ave is nearest to all other nodes(distance=time)
Female: Stanton St & Chrystie St is nearest to all other nodes(distance=time)
