<h1>Citibike Network Assignment</h1>
<li>The file, 201809-citibike-tripdata.csv, contains citibike trip data from September 2018 (a reasonable sized file!)
<li>The data:<br>
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
<li>Each record in the data is a trip 
<li>The data is described at https://www.citibikenyc.com/system-data

<h1>STEP 1: Read the data into a dataframe</h1>
<li>Convert station ids to str if necessary

In [1]:
import pandas as pd
import numpy as np
datafile = "201801-citibike-tripdata.csv"
df = pd.read_csv(datafile)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 718994 entries, 0 to 718993
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             718994 non-null  int64  
 1   starttime                718994 non-null  object 
 2   stoptime                 718994 non-null  object 
 3   start station id         718994 non-null  int64  
 4   start station name       718994 non-null  object 
 5   start station latitude   718994 non-null  float64
 6   start station longitude  718994 non-null  float64
 7   end station id           718994 non-null  int64  
 8   end station name         718994 non-null  object 
 9   end station latitude     718994 non-null  float64
 10  end station longitude    718994 non-null  float64
 11  bikeid                   718994 non-null  int64  
 12  usertype                 718994 non-null  object 
 13  birth year               718994 non-null  int64  
 14  gend

<h1>STEP 2: Basic cleaning</h1>
<li>Remove data that have any nans in any row (none in this file but others do have nans)
<li>and convert stationids to str 

In [2]:
df.dropna(axis=0,inplace=True)
df = df[df['start station id'] != df['end station id']]
df = df.astype({'end station id': str, 'start station id': str})

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 709637 entries, 0 to 718993
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             709637 non-null  int64  
 1   starttime                709637 non-null  object 
 2   stoptime                 709637 non-null  object 
 3   start station id         709637 non-null  object 
 4   start station name       709637 non-null  object 
 5   start station latitude   709637 non-null  float64
 6   start station longitude  709637 non-null  float64
 7   end station id           709637 non-null  object 
 8   end station name         709637 non-null  object 
 9   end station latitude     709637 non-null  float64
 10  end station longitude    709637 non-null  float64
 11  bikeid                   709637 non-null  int64  
 12  usertype                 709637 non-null  object 
 13  birth year               709637 non-null  int64  
 14  gend

<h1>STEP 3: Write a function that returns a graph given a citibike data frame</h1> 
<li>Your function should return two things:
<ol>
<li>a graph
<li>a dictionary with station ids as the key and station name as the value
</ol>
<li>The graph should contain 
<ol>
<li>nodes (station ids)
<li>edges (station id, station id)
<li>edge data 
<ol>
<li>count: number of trips on the edge
<li>time: average duration - pickup to dropoff - on that edge
</ol>
</ol>
<li><b>Note:</b> the edge (x1,y1) is the same as (y1,x1) even though the start station ids and end station ids are flipped in the dataframe

In [4]:
def get_citibike_graph(df):
    s = 'start station id'
    e = 'end station id'
    import networkx as nx
    G = nx.Graph()
    node_names = dict()
    edges = list()
    # node_names = dict(zip(df[['start station id','end station id']],df[['start station name','end station name']]))
    ids = df[s]
    ids.append(df[e])
    nodes = list(ids.unique())
    station_name = list()
    for n in range(len(nodes)):
        try:
            station_name.append(df['start station name'][df[s]==nodes[n]].iloc[0])
        except:
            station_name.append(df['end station name'][df[e]==nodes[n]].iloc[0])
    node_names = dict(zip(nodes,station_name))   
    for i in range(len(df[s])):
        # the small id goes first
        a = int(df.iloc[i][s])
        b =  int(df.iloc[i][e])
        if a < b:
            edges.append((df.iloc[i][s],df.iloc[i][e]))
        else:
            edges.append((df.iloc[i][e],df.iloc[i][s]))  
    df[['Start_id','End_id']] = [[edges[i][0],edges[i][1]] for i in range(len(edges))]
    
    unq_edges = set(edges)
    unq_edges = list(unq_edges)
    # start = [unq_edges[i][0] for i in range(len(unq_edges))]
    # end = [unq_edges[i][1] for i in range(len(unq_edges))]
    c = df.groupby(['Start_id','End_id']).size().reset_index(name='counts')
    count = dict()
    times = dict()
    d = df.groupby(['Start_id','End_id'])['tripduration'].sum().reset_index(name='total_time')
    c['total_time'] = d['total_time']
    for index, row in c.iterrows():
        # a is how many time the edge appears
        # a = edges.count(i)
        key = (row['Start_id'],row['End_id'])
        count[key]=row['counts']
        # durations = df['tripduration'][(df['Start_id']==row['Start_id']) & (df['End_id']==row['End_id'])]
        ave_time = row['total_time']/row['counts']
        times[key]=ave_time
        G.add_edge(row['Start_id'],row['End_id'],count=row['counts'],time = ave_time)   
    # G.add_edge(start,end,times=time,counts=count)
    return G,node_names
    

In [8]:
G,nodes=get_citibike_graph(df)

In [6]:
len(nodes)

NameError: name 'node_names' is not defined

<h1>STEP 4: Create the following graphs using the function above</h1>
<li>G: A graph of all the data in the dataframe
<li>m_G: A graph containing only data from male riders
<li>f_G: A graph containing only data from female riders
<li>Note: for m_G and f_G you will need to extract data from the dataframe

In [None]:
G,nodes=get_citibike_graph(df)

In [None]:
#df['Female']=np.where(df['gender']==2,1,0)
#df['Male']=np.where(df['gender']==1,1,0)

<h1>STEP 5: Answer the following questions for each of the graphs</h1>
<ol>
<li>Which station (name) is the best connected (max degree)?
<li>Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs. Report both the two stations as well as the time in minutes
<li>Which edge is associated with the most number of trips?
<li>Which station is the most central?
<li>Which node is a bottleneck node?

Which station (name) has the greatest number of connections (max degree)?

Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs

In [None]:
#Note: I've printed the max edges but you don't need to print them

Which edge is associated with the most number of trips?

<h2>Centrality</h2>
One of the concerns that the citibike system has to deal with is ensuring that no station has empty slots (a bike should always be available) and that no station should have no empty slots (you should be able to return a bike). To do this, it needs to monitor the movement of bikes through the system, ideally using a directed graph. Though our graph is not directed, we can look at some network characteristics that will help us answer these questions. Note that the "trips" feature in edge data captures flows.
<li>Which node is a possible bottleneck node in terms of bike flows?
<li>Which node is the "nearest" to all other nodes (irrespective of flows)
<li>Which node is the "nearest" to all other nodes (in terms of distance = time)
