# Graph Building

We use [NetworkX](https://networkx.org/) for graph manipulation.

We use the timetable data to create the graph. 
- Nodes represent stops (`stop_id`). 
- Edges (directed) represent trips. There can exists multiple edges between two nodes. 
  
  edge attributes:
  - `trip_id`
  - `departure_time`
  - `arrival_time`
  - `monday`, ..., `sunday`: bool

We also add walking edges between stops which are closer than 500m.

## Warning: 2GB memory is needed to run this notebook!

## Load data

In [1]:
from hdfs3 import HDFileSystem
import pandas as pd

hdfs = HDFileSystem(user='eric') 

def read_hdfs(path):
    files = hdfs.glob(path)
    dfs = []
    for file in files:
        with hdfs.open(file) as f:
            dfs.append(pd.read_parquet(f))
    return pd.concat(dfs)

hdfs.ls("/user/tshen/final-assn/parquet")

trips = read_hdfs('/user/tshen/final-assn/parquet/trips/*.parquet')
stop_times = read_hdfs('/user/tshen/final-assn/parquet/stop_times/*.parquet')

In [2]:
trips = trips.astype({
    'service_id': 'string',
    'route_id': 'string',
    'trip_id': 'string',
    'monday': 'bool', 
    'tuesday': 'bool',
    'wednesday': 'bool',
    'thursday': 'bool',
    'friday': 'bool',
    'saturday': 'bool',
    'sunday': 'bool'
})

stop_times = stop_times.astype({
    'stop_id': 'string',
    'trip_id': 'string',
    'arrival_time': 'string',
    'departure_time': 'string'
})

trips.info()
trips.head()

stop_times.info()
stop_times.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 160040 entries, 0 to 779
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   service_id  160040 non-null  string
 1   route_id    160040 non-null  string
 2   trip_id     160040 non-null  string
 3   monday      160040 non-null  bool  
 4   tuesday     160040 non-null  bool  
 5   wednesday   160040 non-null  bool  
 6   thursday    160040 non-null  bool  
 7   friday      160040 non-null  bool  
 8   saturday    160040 non-null  bool  
 9   sunday      160040 non-null  bool  
dtypes: bool(7), string(3)
memory usage: 6.0 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2166674 entries, 0 to 4125
Data columns (total 5 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   stop_id         string
 1   trip_id         string
 2   arrival_time    string
 3   departure_time  string
 4   stop_sequence   int32 
dtypes: int32(1), string(4)
memory usage: 90.9 

Unnamed: 0,stop_id,trip_id,arrival_time,departure_time,stop_sequence
0,8587020,1.TA.1-N31-j19-1.1.R,25:25:00,25:25:00,1
1,8590535,1.TA.1-N31-j19-1.1.R,25:27:00,25:27:00,2
2,8590222,1.TA.1-N31-j19-1.1.R,25:27:00,25:27:00,3
3,8590218,1.TA.1-N31-j19-1.1.R,25:28:00,25:28:00,4
4,8590521,1.TA.1-N31-j19-1.1.R,25:30:00,25:30:00,5


## Build Graph

In [3]:
import networkx as nx

In [4]:
G = nx.MultiDiGraph()

In [6]:
%%timeit -n1 -r1
def add_trip_to_graph(trip_df):
    # we need to sort by stop_sequence, so that we can get the two consecutive stops
    trip_df = trip_df.sort_values('stop_sequence')
    trip_df['previous_stop_id'] = trip_df['stop_id'].shift(1)
    trip_df['previous_departure_time'] = trip_df['departure_time'].shift(1)
    # drop first row
    trip_df = trip_df.tail(trip_df.shape[0]-1)
    G.add_edges_from([*zip(
        trip_df['previous_stop_id'], 
        trip_df['stop_id'],
        map(lambda x: {'departure_time':x[0], 'arrival_time':x[1], 'trip_id':x[2]},
            zip(
                trip_df['previous_departure_time'],
                trip_df['arrival_time'],
                trip_df['trip_id']
            )
           )
    )])

stop_times.merge(trips, on='trip_id', how='inner').groupby('trip_id').apply(add_trip_to_graph)

5min 6s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [20]:
print(len(G.nodes))
print(len(G.edges))

print(G['8587020']['8590535'])

1812
2006634
{0: {'departure_time': '25:25:00', 'arrival_time': '25:27:00', 'trip_id': '1.TA.1-N31-j19-1.1.R'}, 1: {'departure_time': '28:20:00', 'arrival_time': '28:21:00', 'trip_id': '10.TA.1-N33-j19-1.2.R'}, 2: {'departure_time': '25:20:00', 'arrival_time': '25:21:00', 'trip_id': '11.TA.1-N33-j19-1.3.R'}, 3: {'departure_time': '25:20:00', 'arrival_time': '25:21:00', 'trip_id': '12.TA.1-N33-j19-1.3.R'}, 4: {'departure_time': '26:20:00', 'arrival_time': '26:21:00', 'trip_id': '13.TA.1-N33-j19-1.3.R'}, 5: {'departure_time': '26:20:00', 'arrival_time': '26:21:00', 'trip_id': '14.TA.1-N33-j19-1.3.R'}, 6: {'departure_time': '27:20:00', 'arrival_time': '27:21:00', 'trip_id': '15.TA.1-N33-j19-1.3.R'}, 7: {'departure_time': '26:25:00', 'arrival_time': '26:27:00', 'trip_id': '2.TA.1-N31-j19-1.1.R'}, 8: {'departure_time': '27:25:00', 'arrival_time': '27:27:00', 'trip_id': '3.TA.1-N31-j19-1.1.R'}, 9: {'departure_time': '27:20:00', 'arrival_time': '27:21:00', 'trip_id': '8.TA.1-N33-j19-1.2.R'}, 