# Graph Building

We use [NetworkX](https://networkx.org/) for graph manipulation.

We use the timetable data to create the graph. 
- Nodes represent stops (`stop_id`). 
- Edges (directed) represent trips. There can exists multiple edges between two nodes. 
  
  edge attributes:
  - `trip_id`
  - `departure_time`
  - `arrival_time`
  - `monday`, ..., `sunday`: bool

We also add walking edges between stops which are closer than 500m.

## Load data

In [1]:
from hdfs3 import HDFileSystem
import pandas as pd

hdfs = HDFileSystem(user='eric') 

def read_hdfs(path):
    files = hdfs.glob(path)
    dfs = []
    for file in files:
        with hdfs.open(file) as f:
            dfs.append(pd.read_parquet(f))
    return pd.concat(dfs)

hdfs.ls("/user/tshen/final-assn/parquet")

trips = read_hdfs('/user/tshen/final-assn/parquet/trips/*.parquet')
stop_times = read_hdfs('/user/tshen/final-assn/parquet/stop_times/*.parquet')

In [2]:
# trips = trips.astype({
#     'service_id': 'string',
#     'route_id': 'string',
#     'trip_id': 'string',
#     'monday': 'bool', 
#     'tuesday': 'bool',
#     'wednesday': 'bool',
#     'thursday': 'bool',
#     'friday': 'bool',
#     'saturday': 'bool',
#     'sunday': 'bool'
# })

# stop_times = stop_times.astype({
#     'stop_id': 'string',
#     'trip_id': 'string',
#     'arrival_time': 'string',
#     'departure_time': 'string'
# })

# trips.info()
# trips.head()

# stop_times.info()
# stop_times.head()

## Build Graph

In [3]:
import networkx as nx

In [4]:
G = nx.MultiDiGraph()

In [None]:
def add_trip_to_graph(trip_df):
    # we need to sort by stop_sequence, so that we can get the two consecutive stops
    trip_df.sort_values('stop_sequence')
    trip_df['previous_stop_id'] = trip_df['stop_id'].shift(1)
    # drop first row
    trip_df = trip_df.tail(trip_df.shape[0] - 1)
    G.add_edges_from([*zip(trip_df['previous_stop_id'], trip_df['stop_id'])])

stop_times.merge(trips, on='trip_id', how='inner').groupby('trip_id').apply(add_trip_to_graph)

In [None]:
print(len(G.nodes))
print(len(G.edges))