# Graph Building

We use [NetworkX](https://networkx.org/) for graph manipulation.

We use the timetable data to create the graph. 
- Nodes represent stops (`stop_id`). 
- Edges (directed) represent trips. There can exists multiple edges between two nodes. 
  
  edge attributes:
  - `trip_id`
  - `departure_time`
  - `arrival_time`
  - `monday`, ..., `sunday`: bool

We also add walking edges between stops which are closer than 500m.

## Load data

In [1]:
from hdfs3 import HDFileSystem
import pandas as pd

hdfs = HDFileSystem(user='eric') 

def read_hdfs(path):
    files = hdfs.glob(path)
    dfs = []
    for file in files:
        with hdfs.open(file) as f:
            dfs.append(pd.read_parquet(f))
    return pd.concat(dfs)

hdfs.ls("/user/tshen/final-assn/parquet")

trips = read_hdfs('/user/tshen/final-assn/parquet/trips/*.parquet')
stop_times = read_hdfs('/user/tshen/final-assn/parquet/stop_times/*.parquet')

In [2]:
trips.info()
stop_times.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 160040 entries, 0 to 779
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   service_id    160040 non-null  object
 1   route_id      160040 non-null  object
 2   trip_id       160040 non-null  object
 3   direction_id  160040 non-null  int32 
 4   monday        160040 non-null  int32 
 5   tuesday       160040 non-null  int32 
 6   wednesday     160040 non-null  int32 
 7   thursday      160040 non-null  int32 
 8   friday        160040 non-null  int32 
 9   saturday      160040 non-null  int32 
 10  sunday        160040 non-null  int32 
dtypes: int32(8), object(3)
memory usage: 9.8+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2166674 entries, 0 to 4125
Data columns (total 4 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   stop_id         object
 1   trip_id         object
 2   arrival_time    object
 3   departure_time  object
dtypes

In [3]:
## Build Graph

In [4]:
import networkx as nx

In [5]:
G = nx.MultiDiGraph()

In [6]:
for row in stop_times.merge(trips, on='trip_id', how='inner').itertuples():
    # TODO

   stop_id               trip_id arrival_time departure_time
0  8587020  1.TA.1-N31-j19-1.1.R     25:25:00       25:25:00
1  8590535  1.TA.1-N31-j19-1.1.R     25:27:00       25:27:00
2  8590222  1.TA.1-N31-j19-1.1.R     25:27:00       25:27:00
3  8590218  1.TA.1-N31-j19-1.1.R     25:28:00       25:28:00
4  8590521  1.TA.1-N31-j19-1.1.R     25:30:00       25:30:00
  service_id        route_id                   trip_id  direction_id  monday  \
0   TA+b0b46     26-18-j19-1      1.TA.26-18-j19-1.1.H             0       1   
1   TA+b0a2k  63-138-Y-j19-1   1.TA.63-138-Y-j19-1.1.H             0       0   
2   TA+b001t     26-77-j19-1     10.TA.26-77-j19-1.1.H             0       0   
3   TA+b07dj    42-1-Y-j19-1    10.TA.42-1-Y-j19-1.9.H             0       1   
4   TA+b090k   80-55-Y-j19-1  10.TA.80-55-Y-j19-1.10.H             0       0   

   tuesday  wednesday  thursday  friday  saturday  sunday  
0        1          1         1       1         0       0  
1        0          1         0  

In [8]:
print(len(G.nodes))
print(len(G.edges))

0
0
