## Walkthrough

We dealt with the following data:

- `stop_times` (timetable data)
- `calendar` (timetable data)
- `routes` (timetable data)
- `trips` (timetable data)
- `stops` (timetable data)
- `actual_condition` (real sbb data)


They will be loaded down there. View data structure with `dataName.printSchema()`, to view few (e.g. 5) rows, use `dataName.show(5)`

timetable data are on purpose only read from data recorded at 2022/06/01, recall that:
>The timetables are updated weekly. It is ok to assume that the weekly changes are small, and a timetable for
a given week is thus the same for the full year - use the schedule of the most recent week for the day of the trip.

It is also way too expensive to load all data (in fact I tried, the session keeps crushing)

---

**Proprocessed data:**
- `stops_in_15`: Filtered out all stops outside of 15km range; 
- `walk_map`: Calculated the walking time between walkable stops, date-independent
- `weekday_trans`: Combining all timetable data and filtering out weekends, non-business hours; (within 15km range)
- `trans_map`: Time that takes from one stop to another for a trip (date-dependent), similar to `walk_map`

---

**UI:**
- Doesn't support fuzzy search, must use exact stop names;
- Arrive time input format HH:MM, no spaces.

---

**Graph:**
Directed graph, node is `stop_id`; edge takes two values:
- `time`: in seconds, the time it takes from one node to another;
- `trip_id`: the id of trip, if is walk, `trip_id` = 'walk'

In [61]:
%%local
#Installing dependencies
!pip install networkx
!pip install pyarrow
!pip install fastparquet



## Spark stuff

**In peak hours the sesson might not be able to start** and throw:
```
The code failed because of a fatal error:
	Session xxxx did not start up in 60 seconds..

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
```
Solution is to keep retrying ;) good luck for that!

In [1]:
%%local
import os
import json
from IPython import get_ipython

username = os.environ['RENKU_USERNAME']

configuration = dict(
    name = f"{username}-final-project",
    executorMemory = "4G",
    executorCores = 4,
    numExecutors = 10,
    conf = {
        "spark.jars.repositories": "https://repos.spark-packages.org",
    }
)

get_ipython().run_cell_magic('configure', line="-f", 
                             cell=json.dumps(configuration))

ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3291,application_1680948035106_3036,pyspark,busy,Link,Link,,


In [2]:
%spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3400,application_1680948035106_3151,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
%%send_to_spark -i username -t str -n username

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'username' as 'username' to Spark kernel

## Loading data

### data reading

We use data one week before the final presentation but in the year of 2022.

In [3]:
orc_file_path = "/data/sbb/part_orc/timetables"
stop_times = spark.read.orc(orc_file_path + "/stop_times/year=2022/month=6/day=1")
calendar = spark.read.orc(orc_file_path + "/calendar/year=2022/month=6/day=1")
routes = spark.read.orc(orc_file_path + "/routes/year=2022/month=6/day=1")
trips = spark.read.orc(orc_file_path + "/trips/year=2022/month=6/day=1")
csv_file_path = "/data/sbb/part_csv/timetables"
stops_csv = spark.read.csv(csv_file_path + "/stops/year=2022/month=06/day=01", header = True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
actual_temp = spark.read.load("/data/sbb/part_orc/istdaten", format="orc", sep=";", inferSchema="true", header="true")
actual_condition = actual_temp.withColumnRenamed("betriebstag", "Date_of_trip")\
                .withColumnRenamed("fahrt_bezeichner", "Trip_id")\
                .withColumnRenamed("betreiber_id", "Operator_id")\
                .withColumnRenamed("betreiber_abk", "Operator_abk")\
                .withColumnRenamed("betreiber_name", "Operator_name")\
                .withColumnRenamed("produkt_id", "Transport_type")\
                .withColumnRenamed("linien_id", "Train_number(train)")\
                .withColumnRenamed("linien_text", "Service type(train)")\
                .withColumnRenamed("umlauf_id", "Circulation_id")\
                .withColumnRenamed("verkehrsmittel_text", "Means_of_transport_text")\
                .withColumnRenamed("zusatzfahrt_tf", "If_additional")\
                .withColumnRenamed("faellt_aus_tf", "If_failed")\
                .withColumnRenamed("bpuic", "Stop_id")\
                .withColumnRenamed("haltestellen_name", "Stop_name")\
                .withColumnRenamed("ankunftszeit", "Arrival_time")\
                .withColumnRenamed("an_prognose", "Actual_arrival_time")\
                .withColumnRenamed("abfahrtszeit", "Departure_time")\
                .withColumnRenamed("ab_prognose", "Actual_departure_time")\
                .withColumnRenamed("durchfahrt_tf", "Not_stop")
actual_condition.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Date_of_trip: string (nullable = true)
 |-- Trip_id: string (nullable = true)
 |-- Operator_id: string (nullable = true)
 |-- Operator_abk: string (nullable = true)
 |-- Operator_name: string (nullable = true)
 |-- Transport_type: string (nullable = true)
 |-- Train_number(train): string (nullable = true)
 |-- Service type(train): string (nullable = true)
 |-- Circulation_id: string (nullable = true)
 |-- Means_of_transport_text: string (nullable = true)
 |-- If_additional: string (nullable = true)
 |-- If_failed: string (nullable = true)
 |-- Stop_id: string (nullable = true)
 |-- Stop_name: string (nullable = true)
 |-- Arrival_time: string (nullable = true)
 |-- Actual_arrival_time: string (nullable = true)
 |-- an_prognose_status: string (nullable = true)
 |-- Departure_time: string (nullable = true)
 |-- Actual_departure_time: string (nullable = true)
 |-- ab_prognose_status: string (nullable = true)
 |-- Not_stop: string (nullable = true)
 |-- year: integer (nullable = 

## Data preprocessing

We filter out all stations that are 15kms away from the given Zurich location.

For distance calculation, refer to [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula)

In [6]:
from math import sin, cos, sqrt, atan2, radians
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType

@F.udf(returnType=FloatType())

#Implement the Haversine formula. 
def distance_calculation(latitude_1, longitude_1, latitude_2, longitude_2):
    #Use Haversine formula. 
    #The Haversine formula calculates the distance between two points on a sphere 
    #(such as the Earth) based on their latitude and longitude.
    radius_of_Earth = 6371.0 #Earth radius, just refer to the actual data 
    
    #First, Convert latitude and longitude from degrees to radians
    latitude_1 = radians(float(latitude_1))
    latitude_2 = radians(float(latitude_2))
    longitude_1 = radians(float(longitude_1))
    longitude_2 = radians(float(longitude_2))

    ## Haversine formula implementation
    delta_latitude = latitude_2 - latitude_1
    delta_longitude = longitude_2 - longitude_1

    a = cos(latitude_1)*cos(latitude_2)*sin(delta_longitude/2)**2+sin(delta_latitude/2)**2
    c = 2*atan2(sqrt(a), sqrt(1-a))

    distance = radius_of_Earth * c 
    return distance

print("We have " + str(stops_csv.distinct().count()) + " stops before")

stops_in_15 = stops_csv.where(distance_calculation(F.lit(47.378177), F.lit(8.540192), F.col("stop_lat"), F.col("stop_lon")) <=15)

print("There are " + str(stops_in_15.distinct().count()) + " stops that are in 15km radius of Zurich HB")

stops_in_15.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We have 40077 stops before
There are 1888 stops that are in 15km radius of Zurich HB
+-------------+--------------------+----------------+----------------+-------------+--------------+
|      stop_id|           stop_name|        stop_lat|        stop_lon|location_type|parent_station|
+-------------+--------------------+----------------+----------------+-------------+--------------+
|      8500926|Oetwil a.d.L., Sc...|47.4236270123012| 8.4031825286317|         null|          null|
|      8502186|Dietikon Stoffelbach|47.3933267759652|8.39896044679575|         null| Parent8502186|
|8502186:0:1/2|Dietikon Stoffelbach|47.3933997509195|8.39894248049007|         null| Parent8502186|
|      8502187|Rudolfstetten Hof...|47.3646702178563|8.37695172233176|         null| Parent8502187|
|8502187:0:1/2|Rudolfstetten Hof...|47.3647371479356|8.37703257070734|         null| Parent8502187|
+-------------+--------------------+----------------+----------------+-------------+--------------+
only showing to

With the within-15km stops, we want to preprocess the walking time, for this there are two steps:
- Filter out stations that are too far away for walking (>500m)
- Calculate walking time

In [40]:
walking_df = stops_in_15.select(F.col("stop_id").alias("stop_id_1"), F.col("stop_name").alias("stop_name_1"), F.col("stop_lat").alias("stop_lat_1"), F.col("stop_lon").alias("stop_lon_1")) \
    .crossJoin(stops_in_15.select(F.col("stop_id").alias("stop_id_2"), F.col("stop_name").alias("stop_name_2"), F.col("stop_lat").alias("stop_lat_2"),F.col("stop_lon").alias("stop_lon_2"))) \
    .withColumn("distance", distance_calculation(F.col("stop_lat_1"), F.col("stop_lon_1"), F.col("stop_lat_2"), F.col("stop_lon_2"))) \
    .select(F.col("stop_id_1"), F.col("stop_name_1"), F.col("stop_id_2"), F.col("stop_name_2"), F.col("distance")) \
    .filter("distance<=0.5 and distance>0.0")

walking_df = walking_df.withColumn("used_time", walking_df.distance*1200).select("stop_id_1","stop_name_1","stop_id_2","stop_name_2","used_time")
walk_map = walking_df.withColumn('trip_id',F.lit('walk')).withColumn('stops_id1_dep',F.lit('null')).withColumn('stops_id2_arr',F.lit('null'))\
                        .withColumn('route_desc', F.lit('walk')).select('trip_id','stop_id_1','stop_id_2','used_time','stops_id1_dep','stops_id2_arr','route_desc')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

According to the requirement of the task, we select only the weekdays.

The weekdays filtering can be done in calendar and then we use service_id to join other dataframes to get the transportation methods in weekdays.

We then implement all the joins using primary keys, and filte out not-business times

In [34]:
weekdays_mask = calendar.where("monday = TRUE and tuesday = TRUE and wednesday = TRUE and thursday = TRUE and friday = TRUE").select('service_id')
weekday_trips = trips.join(weekdays_mask, "service_id")
weekday_trips_routes = weekday_trips.join(routes, "route_id")
weekday_stop_times = stop_times.join(weekday_trips_routes, "trip_id")
weekday_trans = weekday_stop_times.join(stops_in_15, "stop_id")
time_range = (8,18)
weekday_trans = weekday_trans.filter(F.hour(weekday_trans.arrival_time)>=time_range[0])\
                        .filter(F.hour(weekday_trans.departure_time)>=time_range[0])\
                        .filter(F.hour(weekday_trans.arrival_time)<=time_range[1])\
                        .filter(F.hour(weekday_trans.departure_time)<=time_range[1])\
                        .select("trip_id","stop_id","stop_name","arrival_time","departure_time","stop_sequence", "route_desc")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Graph building

Building the edges (time spend between two stops)

For one trip (identified by `trip_id`), we aggregate info about it and store it all in one row (per `trip_id`)

In [35]:
from datetime import datetime
import pandas as pd
from pyspark.sql.types import ArrayType, StringType, StructType, StructField, BooleanType

@F.udf(returnType=ArrayType(StringType()))
def to_line(column):
    set_names = column.split(';')
    line_set = []
    for i in range(len(set_names) - 1):
        line_set.append([set_names[i], set_names[i+1]])
    return line_set

@F.udf(returnType=ArrayType(StringType()))
def to_timetable(arr, dep):
    arr_time = arr.split(';')
    dep_time = dep.split(';')
    line_set = []
    for i in range(len(arr_time) - 1):
        line_set.append([dep_time[i], arr_time[i+1]])
    return line_set

@F.udf(returnType=ArrayType(StringType()))
def to_transtype(route_desc):
    routes_desc = route_desc.split(';')
    routes = []
    for i in range(len(routes_desc)):
        routes.append(routes_desc[i])
    return routes

@F.udf(returnType=ArrayType(FloatType()))
def calculate_time(arr, dep):
    arr_time = arr.split(';')
    dep_time = dep.split(';')
    time_set = []
    for i in range(len(arr_time) - 1):
        time = (datetime.strptime(arr_time[i+1], '%H:%M:%S') - datetime.strptime(dep_time[i], '%H:%M:%S')).total_seconds()
        time_set.append(time)
    return time_set

@F.udf(returnType=StringType())
def remove_parentheses(cols):
    return cols[1:-1]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [36]:
from pyspark.sql import types as T

schema = T.StructType([
    T.StructField("stops_id_group", T.StringType()),
    T.StructField("stops_name_group", T.StringType()),
    T.StructField("used_time", T.StringType()),
    T.StructField("stops_time_group", T.StringType()),
    T.StructField("trans_type", T.StringType())
])

def zip_arrays(stops_id_group, stops_name_group, used_time, stops_time_group, trans_type):
    return list(zip(stops_id_group, stops_name_group, used_time, stops_time_group, trans_type))

combine = F.udf(zip_arrays, returnType=T.ArrayType(schema))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [37]:
reduced_rdd = weekday_trans.rdd.map(
    lambda row: (
        row[0],
        [(row[1], row[2], row[3], row[4], row[5], row[6])]
    )
).reduceByKey(
    lambda x, y: x + y
).map(
    lambda row: (
        row[0],
        sorted(row[1], key=lambda text: int(text[4]))
    )
).map(
    lambda row: (
        row[0],
        ";".join([e[0] for e in row[1]]),
        ";".join([e[1] for e in row[1]]),
        ";".join([e[2] for e in row[1]]),
        ";".join([e[3] for e in row[1]]),
        ";".join([e[4] for e in row[1]]),
        ";".join([e[5] for e in row[1]])
    )
)

reduced_df = reduced_rdd.toDF(
    ["trip_id", "stop_id", "stop_name", "arrival_time", "departure_time", "stop_sequence", "route_desc"]
)

merge2stops_df = reduced_df.withColumn(
    'stops_id_group', to_line(reduced_df.stop_id)
).withColumn(
    'stops_name_group', to_line(reduced_df.stop_name)
).withColumn(
    'stops_time_group', to_timetable(reduced_df.arrival_time, reduced_df.departure_time)
).withColumn(
    'used_time', calculate_time(reduced_df.arrival_time, reduced_df.departure_time)
).withColumn(
    'trans_type', to_transtype(reduced_df.route_desc)
).select(
    'trip_id', 'stops_id_group', 'stops_name_group', 'used_time', 'stops_time_group', 'trans_type'
)

merge2stops_df = merge2stops_df.withColumn(
    "agg_info",
    combine(merge2stops_df.stops_id_group, merge2stops_df.stops_name_group, merge2stops_df.used_time,
            merge2stops_df.stops_time_group, merge2stops_df.trans_type)
).withColumn(
    "agg_info", F.explode("agg_info")
).select(
    "trip_id", F.col("agg_info.stops_id_group").alias("stops_id_group"),
    F.col("agg_info.stops_name_group").alias("stops_name_group"),
    F.col("agg_info.used_time").alias("used_time"),
    F.col("agg_info.stops_time_group").alias("stops_time_group"),
    F.col("agg_info.trans_type").alias("route_desc")
)

merge2stops_df = merge2stops_df.withColumn(
    'new_id_group', remove_parentheses(merge2stops_df.stops_id_group)
).withColumn(
    'new_name_group', remove_parentheses(merge2stops_df.stops_name_group)
).withColumn(
    'new_time_group', remove_parentheses(merge2stops_df.stops_time_group)
)

trans_map = merge2stops_df.select(
    "trip_id",
    F.split("new_id_group", ",")[0].alias("stop_id1"),
    F.split("new_id_group", ", ")[1].alias("stop_id2"),
    "used_time",
    F.split("new_time_group", ",")[0].alias("stop_id1_dep"),
    F.split("new_time_group", ", ")[1].alias("stop_id2_arr"),
    "route_desc"
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [39]:
trans_map.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------+--------+---------+------------+------------+----------+
|             trip_id|stop_id1|stop_id2|used_time|stop_id1_dep|stop_id2_arr|route_desc|
+--------------------+--------+--------+---------+------------+------------+----------+
|1104.TA.91-3-j22-...| 8591233| 8591199|     60.0|    15:11:00|    15:12:00|         T|
|1104.TA.91-3-j22-...| 8591199| 8503083|     60.0|    15:12:00|    15:13:00|         T|
|1104.TA.91-3-j22-...| 8503083| 8591202|    120.0|    15:13:00|    15:15:00|         T|
|1104.TA.91-3-j22-...| 8591202| 8591239|     60.0|    15:15:00|    15:16:00|         T|
|1104.TA.91-3-j22-...| 8591239| 8591287|    120.0|    15:16:00|    15:18:00|         T|
+--------------------+--------+--------+---------+------------+------------+----------+
only showing top 5 rows

In [41]:
# the walking time we processed earlier
walk_map.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+---------+-------------+---------+-------------+-------------+----------+
|trip_id|stop_id_1|    stop_id_2|used_time|stops_id1_dep|stops_id2_arr|route_desc|
+-------+---------+-------------+---------+-------------+-------------+----------+
|   walk|  8500926|      8590616|146.91576|         null|         null|      walk|
|   walk|  8500926|      8590737|359.59048|         null|         null|      walk|
|   walk|  8502186|8502186:0:1/2| 9.871648|         null|         null|      walk|
|   walk|  8502186|      8502270| 552.5724|         null|         null|      walk|
|   walk|  8502186|      8590200| 580.6377|         null|         null|      walk|
+-------+---------+-------------+---------+-------------+-------------+----------+
only showing top 5 rows

## UI

In [48]:
%%local
import pandas as pd
from ipywidgets import widgets, interact, VBox
from IPython.display import display, HTML
import datetime
current_datetime = datetime.datetime.now()
current_hour = current_datetime.hour
current_minute = str(current_datetime.minute)
proposed_hour = str(current_hour + 4)

def create_schedule(change):
    Departure = departure_widget.value
    Destination = destination_widget.value
    timeInput = input_widget.value
    hour, minute = map(int, timeInput.split(":"))
    time = hour + minute / 60
    
    schedule_info = {
        'dep': [Departure],
        'destination': [Destination],
        'arrival_time': [time]
    }
    
    schedule_df = pd.DataFrame(schedule_info)
    
    schedule_df.to_csv('./info.csv')

    
css = """
.widget-label {
    min-width: 150px;
    text-align: right;
    padding-right: 10px;
}
#container {
    background-color: orange;
    color: white;
}
#title {
    color: red;
}
"""

html = "<style>{}</style>".format(css)
display(HTML(html))
    
title = widgets.HTML('<h2 id="title">Route Planner</h2>')
departure_widget = widgets.Text(value='Küsnacht ZH', description='Departure')
destination_widget = widgets.Text(value='Zürich, Neeserweg', description='Destination')
input_widget = widgets.Text(value = proposed_hour + ":"+current_minute, description = 'Arrive at (HH:MM)')

input_widget.continuous_update = False
input_widget.observe(create_schedule, 'value')

container = VBox([title, departure_widget, destination_widget,input_widget], layout=widgets.Layout(id='container'))
display(container)

VBox(children=(HTML(value='<h2 id="title">Route Planner</h2>'), Text(value='Küsnacht ZH', description='Departu…

---

Reading input data and sending it to spark

In [49]:
%%local
import pandas as pd
schedule_df = pd.read_csv('./info.csv')
time_local = schedule_df.iloc[0,3]

In [50]:
%%send_to_spark -i schedule_df -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'schedule_df' as 'schedule_df' to Spark kernel

Getting the stop id corresponding to the stop name. 

Here I preserve only the root stations (no parent stations) and distinct them based on coordinates.

In [51]:
stop_name1 = schedule_df.select('dep').rdd.flatMap(lambda x: x).collect()[0]
stop_name2 = schedule_df.select('destination').rdd.flatMap(lambda x: x).collect()[0]
time = schedule_df.select('arrival_time').rdd.flatMap(lambda x: x).collect()[0]

stop1 = stops_in_15.filter(stops_in_15["stop_name"]==stop_name1)\
.filter(stops_in_15['parent_station'].isNull())\
.dropDuplicates(['stop_lat', 'stop_lon'])\

stop2 = stops_in_15.filter(stops_in_15["stop_name"]==stop_name2)\
.filter(stops_in_15['parent_station'].isNull())\
.dropDuplicates(['stop_lat', 'stop_lon'])\

# the calculation is not stable as I observe, sometimes there are multiple results with
# exactly the same coordinates and the only difference is the id, sometimes this does
# not happen. Therefore I take the first element of the result.

# stop_id1, stop_id2 is a string
stop_id1 = stop1.select('stop_id').first().stop_id
stop_id2 = stop2.select('stop_id').first().stop_id

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Graph construction

Before building the graph, we filter out trips that are:
- Arriving later than our desired arrivial time;
- Arriving too early (more than 2 hours before) than our desired arrivial time

In [53]:
arrival_time_in_hours = F.hour("arrival_time") + F.minute("arrival_time") / 60

start_time = time - 2
end_time = time

trips_arriving_in_time_range = weekday_trans.filter(arrival_time_in_hours.between(start_time, end_time))

distinct_trip_ids = trips_arriving_in_time_range.select("trip_id")

trans_map = trans_map.join(distinct_trip_ids,'trip_id')

trans_map = trans_map.union(walk_map)

trans_map.write.parquet('/user/{0}/file/'.format(username), mode='overwrite')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
%%local

from hdfs3 import HDFileSystem
import pandas as pd
import networkx as nx

hdfs = HDFileSystem(host='hdfs://iccluster044.iccluster.epfl.ch', port=8020, user='ebouille')
files = hdfs.glob('/user/{0}/file/*.parquet'.format(username))
trans = pd.DataFrame()
for file in files:
    with hdfs.open(file) as f:
        trans = pd.concat([trans, pd.read_parquet(f)])

edges = trans.values.tolist()

for i, row in enumerate(edges):
    edges[i] = (row[1], row[2], {"time": float(row[3]), "trip_id": row[0]})

g = nx.DiGraph()
g.add_edges_from(edges)

In [24]:
%%local
edge_data = g.get_edge_data('8591365', '8591329')
edge_data

{'time': 168.71622, 'trip_id': 'walk'}