## Walkthrough

We dealt with the following data:

- `stop_times` (timetable data)
- `calendar` (timetable data)
- `routes` (timetable data)
- `trips` (timetable data)
- `stops` (timetable data)
- `actual_condition` (real sbb data)


They will be loaded down there. View data structure with `dataName.printSchema()`, to view few (e.g. 5) rows, use `dataName.show(5)`

timetable data are on purpose only read from data recorded at 2022/06/01, recall that:
>The timetables are updated weekly. It is ok to assume that the weekly changes are small, and a timetable for
a given week is thus the same for the full year - use the schedule of the most recent week for the day of the trip.

It is also way too expensive to load all data (in fact I tried, the session keeps crushing)

---

**Proprocessed data:**
- `stops_in_15`: Filtered out all stops outside of 15km range; 
- `walk_map`: Calculated the walking time between walkable stops, date-independent
- `weekday_trans`: Combining all timetable data and filtering out weekends, non-business hours; (within 15km range)
- `trans_map`: Time that takes from one stop to another for a trip (date-dependent), similar to `walk_map`

---

**UI:**
- Doesn't support fuzzy search, must use exact stop names;
- Arrive time input format HH:MM, no spaces.

---

**Graph:**
Directed graph, node is `stop_id`; edge takes two values:
- `time`: in seconds, the time it takes from one node to another;
- `trip_id`: the id of trip, if is walk, `trip_id` = 'walk'

In [50]:
%%local
#Installing dependencies
!pip install networkx
!pip install pyarrow
!pip install fastparquet



## Spark stuff

**In peak hours the sesson might not be able to start** and throw:
```
The code failed because of a fatal error:
	Session xxxx did not start up in 60 seconds..

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
```
Solution is to keep retrying ;) good luck for that!

In [265]:
%%local
import os
import json
from IPython import get_ipython

username = os.environ['RENKU_USERNAME']

configuration = dict(
    name = f"{username}-final-project",
    executorMemory = "4G",
    executorCores = 4,
    numExecutors = 10,
    conf = {
        "spark.jars.repositories": "https://repos.spark-packages.org",
    }
)

get_ipython().run_cell_magic('configure', line="-f", 
                             cell=json.dumps(configuration))

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
8242,application_1680948035106_7591,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
8242,application_1680948035106_7591,pyspark,idle,Link,Link,,✔


In [266]:
%spark

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [267]:
%%send_to_spark -i username -t str -n username

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'username' as 'username' to Spark kernel

## Loading data

### data reading

We use data one week of the final presentation but in the year of 2022.

In [54]:
# we assume little weekly change in the timetable data (see README)
# we wish to calculate the connections at the date of 08/06/2023 (day of oral defense)
# for the data, there are 5 entries every month. For example, `hdfs dfs -ls /data/sbb/part_orc/timetables/calendar/year=2022/month=6`
# will return 5 entries: day =1, 8, 15, 22, 29, each entry corresspond to the according week's data.
# For the sake of simplicity we only choose the year of 2022.
# Therefore we read data from 2022/6/8
orc_file_path = "/data/sbb/part_orc/timetables"
stop_times = spark.read.orc(orc_file_path + "/stop_times/year=2022/month=6/day=8")
calendar = spark.read.orc(orc_file_path + "/calendar/year=2022/month=6/day=8")
routes = spark.read.orc(orc_file_path + "/routes/year=2022/month=6/day=8")
trips = spark.read.orc(orc_file_path + "/trips/year=2022/month=6/day=8")
csv_file_path = "/data/sbb/part_csv/timetables"
stops_csv = spark.read.csv(csv_file_path + "/stops/year=2022/month=06/day=08", header = True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [55]:
actual_temp = spark.read.load("/data/sbb/part_orc/istdaten", format="orc", sep=";", inferSchema="true", header="true")
actual_condition = actual_temp.withColumnRenamed("betriebstag", "Date_of_trip")\
                .withColumnRenamed("fahrt_bezeichner", "Trip_id")\
                .withColumnRenamed("betreiber_id", "Operator_id")\
                .withColumnRenamed("betreiber_abk", "Operator_abk")\
                .withColumnRenamed("betreiber_name", "Operator_name")\
                .withColumnRenamed("produkt_id", "Transport_type")\
                .withColumnRenamed("linien_id", "Train_number(train)")\
                .withColumnRenamed("linien_text", "Service type(train)")\
                .withColumnRenamed("umlauf_id", "Circulation_id")\
                .withColumnRenamed("verkehrsmittel_text", "Means_of_transport_text")\
                .withColumnRenamed("zusatzfahrt_tf", "If_additional")\
                .withColumnRenamed("faellt_aus_tf", "If_failed")\
                .withColumnRenamed("bpuic", "Stop_id")\
                .withColumnRenamed("haltestellen_name", "Stop_name")\
                .withColumnRenamed("ankunftszeit", "Arrival_time")\
                .withColumnRenamed("an_prognose", "Actual_arrival_time")\
                .withColumnRenamed("abfahrtszeit", "Departure_time")\
                .withColumnRenamed("ab_prognose", "Actual_departure_time")\
                .withColumnRenamed("durchfahrt_tf", "Not_stop")
actual_condition.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Date_of_trip: string (nullable = true)
 |-- Trip_id: string (nullable = true)
 |-- Operator_id: string (nullable = true)
 |-- Operator_abk: string (nullable = true)
 |-- Operator_name: string (nullable = true)
 |-- Transport_type: string (nullable = true)
 |-- Train_number(train): string (nullable = true)
 |-- Service type(train): string (nullable = true)
 |-- Circulation_id: string (nullable = true)
 |-- Means_of_transport_text: string (nullable = true)
 |-- If_additional: string (nullable = true)
 |-- If_failed: string (nullable = true)
 |-- Stop_id: string (nullable = true)
 |-- Stop_name: string (nullable = true)
 |-- Arrival_time: string (nullable = true)
 |-- Actual_arrival_time: string (nullable = true)
 |-- an_prognose_status: string (nullable = true)
 |-- Departure_time: string (nullable = true)
 |-- Actual_departure_time: string (nullable = true)
 |-- ab_prognose_status: string (nullable = true)
 |-- Not_stop: string (nullable = true)
 |-- year: integer (nullable = 

## Data preprocessing

We filter out all stations that are 15kms away from the given Zurich location.

For distance calculation, refer to [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula)

In [56]:
from math import sin, cos, sqrt, atan2, radians
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType

@F.udf(returnType=FloatType())

# Filtering out all stops outside of 15km range with the Haversine formula
def distance_calculation(latitude_1, longitude_1, latitude_2, longitude_2):
    #Use Haversine formula. 
    #The Haversine formula calculates the distance between two points on a sphere 
    #(such as the Earth) based on their latitude and longitude.
    radius_of_Earth = 6371.0 #Earth radius, just refer to the actual data 
    
    #First, Convert latitude and longitude from degrees to radians
    latitude_1 = radians(float(latitude_1))
    latitude_2 = radians(float(latitude_2))
    longitude_1 = radians(float(longitude_1))
    longitude_2 = radians(float(longitude_2))

    ## Haversine formula implementation
    delta_latitude = latitude_2 - latitude_1
    delta_longitude = longitude_2 - longitude_1

    a = cos(latitude_1)*cos(latitude_2)*sin(delta_longitude/2)**2+sin(delta_latitude/2)**2
    c = 2*atan2(sqrt(a), sqrt(1-a))

    distance = radius_of_Earth * c 
    return distance

stops_in_15 = stops_csv.where(distance_calculation(F.lit(47.378177), F.lit(8.540192), F.col("stop_lat"), F.col("stop_lon")) <=15)
stops_in_15.write.mode("overwrite").parquet('/user/{0}/stops_in_15/'.format(username))
stops_in_15.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------+--------------------+----------------+----------------+-------------+--------------+
|      stop_id|           stop_name|        stop_lat|        stop_lon|location_type|parent_station|
+-------------+--------------------+----------------+----------------+-------------+--------------+
|      8500926|Oetwil a.d.L., Sc...|47.4236270123012| 8.4031825286317|         null|          null|
|      8502186|Dietikon Stoffelbach|47.3933267759652|8.39896044679575|         null| Parent8502186|
|8502186:0:1/2|Dietikon Stoffelbach|47.3933997509195|8.39894248049007|         null| Parent8502186|
|      8502187|Rudolfstetten Hof...|47.3646702178563|8.37695172233176|         null| Parent8502187|
|8502187:0:1/2|Rudolfstetten Hof...|47.3647371479356|8.37703257070734|         null| Parent8502187|
+-------------+--------------------+----------------+----------------+-------------+--------------+
only showing top 5 rows

With the within-15km stops, we want to preprocess the walking time, for this there are two steps:
- Filter out stations that are too far away for walking (>500m)
- Calculate walking time

In [57]:
# rename the stops_in_15 dataframe
# use crossJoin to make pairs between different stops
# calculate the distance between a pair of stops
# filtering out stop pairs that are too far away
walking_df = stops_in_15.select(F.col("stop_id").alias("stop_id_1"), F.col("stop_name").alias("stop_name_1"), F.col("stop_lat").alias("stop_lat_1"), F.col("stop_lon").alias("stop_lon_1")) \
    .crossJoin(stops_in_15.select(F.col("stop_id").alias("stop_id_2"), F.col("stop_name").alias("stop_name_2"), F.col("stop_lat").alias("stop_lat_2"),F.col("stop_lon").alias("stop_lon_2"))) \
    .withColumn("distance", distance_calculation(F.col("stop_lat_1"), F.col("stop_lon_1"), F.col("stop_lat_2"), F.col("stop_lon_2"))) \
    .select(F.col("stop_id_1"), F.col("stop_name_1"), F.col("stop_id_2"), F.col("stop_name_2"), F.col("distance")) \
    .filter("distance<=0.5 and distance>0.0")
# calculating the time spent by walking between the filtered stop pairs
walking_df = walking_df.withColumn("used_time", walking_df.distance*1200).select("stop_id_1","stop_name_1","stop_id_2","stop_name_2","used_time")
# generating the walk map from the previous result, fill missing values with given default values.
walk_map = walking_df.withColumn('trip_id',F.lit('walk')).withColumn('stops_id1_dep',F.lit('null')).withColumn('stops_id2_arr',F.lit('null'))\
                        .withColumn('route_desc', F.lit('walk')).select('trip_id','stop_id_1','stop_id_2','used_time','stops_id1_dep','stops_id2_arr','route_desc').withColumn('route_id', F.lit('walk'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

According to the requirement of the task, we select only the weekdays.

The weekdays filtering can be done in calendar and then we use service_id to join other dataframes to get the transportation methods in weekdays.

We then implement all the joins using primary keys, and filte out not-business times

In [58]:
# filtering out week-end data
weekdays_mask = calendar.where("monday = TRUE and tuesday = TRUE and wednesday = TRUE and thursday = TRUE and friday = TRUE").select('service_id')
weekday_trips = trips.join(weekdays_mask, "service_id")
# joining all the data together (stops within 15km range)
weekday_trips_routes = weekday_trips.join(routes, "route_id")
weekday_stop_times = stop_times.join(weekday_trips_routes, "trip_id")
weekday_trans = weekday_stop_times.join(stops_in_15, "stop_id")
# filtering out non-business hours
time_range = (8,18)
# selecting only the columns of interest
weekday_trans = weekday_trans.filter(F.hour(weekday_trans.arrival_time)>=time_range[0])\
                        .filter(F.hour(weekday_trans.departure_time)>=time_range[0])\
                        .filter(F.hour(weekday_trans.arrival_time)<=time_range[1])\
                        .filter(F.hour(weekday_trans.departure_time)<=time_range[1])\
                        .select("trip_id","stop_id","stop_name","arrival_time","departure_time","stop_sequence", "route_desc", "route_id")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Graph building

Building the edges (time spend between two stops)

For one trip (identified by `trip_id`), we aggregate info about it and store it all in one row (per `trip_id`)

In [59]:
from datetime import datetime
import pandas as pd
from pyspark.sql.types import ArrayType, StringType, StructType, StructField, BooleanType

@F.udf(returnType=ArrayType(StringType()))
def to_line(column):
    set_names = column.split(';')
    line_set = []
    for i in range(len(set_names) - 1):
        line_set.append([set_names[i], set_names[i+1]])
    return line_set

@F.udf(returnType=ArrayType(StringType()))
def to_timetable(arr, dep):
    arr_time = arr.split(';')
    dep_time = dep.split(';')
    line_set = []
    for i in range(len(arr_time) - 1):
        line_set.append([dep_time[i], arr_time[i+1]])
    return line_set

@F.udf(returnType=ArrayType(StringType()))
def to_transtype(route_desc):
    routes_desc = route_desc.split(';')
    routes = []
    for i in range(len(routes_desc)):
        routes.append(routes_desc[i])
    return routes

@F.udf(returnType=ArrayType(StringType()))
def to_routeid(route_id):
    routes_id = route_id.split(';')
    routes_id = []
    for i in range(len(routes_id)):
        routes_id.append(routes_id[i])
    return routes_id
    

@F.udf(returnType=ArrayType(FloatType()))
def calculate_time(arr, dep):
    arr_time = arr.split(';')
    dep_time = dep.split(';')
    time_set = []
    for i in range(len(arr_time) - 1):
        time = (datetime.strptime(arr_time[i+1], '%H:%M:%S') - datetime.strptime(dep_time[i], '%H:%M:%S')).total_seconds()
        time_set.append(time)
    return time_set

@F.udf(returnType=StringType())
def remove_parentheses(cols):
    return cols[1:-1]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [60]:
from pyspark.sql import types as T

schema = T.StructType([
    T.StructField("stops_id_group", T.StringType()),
    T.StructField("stops_name_group", T.StringType()),
    T.StructField("used_time", T.StringType()),
    T.StructField("stops_time_group", T.StringType()),
    T.StructField("trans_type", T.StringType()),
    T.StructField("route_id", T.StringType())
])

# zip_arrays takes 5 equal-length lists, and return a list of elements zipped from them.
# see: zip (Python) for an example

# @sample:
# stops_id_group = [[1,2], [2,3], [3,4]]
# stops_name_group = [[A,B], [B,C], [C,D]]
# used_time = [10, 5, 8]
# stops_time_group = [[8:00, 8:10], [8:15, 8:20], [8:20, 8:28]]
# trans_type = ["Bus", "Bus", "Bus"]
# route_id = ["id", "id", "id"]
# zip_arrays(stops_id_group, stops_name_group, used_time, stops_time_group, trans_type, route_id) returns:
# [([1,2], [A,B], 10, [8:00, 8:10], "Bus", "id"),
#  ([2,3], [B,C], 5, [8:15, 8:20], "bUS", "id"),
#  ([3,4], [C,D], 8, [8:20, 8:28], "Bus", "id")]
# each element of the list contains information about travelling in a pair of stops (single-directional).


def zip_arrays(stops_id_group, stops_name_group, used_time, stops_time_group, trans_type,route_id):
    return list(zip(stops_id_group, stops_name_group, used_time, stops_time_group, trans_type, route_id))

# combine wraps the zip_arrays function to be a udf function for spark.
combine = F.udf(zip_arrays, returnType=T.ArrayType(schema))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [61]:
# for one trip, there are multiple stops, they are ordered by the `stop_sequence`
# from the weekday_trans data, we first concatenate all data that belongs to one trip with the reduceByKey method,
# then within one trip_id, sort by the `stop_sequence`,
# then concatenate each attribute within one trip, separated by ';', and make a new columns for each attribute with the map method
# finally, name the attributes.
reduced_rdd = weekday_trans.rdd.map(
    lambda row: (
        row[0],
        [(row[1], row[2], row[3], row[4], row[5], row[6],row[7])]
    )
).reduceByKey(
    lambda x, y: x + y
).map(
    lambda row: (
        row[0],
        sorted(row[1], key=lambda text: int(text[4]))
    )
).map(
    lambda row: (
        row[0],
        ";".join([e[0] for e in row[1]]),
        ";".join([e[1] for e in row[1]]),
        ";".join([e[2] for e in row[1]]),
        ";".join([e[3] for e in row[1]]),
        ";".join([e[4] for e in row[1]]),
        ";".join([e[5] for e in row[1]]),
        ";".join([e[6] for e in row[1]]),
    )
)

reduced_df = reduced_rdd.toDF(
    ["trip_id", "stop_id", "stop_name", "arrival_time", "departure_time", "stop_sequence", "route_desc","route_id"]
)
# call reduced_df.show(5) to get a feeling of what's there.

# for the reduced_df, we have:
# trip_id: one single string referring to a trip, the trip is time-dependent
# stop_id: a series of string, split by ';', the sequential stop_id along the trip, sorted by the stop_sequence.
# stop_name: a series of string, split by ';', the sequential stop_name along the trip, sorted by the stop_sequence.
# arrival_time: arrival time at each stop, separated by ';'
# departure_time: departure time at each stop, separated by ';'
# stop_sequence: the stop sequence, not useful, will be discarded after
# route_desc: the same values separated by ';'
# route_id: the same route_ids, separated by ';'

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [62]:
merge2stops_df = reduced_df.withColumn(
    'stops_id_group', to_line(reduced_df.stop_id) 
    # to_line allows a length-2 window sliding across the stop_id sequence, to make pairs of stops along the trip
    # e.g. the stop_id: "id1;id2;id3", then after applying to_line, we get [[id1,id2], [id2,id3]]
).withColumn(
    'stops_name_group', to_line(reduced_df.stop_name)
    # same as above
).withColumn(
    'stops_time_group', to_timetable(reduced_df.arrival_time, reduced_df.departure_time)
    # depart at previous stop and arrive at the next stop, applying to_timetable will generate such timetables
    # e.g. we have stop_id: "id1;id2;id3", arrival time: "0;1;2", departure_time"0;1.5;2"
    # after applying to_timetable, we have [[0,1],[1.5,2]]: depart at stop_1 at 0, arrive at stop_2 at 1; depart at stop_2 at 1.5, arrive at stop_3 at 2.
).withColumn(
    'used_time', calculate_time(reduced_df.arrival_time, reduced_df.departure_time)
    # calculating time spent between stops, from the previous example, we have
    # [1, 0.5]
).withColumn(
    'trans_type', to_transtype(reduced_df.route_desc)
    # tranforming the ';' separated string to a list
    # "Bus;Bus;Bus" -> ["Bus", "Bus", "Bus"]
).withColumn(
    'route_id', to_transtype(reduced_df.route_id)
    # same as above
).select(
    'trip_id', 'stops_id_group', 'stops_name_group', 'used_time', 'stops_time_group', 'trans_type', 'route_id'
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [63]:
merge2stops_df = merge2stops_df.withColumn(
    # for comments of combine, see previous block
    "agg_info",
    combine(merge2stops_df.stops_id_group, merge2stops_df.stops_name_group, merge2stops_df.used_time,
            merge2stops_df.stops_time_group, merge2stops_df.trans_type, merge2stops_df.route_id)
).withColumn(
    # explode will turn every single element in the list to become a row
    "agg_info", F.explode("agg_info")
).select(
    # renaming
    "trip_id",
    F.col("agg_info.stops_id_group").alias("stops_id_group"),
    F.col("agg_info.stops_name_group").alias("stops_name_group"),
    F.col("agg_info.used_time").alias("used_time"),
    F.col("agg_info.stops_time_group").alias("stops_time_group"),
    F.col("agg_info.trans_type").alias("route_desc"),
    F.col("agg_info.route_id").alias("route_id")
)

merge2stops_df = merge2stops_df.withColumn(
    'new_id_group', remove_parentheses(merge2stops_df.stops_id_group)
).withColumn(
    'new_name_group', remove_parentheses(merge2stops_df.stops_name_group)
).withColumn(
    'new_time_group', remove_parentheses(merge2stops_df.stops_time_group)
)

trans_map = merge2stops_df.select(
    # spliting the paired information into 2 columns.
    "trip_id",
    F.split("new_id_group", ",")[0].alias("stop_id1"),
    F.split("new_id_group", ", ")[1].alias("stop_id2"),
    "used_time",
    F.split("new_time_group", ",")[0].alias("stop_id1_dep"),
    F.split("new_time_group", ", ")[1].alias("stop_id2_arr"),
    "route_desc",
    "route_id"
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [117]:
trans_map.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------+--------+---------+------------+------------+----------+----------+
|             trip_id|stop_id1|stop_id2|used_time|stop_id1_dep|stop_id2_arr|route_desc|  route_id|
+--------------------+--------+--------+---------+------------+------------+----------+----------+
|2149.TA.91-7-j22-...| 8591439| 8591106|    120.0|    15:44:00|    15:46:00|         T|91-7-j22-1|
|2149.TA.91-7-j22-...| 8591106| 8591279|     60.0|    15:46:00|    15:47:00|         T|91-7-j22-1|
|2149.TA.91-7-j22-...| 8591279| 8591304|     60.0|    15:47:00|    15:48:00|         T|91-7-j22-1|
|2149.TA.91-7-j22-...| 8591304| 8591081|     60.0|    15:48:00|    15:49:00|         T|91-7-j22-1|
|2149.TA.91-7-j22-...| 8591081| 8591082|    120.0|    15:49:00|    15:51:00|         T|91-7-j22-1|
+--------------------+--------+--------+---------+------------+------------+----------+----------+
only showing top 5 rows

In [118]:
# the walking time we processed earlier
walk_map.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+---------+-------------+---------+-------------+-------------+----------+--------+
|trip_id|stop_id_1|    stop_id_2|used_time|stops_id1_dep|stops_id2_arr|route_desc|route_id|
+-------+---------+-------------+---------+-------------+-------------+----------+--------+
|   walk|  8500926|      8590616|146.91576|         null|         null|      walk|    walk|
|   walk|  8500926|      8590737|359.59048|         null|         null|      walk|    walk|
|   walk|  8502186|8502186:0:1/2| 9.871648|         null|         null|      walk|    walk|
|   walk|  8502186|      8502270| 552.5724|         null|         null|      walk|    walk|
|   walk|  8502186|      8590200| 580.6377|         null|         null|      walk|    walk|
+-------+---------+-------------+---------+-------------+-------------+----------+--------+
only showing top 5 rows

## UI

In [64]:
%%local
import pandas as pd
from ipywidgets import widgets, interact, VBox
from IPython.display import display, HTML
import datetime
current_datetime = datetime.datetime.now()
current_hour = current_datetime.hour
current_minute = str(current_datetime.minute)
proposed_hour = str(current_hour + 4)

def create_schedule(change):
    Departure = departure_widget.value
    Destination = destination_widget.value
    timeInput = input_widget.value
    hour, minute = map(int, timeInput.split(":"))
    time = hour + minute / 60
    
    schedule_info = {
        'dep': [Departure],
        'destination': [Destination],
        'arrival_time': [time]
    }
    
    schedule_df = pd.DataFrame(schedule_info)
    
    schedule_df.to_csv('./info.csv')

    
css = """
.widget-label {
    min-width: 150px;
    text-align: right;
    padding-right: 10px;
}
#container {
    background-color: orange;
    color: white;
}
#title {
    color: red;
}
"""

html = "<style>{}</style>".format(css)
display(HTML(html))
    
title = widgets.HTML('<h2 id="title">Route Planner</h2>')
departure_widget = widgets.Text(value='Küsnacht ZH', description='Departure')
destination_widget = widgets.Text(value='Zürich, Neeserweg', description='Destination')
input_widget = widgets.Text(value = proposed_hour + ":"+current_minute, description = 'Arrive at (HH:MM)')

input_widget.continuous_update = False
input_widget.observe(create_schedule, 'value')

container = VBox([title, departure_widget, destination_widget,input_widget], layout=widgets.Layout(id='container'))
display(container)

VBox(children=(HTML(value='<h2 id="title">Route Planner</h2>'), Text(value='Küsnacht ZH', description='Departu…

---

Reading input data and sending it to spark

In [65]:
%%local
import pandas as pd
from datetime import timedelta

schedule_df = pd.read_csv('./info.csv')
time_local = schedule_df.iloc[0,3]
time_local = str(timedelta(hours=time_local))

In [66]:
%%send_to_spark -i schedule_df -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'schedule_df' as 'schedule_df' to Spark kernel

Getting the stop id corresponding to the stop name. 

Here I preserve only the root stations (no parent stations) and distinct them based on coordinates.

In [67]:
%%spark -o stop_ids
stop_name1 = schedule_df.select('dep').rdd.flatMap(lambda x: x).collect()[0]
stop_name2 = schedule_df.select('destination').rdd.flatMap(lambda x: x).collect()[0]
time = schedule_df.select('arrival_time').rdd.flatMap(lambda x: x).collect()[0]

stop1 = stops_in_15.filter(stops_in_15["stop_name"]==stop_name1)\

stop2 = stops_in_15.filter(stops_in_15["stop_name"]==stop_name2)\

# The result of stop_1 and stop_2 are dataframes with potentially multiple rows
# Because there are stops that have a common parent and share the same name
# It is possible that only the child stop exists in the trans map, so
# here we need to query a bit.

stop1_id_unfilter = stop1.select('stop_id')
stop2_id_unfilter = stop2.select('stop_id')

filtered_stop1 = stop1_id_unfilter.join(trans_map, stop1_id_unfilter.stop_id == trans_map.stop_id1, "inner").select(stop1_id_unfilter.stop_id).distinct()
filtered_stop2 = stop2_id_unfilter.join(trans_map, stop2_id_unfilter.stop_id == trans_map.stop_id2, "inner").select(stop2_id_unfilter.stop_id).distinct()

# stop_id1, stop_id2 is a string
stop_id1 = filtered_stop1.select('stop_id').first().stop_id
stop_id2 = filtered_stop2.select('stop_id').first().stop_id
schema = StructType([
    StructField("stop_id1", StringType(), nullable=False),
    StructField("stop_id2", StringType(), nullable=False)
])
stop_ids = spark.createDataFrame([], schema)
stop_ids = stop_ids.union(spark.createDataFrame([(stop_id1, stop_id2)], schema))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [68]:
%%local
stop_id1 = str(stop_ids['stop_id1'].iloc[0])
stop_id2 = str(stop_ids['stop_id2'].iloc[0])

## Graph construction

Before building the graph, we filter out trips that are:
- Arriving later than our desired arrivial time;
- Arriving too early (more than 2 hours before) than our desired arrivial time

In [69]:
arrival_time_in_hours = F.hour("arrival_time") + F.minute("arrival_time") / 60

start_time = time - 2
end_time = time

trips_arriving_in_time_range = weekday_trans.filter(arrival_time_in_hours.between(start_time, end_time))

distinct_trip_ids = trips_arriving_in_time_range.select("trip_id")

trans_map2 = trans_map.join(distinct_trip_ids,'trip_id')

trans_map2 = trans_map2.union(walk_map)

trans_map2.write.parquet('/user/{0}/file/'.format(username), mode='overwrite')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [70]:
# extract route_id from trip_id
@F.udf(returnType=StringType())
def split_trip(t):
    tmp = t.split('.')
    if len(tmp) > 1: # train/bus/...
        return tmp[2]
    else: # 'walk'
        return t
trans_map3 = trans_map2.withColumn('route_id', split_trip(trans_map.trip_id))#.withColumnRenamed("trip_id", "route_id")
trans_map3.write.mode("overwrite").parquet('/user/{0}/trans_map3/'.format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [71]:
%%local

from hdfs3 import HDFileSystem
import pandas as pd
import networkx as nx

hdfs = HDFileSystem(host='hdfs://iccluster044.iccluster.epfl.ch', port=8020, user='ebouille')
files = hdfs.glob('/user/{0}/trans_map3/*.parquet'.format(username))
trans = pd.DataFrame()
for file in files:
    with hdfs.open(file) as f:
        trans = pd.concat([trans, pd.read_parquet(f)])

In [72]:
%%local
edges = trans.values.tolist()

for i, row in enumerate(edges):
    edges[i] = (row[1], row[2], {"time": float(row[3]), "trip_id": row[0], "route_id": row[7]})

In [259]:
%%local
from functions import *
g = create_multigraph(edges)

In [None]:
%%local 
source = stop_id1#'8591365'
target = stop_id2#'8591070'
paths, route_id_lists = k_shortest_paths(g, source, target, k=50, weight='used_time')


In [268]:
from pyspark.sql.functions import to_timestamp, date_format
actual_data_temp = spark.read.load("/data/sbb/part_orc/istdaten", format="orc", sep=";", inferSchema="true", header="true")
all_data = actual_data_temp.withColumnRenamed("betriebstag", "date_of_trip")\
                .withColumnRenamed("fahrt_bezeichner", "journey_id")\
                .withColumnRenamed("betreiber_id", "operator_id")\
                .withColumnRenamed("betreiber_abk", "operator_abk")\
                .withColumnRenamed("betreiber_name", "operator_name")\
                .withColumnRenamed("produkt_id", "transport_type")\
                .withColumnRenamed("linien_id", "service_number")\
                .withColumnRenamed("linien_text", "service_type")\
                .withColumnRenamed("umlauf_id", "circulation_id")\
                .withColumnRenamed("verkehrsmittel_text", "means_of_transport_text")\
                .withColumnRenamed("zusatzfahrt_tf", "is_additional")\
                .withColumnRenamed("faellt_aus_tf", "is_failed")\
                .withColumnRenamed("bpuic", "stop_id")\
                .withColumnRenamed("haltestellen_name", "stop_name")\
                .withColumnRenamed("ankunftszeit", "schedule_arrival_timestamp")\
                .withColumnRenamed("an_prognose", "actual_arrival_timestamp")\
                .withColumnRenamed("abfahrtszeit", "schedule_departure_timestamp")\
                .withColumnRenamed("ab_prognose", "actual_departure_timestamp")\
                .withColumnRenamed("durchfahrt_tf", "not_stop")\
                .withColumn("schedule_departure_time", date_format(to_timestamp("schedule_departure_timestamp", "dd.MM.yyyy HH:mm"), "HH:mm:ss"))\
                .withColumn("schedule_arrival_time", date_format(to_timestamp("schedule_arrival_timestamp", "dd.MM.yyyy HH:mm"), "HH:mm:ss"))\
                .withColumn("date", date_format(to_timestamp("schedule_departure_timestamp", "dd.MM.yyyy HH:mm"), "dd.MM.yyyy"))
all_data.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- date_of_trip: string (nullable = true)
 |-- journey_id: string (nullable = true)
 |-- operator_id: string (nullable = true)
 |-- operator_abk: string (nullable = true)
 |-- operator_name: string (nullable = true)
 |-- transport_type: string (nullable = true)
 |-- service_number: string (nullable = true)
 |-- service_type: string (nullable = true)
 |-- circulation_id: string (nullable = true)
 |-- means_of_transport_text: string (nullable = true)
 |-- is_additional: string (nullable = true)
 |-- is_failed: string (nullable = true)
 |-- stop_id: string (nullable = true)
 |-- stop_name: string (nullable = true)
 |-- schedule_arrival_timestamp: string (nullable = true)
 |-- actual_arrival_timestamp: string (nullable = true)
 |-- an_prognose_status: string (nullable = true)
 |-- schedule_departure_timestamp: string (nullable = true)
 |-- actual_departure_timestamp: string (nullable = true)
 |-- ab_prognose_status: string (nullable = true)
 |-- not_stop: string (nullable = true)
 |

In [269]:
from pyspark.sql.functions import col, expr, round, unix_timestamp
from pyspark.sql.types import DateType
from datetime import datetime, timedelta

def get_probability_of_train_being_on_time(all_data, 
                                           start_stop_id, 
                                           end_stop_id,
                                           schedule_departure_time, 
                                           schedule_arrival_time, 
                                           transport_type, 
                                           date: str,
                                           stopover_time_in_minutes: int):
    
    # date format
    date_format = '%d.%m.%Y'
    
    # Convert date string to DateType
    trip_date = datetime.strptime(str(date), date_format).date()

    # Calculate the date 30 days ago
    history_lower_bound_date = trip_date - timedelta(days=30)

    # Filter the DataFrame to get only the records that were arriving at the destination stop at the same time.
    trip_arrivals = all_data.filter(
        (col('stop_id') == end_stop_id) &
        (col('schedule_arrival_time') == schedule_arrival_time) &
        (col('transport_type') == transport_type) &
        (col('date').between(trip_date.strftime(date_format), history_lower_bound_date.strftime(date_format)))
    )

    # Filter the DataFrame to get only the records that were starting at the start stop at the same departure time.
    trip_departures = all_data.filter(
        (col('stop_id') == start_stop_id) &
        (col('schedule_departure_time') == schedule_departure_time) &
        (col('transport_type') == transport_type) &
        (col('date').between(trip_date.strftime(date_format), history_lower_bound_date.strftime(date_format)))
    )

    # in the trip arrivals, keep only the records for which the journey_id is also in the trip_departures. 
    # This is to ensure that the transports taken into consideration are only the trips corresponding to the same lines as we are considering.
    trip_arrivals_filtered = trip_arrivals.filter(col('journey_id').isin(trip_departures.select('journey_id').rdd.flatMap(lambda x: x).collect()))

    # in the trip_arrivals_filtered, add a column with the computation of the difference 
    # between actual_arrival_timestamp and schedule_arrival_timestamp so that we could have, 
    # rounded in minutes, the number of minutes of delay for every trip.
    
    # Convert timestamp columns to Unix timestamp
    trip_arrivals_filtered = trip_arrivals_filtered.withColumn('actual_arrival_unix', unix_timestamp(col('actual_arrival_timestamp'), 'dd.MM.yyyy HH:mm:ss'))
    trip_arrivals_filtered = trip_arrivals_filtered.withColumn('schedule_arrival_unix', unix_timestamp(col('schedule_arrival_timestamp'), 'dd.MM.yyyy HH:mm'))

    # Calculate the time difference in minutes and round to integers
    trip_arrivals_filtered = trip_arrivals_filtered.withColumn('delay_minutes', round((col('actual_arrival_unix') - col('schedule_arrival_unix')) / 60).cast('integer'))

    # Calculate the probability of the train being on time
    on_time_trips = trip_arrivals_filtered.filter(col('delay_minutes') <= stopover_time_in_minutes).count()
    total_trips = trip_arrivals_filtered.count()
    
    if total_trips == 0:
        print("default proba = 1 as no trips in history")
        return 1
    
    probability_of_on_time = on_time_trips / total_trips
    
    print("on_time_trips", on_time_trips)
    print("total_trips", total_trips)
    print("probability_of_on_time", probability_of_on_time)
    
    return probability_of_on_time

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [116]:
%%local
str('16:00:00' - pd.Timedelta(seconds=100)).split(' ')[-1]

'15:58:20'

In [261]:
%%local
pd.options.mode.chained_assignment = 'warn'
def calculate_connections(trans, route_ids, nodes, arrival_time):
    latest_arrivals = []

    #Keep track of the arrival time and when different routes are taken
    prev_arrival = arrival_time
    prev_route_id = route_ids[-1]
    departure_time = None

    #Generate the paths
    node_paths = [(nodes[j], nodes[j+1]) for j in range(len(nodes)-1)]
    node_paths = [(None, node_paths[0][0])] + node_paths
    route_ids = [None] + route_ids
    
    #Iterate backwards, starting from the destination
    for i, (r, edge) in enumerate(zip(route_ids[::-1],node_paths[::-1])):
        if r is None:
            break
    
        #Get the arrival stop of the corresponding route
        stop_id1 = edge[0]
        stop_id2 = edge[1]
        temp = trans[(trans.stop_id1==stop_id1)&(trans.stop_id2==stop_id2)&(trans.route_id==r)]

        #If we change routes we need to consider the 2 minute transition time
        if prev_route_id != r:
            prev_arrival = str(prev_arrival - pd.Timedelta('2 min')).split(' ')[2]
            
        if temp.trip_id.iloc[0] == 'walk':
            new_arrival_time = latest_arrivals[-1].stop_id1_dep
            trans.loc[temp.index[0], 'stop_id2_arr'] = new_arrival_time
            used_time = int(float(temp.used_time.iloc[0]))           
            trans.loc[temp.index[0], 'stop_id1_dep'] = str(new_arrival_time - pd.Timedelta(seconds=used_time)).split(' ')[-1]
        else:
            temp = temp[temp.stop_id2_arr<=prev_arrival].sort_values('stop_id2_arr',ascending=False)
        
        if temp.empty:
            return None
        
        #Update arrival times and current route
        
        latest_arrivals.append(temp.iloc[0])
        trip_id = temp.trip_id.iloc[0]
        prev_arrival = temp.stop_id1_dep.iloc[0]
        prev_route_id = r
    return latest_arrivals

In [262]:
%%local
from transport_mapping import transport_type_dict
arr_time = time_local
date = '13.02.2022'
input_dfs = []
trip_durations = []
for i, (stops, route_ids) in enumerate(zip(paths, route_id_lists)):
    latest_arrivals = calculate_connections(trans,route_ids,stops,arr_time)
    if latest_arrivals is None:
        continue
        
    link = get_connection_info(trans,latest_arrivals, stops, date, arr_time)
    
    input_df = pd.DataFrame(link, columns=['stop_id1','stop_id2','departure_time','arrival_time',
                                                  'transport_type','date','stopover_time_in_minutes', 'route_id'])
    input_df.date = [date]*len(input_df.date)
    input_df.transport_type = input_df.transport_type.apply(lambda r: transport_type_dict[r])
    input_df['index'] = [i]*len(input_df.date)
    input_dfs.append(input_df)
    trip_durations.append(calculate_total_time(input_df))

if len(input_dfs)==0:
    print('No possible trips')
else:
    input_dfs = pd.concat(input_dfs)

In [270]:
%%send_to_spark -i input_dfs -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'input_dfs' as 'input_dfs' to Spark kernel

In [None]:
%%spark -o probs
from pyspark.sql import types as T

path_ids = input_dfs.select('index').distinct().collect()
probs = []
for path_id in path_ids:
    prob = 1.0
    for r in input_dfs.filter(col('index') == path_id['index']).collect():
        prob *= get_probability_of_train_being_on_time(all_data, r['stop_id1'], r['stop_id2'], 
                                                  r['departure_time'], r['arrival_time'],
                                                  r['transport_type'], r['date'], r['stopover_time_in_minutes'])
    probs.append(prob)
schema = T.StructType([T.StructField('index', T.IntegerType()), T.StructField('probs', T.FloatType())])
probs = spark.createDataFrame(data=[(path_id['index'], p) for path_id, p in zip(path_ids, probs)], schema=schema)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [105]:
%%local
files = hdfs.glob('/user/{0}/stops_in_15/*.parquet'.format(username))
stops_in_15 = pd.DataFrame()
for file in files:
    with hdfs.open(file) as f:
        stops_in_15 = pd.concat([stops_in_15, pd.read_parquet(f)])

In [173]:
%%local
import pandas as pd

# Merge the dataframes based on 'stop_id1' and 'stop_id'
merged_df = pd.merge(input_dfs, stops_in_15, left_on='stop_id1', right_on='stop_id', how='left')

# Rename the columns to reflect the merged data
merged_df = merged_df.rename(columns={'stop_name': 'from', 'stop_lat': 'lat1', 'stop_lon': 'lon1'})

# Merge again based on 'stop_id2' and 'stop_id'
merged_df = pd.merge(merged_df, stops_in_15, left_on='stop_id2', right_on='stop_id', how='left')

# Rename the columns to reflect the merged data
merged_df = merged_df.rename(columns={'stop_name': 'to', 'stop_lat': 'lat2', 
                                      'stop_lon': 'lon2', 'transport_type':'transport',
                                      'index':'trip_id','departure_time':'departure'})

# Drop the redundant columns
merged_df = merged_df.drop(columns=['stop_id_x', 'stop_id_y','stopover_time_in_minutes',
                                    'location_type_x','parent_station_x','stop_id',
                                    'location_type_y', 'parent_station_y','stop_id1','stop_id2'])
trips_df = merged_df.merge(probs, left_on='trip_id', right_on='index', how='left').rename(columns={'probs':'confidence'})
trips_df.confidence = trips_df.confidence * 100.
trips_df.lat1 = trips_df.lat1.astype(float)
trips_df.lat2 = trips_df.lat2.astype(float)
trips_df.lon1 = trips_df.lon1.astype(float)
trips_df.lon2 = trips_df.lon2.astype(float)

trips_df.to_csv('trips.csv')