# To Begin With...

### Name your spark application as `GASPAR_final` or `GROUP_NAME_final`.

<div class='alert alert-info'><b>Any application without a proper name would be promptly killed.</b></div>

In [None]:
%%configure
{"conf": {
    "spark.app.name": <--NAME IT WIHT YOUR GASPAR OR GROUP NAME-->
}}

### Start Spark

In [None]:
# Initialization

### Read the [SBB actual data](https://opentransportdata.swiss/en/dataset/istdaten) in ORC format

In [3]:
sbb = spark.read.orc('/data/sbb/orc/istdaten')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
sbb.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- betriebstag: string (nullable = true)
 |-- fahrt_bezeichner: string (nullable = true)
 |-- betreiber_id: string (nullable = true)
 |-- betreiber_abk: string (nullable = true)
 |-- betreiber_name: string (nullable = true)
 |-- produkt_id: string (nullable = true)
 |-- linien_id: string (nullable = true)
 |-- linien_text: string (nullable = true)
 |-- umlauf_id: string (nullable = true)
 |-- verkehrsmittel_text: string (nullable = true)
 |-- zusatzfahrt_tf: string (nullable = true)
 |-- faellt_aus_tf: string (nullable = true)
 |-- bpuic: string (nullable = true)
 |-- haltestellen_name: string (nullable = true)
 |-- ankunftszeit: string (nullable = true)
 |-- an_prognose: string (nullable = true)
 |-- an_prognose_status: string (nullable = true)
 |-- abfahrtszeit: string (nullable = true)
 |-- ab_prognose: string (nullable = true)
 |-- ab_prognose_status: string (nullable = true)
 |-- durchfahrt_tf: string (nullable = true)

### Read the station list data [BFKOORD_GEO](https://opentransportdata.swiss/en/cookbook/hafas-rohdaten-format-hrdf/#Abgrenzung)

In [5]:
metadata = spark.read.csv('/data/sbb/stations/bfkoordgeo.csv', header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
metadata.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- StationID: string (nullable = true)
 |-- Longitude: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Remark: string (nullable = true)

In [7]:
metadata.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+---------+---------+------+----------------+
|StationID|Longitude| Latitude|Height|          Remark|
+---------+---------+---------+------+----------------+
|  0000002|26.074412|44.446770|     0|       Bucuresti|
|  0000003| 1.811446|50.901549|     0|          Calais|
|  0000004| 1.075329|51.284212|     0|      Canterbury|
|  0000005|-3.543547|50.729172|     0|          Exeter|
|  0000007| 9.733756|46.922368|   744|Fideris, Bahnhof|
+---------+---------+---------+------+----------------+
only showing top 5 rows

# Keep in mind

- Max 100MB (including git history -> the whole repo) (Use git-lfs if too much data)

# Ideas

Compute route from `A` to `B`, with tolerance `p`, with departure time `d`:
1. Find all routes from `A` to `B`, with departure time `d`
2. Remove all routes that historically were possible less than `p` percent of the time
3. From the remaining routes take the fastest one

## Schedule

### Week 1:
- Construct graph
- Have some planner

Wednesday 13th May:
- Try to have already a graph, as complete as possible (if issues ask Tao on Wednesday)
- With the graph constructed, clearly state how we'll proceed for each task (and validate with office hours)

Friday 15th May:
- Dijkstra, find shortest paths

### Week 2:
- Predictive model
- Test and validate
- Some visualization...

Wednesday 20th May:
- Predictive model

Friday 22th May:
- Test and validate

Monday 25th May:
- Some visualization...

# Todo

1. Model the public transport infrastructure for your route planning algorithm using the data provided to you.
2. Build a predictive model using the historical arrival/departure time data, and optionally other sources of data.
3. Implement a robust route planning algorithm using this predictive model.
4. Test and validate your results.
5. Implement a simple Jupyter-based visualization to demonstrate your method, using Jupyter dashboard such as Voilà or ipywidgets.

## 1. Model the public transport infrastructure for your route planning algorithm using the data provided to you.

Assumptions:
- Take May 13-17, 2019 to build schedule (8AM - 8PM). Each day of the week has a different schedule, but different weeks have the same schedules.

Concept of network:
- Each node represents a station and a time.
- Each edge `A -> B` represents a transport departs from station A at the station's time and arrives at station B at station B's time.
- Add edges between all stations and all stations `distance/50` minutes later that are <= 500 meters away (reachable by foot).
- Add waiting edges between all stations and themselves in the future. Example: (A, 15h00) and (A, 15h02) => Add edge from the first to the second 

<img src="network-example.png">

To build network concretely:
- `/data/sbb/timetables/csv/stop_times/2019/05/14/stop_times.txt`
- `trip_id | arrival_time | departure_time | stop_id | stop_sequence | pickup_type | drop_off_type`
- Load the table in hive
- Add locations of stations (from `/somewhere.../stops.txt`)
- Remove stations that are more than 15km away from `Zürich HB`
- Remove stations that are not reachable from `Zürich HB` (If too hard, can do this later)
- Join on `trip_id` (is `trip_id` unique on day?):
    - Know the schedule with `stop_sequence`
    - `arrival_time` and `departure_time` give the waiting time
- Add foot edges (If too hard, can do this later)
    - For each station find stations that are <= 500m away
    - Add needed edges and nodes
- Add waiting edges (Do we limit waiting time?)
- `networkx` or just a simple table?

## Office hours:
Graph generation:
- Don't put time in node
- Edges are time-varying at some time you have these specific edges, weight = #minutes

Algorithms to use:
- Dijkstra works

Store network:
- `networkx` works

Can we use spark:
- Yes

Are `stop_times.txt` and `stops.txt` sufficient?
- Need `table_calendars`, see if it's useful
- `service_id` when `service_id` is operating

Predictive model:
- General idea we had is correct -> find all journeys that work -> rank them by time -> check prob. of success
- Maybe output list (if we have time...)

## 2. Build a predictive model using the historical arrival/departure time data, and optionally other sources of data.

Later

## 3. Implement a robust route planning algorithm using this predictive model.

Later

## 4. Test and validate your results.

Later

## 5. Implement a simple Jupyter-based visualization to demonstrate your method, using Jupyter dashboard such as [Voilà](https://voila.readthedocs.io/en/stable/) or [ipywidgets](https://ipywidgets.readthedocs.io/en/stable/user_guide.html).

Later