# Add statistical information to edges:

In this notebook, the mean and std computed in `compute_mean_std.ipynb` and saved in `mean_std_distributions.pickle`are loaded and added to the edges of our network (e.g. saved to `edges_with_mean_and_std_sec.orc`. This will later be used to create a network. 

## Set up:

In [1]:
%%configure
{"conf": {
    "spark.app.name": "dslab-group_final"
}}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7676,application_1589299642358_2172,pyspark,busy,Link,Link,
7684,application_1589299642358_2180,pyspark,idle,Link,Link,
7704,application_1589299642358_2200,pyspark,idle,Link,Link,
7711,application_1589299642358_2207,pyspark,idle,Link,Link,
7719,application_1589299642358_2215,pyspark,idle,Link,Link,
7724,application_1589299642358_2220,pyspark,idle,Link,Link,
7725,application_1589299642358_2221,pyspark,busy,Link,Link,
7727,application_1589299642358_2223,pyspark,idle,Link,Link,
7729,application_1589299642358_2225,pyspark,idle,Link,Link,
7731,application_1589299642358_2227,pyspark,dead,Link,Link,


In [2]:
from pyspark.sql.functions import col, udf, lit
from pyspark.sql.types import IntegerType

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7178,application_1589299642358_1674,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
%%local
import os
username = os.environ['JUPYTERHUB_USER']

In [4]:
%%send_to_spark -i username -t str -n username

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'username' as 'username' to Spark kernel

## Create a new edge dataframe:

**TO COMPLETE**

### Load data:

In [5]:
edges_df = spark.read.orc("/user/{}/edges.orc".format(username))
trips = spark.read.format('orc').load('/data/sbb/timetables/orc/trips/000000_0')
routes = spark.read.format('orc').load('/data/sbb/timetables/orc/routes/000000_0')

edges_with_route = trips.join(routes, 'route_id').select(col('trip_id'), col('route_desc')).distinct()\
                        .join(edges_df, 'trip_id')
edges_with_route.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+----------+-------+------------+--------------+------------+-------------+
|             trip_id|route_desc|stop_id|arrival_time|departure_time|   next_stop|trip_duration|
+--------------------+----------+-------+------------+--------------+------------+-------------+
|256.TA.26-10-B-j1...|    S-Bahn|8503090|         804|           804|8503088:0:22|          3.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503051|         801|           802|     8503090|          2.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503052|         800|           800|     8503051|          1.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503053|         799|           799|     8503052|          1.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503054|         797|           797|     8503053|          2.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503055|         792|           793|     8503054|          4.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503056|         790|           790|     8503055|          2.0|
|256.TA.26-10-B-j1...|    S-Ba

### Add transportation type and mean,std:

Create a dictionnary for transportation types:

In [6]:
translate_route_desc = {
    'TGV': 'TGV',
    'Eurocity': 'EC',
    'tandseilbahn': 'AT',
    'Regionalzug': 'R',
    'RegioExpress': 'RE',
    'S-Bahn': 'S',
    'Luftseilbahn': '',
    'Sesselbahn': '',
    'Taxi': '',
    'Fähre': '',
    'Tram': 'Tram',
    'ICE': 'ICE',
    'Bus': 'Bus',
    'Gondelbahn': '',
    'Nacht-Zug': '',
    'Standseilbahn': 'AT',
    'Auoreisezug': 'ARZ',
    'Eurostar': 'EC',
    'Schiff': '',
    'Schnellzug': 'TGV',
    'Intercity': 'IC',
    'InterRegio': 'IR',
    'Extrazug': 'EXT',
    'Metro': 'Metro'
}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
@udf("string")
def translate_dict(text):
    return translate_route_desc[text]

@udf('string')
def truncate_stop_id_column(s):
    return s.split(':')[0]

@udf('string')
def truncate_stop_id_len(s):
    return str(s)[:7]

@udf('long')
def leng(s):
    return len(str(s))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Load the pickle dataframe from `mean_std_distributions.pickle` to add it to the edges.

In [8]:
%%local
import pandas as pd
delays = pd.read_pickle('mean_std_distributions.pickle')

In [9]:
%%send_to_spark -i delays -t df -m 100000

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'delays' as 'delays' to Spark kernel

In [10]:
edges_with_route = edges_with_route.withColumn('route_desc_translated', translate_dict(col('route_desc')))\
                                       .withColumn('hour', (col('arrival_time')/60).cast(IntegerType()))\
                                       .withColumn('truncated_stop_id', truncate_stop_id_column(col('stop_id'))).cache()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
delays = delays.select(col('mean'), col('std'), col('hour').alias('hour_2'), col('stop_id').alias('stop_id_2'), col('verkehrsmittel_text'))\
               .withColumn('truncated_stop_id', truncate_stop_id_len(col('stop_id_2')))
delays = delays.withColumn('mean', col('mean')/60).withColumn('std', col('std')/60)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
edges_final = edges_with_route.join(delays, (edges_with_route.hour == delays.hour_2) &\
                                            (edges_with_route.truncated_stop_id == delays.truncated_stop_id) &\
                                            (edges_with_route.route_desc_translated == delays.verkehrsmittel_text), how='left')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Create edges dataframe with the following information:
 - trip_id
 - stop_id
 - train_type
 - arrival_time
 - departure_time
 - next_stop
 - trip_duration
 - mean
 - std 
 
From the original edges dataframe (from `edges.orc`, we now add mean, std and train information)

In [14]:
edges_final = edges_final.select('trip_id', 'stop_id', col('route_desc').alias('train_type'),\
                                           'arrival_time', 'departure_time', 'next_stop', 'trip_duration', 'mean', 'std').cache()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
print('Replacing ->')
`mean_std_distributions.pickle`.show(5)
print('With ->')
edges_final.filter(~col('mean').isNull()).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Replacing ->
+--------------------+-------+------------+--------------+-----------+-------------+
|             trip_id|stop_id|arrival_time|departure_time|  next_stop|trip_duration|
+--------------------+-------+------------+--------------+-----------+-------------+
|1391.TA.26-225-j1...|8573732|         684|           684|    8573734|          2.0|
|1391.TA.26-225-j1...|8573734|         686|           686|    8582763|          2.0|
|1391.TA.26-225-j1...|8582763|         688|           688|    8580268|          1.0|
|1391.TA.26-225-j1...|8580268|         689|           689|    8502953|          1.0|
|1391.TA.26-225-j1...|8502953|         690|           690|8573178:0:D|          3.0|
+--------------------+-------+------------+--------------+-----------+-------------+
only showing top 5 rows

With ->
+--------------------+-----------+----------+------------+--------------+-----------+-------------+------------------+------------------+
|             trip_id|    stop_id|train_type|arriva

#### Check how many edges have no mean or std information:

In [19]:
(edges_final.filter(col('mean').isNull() | col('std').isNull()).count() 
 / float(edges_final.filter(~col('mean').isNull() & ~col('std').isNull()).count()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0.04107935770845686

In [16]:
print('Proportion of null values:\n\tMean: {:.3f}'
      .format(edges_final.filter(col('mean').isNull()).select('mean').count() / float(edges_final.filter(~col('mean').isNull()).select('mean').count())))
print('\tStd: {}'.format(edges_final.filter(col('std').isNull()).select('std').count() / float(edges_final.filter(~col('std').isNull()).select('std').count())))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Proportion of null values:
	Mean: 0.041
	Std: 0.0410793577085

Notice that most edges have statistical information about the mean and std. For those that have no values, we will replace the mean by the duration of the trip and std by 0 in further computations. 

#### Write the edges to orc:

In [20]:
edges_final.write.format("orc").save("/user/{}/edges_with_mean_and_std_sec.orc".format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…