# Add statistical information to edges:

In this notebook, the mean and std computed in `compute_mean_std.ipynb` and saved in `mean_std_distributions.pickle`are loaded and added to the edges of our network (e.g. saved to `edges_with_mean_and_std_sec.orc`. This will later be used to create a network. 

## Set up:

In [1]:
%%configure
{"conf": {
    "spark.app.name": "dslab-group_final"
}}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8919,application_1589299642358_3451,pyspark,idle,Link,Link,
8933,application_1589299642358_3465,pyspark,idle,Link,Link,
8973,application_1589299642358_3509,pyspark,idle,Link,Link,
8983,application_1589299642358_3520,pyspark,idle,Link,Link,
8985,application_1589299642358_3522,pyspark,idle,Link,Link,
8986,application_1589299642358_3523,pyspark,idle,Link,Link,
8996,application_1589299642358_3534,pyspark,idle,Link,Link,
8997,application_1589299642358_3535,pyspark,idle,Link,Link,
8998,application_1589299642358_3536,pyspark,idle,Link,Link,
8999,application_1589299642358_3537,pyspark,idle,Link,Link,


In [2]:
from pyspark.sql.functions import col, udf, lit
from pyspark.sql.types import IntegerType

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
9032,application_1589299642358_3578,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
%%local
import os
username = os.environ['JUPYTERHUB_USER']

In [4]:
%%send_to_spark -i username -t str -n username

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'username' as 'username' to Spark kernel

## Create a new edge dataframe:

**TO COMPLETE**

### Load data:

In [5]:
edges_df = spark.read.orc("/user/{}/edges.orc".format(username))
delays = spark.read.orc("/user/{}/delay_distribution_percentiles.orc".format(username))
trips = spark.read.format('orc').load('/data/sbb/timetables/orc/trips/000000_0')
routes = spark.read.format('orc').load('/data/sbb/timetables/orc/routes/000000_0')

edges_with_route = trips.join(routes, 'route_id').select(col('trip_id'), col('route_desc')).distinct()\
                        .join(edges_df, 'trip_id')
edges_with_route.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+----------+-------+------------+--------------+------------+-------------+
|             trip_id|route_desc|stop_id|arrival_time|departure_time|   next_stop|trip_duration|
+--------------------+----------+-------+------------+--------------+------------+-------------+
|256.TA.26-10-B-j1...|    S-Bahn|8503090|         804|           804|8503088:0:22|          3.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503051|         801|           802|     8503090|          2.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503052|         800|           800|     8503051|          1.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503053|         799|           799|     8503052|          1.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503054|         797|           797|     8503053|          2.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503055|         792|           793|     8503054|          4.0|
|256.TA.26-10-B-j1...|    S-Bahn|8503056|         790|           790|     8503055|          2.0|
|256.TA.26-10-B-j1...|    S-Ba

### Add transportation type and mean,std:

Create a dictionnary for transportation types:

In [6]:
translate_route_desc = {
    'TGV': 'TGV',
    'Eurocity': 'EC',
    'tandseilbahn': 'AT',
    'Regionalzug': 'R',
    'RegioExpress': 'RE',
    'S-Bahn': 'S',
    'Luftseilbahn': '',
    'Sesselbahn': '',
    'Taxi': '',
    'Fähre': '',
    'Tram': 'Tram',
    'ICE': 'ICE',
    'Bus': 'Bus',
    'Gondelbahn': '',
    'Nacht-Zug': '',
    'Standseilbahn': 'AT',
    'Auoreisezug': 'ARZ',
    'Eurostar': 'EC',
    'Schiff': '',
    'Schnellzug': 'TGV',
    'Intercity': 'IC',
    'InterRegio': 'IR',
    'Extrazug': 'EXT',
    'Metro': 'Metro'
}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
@udf("string")
def translate_dict(text):
    return translate_route_desc[text]

@udf('string')
def truncate_stop_id_column(s):
    return s.split(':')[0]

@udf('string')
def truncate_stop_id_len(s):
    return str(s)[:7]

@udf('long')
def leng(s):
    return len(str(s))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
edges_with_route = edges_with_route.withColumn('route_desc_translated', translate_dict(col('route_desc')))\
                                       .withColumn('hour', (col('arrival_time')/60).cast(IntegerType()))\
                                       .withColumn('truncated_stop_id', truncate_stop_id_column(col('stop_id'))).cache()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
delays = delays.select((col('mean')/60).alias('mean'), (col('std')/60).alias('std'),
                       (col('p_90')/60).alias('p_90'), (col('p_91')/60).alias('p_91'), (col('p_92')/60).alias('p_92'), (col('p_93')/60).alias('p_93'), (col('p_94')/60).alias('p_94'), (col('p_95')/60).alias('p_95'), (col('p_96')/60).alias('p_96'), (col('p_97')/60).alias('p_97'), (col('p_98')/60).alias('p_98'), (col('p_99')/60).alias('p_99'),
                       col('hour').alias('hour_2'), col('stop_id').alias('stop_id_2'), col('verkehrsmittel_text'))\
               .withColumn('truncated_stop_id', truncate_stop_id_len(col('stop_id_2')))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
edges_final = edges_with_route.join(delays, (edges_with_route.hour == delays.hour_2) &\
                                            (edges_with_route.truncated_stop_id == delays.truncated_stop_id) &\
                                            (edges_with_route.route_desc_translated == delays.verkehrsmittel_text), how='left')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Create edges dataframe with the following information:
 - trip_id
 - stop_id
 - train_type
 - arrival_time
 - departure_time
 - next_stop
 - trip_duration
 - mean
 - std 
 
From the original edges dataframe (from `edges.orc`, we now add mean, std and train information)

In [11]:
edges_final = edges_final.select('trip_id', 'stop_id', col('route_desc').alias('train_type'),
                                 'arrival_time', 'departure_time', 'next_stop', 'trip_duration', 'mean', 'std',
                                 'p_90', 'p_91', 'p_92', 'p_93', 'p_94', 'p_95', 'p_96', 'p_97', 'p_98', 'p_99').cache()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
print('Replacing ->')
`mean_std_distributions.pickle`.show(5)
print('With ->')
edges_final.filter(~col('mean').isNull()).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
name 'mean_std_distributions' is not defined
Traceback (most recent call last):
NameError: name 'mean_std_distributions' is not defined



#### Check how many edges have no mean or std information:

In [13]:
(edges_final.filter(col('mean').isNull() | col('std').isNull()).count() 
 / float(edges_final.filter(~col('mean').isNull() & ~col('std').isNull()).count()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

46.56990280892876

In [16]:
print('Proportion of null values:\n\tMean: {:.3f}'
      .format(edges_final.filter(col('mean').isNull()).count() / float(edges_final.count())))
print('\tStd: {:.3f}'.format(edges_final.filter(col('std').isNull()).count() / float(edges_final.count())))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Proportion of null values:
	Mean: 0.979
	Std: 0.979

Notice that most edges have statistical information about the mean and std. For those that have no values, we will replace the mean by the duration of the trip and std by 0 in further computations. 

#### Write the edges to orc:

In [15]:
edges_final.write.format("orc").mode('overwrite').save("/user/{}/edges_with_mean_and_std_sec.orc".format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [21]:
edges_final.filter(col('mean').isNull()).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1308102

In [22]:
edges_final.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1336191