## Setup of environnement

This section contains all the imports of modules that are required to run this Notebook.

In [1]:
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import col, explode, json_tuple, regexp_replace, udf
from pyspark.sql.functions import sum as col_sum
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, ArrayType, FloatType
import pyspark
from pyspark import SparkConf

from math import radians, cos, sin, asin, sqrt

import re
import os

## Input parameter
The only input parameter of this notebook is 'location': a geocoordinate.
The closest stop to this location will be calculated.

If you want to change this value, only the two last code section need to be rerun (section where udf is created, and the display of the result)

In [None]:
# Location to find the closest stop for
location = [51, 4]

## Stops
In this section the stops txt file gets converted into a usable dataframe.
The output of this section are the first 20 entries of the resulting dataframe.
Village number, entity number and the links irrelevant data, so they are removed from the dataframe.
To be more consistent with the rest of this notebook, the Dutch names are translated to English names.

In [2]:
# Read in the file (as json) and convert to a single column
# The 'haltes' column gets renamed to 'stops'
stops = spark.read.json("data/stops.txt")
stops = stops.select((explode("haltes").alias("stops")))

# Map each entry in the dataframe to its own stop
stops = stops.select('stops').rdd.map(lambda x: x.stops).toDF()
# Drop unnecessary data
stops = stops.drop('links', 'gemeentenummer', 'entiteitnummer')

# Rename columns of dataframe to better name
stops = stops \
            .withColumnRenamed('haltenummer', 'stop_number') \
            .withColumnRenamed('omschrijving', 'desc') \
            .withColumnRenamed('geoCoordinaat', 'coord') \
            .withColumnRenamed('omschrijvingGemeente', 'village') \

# Show the first 20 entries of the dataframe
stops.show()


+--------------------+-----------+--------------------+---------+
|               coord|stop_number|                desc|  village|
+--------------------+-----------+--------------------+---------+
|[51.1638893702134...|     101000| A. Chantrainestraat|  Wilrijk|
|[51.2062496902375...|     101001|           Zurenborg|Antwerpen|
|[51.1660665941742...|     101002|Verenigde Natieslaan|  Hoboken|
|[51.1660216374063...|     101003|Verenigde Natieslaan|  Hoboken|
|[51.1740548394127...|     101004|     D. Baginierlaan|  Hoboken|
|[51.1630084393468...|     101005| A. Chantrainestraat|  Wilrijk|
|[51.1597748887066...|     101006|      Fotografielaan|  Wilrijk|
|[51.1599636330007...|     101007|      Fotografielaan|  Wilrijk|
|[51.1629556669243...|     101008|            Moerelei|  Wilrijk|
|[51.1634592883462...|     101009|            Moerelei|  Wilrijk|
|[51.1887431659368...|     101010|        J. De Voslei|Antwerpen|
|[51.1829725415369...|     101011|   Middelheim Vijver|Antwerpen|
|[51.16220

## Haversine function
A function to calculate the distance in between two geocoordinates. <br>
Found at https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points

In [3]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

## Calculate distance function

In this section, a user defined function (udf) is created to caluclate the distance of the specified geocoordinate to the stop. The coordinate can be specified in the input section.
The user defined function uses the haversine function from above to calculate the distance between the two geocoordinates.

In [4]:
@udf(returnType=FloatType())
def get_distance(coord):
        """
    This UDF calculates the distance of the center to the location.
    @param lat_center: the latitude of the village center
    @param lon_center: the longitude of the village center
    @return: the distance of the center to the location
    """
    lat1 = location[0]
    lon1 = location[1]
    
    lat2 = coord[0]
    lon2 = coord[1]
    
    return haversine(lon1, lat1, lon2, lat2)


## Finding the closest stop
In order to find the closest stop to the given coordinate, a new column 'distances' is added to the stops dataframe.
To calculate this distance, the udf created in the previous section is used.

Now we have a table that contains the stops and the distance to the specified coordinate. To find the closest stop, the only thing that needs to be done is sorting the table on the distances in ascending order. The closest stop will then be the first entry of the dataframe.

In [7]:
# Add the distance column
stops = stops.withColumn('distances', get_distance('coord'))
# Sort on distances in ascending order
stops = stops.sort(col('distances').asc())
# Only show the first entry as this is the closest stop
stops.show(1)

+--------------------+-----------+--------------------+-----------+----------+
|               coord|stop_number|                desc|    village| distances|
+--------------------+-----------+--------------------+-----------+----------+
|[51.0023546374898...|     207399|Steenweg Naar Wet...|Schoonaarde|0.35383633|
+--------------------+-----------+--------------------+-----------+----------+
only showing top 1 row

