## Setup of environnement

This section contains all the imports of modules that are required to run this Notebook.

In [13]:
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import col, explode, json_tuple, regexp_replace, udf
from pyspark.sql.functions import sum as col_sum
from pyspark.sql.types import BooleanType

from math import radians, cos, sin, asin, sqrt
from typing import List

import re
import os

## Input parameters

The input parameters are located in this section. There are two:

- radius: the radius to check for stops (in km)
- latlng: a geocoordinate that acts as the point to search from

If you want to change this parameters, the only sections that need to be rerun are the two last code sections. (creation of udf and application of the filter).

In [21]:
# Radius to check for (in km)
radius = 1

# Position to check from. The location provided here is the location of the A. Chantrainestraat in Wilrijk.
latlng = [51.16388937021345, 4.392073389160737]

## Read and format the stop data
In this section the stops txt file gets converted into a usable dataframe.
The output of this section are the first 20 entries of the resulting dataframe.
Village number, entity number and the links irrelevant data, so they are removed from the dataframe.
To be more consistent with the rest of this notebook, the Dutch names are translated to English names.

In [15]:
# Read in the file (as json) and convert to a single column
# The 'haltes' column gets renamed to 'stops'
stops = spark.read.json("data/stops.txt")
stops = stops.select((explode("haltes").alias("stops")))

# Map each entry in the dataframe to its own stop
stops = stops.select('stops').rdd.map(lambda x: x.stops).toDF()
# Drop unnecessary data
stops = stops.drop('links', 'gemeentenummer', 'entiteitnummer')

# Rename columns of dataframe to better name
stops = stops \
            .withColumnRenamed('haltenummer', 'stop_number') \
            .withColumnRenamed('omschrijving', 'desc') \
            .withColumnRenamed('geoCoordinaat', 'coord') \
            .withColumnRenamed('omschrijvingGemeente', 'village') \

# Show the first 20 entries of the dataframe
stops.show()

+--------------------+-----------+--------------------+---------+
|               coord|stop_number|                desc|  village|
+--------------------+-----------+--------------------+---------+
|[51.1638893702134...|     101000| A. Chantrainestraat|  Wilrijk|
|[51.2062496902375...|     101001|           Zurenborg|Antwerpen|
|[51.1660665941742...|     101002|Verenigde Natieslaan|  Hoboken|
|[51.1660216374063...|     101003|Verenigde Natieslaan|  Hoboken|
|[51.1740548394127...|     101004|     D. Baginierlaan|  Hoboken|
|[51.1630084393468...|     101005| A. Chantrainestraat|  Wilrijk|
|[51.1597748887066...|     101006|      Fotografielaan|  Wilrijk|
|[51.1599636330007...|     101007|      Fotografielaan|  Wilrijk|
|[51.1629556669243...|     101008|            Moerelei|  Wilrijk|
|[51.1634592883462...|     101009|            Moerelei|  Wilrijk|
|[51.1887431659368...|     101010|        J. De Voslei|Antwerpen|
|[51.1829725415369...|     101011|   Middelheim Vijver|Antwerpen|
|[51.16220

## Haversine function
A function to calculate the distance in between two geocoordinates. <br>
Found at https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points

In [16]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

## Create filter function and parameters

In this section, a user defined function (udf) is created to check whether the distance of the specified geocoordinate to the stop is smaller or equal to the given radius in the input section.
The user defined function uses the haversine function from above to calculate the distance between the two geocoordinates.

In [23]:
@udf(returnType=BooleanType())
def in_radius(latlng_col: List):
    '''
    This udf checks whether a coordinate (of a stop) is within a radius of a specified geocoordinate
    View the 'Input parameters' section to change the input parameters of this function.
    @param latlng_col: the column of the dataframe that contains a geocoordinate (as a list)
    @return true if the distance from the provided geocoordinate is smaller or equal to the specified radius
    '''
    lat1 = latlng_col[0]
    lon1 = latlng_col[1]
    
    lat2 = latlng[0]
    lon2 = latlng[1]
    
    return haversine(lon1, lat1, lon2, lat2) <= radius
    

## Filter the data
The udf can be used to filter the data. If this function is used as a filter on the 'coord' column, the distance from the stop to the specified geolocation is calculated. Only the stops that are within the specified radius will be added to the dataframe. 

As an example, the first 100 entries of the dataframe are showed.

In [24]:
stops.filter(in_radius('coord')).drop('coord').show(100)

+-----------+--------------------+-----------+
|stop_number|                desc|    village|
+-----------+--------------------+-----------+
|     101000| A. Chantrainestraat|    Wilrijk|
|     101001|           Zurenborg|  Antwerpen|
|     101002|Verenigde Natieslaan|    Hoboken|
|     101003|Verenigde Natieslaan|    Hoboken|
|     101004|     D. Baginierlaan|    Hoboken|
|     101005| A. Chantrainestraat|    Wilrijk|
|     101006|      Fotografielaan|    Wilrijk|
|     101007|      Fotografielaan|    Wilrijk|
|     101008|            Moerelei|    Wilrijk|
|     101009|            Moerelei|    Wilrijk|
|     101010|        J. De Voslei|  Antwerpen|
|     101011|   Middelheim Vijver|  Antwerpen|
|     101012|          Antarctica|    Wilrijk|
|     101013|          Antarctica|    Wilrijk|
|     101014|     Rozenkransplein|    Wilrijk|
|     101015|     Rozenkransplein|    Wilrijk|
|     101016|   L. Kieboomsstraat|    Wilrijk|
|     101017|   L. Kieboomsstraat|    Wilrijk|
|     101018|