## Setup of environnement
This section contains all the imports of modules that are required to run this Notebook.
The output of this section is the contents of the input directory.
This directory must be called 'data'.

In [1]:
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import col, explode, json_tuple, regexp_replace
from pyspark.sql.functions import sum as col_sum
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, ArrayType, FloatType
import pyspark
from pyspark import SparkConf

import re
import os

spark = SparkSession.builder.appName("Notebook 1").master("local").getOrCreate()

for dirname, _, filenames in os.walk('data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data/citizens2.txt
data/flemish_districs.txt
data/zipcodes.csv
data/citizens.txt
data/stops.txt


## Data preparation
In this section, the data gets prepared and cleaned so that it is easy to use in the future.

#### Stops
In this section the stops txt file gets converted into a usable dataframe.
The output of this section are the first 20 entries of the resulting dataframe.
Village number, entity number and the links irrelevant data, so they are removed from the dataframe.
To be more consistent with the rest of this notebook, the Dutch names are translated to English names.

In [2]:
# Read in the file (as json) and convert to a single column
# The 'haltes' column gets renamed to 'stops'
stops = spark.read.json("data/stops.txt")
stops = stops.select((explode("haltes").alias("stops")))

# Map each entry in the dataframe to its own stop
stops = stops.select('stops').rdd.map(lambda x: x.stops).toDF()
# Drop unnecessary data
stops = stops.drop('links', 'gemeentenummer', 'entiteitnummer')

# Rename columns of dataframe to better name
stops = stops \
            .withColumnRenamed('haltenummer', 'stop_number') \
            .withColumnRenamed('omschrijving', 'desc') \
            .withColumnRenamed('geoCoordinaat', 'coord') \
            .withColumnRenamed('omschrijvingGemeente', 'village') \

# Show the first 20 entries of the dataframe
stops.show()


+--------------------+-----------+--------------------+---------+
|               coord|stop_number|                desc|  village|
+--------------------+-----------+--------------------+---------+
|[51.1638893702134...|     101000| A. Chantrainestraat|  Wilrijk|
|[51.2062496902375...|     101001|           Zurenborg|Antwerpen|
|[51.1660665941742...|     101002|Verenigde Natieslaan|  Hoboken|
|[51.1660216374063...|     101003|Verenigde Natieslaan|  Hoboken|
|[51.1740548394127...|     101004|     D. Baginierlaan|  Hoboken|
|[51.1630084393468...|     101005| A. Chantrainestraat|  Wilrijk|
|[51.1597748887066...|     101006|      Fotografielaan|  Wilrijk|
|[51.1599636330007...|     101007|      Fotografielaan|  Wilrijk|
|[51.1629556669243...|     101008|            Moerelei|  Wilrijk|
|[51.1634592883462...|     101009|            Moerelei|  Wilrijk|
|[51.1887431659368...|     101010|        J. De Voslei|Antwerpen|
|[51.1829725415369...|     101011|   Middelheim Vijver|Antwerpen|
|[51.16220

#### Citizens
In this section the citizens file is read and converted to a Pyspark dataframe.
With this dataframe, calculations on the data can be made easier.
First, the unneccessary headers are removed from the data.
The French equivalent of the town names is not important and is thus removed.
Empty lines and excess whitespaces are removed as well.
The numbers in this file contain a '.' if the numbers are larger than 999.
This '.' is removed as well.
The output of this section are the first 20 entries of the resulting dataframe and the schema that is used within the dataframe.

In [3]:
citizens = sc.textFile("data/citizens.txt")
citizens.collect()

# Remove unnecessary headers
citizens = citizens.map(lambda x: re.sub(
    r'^KONINKRIJK.*|^BRUSSELS.*|^ARR.*|^ARRONDISSEMENT.*|^PROVINC.*|^VLAAMS.*|^REGION.*', '', x))

# Replace 'village / village-in-french' with 'village'
citizens = citizens.map(lambda x: re.sub(r'/ .* ', '', x))

# Remove everything in between parentheses
citizens = citizens.map(lambda x: re.sub(r'\(.*\)', '', x))

# Remove excess whitespaces
citizens = citizens.map(lambda x: re.sub(r"[^\S\n\t]+", ' ', x))

# Remove empty lines
citizens = citizens.filter(lambda x: x != '')

# Remove '.' from the numbers
citizens = citizens.map(lambda x: re.sub(r'\.', '', x))

# Split on space: gives a list of lists [[village, amount], [village, amount]...]
citizens = citizens.map(lambda x: x.rsplit(' ', 1))

# Create schema for dataframe. Citizens must be a StringType for now to avoid conversion errors.
s = StructType([
    StructField('village', StringType(), False),
    StructField('citizens', StringType(), False)
])

# Create schema and cast the citizens column to IntegerType
citizens = citizens.toDF(schema=s)

# Cast to citizens column to integers
citizens = citizens.withColumn("citizens", citizens['citizens'].cast(IntegerType()))

# Show the dataframe and the schema
citizens.show()
citizens.printSchema()


+--------------------+--------+
|             village|citizens|
+--------------------+--------+
|          Anderlecht|  117724|
|             Brussel|  177112|
|              Elsene|   86336|
|           Etterbeek|   47410|
|               Evere|   41016|
|           Ganshoren|   24794|
|               Jette|   52144|
|          Koekelberg|   21765|
|            Oudergem|   33725|
|          Schaarbeek|  132097|
| Sint‐Agatha‐Berchem|   24831|
|         Sint‐Gillis|   49361|
| Sint‐Jans‐Molenbeek|   95455|
| Sint‐Joost‐ten‐Node|   26813|
|Sint‐Lambrechts‐W...|   56212|
| Sint‐Pieters‐Woluwe|   41513|
|               Ukkel|   82038|
|               Vorst|   55694|
| Watermaal‐Bosvoorde|   25001|
|          Aartselaar|   14298|
+--------------------+--------+
only showing top 20 rows

root
 |-- village: string (nullable = false)
 |-- citizens: integer (nullable = true)



## Number of stops per citizen
Finally, all the data of the previous sections is combined to calculate the amount of stop per citizen in general.

In [4]:
# Count the number of stops
amount_of_stops = stops.count()

# Calculate the sum of the citizens column
amount_of_citizens = citizens.select(col_sum('citizens')).collect()[0][0]

print("There are {} stops per citizen".format(amount_of_stops / amount_of_citizens))

There are 0.0031509838967026657 stops per citizen
