Importing the necessary Spark environment, and downloading the [dataset](https://data.tii.ie/Datasets/TrafficCountData/2020/01/31/per-vehicle-records-2020-01-31.csv).

In [1]:
!pip install wget
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"

import wget
link_to_data = 'https://data.tii.ie/Datasets/TrafficCountData/2020/01/31/per-vehicle-records-2020-01-31.csv'
DataSet = wget.download(link_to_data)

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=dbcd044e8d906ec859d60f82c7d6c965590edeef0e6589f0ef1b62feafd401e9
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


Checking the contents of the local directory

In [2]:
!ls

per-vehicle-records-2020-01-31.csv  spark-2.4.0-bin-hadoop2.7
sample_data			    spark-2.4.0-bin-hadoop2.7.tgz


Initializing spark and loading the data into Spark DataFrame

In [3]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.window import Window

spark = SparkSession.builder.appName('Assignment-1').getOrCreate()

# loading the data
vehicle_counter_DF = spark.read.csv(
    './per-vehicle-records-2020-01-31.csv',
    inferSchema = True, 
    header = True
)

Checking the type of `vehicle_counter_DF`

In [4]:
type(vehicle_counter_DF)

pyspark.sql.dataframe.DataFrame

Counting the number of lines/records in `vehicle_counter_DF`

In [5]:
vehicle_counter_DF.count()

4740861

Taking a look at the first 5 lines in `vehicle_counter_DF`

In [6]:
vehicle_counter_DF.show(5)

+-----+----+-----+---+----+------+------+-----------+-----------+----+--------+------------+----------------+-----+---------+------+-------+----+-----+------+-----------+--------+------------+-------------+-----------+------------+
|cosit|year|month|day|hour|minute|second|millisecond|minuteofday|lane|lanename|straddlelane|straddlelanename|class|classname|length|headway| gap|speed|weight|temperature|duration|validitycode|numberofaxles|axleweights|axlespacings|
+-----+----+-----+---+----+------+------+-----------+-----------+----+--------+------------+----------------+-----+---------+------+-------+----+-----+------+-----------+--------+------------+-------------+-----------+------------+
|  997|2020|    1| 31|   1|    45|     1|          0|        105|   2|   Test2|           0|            null|    5|  HGV_RIG|  11.2|   3.55|3.83| 69.0|   0.0|        0.0|       0|           0|            0|       null|        null|
|  997|2020|    1| 31|   1|    45|     3|          0|        105|   1|  

Checking the DataFrame Schema

In [7]:
vehicle_counter_DF.printSchema()

root
 |-- cosit: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)
 |-- second: integer (nullable = true)
 |-- millisecond: integer (nullable = true)
 |-- minuteofday: integer (nullable = true)
 |-- lane: integer (nullable = true)
 |-- lanename: string (nullable = true)
 |-- straddlelane: integer (nullable = true)
 |-- straddlelanename: string (nullable = true)
 |-- class: integer (nullable = true)
 |-- classname: string (nullable = true)
 |-- length: double (nullable = true)
 |-- headway: double (nullable = true)
 |-- gap: double (nullable = true)
 |-- speed: double (nullable = true)
 |-- weight: double (nullable = true)
 |-- temperature: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- validitycode: integer (nullable = true)
 |-- numberofaxles: integer (nullable = true)
 |-- axleweights: string (nullab

# Question 1

Calculate the usage of Irish road network in terms of percentage grouped by vehicle category.

In [8]:
vehicle_counter_DF\
.groupBy('classname')\
.count()\
.withColumn('percent', 
            f.col('count')*100/f.sum('count').over(Window.partitionBy()))\
.select(['classname', 'percent'])\
.orderBy('percent', ascending=False)\
.show()

+---------+--------------------+
|classname|             percent|
+---------+--------------------+
|      CAR|   80.25858594040197|
|      LGV|  11.194464465420944|
|  HGV_ART|   4.397450167807071|
|  HGV_RIG|  2.7310861887745705|
|      BUS|  0.6871114761643508|
|  CARAVAN| 0.42912036442325563|
|    MBIKE| 0.29486205142905475|
|     null|0.007319345578788326|
+---------+--------------------+



# Question 2

Calculate the highest and lowest hourly flows on M50 - show the hours and total number of vehicle counts.

## Highest Hourly flow

In [9]:
vehicle_counter_DF\
.groupBy('hour')\
.count()\
.orderBy('count', ascending=False)\
.show(1)

+----+------+
|hour| count|
+----+------+
|  16|385850|
+----+------+
only showing top 1 row



## Lowest Hourly flow

In [10]:
vehicle_counter_DF\
.groupBy('hour')\
.count()\
.orderBy('count', ascending=True)\
.show(1)

+----+-----+
|hour|count|
+----+-----+
|   2|13682|
+----+-----+
only showing top 1 row



# Question 3

Calculate the evening and morning rush hours on M50 - show the hours and the total counts.

## Morning hours

From 0400 hours in the morning to 1200 hours (Noon)

In [11]:
vehicle_counter_DF\
.groupBy('hour')\
.count()\
.orderBy('hour', ascending=True)\
.filter((f.col('hour') < 12) & (f.col('hour') >= 4))\
.show()

+----+------+
|hour| count|
+----+------+
|   4| 27187|
|   5| 61937|
|   6|198369|
|   7|299784|
|   8|352862|
|   9|277509|
|  10|256183|
|  11|246847|
+----+------+



## Evening hours

From 1600 hours to 2000 hours

In [12]:
vehicle_counter_DF\
.groupBy('hour')\
.count()\
.orderBy('hour', ascending=True)\
.filter((f.col('hour') < 20) & (f.col('hour') >= 16))\
.show()

+----+------+
|hour| count|
+----+------+
|  16|385850|
|  17|367269|
|  18|314085|
|  19|232409|
+----+------+



# Question 4

Calculate average speed between each junction on M50 (e.g., junction 1, junction 2 - junction 3, etc.).

In [13]:
vehicle_counter_DF\
.groupBy('lanename')\
.agg({'speed': 'mean'})\
.orderBy('avg(speed)', ascending=False)\
.show()

+--------------------+------------------+
|            lanename|        avg(speed)|
+--------------------+------------------+
| Southbound 1 (slow)| 135.4469130170314|
|       Northbound 2 |122.31002458344715|
|        Eastbound  2|114.68716172331673|
|  Eastbound 2 (fast)|113.59000942507083|
| Southbound 2 (fast)|111.72458022387893|
|  Westbound 2 (fast)|111.34068965517257|
| Northbound 2 (fast)|110.21109738884894|
|          southbound| 104.4090909090909|
|         Nortbound 1|103.95987028779895|
| Northbound 1 (slow)|103.63843987902779|
|  Westbound 3 (fast)|103.47554310278697|
|  Eastbound 3 (fast)|100.55896097639352|
|        Southbound 2| 97.79121728990314|
|        Northbound 2| 97.76209841746629|
|  Westbound 2 (slow)| 95.40281196241926|
|Southbound Mainli...| 95.25522388059701|
|         Westbound 2| 93.36724880445983|
|         Eastbound 2| 92.79648071706569|
|   Eastbound on slip|  92.7741935483871|
|        Southbound 1| 92.74038016587762|
+--------------------+------------

# Question 5

Calculate the top 10 locations with highest number of counts of HGVs (class). Map the COSITs with their names given on the map.

In [14]:
vehicle_counter_DF\
.filter((f.col('classname') == "HGV_ART") | (f.col('classname') == "HGV_RIG"))\
.groupBy('lanename')\
.agg(
    f.mean("cosit").alias('Average cosit'),
    f.count(f.lit(1)).alias('count')
)\
.orderBy('count', ascending=False)\
.show(10)

+------------+------------------+-----+
|    lanename|     Average cosit|count|
+------------+------------------+-----+
|Northbound 1|20693.790824685962|47606|
|Southbound 1| 20370.47938361651|47438|
| Westbound 1| 47984.26970280579|26481|
| Eastbound 1|50842.564949674364|25335|
|  Northbound|11535.248807024242|20956|
|  Southbound| 15721.02364244845|18526|
|Northbound 2| 6737.779673675744|17406|
|Southbound 2| 5250.837445297139|15767|
|   Eastbound|11569.320689406504|13867|
|   Westbound|11157.380619527628|13591|
+------------+------------------+-----+
only showing top 10 rows

