<h2>Big Data Systems and Architectures - Spark Assignment 2021</h2>
<h3>Exploring International Flights in 2017 Data</h3>

---
> Georgia Vlassi p2822001<br />
> Business Analytics <br />
> Athens University of Economics and Business <br/>

---

The main scope of this assignment is to analyse a dataset about international flights in 2017, using Apache Spark to reveal insights about these data. 

You can find the aforementioned data here:
http://andrea.imis.athena-innovation.gr/aueb-master/flights.csv.zip


<h3>Task 1</h3>
Your first task is to calculate the average flight delays in the dataset.

In [1]:
import findspark
findspark.init()

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.readwriter import DataFrameReader
from pyspark.sql.types import IntegerType, Row, StringType

#Import the following to use SQL commands
import pyspark.sql.functions as F
from pyspark.sql.functions import avg,expr, col, desc,round

Our first step is to create a temporary view to read the data.

In [2]:
spark = SparkSession.builder.appName("FlightsAssignment").getOrCreate()
spark

After initialising the view we have to load our data. The data file named '671009038_T_ONTIME_REPORTING.csv', will be read by using using spark and save them into flights_data variable. In order to access the data download the file mentioned on description and unzip it.

In [3]:
flights_data = spark.read\
                    .option("header","true")\
                    .option("inferSchema","true")\
                    .csv("671009038_T_ONTIME_REPORTING.csv")

Below we can see the fisrt 10 records from our data.

In [4]:
flights_data.show(10)

+----------+--------+-------+------+----------------+----+------------------+--------+---------+--------+---------+---------+-----------------+--------+-------------+-------------+---------+--------------+-------------------+----+
|   FL_DATE|TAIL_NUM|CARRIER|ORIGIN|ORIGIN_CITY_NAME|DEST|    DEST_CITY_NAME|DEP_TIME|DEP_DELAY|ARR_TIME|ARR_DELAY|CANCELLED|CANCELLATION_CODE|DIVERTED|CARRIER_DELAY|WEATHER_DELAY|NAS_DELAY|SECURITY_DELAY|LATE_AIRCRAFT_DELAY|_c19|
+----------+--------+-------+------+----------------+----+------------------+--------+---------+--------+---------+---------+-----------------+--------+-------------+-------------+---------+--------------+-------------------+----+
|2019-01-01|  N8974C|     9E|   AVL|   Asheville, NC| ATL|       Atlanta, GA|    1658|     -7.0|    1758|    -22.0|      0.0|             null|     0.0|         null|         null|     null|          null|               null|null|
|2019-01-01|  N922XJ|     9E|   JFK|    New York, NY| RDU|Raleigh/Durham, NC

In [5]:
flights_data.printSchema()

root
 |-- FL_DATE: string (nullable = true)
 |-- TAIL_NUM: string (nullable = true)
 |-- CARRIER: string (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- ORIGIN_CITY_NAME: string (nullable = true)
 |-- DEST: string (nullable = true)
 |-- DEST_CITY_NAME: string (nullable = true)
 |-- DEP_TIME: integer (nullable = true)
 |-- DEP_DELAY: double (nullable = true)
 |-- ARR_TIME: integer (nullable = true)
 |-- ARR_DELAY: double (nullable = true)
 |-- CANCELLED: double (nullable = true)
 |-- CANCELLATION_CODE: string (nullable = true)
 |-- DIVERTED: double (nullable = true)
 |-- CARRIER_DELAY: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)
 |-- SECURITY_DELAY: double (nullable = true)
 |-- LATE_AIRCRAFT_DELAY: double (nullable = true)
 |-- _c19: string (nullable = true)



Like SQL we should create a query in spark to calculate the calculate the average flight delays.

Find the average departure delay of flights(the DEP_DELAY column is in minutes)

In [6]:
AvgDelay = flights_data\
                .agg(round(avg(("DEP_DELAY")),2).alias("AverageDepartureDelay"), 
                     round(avg(("ARR_DELAY")),2).alias("AverageArrivalDelay"))
AvgDelay.show()

+---------------------+-------------------+
|AverageDepartureDelay|AverageArrivalDelay|
+---------------------+-------------------+
|                10.92|               5.41|
+---------------------+-------------------+



Implement the above with the use of SQL

We wiil convert the above dataframe to a table/view

In [7]:
flights_data.createOrReplaceTempView("flights_data")
flights_data

DataFrame[FL_DATE: string, TAIL_NUM: string, CARRIER: string, ORIGIN: string, ORIGIN_CITY_NAME: string, DEST: string, DEST_CITY_NAME: string, DEP_TIME: int, DEP_DELAY: double, ARR_TIME: int, ARR_DELAY: double, CANCELLED: double, CANCELLATION_CODE: string, DIVERTED: double, CARRIER_DELAY: double, WEATHER_DELAY: double, NAS_DELAY: double, SECURITY_DELAY: double, LATE_AIRCRAFT_DELAY: double, _c19: string]

In [8]:
AvgDelay_SQL = spark.sql("SELECT ROUND(AVG(DEP_DELAY),2) AS AverageDepartureDelay, ROUND(AVG(ARR_DELAY),2) AS AverageArrivalDelay FROM flights_data")
AvgDelay_SQL.show()

+---------------------+-------------------+
|AverageDepartureDelay|AverageArrivalDelay|
+---------------------+-------------------+
|                10.92|               5.41|
+---------------------+-------------------+

