1. Load the flight-summary.csv into a dataframe: ( be sure to use the correct path )

This data represents the number of flights taking place from origin airports to destination
airports.

In [0]:
flight_summary = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("dbfs:/FileStore/flight_summary.csv")

flight_summary.show(5)

+-----------+--------------------+------------+------------+---------+--------------------+---------+----------+-----+
|origin_code|      origin_airport| origin_city|origin_state|dest_code|        dest_airport|dest_city|dest_state|count|
+-----------+--------------------+------------+------------+---------+--------------------+---------+----------+-----+
|        BQN|Rafael Hernández ...|   Aguadilla|          PR|      MCO|Orlando Internati...|  Orlando|        FL|  441|
|        PHL|Philadelphia Inte...|Philadelphia|          PA|      MCO|Orlando Internati...|  Orlando|        FL| 4869|
|        MCI|Kansas City Inter...| Kansas City|          MO|      IAH|George Bush Inter...|  Houston|        TX| 1698|
|        SPI|Abraham Lincoln C...| Springfield|          IL|      ORD|Chicago O'Hare In...|  Chicago|        IL|  998|
|        SNA|John Wayne Airpor...|   Santa Ana|          CA|      PHX|Phoenix Sky Harbo...|  Phoenix|        AZ| 3846|
+-----------+--------------------+------------+-

2. Write a query which determines how many unique origin airports are contained in the
data. ( 3 marks)

In [0]:
unique_origins = flight_summary.select("origin_code").distinct().count()
print("Unique origin airports: ", unique_origins)

Unique origin airports:  322


3. Modify query #2 to use the approx._count_distinct function with a margin of error of
10%. (3 marks) Why does this function exists as it is not completely accurate? ( 2
marks)

In [0]:
from pyspark.sql.functions import approx_count_distinct

approx_unique = flight_summary.agg(approx_count_distinct("origin_code").alias("approx_count")).show()

# Approx count function exists because on large datasets calculating an exact value can be expensive. Using approximation values on these large datasets can still # give you a fairly accurate result as well for most use cases


+------------+
|approx_count|
+------------+
|         318|
+------------+



4. What does the function skewness determine? (2 marks) Write a query which outputs
the skewness of the “count” column. (3 marks) What does the result indicate? ( 2 marks)

In [0]:
from pyspark.sql.functions import skewness

# Skewness measures the asymmetry of unbalance of the distribution of a variable. A value of zero indicates a balanced distribution. A positive value means the data is skewed to the right, a negative value to the left

skewness = flight_summary.select(skewness("count").alias("skewness_value")).show()

+-----------------+
|   skewness_value|
+-----------------+
|2.682183800064101|
+-----------------+



5. Write a query which outputs the top 5 most popular destination cities. You will need to
group the data by destination state and destination city. ( 5 marks )

In [0]:
from pyspark.sql.functions import sum

top_destinations = flight_summary.groupBy("dest_state", "dest_city") \
    .agg(sum("count").alias("total_flights")) \
    .orderBy("total_flights", ascending=False) \
    .limit(5)

top_destinations.show()

+----------+-----------------+-------------+
|dest_state|        dest_city|total_flights|
+----------+-----------------+-------------+
|        IL|          Chicago|       366790|
|        GA|          Atlanta|       346904|
|        TX|Dallas-Fort Worth|       239582|
|        TX|          Houston|       198724|
|        CO|           Denver|       196010|
+----------+-----------------+-------------+



6. Write a query which groups the data by each origin airport and outputs the sum,
average and standard deviation of the count column. Use the “.agg” function ( 5 marks )

In [0]:
from pyspark.sql.functions import avg, stddev

origin_aggs = flight_summary.groupBy("origin_code") \
    .agg(
        sum("count").alias("sum_count"),
        avg("count").alias("avg_count"),
        stddev("count").alias("stddev_count")
    )

origin_aggs.show()

+-----------+---------+------------------+------------------+
|origin_code|sum_count|         avg_count|      stddev_count|
+-----------+---------+------------------+------------------+
|        BGM|      262|             262.0|              null|
|        DLG|       77|              77.0|              null|
|        PSE|      749|             374.5| 58.68986283848344|
|        INL|      574|191.33333333333334| 169.5120448031152|
|        MSY|    38804|1021.1578947368421|1080.0178308416896|
|        PPG|      107|             107.0|              null|
|        GEG|     9505|             950.5| 878.9061699382679|
|        SNA|    37187|1690.3181818181818| 1288.817966619899|
|        BUR|    18889|1717.1818181818182|1015.1333723390063|
|        GRB|     4881|           1220.25|1066.3112037924639|
|        GTF|     1966| 655.3333333333334|308.61356634686905|
|        IDA|     2247|             749.0| 665.8498329203064|
|        GRR|    10845|             723.0| 714.2099531250296|
|       