
## Exercise

Consider the dataset `/databricks-datasets/flights/departuredelays.csv` about flights and delays.

1. Import the csv in a `DataFrame`. Would you define the schema before?
2. The column `delay` expresses the delay in minutes. Can you compute a new column `delayInHours` where the amount of `delay` is converted to hours?
3. What is the flight with largest delay ever?
4. [Bonus] What is the most popular route? Note that a route is the combination of an `origin` and a `destination`

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("date",StringType(),True),
    StructField("delay",IntegerType(),True),
    StructField("distance",IntegerType(),True),
    StructField("origin",StringType(),True),
    StructField("destination",StringType(),True)]
)

In [0]:
from pyspark.sql.functions import concat, lit, col

df = spark.read.csv(
    "dbfs:/databricks-datasets/flights/departuredelays.csv",
    header=True,
    schema=schema
)

In [0]:

df = df.withColumn("delayInHours", df.delay / 60)
display(
    df.withColumn("route", concat(df.origin, lit(" - "), df.destination ))
    .groupBy("route").count()
    .sort(col("Count").desc())
)

route,count
SFO - LAX,3232
LAX - SFO,3198
LAS - LAX,3016
LAX - LAS,2964
JFK - LAX,2720
LAX - JFK,2719
ATL - LGA,2501
LGA - ATL,2500
LAX - PHX,2394
PHX - LAX,2387


In [0]:
import pyspark.sql.functions as F

display(df.select(F.max(df.delay)))

max(delay)
1642


In [0]:
display(df.orderBy(df.delay.desc()))

date,delay,distance,origin,destination,delayInHours
3090615,1642,807,TPA,DFW,27.366666666666667
2190925,1638,1604,SFO,ORD,27.3
2021245,1636,972,FLL,DFW,27.266666666666666
3020700,1592,974,RSW,ORD,26.53333333333333
1180805,1560,548,BNA,DFW,26.0
3031210,1553,1404,PDX,DFW,25.883333333333333
3070645,1543,887,CLE,DFW,25.716666666666665
2210630,1511,873,MCO,ORD,25.183333333333334
1300915,1500,1517,EGE,JFK,25.0
1150715,1496,1033,ONT,DFW,24.933333333333334
