# Group Assigment: D

In [3]:
import findspark
findspark.init()

In [4]:
findspark.find()
import pyspark
findspark.find()

'/opt/spark-2.4.4-bin-hadoop2.7'

In [5]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

conf = pyspark.SparkConf().setAppName('appName').setMaster('local[4]')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

## Introduction to the Flights dataset

According to a 2010 report made by the US Federal Aviation Administration, the economic price of domestic flight delays entails a yearly cost of 32.9 billion dollars to passengers, airlines and other parts of the economy. More than half of that amount comes from passengers' pockets, as they do not only waste time waiting for their planes to leave, but also miss connecting flights, spend money on food and have to sleep on hotel rooms while they're stranded.

The report, focusing on data from year 2007, estimated that air transportation delays put a 4 billion dollar dent in the country's gross domestic product that year. Full report can be found 
<a href="http://www.isr.umd.edu/NEXTOR/pubs/TDI_Report_Final_10_18_10_V3.pdf">here</a>.

But which are the causes for these delays?

In order to answer this question, we are going to analyze the provided dataset, containing up to 1.936.758 different internal flights in the US for 2008 and their causes for delay, diversion and cancellation; if any.

The data comes from the U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS)

This dataset is composed by the following variables:
1. **Year** 2008
2. **Month** 1
3. **DayofMonth** 1-31
4. **DayOfWeek** 1 (Monday) - 7 (Sunday)
5. **DepTime** actual departure time (local, hhmm)
6. **CRSDepTime** scheduled departure time (local, hhmm)
7. **ArrTime** actual arrival time (local, hhmm)
8. **CRSArrTime** scheduled arrival time (local, hhmm)
9. **UniqueCarrie**r unique carrier code
10. **FlightNum** flight number
11. **TailNum** plane tail number: aircraft registration, unique aircraft identifier
12. **ActualElapsedTime** in minutes
13. **CRSElapsedTime** in minutes
14. **AirTime** in minutes
15. **ArrDelay** arrival delay, in minutes: A flight is counted as "on time" if it operated less than 15 minutes later the scheduled time shown in the carriers' Computerized Reservations Systems (CRS).
16. **DepDelay** departure delay, in minutes
17. **Origin** origin IATA airport code
18. **Dest** destination IATA airport code
19. **Distance** in miles
20. **TaxiIn** taxi in time, in minutes
21. **TaxiOut** taxi out time in minutes
22. **Cancelled** *was the flight cancelled
23. **CancellationCode** reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24. **Diverted** 1 = yes, 0 = no
25. **CarrierDelay** in minutes: Carrier delay is within the control of the air carrier. Examples of occurrences that may determine carrier delay are: aircraft cleaning, aircraft damage, awaiting the arrival of connecting passengers or crew, baggage, bird strike, cargo loading, catering, computer, outage-carrier equipment, crew legality (pilot or attendant rest), damage by hazardous goods, engineering inspection, fueling, handling disabled passengers, late crew, lavatory servicing, maintenance, oversales, potable water servicing, removal of unruly passenger, slow boarding or seating, stowing carry-on baggage, weight and balance delays.
26. **WeatherDelay** in minutes: Weather delay is caused by extreme or hazardous weather conditions that are forecasted or manifest themselves on point of departure, enroute, or on point of arrival.
27. **NASDelay** in minutes: Delay that is within the control of the National Airspace System (NAS) may include: non-extreme weather conditions, airport operations, heavy traffic volume, air traffic control, etc.
28. **SecurityDelay** in minutes: Security delay is caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.
29. **LateAircraftDelay** in minutes: Arrival delay at an airport due to the late arrival of the same aircraft at a previous airport. The ripple effect of an earlier delay at downstream airports is referred to as delay propagation

Read the CSV file using Spark's default delimiter (","). The first line contains the headers so it is not part of the data. Hence we set the header option to true.

In [6]:
# This does nothing: Spark is lazy so the read operation will be deferred until an action is executed
flightsDF = spark.read.option("header", "true").csv("flights_jan08.csv")

## Your topic: Arrival Delay related to the time of departure and arrival

We want to check what happens in each of the cities, and if there is a relation between the city and the delay that goes beyond the airports. For that purpose, **you have to get a small dataset with a few economic indicators of the city** where the airport is located, such as the per-capita income, number of large companies (tech companies, large banks, etc) operating in each city, etc. No need for a lot of features, just 4 or 5 are fine. If you deem necessary, categorize the cities according to the economic development level. Once you have it, answer the following questions:

* Is there a relation between the economic prosperity of a city and the proportion of flights that arrive to it during the weekdays or weekends?
* Is there any relation between the business nature of a city and the proportion of flights that arrive early in the morning (e.g. with people in a business travel) with respect to the number of flights arriving during the rest of the day? 
* Is this proportion between flights arriving in the morning and the afternoon much different if we compare weekdays vs weekends in those cities?
* Are there cities that receive most flights at weekends? Is this typical of a vacation city?
* Can we say more developed cities suffer of smaller arrival delays on average?
* Is there a relation between the arrival time and the arrival delay? If you deem necessary, categorize the day into discrete parts for the arrival time. Is the relation the same for all categories of cities?
* What about the departure time?

In [7]:
## Importing Functions

from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit, mean, sum, corr, coalesce, avg
from pyspark.sql.functions import count, round
from pyspark.sql.types import IntegerType 
from itertools import chain

# Understing the flight dataset

In this part of the analysis we simply count the number of flight for each one of the IATA Codes in the column Dest. We also count the number of distinct values for that columns so we could know with how many different airports are we working for the analysis. 

In [8]:
print("Checking amount of distinct values in column Dest:")
flightsDF.select(countDistinct("Dest").alias("Airports Count")).show()

flightsDF\
    .groupBy(col("Dest"))\
    .agg(count("FlightNum").alias("Count"))\
    .orderBy("Count", ascending = False)\
    .select("Dest","Count").show(81,truncate = False)

Checking amount of distinct values in column Dest:
+--------------+
|Airports Count|
+--------------+
|            81|
+--------------+

+----+-----+
|Dest|Count|
+----+-----+
|LAS |6734 |
|MDW |6255 |
|PHX |5513 |
|BWI |4691 |
|OAK |3916 |
|HOU |3898 |
|DAL |3594 |
|LAX |3382 |
|SAN |3327 |
|MCO |3258 |
|SMF |2639 |
|TPA |2347 |
|BNA |2343 |
|ONT |2249 |
|MCI |2231 |
|SJC |2187 |
|ABQ |2085 |
|STL |2010 |
|PHL |1703 |
|BUR |1674 |
|AUS |1620 |
|SAT |1615 |
|DEN |1589 |
|SLC |1468 |
|SEA |1293 |
|RNO |1265 |
|MSY |1258 |
|FLL |1232 |
|PDX |1171 |
|RDU |1111 |
|SNA |1058 |
|ELP |973  |
|PVD |937  |
|TUS |926  |
|OKC |848  |
|MHT |848  |
|BHM |809  |
|JAX |808  |
|CMH |798  |
|ISP |790  |
|SFO |743  |
|TUL |685  |
|GEG |656  |
|BOI |643  |
|PIT |640  |
|OMA |578  |
|DTW |570  |
|SDF |566  |
|BDL |522  |
|CLE |498  |
|BUF |475  |
|IND |472  |
|ALB |388  |
|LIT |377  |
|ORF |359  |
|PBI |356  |
|LBB |349  |
|MAF |319  |
|AMA |318  |
|RSW |317  |
|HRL |312  |
|IAD |308  |
|JAN |244  |
|COS 

In [9]:
flightsDF.summary()

DataFrame[summary: string, Year: string, Month: string, DayofMonth: string, DayOfWeek: string, DepTime: string, CRSDepTime: string, ArrTime: string, CRSArrTime: string, UniqueCarrier: string, FlightNum: string, TailNum: string, ActualElapsedTime: string, CRSElapsedTime: string, AirTime: string, ArrDelay: string, DepDelay: string, Origin: string, Dest: string, Distance: string, TaxiIn: string, TaxiOut: string, Cancelled: string, CancellationCode: string, Diverted: string, CarrierDelay: string, WeatherDelay: string, NASDelay: string, SecurityDelay: string, LateAircraftDelay: string]

# Creating data by city (IATA codes)

To collect the data that was requiered for this assignment we had to go through different sources. The process was as it follows:

- First we had to look for the corresponding city of the IATA code, meanly from wikipedia and a web page that was basically a guide of the us airports codes.
- For the population we take the information from boston.com that although didn't have the information for the year 2008 but for 2007 we still decided to work with that because we consider was a good proxy
- Finally, for the rest of economic indicators we take from areavibes.com the value of the GDP, Population Density, Unemployment Rate and Poverty Level for every city we had in the flight dataset (according de Destination IATA code)

In [10]:
EconomicsAirport = spark.read.option("header", "true").csv("EconomicsAirport.csv")
EconomicsAirport.head(3)

[Row(IATA_CODE='ELP', State='Texas', Airport='El Paso', City='El Paso', Population='606913', GDPPC='27872', PopDensity='53.3', UnemploymentRate='6.1', ProvertyLevel='21'),
 Row(IATA_CODE='LBB', State='Texas', Airport='Lubbock', City='Lubbock', Population='217326', GDPPC='30414', PopDensity='23', UnemploymentRate='3.7', ProvertyLevel='20.9'),
 Row(IATA_CODE='FAT', State='California', Airport='Fresno', City='Fresno', Population='470508', GDPPC='34220', PopDensity='49.9', UnemploymentRate='10.4', ProvertyLevel='30')]

## 1. Is there a relation between the economic prosperity of a city and the proportion of flights that arrive to it during the weekdays or weekends?

- First we create a table that show us how many flight have arrive during weekday and weekend to each airport
- Later we merge that table with the dataset containing the economics indicators for each of the cities
- Finally, we check for the correlation between the proportion of flights on weekday and the economic variables

Looking at the correlation table we can observe that there is no clear relation between the weekday/weekend variables and the economic variables. The only variable that show a 11% of correlation is Unemployment rate that could be explained by a possible migration of talent from other cities to the one in the dataset. 

In [11]:
Weekdays = flightsDF\
   .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
   .withColumn("Weekday", when((col("DayOfWeek")==1) | (col("DayOfWeek")==2) | (col("DayOfWeek")==3) |\
                               (col("DayOfWeek")==4) | (col("DayOfWeek")==5),1)\
                               .otherwise(0))

Dayoftheweek = Weekdays\
                    .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
                    .groupBy(col("Dest"))\
                    .agg(count("FlightNum").alias("Count"),sum("Weekday").alias("Weekday"))\
                    .withColumn("Weekend", col("Count") - col("Weekday"))\
                    .withColumn("% Weekday", col("Weekday")/col("Count")*100) \
                    .withColumn("% Weekend", 100 - col("% Weekday")) \
                    .orderBy("Count", ascending = False)\
                    .select("Dest","Count", "Weekday", "Weekend",\
                            round("% Weekday",2).alias("% Weekday"),round("% Weekend",2).alias("% Weekend"))

Dayoftheweek.show()

+----+-----+-------+-------+---------+---------+
|Dest|Count|Weekday|Weekend|% Weekday|% Weekend|
+----+-----+-------+-------+---------+---------+
| LAS| 6647|   4943|   1704|    74.36|    25.64|
| MDW| 6103|   4547|   1556|     74.5|     25.5|
| PHX| 5435|   4083|   1352|    75.12|    24.88|
| BWI| 4660|   3519|   1141|    75.52|    24.48|
| HOU| 3839|   2941|    898|    76.61|    23.39|
| OAK| 3823|   2902|    921|    75.91|    24.09|
| DAL| 3559|   2776|    783|     78.0|     22.0|
| LAX| 3283|   2471|    812|    75.27|    24.73|
| SAN| 3240|   2449|    791|    75.59|    24.41|
| MCO| 3237|   2367|    870|    73.12|    26.88|
| SMF| 2595|   1972|    623|    75.99|    24.01|
| BNA| 2335|   1769|    566|    75.76|    24.24|
| TPA| 2329|   1730|    599|    74.28|    25.72|
| ONT| 2214|   1691|    523|    76.38|    23.62|
| MCI| 2212|   1687|    525|    76.27|    23.73|
| SJC| 2151|   1639|    512|     76.2|     23.8|
| ABQ| 2070|   1563|    507|    75.51|    24.49|
| STL| 1983|   1510|

In [12]:
DayoftheweekEconomics = Dayoftheweek.join(EconomicsAirport, Dayoftheweek.Dest == EconomicsAirport.IATA_CODE, how="inner")\
                        .select("Dest","Count","Weekday","Weekend",\
                    "% Weekday","% Weekend","Population", "GDPPC", "PopDensity", "UnemploymentRate", "ProvertyLevel")
            
DayoftheweekEconomics.show()

+----+-----+-------+-------+---------+---------+----------+-----+----------+----------------+-------------+
|Dest|Count|Weekday|Weekend|% Weekday|% Weekend|Population|GDPPC|PopDensity|UnemploymentRate|ProvertyLevel|
+----+-----+-------+-------+---------+---------+----------+-----+----------+----------------+-------------+
| LAS| 6647|   4943|   1704|    74.36|    25.64|    558880|48873|      20.5|             6.6|         16.8|
| MDW| 6103|   4547|   1556|     74.5|     25.5|   2836658|57387|     495.6|             6.1|         21.7|
| PHX| 5435|   4083|   1352|    75.12|    24.88|   1552259|47595|     108.3|             5.5|         22.3|
| BWI| 4660|   3519|   1141|    75.52|    24.48|    637455|27149|      1481|               7|         23.1|
| HOU| 3839|   2941|    898|    76.61|    23.39|   2208180|69667|     195.9|             4.7|         21.9|
| OAK| 3823|   2902|    921|    75.91|    24.09|    401489|34984|      1423|             6.3|           20|
| DAL| 3559|   2776|    783|

In [13]:
DayoftheweekEconomics\
    .where(col("Count") > 50)\
    .select([round(corr(col(c),col("% Weekday")),2).alias("Corr Pop/" + c) for c in ["Population","GDPPC", "PopDensity", "UnemploymentRate", "ProvertyLevel"]]).show()

+-------------------+--------------+-------------------+-------------------------+----------------------+
|Corr Pop/Population|Corr Pop/GDPPC|Corr Pop/PopDensity|Corr Pop/UnemploymentRate|Corr Pop/ProvertyLevel|
+-------------------+--------------+-------------------+-------------------------+----------------------+
|               0.05|         -0.08|              -0.01|                     0.11|                 -0.02|
+-------------------+--------------+-------------------+-------------------------+----------------------+



## Is there any relation between the business nature of a city and the proportion of flights that arrive early in the morning (e.g. with people in a business travel) with respect to the number of flights arriving during the rest of the day?

- First we check for the "Arr Time" variable to understand how the variable is so we can use it later to creat a column with all the flights occurring during the morning time
- We create a "Morning column" that correspond to flights with arrival time between 4 a.m. and 12 p.m.
- We merge the dataset with the one with the economics indicators
- We check for the correlation between te variables of interest.

Again the correlation between the variables is not significant. The variable with the highest correlation is Unemployment Rate that again could be explain for a potential migration of talent from other cities, specially in the mornings where is were usually interviews and business meetings are set.


In [14]:
flightsDF.select("ArrTime").summary().show()


+-------+------------------+
|summary|           ArrTime|
+-------+------------------+
|  count|            100000|
|   mean|1492.7392247056678|
| stddev|496.37679391699163|
|    min|                 1|
|    25%|            1114.0|
|    50%|            1518.0|
|    75%|            1913.0|
|    max|                NA|
+-------+------------------+



In [15]:
morning = flightsDF\
   .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
   .withColumn("Morning", when((col("ArrTime") >= 400) & (col("ArrTime") <= 1200),1)\
                               .otherwise(0))

MorningFlights = morning\
                    .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
                    .groupBy(col("Dest"))\
                    .agg(count("FlightNum").alias("Count"),sum("Morning").alias("Morning"))\
                    .withColumn("Not Morning", col("Count") - col("Morning"))\
                    .withColumn("% Morning", col("Morning")/col("Count")*100) \
                    .withColumn("% Not Morning", 100 - col("% Morning")) \
                    .orderBy("Count", ascending = False)\
                    .select("Dest","Count", "Morning", "Not Morning",\
                            round("% Morning",2).alias("% Morning"),round("% Not Morning",2).alias("% Not Morning"))

MorningFlights.show()

+----+-----+-------+-----------+---------+-------------+
|Dest|Count|Morning|Not Morning|% Morning|% Not Morning|
+----+-----+-------+-----------+---------+-------------+
| LAS| 6647|   2096|       4551|    31.53|        68.47|
| MDW| 6103|   1775|       4328|    29.08|        70.92|
| PHX| 5435|   1370|       4065|    25.21|        74.79|
| BWI| 4660|   1266|       3394|    27.17|        72.83|
| HOU| 3839|   1062|       2777|    27.66|        72.34|
| OAK| 3823|   1067|       2756|    27.91|        72.09|
| DAL| 3559|   1043|       2516|    29.31|        70.69|
| LAX| 3283|    965|       2318|    29.39|        70.61|
| SAN| 3240|    996|       2244|    30.74|        69.26|
| MCO| 3237|    845|       2392|     26.1|         73.9|
| SMF| 2595|    717|       1878|    27.63|        72.37|
| BNA| 2335|    714|       1621|    30.58|        69.42|
| TPA| 2329|    664|       1665|    28.51|        71.49|
| ONT| 2214|    664|       1550|    29.99|        70.01|
| MCI| 2212|    588|       1624

In [16]:
MorningEconomics = MorningFlights.join(EconomicsAirport, MorningFlights.Dest == EconomicsAirport.IATA_CODE, how="inner")\
                        .select("Dest","Count","Morning","Not Morning",\
                    "% Morning","% Not Morning","Population", "GDPPC", "PopDensity", "UnemploymentRate", "ProvertyLevel")
            
MorningEconomics.show()

+----+-----+-------+-----------+---------+-------------+----------+-----+----------+----------------+-------------+
|Dest|Count|Morning|Not Morning|% Morning|% Not Morning|Population|GDPPC|PopDensity|UnemploymentRate|ProvertyLevel|
+----+-----+-------+-----------+---------+-------------+----------+-----+----------+----------------+-------------+
| LAS| 6647|   2096|       4551|    31.53|        68.47|    558880|48873|      20.5|             6.6|         16.8|
| MDW| 6103|   1775|       4328|    29.08|        70.92|   2836658|57387|     495.6|             6.1|         21.7|
| PHX| 5435|   1370|       4065|    25.21|        74.79|   1552259|47595|     108.3|             5.5|         22.3|
| BWI| 4660|   1266|       3394|    27.17|        72.83|    637455|27149|      1481|               7|         23.1|
| HOU| 3839|   1062|       2777|    27.66|        72.34|   2208180|69667|     195.9|             4.7|         21.9|
| OAK| 3823|   1067|       2756|    27.91|        72.09|    401489|34984

In [17]:
MorningEconomics\
    .where(col("Count") > 50)\
    .select([round(corr(col(c),col("% Morning")),2).alias("Corr Pop/" + c) for c in ["Population","GDPPC", "PopDensity", "UnemploymentRate", "ProvertyLevel"]]).show()

+-------------------+--------------+-------------------+-------------------------+----------------------+
|Corr Pop/Population|Corr Pop/GDPPC|Corr Pop/PopDensity|Corr Pop/UnemploymentRate|Corr Pop/ProvertyLevel|
+-------------------+--------------+-------------------+-------------------------+----------------------+
|               0.08|          0.02|               0.06|                     0.21|                   0.1|
+-------------------+--------------+-------------------+-------------------------+----------------------+



## Is this proportion between flights arriving in the morning and the afternoon much different if we compare weekdays vs weekends in those cities?

- Here we create a dataset that contains binary columns that indicates if was morning or afternoon in a weekday or weekend.
- We create another dataset where we count the total of flight, and also sum the number of flights for each of the created variables.After we create columns containing the proportion of flights per category.
- Finally we look for the average proportion of flights for each of the category.

We can see that for morning flights, the proportion of flights is higher during weekdays. However when looking at afternoon flights, this ones have a higher occurrence during the weekends.

In [18]:
WeekMorning = flightsDF\
    .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
    .withColumn("Morning", when((col("ArrTime") >= 400) & (col("ArrTime") <= 1200),1)\
                               .otherwise(0))\
    .withColumn("Weekday", when((col("DayOfWeek")==1) | (col("DayOfWeek")==2) | (col("DayOfWeek")==3) |\
                               (col("DayOfWeek")==4) | (col("DayOfWeek")==5),1)\
                               .otherwise(0))\
    .withColumn("MorningWeekday", when((col("Morning")==1) & (col("Weekday")==1), 1).otherwise(0))\
    .withColumn("MorningWeekend", when((col("Morning")==1) & (col("Weekday")==0), 1).otherwise(0))\
    .withColumn("AfternoonWeekday", when((col("Morning")==0) & (col("Weekday")==1), 1).otherwise(0))\
    .withColumn("AfternoonWeekend", when((col("Morning")==0) & (col("Weekday")==0), 1).otherwise(0))
    
    
    

WeekMorningProportion = WeekMorning\
    .groupBy("Dest")\
    .agg(count("FlightNum").alias("Count"), sum("MorningWeekday").alias("MorningWeekday"),\
         sum("MorningWeekend").alias("MorningWeekend"),\
         sum("AfternoonWeekday").alias("AfternoonWeekday"),\
         sum("AfternoonWeekend").alias("AfternoonWeekend"))\
    .where(col("Count")>50)\
    .withColumn("%MWD", col("MorningWeekday")/(col("MorningWeekday")+col("AfternoonWeekday"))*100)\
    .withColumn("%AWD", col("AfternoonWeekday")/(col("MorningWeekday")+col("AfternoonWeekday"))*100)\
    .withColumn("%MWK", col("MorningWeekend")/(col("MorningWeekend")+col("AfternoonWeekend"))*100)\
    .withColumn("%AWK", col("AfternoonWeekend")/(col("MorningWeekend")+col("AfternoonWeekend"))*100)\
    .select("Dest", "Count", round(col("%MWD"),2).alias("Morning Weekday"),\
           round(col("%AWD"),2).alias("Afternoon Weekday"),\
           round(col("%MWK"),2).alias("Morning Weekend"),\
           round(col("%AWK"),2).alias("Afternoon Weekend"))
    
WeekMorningProportion.show()

+----+-----+---------------+-----------------+---------------+-----------------+
|Dest|Count|Morning Weekday|Afternoon Weekday|Morning Weekend|Afternoon Weekend|
+----+-----+---------------+-----------------+---------------+-----------------+
| MSY| 1256|          36.42|            63.58|          33.44|            66.56|
| GEG|  640|          25.51|            74.49|           26.0|             74.0|
| BUR| 1636|          31.02|            68.98|          27.01|            72.99|
| SNA| 1050|          32.24|            67.76|          26.64|            73.36|
| PVD|  934|          24.68|            75.32|          18.57|            81.43|
| OAK| 3823|          29.36|            70.64|          23.34|            76.66|
| ORF|  358|          23.13|            76.87|          21.11|            78.89|
| CMH|  794|          26.42|            73.58|          21.43|            78.57|
| SJC| 2151|          31.06|            68.94|           24.8|             75.2|
| BUF|  464|          27.06|

In [19]:
WeekMorningProportion\
    .filter(col("Count")>50)\
    .agg(round(avg("Morning Weekday"),2).alias("Avg Morning Weekday"), round(avg("Afternoon Weekday"),2).alias("Avg Afternoon Weekday"), round(avg("Morning Weekend"),2).alias("Avg Morning Weekend"), round(avg("Afternoon Weekend"),2).alias("Avg Afternoon Weekend"))\
    .show()

+-------------------+---------------------+-------------------+---------------------+
|Avg Morning Weekday|Avg Afternoon Weekday|Avg Morning Weekend|Avg Afternoon Weekend|
+-------------------+---------------------+-------------------+---------------------+
|              27.91|                72.09|               23.4|                 76.6|
+-------------------+---------------------+-------------------+---------------------+



## Are there cities that receive most flights at weekends? Is this typical of a vacation city?

- When filtering the column "% Weekend" >50 the only city that appear is Syracuse (NY State) but only because it had one flight that occurred during the weekend. When looking for different filters we conclude that this is because for all the cities in the dataset the mayority of flights (above 70%) occurred during weekdays.

In [20]:
Dayoftheweek\
    .where(col("% Weekend")>50)\
    .show()

+----+-----+-------+-------+---------+---------+
|Dest|Count|Weekday|Weekend|% Weekday|% Weekend|
+----+-----+-------+-------+---------+---------+
| SYR|    1|      0|      1|      0.0|    100.0|
+----+-----+-------+-------+---------+---------+



In [21]:
ProportionDayoftheweek = Dayoftheweek\
    .withColumn("Avg Weekday", col("Weekday")/5)\
    .withColumn("Avg Weekend", col("Weekend")/2)

ProportionDayoftheweek\
    .where(col("Avg Weekday")<col("Avg Weekend"))\
    .show()


+----+-----+-------+-------+---------+---------+-----------+-----------+
|Dest|Count|Weekday|Weekend|% Weekday|% Weekend|Avg Weekday|Avg Weekend|
+----+-----+-------+-------+---------+---------+-----------+-----------+
| SYR|    1|      0|      1|      0.0|    100.0|        0.0|        0.5|
+----+-----+-------+-------+---------+---------+-----------+-----------+



## Can we say more developed cities suffer of smaller arrival delays on average?

- First we look for the GDP as a measure of development. We decide that the threshold to classify a city as developed, would be those cities with a GDP above the third quartile of the observations.
- Then we create a table where according to the above criteria, indicates if the flight was to a developed or undeveloped city.
- We count the occurrences for each of the categories and we see that more flights go to undeveloped cities.
- Then we calculate the average delay for each of the cities as well as the proportion of total flight that were delay for each of them.

We can see that for developed cities the average of the the delay is bigger as well as the proportion of flights with delays.

In [22]:
EconomicsAirport\
    .select("GDPPC")\
    .summary()\
    .show()

+-------+------------------+
|summary|             GDPPC|
+-------+------------------+
|  count|                81|
|   mean| 40747.37037037037|
| stddev|16033.931333615941|
|    min|             17040|
|    25%|           27752.0|
|    50%|           37428.0|
|    75%|           52452.0|
|    max|             86120|
+-------+------------------+



In [23]:
DevelopCity = EconomicsAirport\
    .withColumn("DevelopedCity", when((col("GDPPC")>52452),"Developed City")\
                               .otherwise("Undeveloped City"))

DevelopCity.select("IATA_Code","City","Population","GDPPC","DevelopedCity").show()

+---------+------------------+----------+-----+----------------+
|IATA_Code|              City|Population|GDPPC|   DevelopedCity|
+---------+------------------+----------+-----+----------------+
|      ELP|           El Paso|    606913|27872|Undeveloped City|
|      LBB|           Lubbock|    217326|30414|Undeveloped City|
|      FAT|            Fresno|    470508|34220|Undeveloped City|
|      GEG|           Spokane|    200975|37870|Undeveloped City|
|      SAT|       San Antonio|   1328984|41072|Undeveloped City|
|      JAX|      Jacksonville|    805605|41497|Undeveloped City|
|      PVD|        Providence|    172459|41912|Undeveloped City|
|      ABQ|       Albuquerque|    518271|42883|Undeveloped City|
|      ALB|            Albany|     94210|43178|Undeveloped City|
|      TPA|Tampa-Hillsborough|    336823|43793|Undeveloped City|
|      ROC|         Rochester|    206759|44026|Undeveloped City|
|      MRY|          Monterey|     27567|44376|Undeveloped City|
|      PIT|        Pittsb

In [24]:
DevelopCity\
        .groupby("DevelopedCity")\
        .agg(count("DevelopedCity").alias("Count"))\
        .show()

+----------------+-----+
|   DevelopedCity|Count|
+----------------+-----+
|  Developed City|   20|
|Undeveloped City|   61|
+----------------+-----+



In [25]:
Countperairport = flightsDF\
                    .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
                    .groupBy(col("Dest"))\
                    .agg(count("FlightNum").alias("AllCount"))\
                    .orderBy("AllCount", ascending = False)\
                    .select("Dest","AllCount")

AverageDelay = flightsDF\
                    .where((col("ArrDelay") > 15) & (col("Cancelled")==0))\
                    .groupBy(col("Dest"))\
                    .agg(count("FlightNum").alias("DelayCount"),mean("ArrDelay")\
                    .alias("mean"))\
                    .orderBy("DelayCount", ascending = False)\
                    .select("Dest","DelayCount",round("mean",2).alias("ArrivalDelay"),)

AirportDelayProportion = Countperairport.join(AverageDelay, Countperairport.Dest == AverageDelay.Dest, how="inner")\
                        .withColumn("DelayProportion", round(col("DelayCount")/col("AllCount")*100,2))\
                        .select(AverageDelay.Dest,"AllCount","DelayCount","DelayProportion","ArrivalDelay")
            
AirportDelayProportion.show()


+----+--------+----------+---------------+------------+
|Dest|AllCount|DelayCount|DelayProportion|ArrivalDelay|
+----+--------+----------+---------------+------------+
| LAS|    6647|      1508|          22.69|       57.31|
| PHX|    5435|      1101|          20.26|        49.6|
| MDW|    6103|      1040|          17.04|       56.63|
| OAK|    3823|       923|          24.14|       47.96|
| LAX|    3283|       910|          27.72|       49.67|
| SAN|    3240|       784|           24.2|        52.5|
| HOU|    3839|       647|          16.85|       48.51|
| BWI|    4660|       618|          13.26|       48.17|
| SMF|    2595|       579|          22.31|       47.81|
| DAL|    3559|       562|          15.79|       45.31|
| SJC|    2151|       514|           23.9|       45.92|
| ONT|    2214|       451|          20.37|       51.65|
| MCI|    2212|       390|          17.63|       47.39|
| MCO|    3237|       390|          12.05|       48.15|
| SLC|    1452|       390|          26.86|      

In [26]:
DevelopCityFlights = AirportDelayProportion.join(DevelopCity, AirportDelayProportion.Dest == DevelopCity.IATA_CODE, how="inner")\
                        .select("Dest","DevelopedCity","AllCount","DelayCount","DelayProportion","ArrivalDelay")
            
DevelopCityFlights.orderBy("DelayProportion", ascending = False).show()

+----+----------------+--------+----------+---------------+------------+
|Dest|   DevelopedCity|AllCount|DelayCount|DelayProportion|ArrivalDelay|
+----+----------------+--------+----------+---------------+------------+
| MYR|Undeveloped City|       1|         1|          100.0|        18.0|
| SAV|Undeveloped City|       1|         1|          100.0|        31.0|
| ROC|Undeveloped City|       1|         1|          100.0|        24.0|
| GSO|Undeveloped City|       2|         1|           50.0|        60.0|
| IAH|  Developed City|      13|         6|          46.15|        79.0|
| SFO|  Developed City|     698|       306|          43.84|       86.91|
| EWR|Undeveloped City|       9|         3|          33.33|        43.0|
| LAX|  Developed City|    3283|       910|          27.72|       49.67|
| SLC|Undeveloped City|    1452|       390|          26.86|       52.74|
| BFL|Undeveloped City|      60|        16|          26.67|       86.44|
| MRY|Undeveloped City|     109|        28|        

In [27]:
DevelopCityFlights\
    .filter(col("AllCount")>50)\
    .groupBy("DevelopedCity")\
    .agg(round(avg("ArrivalDelay"),2).alias("AvgArrDelay"), round(avg("DelayProportion"),2).alias("AvgDelayProportion"))\
    .show()

+----------------+-----------+------------------+
|   DevelopedCity|AvgArrDelay|AvgDelayProportion|
+----------------+-----------+------------------+
|  Developed City|      53.74|             20.46|
|Undeveloped City|      50.93|             18.42|
+----------------+-----------+------------------+



## Is there a relation between the arrival time and the arrival delay? If you deem necessary, categorize the day into discrete parts for the arrival time. Is the relation the same for all categories of cities?

- We create again different categories for different times of the day: Night, Morning, Evening and Late Night.
- We later checked the proportion of delays and the average delay for the different times of the day.

We see that at Night the flights have a higher rate of delay as well as a bigger proportion of flights with delay when comparing with other times of the day.

Also we see that the rate of arrival delays at night is higher in undeveloped cities.

In [29]:
# Overall Day Category delay

DayCategory = flightsDF\
   .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
   .withColumn("DayCategory", when((col("ArrTime") >= 0) & (col("ArrTime") < 600),"Night")\
                   .when((col("ArrTime")>= 600) & (col("ArrTime") < 1200),"Morning")\
                   .when((col("ArrTime")>= 1200) & (col("ArrTime") < 1800),"Evening")\
                    .otherwise("Late Evening"))

Countperairport = DayCategory\
                    .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
                    .groupBy(col("DayCategory"))\
                    .agg(count("FlightNum").alias("AllCount"))\
                    .orderBy("AllCount", ascending = False)\
                    .select("DayCategory","AllCount")

AverageDelay = DayCategory\
                    .where((col("ArrDelay") > 15) & (col("Cancelled")==0))\
                    .groupBy(col("DayCategory"))\
                    .agg(count("FlightNum").alias("DelayCount"),mean("ArrDelay")\
                    .alias("mean"))\
                    .orderBy("DelayCount", ascending = False)\
                    .select("DayCategory","DelayCount",round("mean",2).alias("ArrivalDelay"),)

DayCategoryDelayProportion = Countperairport.join(AverageDelay, Countperairport.DayCategory == AverageDelay.DayCategory, how="inner")\
                        .withColumn("DelayProportion", round(col("DelayCount")/col("AllCount")*100,2))\
                        .select(AverageDelay.DayCategory,"AllCount","DelayCount","DelayProportion","ArrivalDelay")
            
DayCategoryDelayProportion.show()

+------------+--------+----------+---------------+------------+
| DayCategory|AllCount|DelayCount|DelayProportion|ArrivalDelay|
+------------+--------+----------+---------------+------------+
|Late Evening|   32447|      9133|          28.15|       52.74|
|     Evening|   36884|      6437|          17.45|       43.63|
|     Morning|   27547|      2020|           7.33|       34.17|
|       Night|    1820|      1204|          66.15|      104.62|
+------------+--------+----------+---------------+------------+



In [30]:
DevelopCity = EconomicsAirport\
    .withColumn("DevelopedCity", when((col("GDPPC")>52452),"Developed City")\
                               .otherwise("Undeveloped City"))\
    .select("IATA_CODE","DevelopedCity")

DayCategoryArrive = flightsDF\
    .join(DevelopCity, flightsDF.Dest == DevelopCity.IATA_CODE, how="inner")\
    .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
    .withColumn("DayCategory", when((col("ArrTime") >= 0) & (col("ArrTime") < 600),"Night")\
                   .when((col("ArrTime")>= 600) & (col("ArrTime") < 1200),"Morning")\
                   .when((col("ArrTime")>= 1200) & (col("ArrTime") < 1800),"Evening")\
                    .otherwise("LateEvening"))\
    .withColumn("DelayBinary", when((col("ArrDelay") >= 15),1).otherwise(0))\
    .groupBy("DevelopedCity")\
    .pivot("DayCategory")\
    .agg(round(sum("DelayBinary")/count("DelayBinary"),2)*100)\
    .orderBy("Night",ascending = False)\
    .show(81)



+----------------+-------+------------------+-------+-----+
|   DevelopedCity|Evening|       LateEvening|Morning|Night|
+----------------+-------+------------------+-------+-----+
|Undeveloped City|   18.0|28.999999999999996|    8.0| 70.0|
|  Developed City|   19.0|              30.0|    8.0| 61.0|
+----------------+-------+------------------+-------+-----+



## What about the departure time?

When checking for the departure time we see that different from what expected are the develop cities the one that have a higher rate of delays at night. That could suggest that probably the higher rate of arrival delays in undeveloped cities could have their origin in the departure delays from developed cities.

In [30]:
DevelopCity = EconomicsAirport\
    .withColumn("DevelopedCity", when((col("GDPPC")>52452),"Developed City")\
                               .otherwise("Undeveloped City"))\
    .select("IATA_CODE","DevelopedCity")

DayCategoryDeparture = flightsDF\
    .join(DevelopCity, flightsDF.Dest == DevelopCity.IATA_CODE, how="inner")\
    .where((col("ArrDelay")!="NA") & (col("Cancelled")==0))\
    .withColumn("DayCategory", when((col("DepTime") >= 0) & (col("DepTime") < 600),"Night")\
                   .when((col("DepTime")>= 600) & (col("DepTime") < 1200),"Morning")\
                   .when((col("DepTime")>= 1200) & (col("DepTime") < 1800),"Evening")\
                    .otherwise("LateEvening"))\
    .withColumn("DelayBinary", when((col("ArrDelay") >= 15),1).otherwise(0))\
    .groupBy("DevelopedCity")\
    .pivot("DayCategory")\
    .agg(round(sum("DelayBinary")/count("DelayBinary"),2))\
    .orderBy("Night",ascending = False)\
    .show(81)


+----------------+-------+-----------+-------+-----+
|   DevelopedCity|Evening|LateEvening|Morning|Night|
+----------------+-------+-----------+-------+-----+
|  Developed City|   0.22|       0.35|   0.11|  0.5|
|Undeveloped City|   0.21|       0.33|    0.1| 0.38|
+----------------+-------+-----------+-------+-----+

