![Flight departing large city](flight-nyc.png)


A foremost aviation industry player with a significant presence in New York City has launched an in-depth data analysis project focused on identifying trends in flight durations in air travel. This initiative aims to delve into a wealth of data related to flight schedules and operational patterns, with the objective of optimizing flight times and enhancing the overall travel experience for passengers. As the head data analyst, you have access to rich datasets, sourced from the 'nycflights2022' collection produced by the ModernDive team. These datasets include records of flights departing from major New York City airports, including JFK (John F. Kennedy International Airport), LGA (LaGuardia Airport), and EWR (Newark Liberty International Airport), during the second half of 2022. They offer a comprehensive view of flight operations, covering various aspects such as departure and arrival times, flight paths, and airline specifics:

- `flights2022-h2.csv` contains information about each flight including 

| Variable         | Description                                              |
|------------------|----------------------------------------------------------|
| `carrier`        | Airline carrier code                                     | 
| `origin`         | Origin airport (IATA code)                               | 
| `dest`           | Destination airport (IATA code)                          | 
| `air_time`       | Duration of the flight in air, in minutes                |

- `airlines.csv` contains information about each airline:

| Variable  | Description                          |
|-----------|--------------------------------------|
| `carrier` | Airline carrier code                 |
| `name`    | Full name of the airline             |

- `airports.csv` provides details of airports:

| Variable | Description                           |
|----------|---------------------------------------|
| `faa`    | FAA code of the airport               |
| `name`   | Full name of the airport              |

In [26]:
# Import required packages
library(dplyr)
library(readr)

# Load the data
flights <- read_csv("flights2022-h2.csv")
airlines <- read_csv("airlines.csv")
airports <- read_csv("airports.csv")

[1mRows: [22m[34m218802[39m [1mColumns: [22m[34m19[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m   (4): carrier, tailnum, origin, dest
[32mdbl[39m  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...
[34mdttm[39m  (1): time_hour

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m16[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): carrier, name

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1251[39m [1mColumns: [22m[34m8

In [27]:
head(flights)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2022,7,1,9,2129,160,118,2312,126,B6,325,N229JB,JFK,BNA,106,765,21,29,2022-07-01 21:00:00
2022,7,1,12,1940,272,315,2253,262,B6,20,N591JB,JFK,RNO,333,2411,19,40,2022-07-01 19:00:00
2022,7,1,21,2120,181,140,2240,180,WN,548,N8651A,LGA,MDW,112,725,21,20,2022-07-01 21:00:00
2022,7,1,21,2159,142,225,21,124,B6,286,N537JT,JFK,ATL,101,760,21,59,2022-07-01 21:00:00
2022,7,1,22,2140,162,310,53,137,B6,500,N923JB,JFK,LAX,321,2475,21,40,2022-07-01 21:00:00
2022,7,1,23,2110,193,203,2259,184,YX,955,N130HQ,JFK,RDU,77,427,21,10,2022-07-01 21:00:00


In [28]:
head(airlines)

carrier,name
<chr>,<chr>
9E,Endeavor Air Inc.
AA,American Airlines Inc.
AS,Alaska Airlines Inc.
B6,JetBlue Airways
DL,Delta Air Lines Inc.
F9,Frontier Airlines Inc.


In [29]:
head(airports)

faa,name,lat,lon,alt,tz,dst,tzone
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
AAF,Apalachicola Regional Airport,29.7275,-85.0275,20,-5,A,America/New_York
AAP,Andrau Airpark,29.7225,-95.5883,79,-6,A,America/Chicago
ABE,Lehigh Valley International Airport,40.6521,-75.4408,393,-5,A,America/New_York
ABI,Abilene Regional Airport,32.4113,-99.6819,1791,-6,A,America/Chicago
ABL,Ambler Airport,67.1063,-157.857,334,-9,A,America/Anchorage
ABQ,Albuquerque International Sunport,35.0402,-106.609,5355,-7,A,America/Denver


## 1. Complex data joining and summarization

In [30]:
# Join the flights, airlines, and airports data frames together
complex_join <- flights %>%
  left_join(airlines, by = "carrier") %>%
  rename(airline_name = name) %>% 
  left_join(airports, by = c("dest" = "faa")) %>% 
  rename(airport_name = name)

head(complex_join)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,⋯,minute,time_hour,airline_name,airport_name,lat,lon,alt,tz,dst,tzone
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dttm>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
2022,7,1,9,2129,160,118,2312,126,B6,⋯,29,2022-07-01 21:00:00,JetBlue Airways,Nashville International Airport,36.1245,-86.6782,599,-6,A,America/Chicago
2022,7,1,12,1940,272,315,2253,262,B6,⋯,40,2022-07-01 19:00:00,JetBlue Airways,Reno Tahoe International Airport,39.4991,-119.768,4415,-8,A,America/Los_Angeles
2022,7,1,21,2120,181,140,2240,180,WN,⋯,20,2022-07-01 21:00:00,Southwest Airlines Co.,Chicago Midway International Airport,41.786,-87.7524,620,-6,A,America/Chicago
2022,7,1,21,2159,142,225,21,124,B6,⋯,59,2022-07-01 21:00:00,JetBlue Airways,Hartsfield Jackson Atlanta International Airport,33.6367,-84.4281,1026,-5,A,America/New_York
2022,7,1,22,2140,162,310,53,137,B6,⋯,40,2022-07-01 21:00:00,JetBlue Airways,Los Angeles International Airport,33.9425,-118.408,125,-8,A,America/Los_Angeles
2022,7,1,23,2110,193,203,2259,184,YX,⋯,10,2022-07-01 21:00:00,Republic Airline,Raleigh Durham International Airport,35.8776,-78.7875,435,-5,A,America/New_York


In [31]:
# Find flight duration in hours
transformed_data <- complex_join %>%
  mutate(flight_duration = air_time / 60)

head(transformed_data)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,⋯,time_hour,airline_name,airport_name,lat,lon,alt,tz,dst,tzone,flight_duration
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dttm>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
2022,7,1,9,2129,160,118,2312,126,B6,⋯,2022-07-01 21:00:00,JetBlue Airways,Nashville International Airport,36.1245,-86.6782,599,-6,A,America/Chicago,1.766667
2022,7,1,12,1940,272,315,2253,262,B6,⋯,2022-07-01 19:00:00,JetBlue Airways,Reno Tahoe International Airport,39.4991,-119.768,4415,-8,A,America/Los_Angeles,5.55
2022,7,1,21,2120,181,140,2240,180,WN,⋯,2022-07-01 21:00:00,Southwest Airlines Co.,Chicago Midway International Airport,41.786,-87.7524,620,-6,A,America/Chicago,1.866667
2022,7,1,21,2159,142,225,21,124,B6,⋯,2022-07-01 21:00:00,JetBlue Airways,Hartsfield Jackson Atlanta International Airport,33.6367,-84.4281,1026,-5,A,America/New_York,1.683333
2022,7,1,22,2140,162,310,53,137,B6,⋯,2022-07-01 21:00:00,JetBlue Airways,Los Angeles International Airport,33.9425,-118.408,125,-8,A,America/Los_Angeles,5.35
2022,7,1,23,2110,193,203,2259,184,YX,⋯,2022-07-01 21:00:00,Republic Airline,Raleigh Durham International Airport,35.8776,-78.7875,435,-5,A,America/New_York,1.283333


In [32]:
# Determine the average flight duration and number of flights for each airline and airport combination
analysis_result <- transformed_data %>%
  group_by(airline_name, airport_name) %>%
  summarize(avg_flight_duration = mean(flight_duration, na.rm = TRUE),
            count = n()) %>%
  ungroup()

head(analysis_result)

[1m[22m`summarise()` has grouped output by 'airline_name'. You can override using the
`.groups` argument.


airline_name,airport_name,avg_flight_duration,count
<chr>,<chr>,<dbl>,<int>
Alaska Airlines Inc.,Los Angeles International Airport,5.289376,519
Alaska Airlines Inc.,Portland International Airport,5.391573,362
Alaska Airlines Inc.,San Diego International Airport,5.296549,546
Alaska Airlines Inc.,San Francisco International Airport,5.62532,1304
Alaska Airlines Inc.,Seattle Tacoma International Airport,5.41184,1273
Allegiant Air,Asheville Regional Airport,1.47551,100


## 2. Finding the most frequent flight destination

In [33]:
# From which airline and to which city do the most flights from NYC go to?
frequent <- analysis_result %>% 
	arrange(desc(count)) %>% 
	head(1)

frequent

airline_name,airport_name,avg_flight_duration,count
<chr>,<chr>,<dbl>,<int>
Delta Air Lines Inc.,Hartsfield Jackson Atlanta International Airport,1.820202,5264


## 3. Determining the longest flight duration

In [34]:
# Which airline and to which airport has the longest average flight duration (in hours) from NYC?
longest <- analysis_result %>% 
	arrange(desc(avg_flight_duration)) %>% 
	head(1)

longest

airline_name,airport_name,avg_flight_duration,count
<chr>,<chr>,<dbl>,<int>
Delta Air Lines Inc.,Daniel K Inouye International Airport,10.71667,15


## 4. Discovering the least common destination

In [38]:
# What was the least common destination airport departing from JFK?
transformed_data %>% 
  filter(origin == "JFK") %>% 
  group_by(airport_name) %>% 
  summarize(count = n()) %>% 
  arrange(count) %>%
  head(1)

airport_name,count
<chr>,<int>
Eagle County Regional Airport,17
