(flight-nyc.png)


An aviation industry player with a significant presence in New York City has launched an in-depth data analysis project focused on identifying trends in flight durations in air travel. This initiative aims to delve into a wealth of data related to flight schedules and operational patterns, with the objective of optimizing flight times and enhancing the overall travel experience for passengers. Data is sourced from the 'nycflights2022' collection produced by the ModernDive team. These datasets include records of flights departing from major New York City airports, including JFK (John F. Kennedy International Airport), LGA (LaGuardia Airport), and EWR (Newark Liberty International Airport), during the second half of 2022. They offer a comprehensive view of flight operations, covering various aspects such as departure and arrival times, flight paths, and airline specifics:

- `flights2022-h2.csv` contains information about each flight including 

| Variable         | Description                                              |
|------------------|----------------------------------------------------------|
| `carrier`        | Airline carrier code                                     | 
| `origin`         | Origin airport (IATA code)                               | 
| `dest`           | Destination airport (IATA code)                          | 
| `air_time`       | Duration of the flight in air, in minutes                |

- `airlines.csv` contains information about each airline:

| Variable  | Description                          |
|-----------|--------------------------------------|
| `carrier` | Airline carrier code                 |
| `name`    | Full name of the airline             |

- `airports.csv` provides details of airports:

| Variable | Description                           |
|----------|---------------------------------------|
| `faa`    | FAA code of the airport               |
| `name`   | Full name of the airport              |

In [3]:
# Import required packages
library(dplyr)
library(readr)

In [5]:
# Load the data
flights <- read_csv("flights2022-h2.csv")
airlines <- read_csv("airlines.csv")
airports <- read_csv("airports.csv")

Parsed with column specification:
cols(
  year = col_double(),
  month = col_double(),
  day = col_double(),
  dep_time = col_double(),
  sched_dep_time = col_double(),
  dep_delay = col_double(),
  arr_time = col_double(),
  sched_arr_time = col_double(),
  arr_delay = col_double(),
  carrier = col_character(),
  flight = col_double(),
  tailnum = col_character(),
  origin = col_character(),
  dest = col_character(),
  air_time = col_double(),
  distance = col_double(),
  hour = col_double(),
  minute = col_double(),
  time_hour = col_datetime(format = "")
)
Parsed with column specification:
cols(
  carrier = col_character(),
  name = col_character()
)
Parsed with column specification:
cols(
  faa = col_character(),
  name = col_character(),
  lat = col_double(),
  lon = col_double(),
  alt = col_double(),
  tz = col_double(),
  dst = col_character(),
  tzone = col_character()
)


In [7]:
# Join the flights, airlines, and airports data frames together
complex_join <- flights %>%
  left_join(airlines, by = "carrier") %>%
  rename(airline_name = name) %>% 
  left_join(airports, by = c("dest" = "faa")) %>% 
  rename(airport_name = name)

In [9]:
# Find flight duration in hours
transformed_data <- complex_join %>%
  mutate(flight_duration = air_time / 60)

In [11]:
# Determine the average flight duration and number of flights for each airline and airport combination
analysis_result <- transformed_data %>%
  group_by(airline_name, airport_name) %>%
  summarize(avg_flight_duration = mean(flight_duration, na.rm = TRUE),
            count = n()) %>%
  ungroup()

In [13]:
# From which airline and to which city do the most flights from NYC go to?
frequent <- analysis_result %>% arrange(desc(count)) %>% head(1)

# Which airline and to which airport has the longest average flight duration (in hours) from NYC?
longest <- analysis_result %>% arrange(desc(avg_flight_duration)) %>% head(1)

# What was the least common destination airport departing from JFK?
transformed_data %>% 
  filter(origin == "JFK") %>% 
  group_by(airport_name) %>% 
  summarize(count = n()) %>% 
  arrange(count)

least <- "Eagle County Regional Airport"

airport_name,count
Eagle County Regional Airport,17
Gallatin Field,48
Palm Springs International Airport,59
Barnstable Municipal Boardman Polando Field,88
Norman Y. Mineta San Jose International Airport,92
Albuquerque International Sunport,121
Reno Tahoe International Airport,123
San Antonio International Airport,183
John Wayne Airport-Orange County Airport,184
Ontario International Airport,184
