# Flight Delay Analysis Project
#### Analyze flight data to determine the impact of a delayed arriving flight and if there is a correlated departure delay for the same aircraft. 
Research Question: : Is there a correlation between the size of an airport, International versus Regional, and the likelihood that a departing flight would be delayed due to a delayed inbound aircraft? <br>
### Hypothesis:
There is a positive correlation between the size of an airport (International vs. Regional) and the likelihood that a departing flight would be delayed due to a delayed inbound aircraft. Specifically, it is hypothesized that International airports with higher volumes of air traffic will exhibit a higher probability of delayed departures due to late aircraft compared to smaller, Regional airports.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import scipy as sc
import time_functions as tf

# Set Pandas display options
pd.options.display.max_columns = None
pd.options.display.max_rows = None

## Import Data
| File                      | Description |
| ------------------------- | ----------- |
| airlines.csv              | Data contains airline name information |
| airports.csv              | Data contains airport information |
| flights.csv               | Data contains all flight data for 2015 |
| international_airports.csv | Data identifies U.S. International Airports |

In [2]:
# Load csv files into Pandas DataFrames
df_airlines = pd.read_csv("data/airlines.csv", index_col=False)
df_airports = pd.read_csv("data/airports.csv", index_col=False)
df_flights = pd.read_csv("data/flights.csv", index_col=False, low_memory=False)
df_intl = pd.read_csv("data/international_airports.csv", index_col=False)

## Assessing Data
Visualize sample data from each of the imported datasets to determine which values will be of interest for the analysis.

In [3]:
# View the first 5 rows of the Airlines dataset
df_airlines.head()

Unnamed: 0,IATA_CODE,AIRLINE
0,UA,United Air Lines Inc.
1,AA,American Airlines Inc.
2,US,US Airways Inc.
3,F9,Frontier Airlines Inc.
4,B6,JetBlue Airways


In [4]:
# View the first 5 rows of the Airport dataset
df_airports.head()

Unnamed: 0,IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE
0,ABE,Lehigh Valley International Airport,Allentown,PA,USA,40.65236,-75.4404
1,ABI,Abilene Regional Airport,Abilene,TX,USA,32.41132,-99.6819
2,ABQ,Albuquerque International Sunport,Albuquerque,NM,USA,35.04022,-106.60919
3,ABR,Aberdeen Regional Airport,Aberdeen,SD,USA,45.44906,-98.42183
4,ABY,Southwest Georgia Regional Airport,Albany,GA,USA,31.53552,-84.19447


In [5]:
# View the first 5 rows of the Flights dataset
df_flights.head()


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,2354.0,-11.0,21.0,15.0,205.0,194.0,169.0,1448,404.0,4.0,430,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,2.0,-8.0,12.0,14.0,280.0,279.0,263.0,2330,737.0,4.0,750,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,18.0,-2.0,16.0,34.0,286.0,293.0,266.0,2296,800.0,11.0,806,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,15.0,-5.0,15.0,30.0,285.0,281.0,258.0,2342,748.0,8.0,805,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,24.0,-1.0,11.0,35.0,235.0,215.0,199.0,1448,254.0,5.0,320,259.0,-21.0,0,0,,,,,,


In [6]:
# Find total rows and columns in flights data
df_flights.shape

(5819079, 31)

In [7]:
# View the first 5 rows of the International dataset
df_intl.head()

Unnamed: 0,CITY,AIRPORT_NAME,STATE,IATA-Code,Status
0,Atlanta,Hartsfield-Jackson Atlanta International,Georgia,ATL,INTL
1,Anchorage,Ted Stevens Anchorage International Airport,Alaska,ANC,INTL
2,Austin,Austin-Bergstrom International,Texas,AUS,INTL
3,Baltimore,Baltimore/Washington International - BWI Airport,Maryland,BWI,INTL
4,Boston,Logan International Airport,Massachusetts,BOS,INTL


## Data Cleaning
Review the Flights data and determine which columns and rows can be removed from the dataset.

In [8]:
# Get listing of all columns in the Flight data
df_flights.columns

Index(['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER',
       'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
       'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT',
       'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE',
       'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME',
       'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON',
       'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY',
       'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'],
      dtype='object')

Drop columns that are not needed in the analysis:
- YEAR, AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, WEATHER_DELAY, CANCELLATION_REASON, TAXI_OUT,
       WHEELS_OFF, SCHEDULED_TIME, ELAPSED_TIME, AIR_TIME, DISTANCE, WHEELS_ON, TAXI_IN

In [9]:
# Drop columns 
drop = ['YEAR', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY',
       'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY', 'CANCELLATION_REASON', 'TAXI_OUT',
       'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE',
       'WHEELS_ON', 'TAXI_IN']

df_flights.drop(drop, axis=1, inplace=True)

Find the columns that have null values and determine what to do with them.

In [10]:
# Examine the Flights dataset for NaN (Null) values 
df_flights.isnull().sum()

MONTH                       0
DAY                         0
DAY_OF_WEEK                 0
AIRLINE                     0
FLIGHT_NUMBER               0
TAIL_NUMBER             14721
ORIGIN_AIRPORT              0
DESTINATION_AIRPORT         0
SCHEDULED_DEPARTURE         0
DEPARTURE_TIME          86153
DEPARTURE_DELAY         86153
SCHEDULED_ARRIVAL           0
ARRIVAL_TIME            92513
ARRIVAL_DELAY          105071
DIVERTED                    0
CANCELLED                   0
dtype: int64

In [11]:
# Filter and visualize a few of the NaN rows
df_filtered = df_flights[df_flights['ARRIVAL_DELAY'].isnull()]
df_filtered.head(10)

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED
32,1,1,4,AS,136,N431AS,ANC,SEA,135,,,600,,,0,1
42,1,1,4,AA,2459,N3BDAA,PHX,DFW,200,,,500,,,0,1
68,1,1,4,OO,5254,N746SK,MAF,IAH,510,,,637,,,0,1
82,1,1,4,MQ,2859,N660MQ,SGF,DFW,525,,,700,,,0,1
90,1,1,4,OO,5460,N583SW,RDD,SFO,530,,,700,,,0,1
128,1,1,4,MQ,2926,N932MQ,CHS,DFW,545,,,755,,,0,1
131,1,1,4,OO,6457,N560SW,SMX,LAX,545,,,651,,,0,1
147,1,1,4,MQ,3534,N925MQ,ABI,DFW,550,,,645,,,0,1
166,1,1,4,MQ,3161,N5PBMQ,XNA,DFW,555,,,710,,,0,1
206,1,1,4,AA,175,N3EWAA,DCA,DFW,600,,,835,,,0,1


For the analysis, we are primarily concerned with ARRIVAL_DELAY and DEPARTURE_DELAY. Some rows have null values in those columns, but they also have
associated null values in DEPARTURE_TIME and ARRIVAL_TIME. Those flights have been CANCELLED, so we can safely remove them from the Flights data.

In [14]:
# Drop rows where the flight was CANCELLED
df_flights.drop(df_flights.index[df_flights['CANCELLED'] == 1], inplace=True)

In [15]:
# Check to see what other Null values remain in the Flights data
df_flights.isnull().sum()

MONTH                      0
DAY                        0
DAY_OF_WEEK                0
AIRLINE                    0
FLIGHT_NUMBER              0
TAIL_NUMBER                0
ORIGIN_AIRPORT             0
DESTINATION_AIRPORT        0
SCHEDULED_DEPARTURE        0
DEPARTURE_TIME             0
DEPARTURE_DELAY            0
SCHEDULED_ARRIVAL          0
ARRIVAL_TIME            2629
ARRIVAL_DELAY          15187
DIVERTED                   0
CANCELLED                  0
dtype: int64

ARRIVAL_TIME and ARRIVAL_DELAY still have rows with null values. If we set these values to zero, they will skew the results so it is best to remove those rows from the data.

In [None]:
# Drop remaining rows
df_flights = df_flights[df_flights.ARRIVAL_TIME.notnull()]
df_flights = df_flights[df_flights.ARRIVAL_DELAY.notnull()]

One final check to see where we stand and how many rows are in the data.

In [17]:
df_flights.isnull().sum(), df_flights.shape

(MONTH                  0
 DAY                    0
 DAY_OF_WEEK            0
 AIRLINE                0
 FLIGHT_NUMBER          0
 TAIL_NUMBER            0
 ORIGIN_AIRPORT         0
 DESTINATION_AIRPORT    0
 SCHEDULED_DEPARTURE    0
 DEPARTURE_TIME         0
 DEPARTURE_DELAY        0
 SCHEDULED_ARRIVAL      0
 ARRIVAL_TIME           0
 ARRIVAL_DELAY          0
 DIVERTED               0
 CANCELLED              0
 dtype: int64,
 (5714008, 16))

The Flights data started with 5819079 rows of data. We now have 5714008 rows remaining.<br>
The total reduction of 1.8% in data rows should not have any material impact on our final results.

## Join Datasets
- Rename column AIRLINE to AIRLINE_CODE in df_flights
- From df_airports bring in the Airport Name based on the IATA code of the ORIGIN_AIRPORT in df_flights
- From df_airlines bring in the Airline Name based on the IATA code of the AIRLINE_CODE in df_flights
- From df_international_airports bring in the airport Status based on the IATA code of the ORIGIN_AIRPORT in df_flights

In [19]:
# Rename column
df_flights.rename(columns={'AIRLINE': 'AIRLINE_CODE'}, inplace=True)

In [20]:
# Bring in Airport Name and drop additional columns
df_flights = df_flights.merge(df_airports,left_on='ORIGIN_AIRPORT', right_on='IATA_CODE', how='left').drop(columns=[
    'IATA_CODE', 'CITY', 'STATE', 'COUNTRY', 'LATITUDE', 'LONGITUDE'])

In [21]:
# Bring in Airline Name and drop additonal columns
df_flights = df_flights.merge(df_airlines, left_on='AIRLINE_CODE', right_on='IATA_CODE', how='left').drop(columns=['IATA_CODE'])

In [22]:
# Bring in airport Status and drop additional columns
df_flights = df_flights.merge(df_intl, left_on='ORIGIN_AIRPORT', right_on='IATA-Code', how='left').drop(columns=['CITY','AIRPORT_NAME', 'STATE', 'IATA-Code'])

# Set Status to REG for all flights that are not already identified as INTL
df_flights['Status'] = df_flights['Status'].fillna("REG")

Convert SHEDULED_DEPARTURE, DEPARTURE_TIME, SCHEDULED_ARRIVAL, ARRIVAL_TIME columns to proper time values HH:MM.<br>
Convert ARRIVAL_DELAY, DEPARTURE_DELAY into proper time values HH:MM

In [31]:
df_flights['SCHEDULED_DEPARTURE'] = tf.flight_time(df_flights, 'SCHEDULED_DEPARTURE')
df_flights['DEPARTURE_TIME'] = df_flights['DEPARTURE_TIME'].apply(tf.format_hour)
df_flights['SCHEDULED_ARRIVAL'] = df_flights['SCHEDULED_ARRIVAL'].apply(tf.format_hour)
df_flights['ARRIVAL_TIME'] = df_flights['ARRIVAL_TIME'].apply(tf.format_hour)

NameError: name 'tf' is not defined

In [None]:
df_flights.head()

Breakdown of flights associated with International and Regional airports in df_flights.

In [23]:
# Count of INTL and REG flights
df_flights['Status'].value_counts()

Status
INTL    4217609
REG     1496399
Name: count, dtype: int64

Split df_flights into two new datasets: df_intl and df_reg

In [26]:
# Split into INTL and REG datasets
df_intl = df_flights[df_flights['Status']=='INTL']
df_reg = df_flights[df_flights['Status']=='REG']

## Data Analysis


In [29]:
df_intl.describe()

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED
count,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0,4217609.0
mean,6.236483,15.6957,3.924523,1990.353,1354.636,1360.87,10.15965,1519.189,1498.15,4.967092,0.0,0.0
std,3.375855,8.765089,1.990088,1649.394,486.3643,500.1293,36.37087,513.7135,536.2432,38.86724,0.0,0.0
min,1.0,1.0,1.0,1.0,1.0,1.0,-68.0,1.0,1.0,-87.0,0.0,0.0
25%,3.0,8.0,2.0,687.0,935.0,937.0,-4.0,1125.0,1114.0,-13.0,0.0,0.0
50%,6.0,16.0,4.0,1553.0,1340.0,1346.0,-1.0,1539.0,1530.0,-5.0,0.0,0.0
75%,9.0,23.0,6.0,2739.0,1747.0,1759.0,9.0,1940.0,1937.0,9.0,0.0,0.0
max,12.0,31.0,7.0,8445.0,2359.0,2400.0,1670.0,2400.0,2400.0,1665.0,0.0,0.0


In [None]:
#df_reg.to_csv('data/reg.csv', encoding='utf-8', index=False)

In [None]:
df_reg.head()