# Airline On-Time Performance Data Exploratory Analysis
## by Esosa Orumwese

## Investigation Overview

The goal of this investigation was to generate insights from 2007 airline on-time performance data which revolves around
1. Were there more delayed flights than there were early or on-time flights?
1. How does the number of registered flights vary per month?
2. What is the distribution of early, on-time, delayed and cancelled flights per month?
3. What is the distribution of daily flight cancellations?
4. How does the number of registered flights compare amongst airlines?
5. What is the distribution of delays for the top airlines?

## Dataset Overview

The dataset used consists of flight arrival and departure details for all commercial flights within the USA in 2007. It is a large dataset with nearly 7.5 million records, 25 variables, and takes up 1.2 gigabytes of space. The data comes originally from RITA where it is described in detail.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [2]:
# load in the dataset into a pandas dataframe
airline_df = pd.read_csv('airline_clean.csv')
airports = pd.read_csv('airports.csv')
planes_data = pd.read_csv('plane_clean.csv')
carriers = pd.read_csv('carriers.csv')

In [3]:
# dataset overview
airline_df.head(10)

Unnamed: 0,Date,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,ArrDelay,DepDelay,Origin,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007-01-01,1232.0,1225,1341.0,1340,WN,2891,1.0,7.0,SMF,...,4,11,False,,False,0,0,0,0,0
1,2007-01-01,1918.0,1905,2043.0,2035,WN,462,8.0,13.0,SMF,...,5,6,False,,False,0,0,0,0,0
2,2007-01-01,2206.0,2130,2334.0,2300,WN,1229,34.0,36.0,SMF,...,6,9,False,,False,3,0,0,0,31
3,2007-01-01,1230.0,1200,1356.0,1330,WN,1355,26.0,30.0,SMF,...,3,8,False,,False,23,0,0,0,3
4,2007-01-01,831.0,830,957.0,1000,WN,2278,-3.0,1.0,SMF,...,3,9,False,,False,0,0,0,0,0
5,2007-01-01,1430.0,1420,1553.0,1550,WN,2386,3.0,10.0,SMF,...,2,7,False,,False,0,0,0,0,0
6,2007-01-01,1936.0,1840,2217.0,2130,WN,409,47.0,56.0,SMF,...,5,7,False,,False,46,0,0,0,1
7,2007-01-01,944.0,935,1223.0,1225,WN,1131,-2.0,9.0,SMF,...,4,9,False,,False,0,0,0,0,0
8,2007-01-01,1537.0,1450,1819.0,1735,WN,1212,44.0,47.0,SMF,...,5,7,False,,False,20,0,0,0,24
9,2007-01-01,1318.0,1315,1603.0,1610,WN,2456,-7.0,3.0,SMF,...,5,8,False,,False,0,0,0,0,0


In [4]:
# structure of data
airline_df.describe()

Unnamed: 0,DepTime,CRSDepTime,ArrTime,CRSArrTime,FlightNum,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,7292440.0,7453188.0,7275261.0,7453188.0,7453188.0,7275261.0,7292440.0,7453188.0,7453188.0,7453188.0,7453188.0,7453188.0,7453188.0,7453188.0,7453188.0
mean,1339.221,1330.596,1482.104,1495.391,2188.106,10.19221,11.39917,719.8048,6.691974,16.30016,3.865245,0.7700931,3.783703,0.02373561,5.099152
std,479.8517,464.7069,507.2232,481.5892,1971.958,39.3078,36.14195,562.3055,5.151336,11.83398,20.84244,9.619564,16.17673,1.084997,21.27757
min,1.0,0.0,1.0,0.0,1.0,-312.0,-305.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,930.0,930.0,1107.0,1115.0,590.0,-9.0,-4.0,319.0,4.0,10.0,0.0,0.0,0.0,0.0,0.0
50%,1329.0,1322.0,1513.0,1520.0,1509.0,0.0,0.0,569.0,5.0,14.0,0.0,0.0,0.0,0.0,0.0
75%,1733.0,1720.0,1911.0,1906.0,3553.0,14.0,11.0,946.0,8.0,19.0,0.0,0.0,0.0,0.0,0.0
max,2400.0,2359.0,2400.0,2400.0,9602.0,2598.0,2601.0,4962.0,545.0,530.0,2580.0,1429.0,1386.0,382.0,1031.0


## Were there more delayed flights than there were early or on-time flights?
The dataset contained a lot of negative values, signifying the flights that were early. As a result, departure delay was grouped into 4 categories;
* OnTime_Early: Delay times that were less than or equal zero,
* Small_Delay: Delay time of less than 15 minutes,
* Medium_Delay: Delay times between 15 minutes and 45 minutes,
* Large_Delay: Delay times greater than 45 minutes.

It was found that 55% of the flights in 2007 were either on-time or early with just 21.95% experiencing a small delay and 2% cancelled.

![img](figures/flight_result_barchart.png)

## How does the number of registered flights vary per month?

Although the median number of bookings seems to be above 600,000 in 2007, February seemed to have the least amount of flight bookings (565,604) while July and August had the most (648,544 and 653,276 respectively).

![img](figures/no_of_booked_flights_per_month.png)

## What is the distribution of early, on-time, delayed and cancelled flights per month?

There is clear and gradual decrease in percentage of delays for each month as seen in the heatmap below. 

![img](figures/pct_of_DelayGroup_month.png)

When looking at the distribution, neglecting the on-time and early flights, we can notice that most delays occurred between June, July, August and December while September to November had the highest percentage of on time flights. February then December saw the highest percentage of cancelled flights than all other months in 2007.

![img](figures/pct_of_DelayGroup_month_except_OnTime.png)

## What is the distribution of daily flight cancellations like?

When plotting the log transform of the right skewed daily flight cancellations on a calendar plot, we can see basically 3 periods of increase in cancellations. From January to mid-April, then from June to August and from the last week of November to December.

![img](figures/calplot_log_daily_flight_cancellations.png)

## How does the number of registered flights compare amongst airlines?

Southwest Airlines Co. seems to be the most popular airline in 2007 with a total of 1,158,878 registered flights followed by 'American Airlines Inc'. and 'Skywest Airlines Inc'. both having 615,933 and 583,696 registered flights respectively. The least used airline is 'Aloha Airlines Inc.' having just 45,972 booked flights.

![img](figures/registered_flights_per_airline.png)

## What is the distribution of delays for the top airlines?
Airlines with more that 450k registered flights were classified as top airlines. Visualizing by delay group, we can notice that although Southwest Airlines had the most registered flights, it turns out that it has the least on-time or early flights (44%) out of the airlines in the 'more than 450k' group. While Delta Air Lines Inc., which comparatively had the least registered number of flights, turns out to have the most on-time or early flights with a value of 61%.

![img](figures/delay_group_for_top_airlines.png)

The delay groups were further grouped into 2 groups. Good delays represented the acceptable delays, covering on-time, early and small delay flights while bad delays covered the medium to Large delays and cancelled flights.

It was seen that Delta Air Lines Inc. has the least bad delays overall while, of the top airlines, American Eagles Airlines Inc has the worst delays.

![img](figures/top_airlines_pct_of_delays.png)

>**Generate Slideshow**: Once you're ready to generate your slideshow, use the `jupyter nbconvert` command to generate the HTML slide show. . From the terminal or command line, use the following expression.

In [None]:
!jupyter nbconvert slide_deck.ipynb --to slides --post serve --no-input --no-prompt

> This should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation! At last, you can stop the Kernel. 