# Part II - Airline on-time performance - Exploration
## by Juanita Smith

## Introduction
Have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you'd had more data? This is our chance to find out.

This analysis will be focused on predicting flight delays or cancellations.

> This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008.
> - See more information from the data expo challenge in 2009 [here](https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2009).
> - See a full description of the features [here](https://www.transtats.bts.gov/DatabaseInfo.asp?QO_VQ=EFD&Yv0x=D.)
> - Data can be downloaded from [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7).

Dictionary:
1) Year 1987-2008 
2) Month 1-12 
3) DayofMonth 1-31 
4) DayOfWeek 1 (Monday) - 7 (Sunday) 
5) DepTime actual departure time (local, hhmm) 
6) CRSDepTime scheduled departure time (local, hhmm) 
7) ArrTime actual arrival time (local, hhmm) 
8) CRSArrTime scheduled arrival time (local, hhmm) 
9) UniqueCarrier unique carrier code 
10) FlightNum flight number 
11) TailNum plane tail number 
12) ActualElapsedTime in minutes 
13) CRSElapsedTime in minutes 
14) AirTime in minutes 
15) ArrDelay arrival delay, in minutes 
16) DepDelay departure delay, in minutes 
17) Origin origin IATA airport code 
18) Destination IATA airport code 
19) Distance in miles 
20) TaxiIn - The time elapsed between wheels down and arrival at the destination airport gate in minutes
21) TaxiOut - The time elapsed between departure from the origin airport gate and wheels off in minutes
22) Cancelled was the flight cancelled? 
23) CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 
24) Diverted 1 = yes, 0 = no 
25) CarrierDelay in minutes
26) WeatherDelay in minutes 
27) NASDelay in minutes 
28) SecurityDelay in minutes 
29) LateAircraftDelay in minutes


**Important to note:** According to the documentation, a late flight is defined as a flight arriving or departing 15 minutes or more after the scheduled time.

>**Rubric Tip**: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

> **Rubric Tip**: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

>**Rubric Tip**: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.  





Important points:

- When a flight is cancelled, there are lots of missing values in departure and arrival related fields, as the flight never took off
- When a flight is diverted, columns related to departed are captured, whilst colums related to arrival are missing.
- A flight is considered delayed if arrival delay (arrDelay) >= 15 minutes
- Reasons for delay will only be captured if arrival delay > 15 minutes

In [None]:
# import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import seaborn as sns
import time
import glob

# clear the garbage to free memory as we are working with huge datasets
import gc 

# import warnings
# warnings.filterwarnings("ignore")

# Import custom modules
from src.utils import reduce_mem_usage, create_folder, change_width

# set plots to be embedded inline
%matplotlib inline

# suppress matplotlib user warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="matplotlib")

# use high resolution if this project is run on an apple device
# %config InlineBackend.figure_format='retina'

# Make your Jupyter Notebook wider
from IPython.display import display, HTML
display(HTML('<style>.container { width:80% !important; }</style>'))

# environment settings
# display all columns and rows during visual inspection
pd.options.display.max_columns = None
pd.options.display.max_rows = None


# stop scientific notation on graphs
#pd.options.display.float_format = '{:.02}'.format


from pylab import rcParams
from statsmodels.graphics import tsaplots
import statsmodels.api as sm

In [None]:
def set_pub():
    rcParams.update({
        "font.weight": "ultralight",  # bold fonts
        "tick.labelsize": 8,   # large tick labels
        "lines.linewidth": 1,   # thick lines
#         "lines.color": "k",     # black lines
#         "grid.color": "0.5",    # gray gridlines
#         "grid.linestyle": "-",  # solid gridlines
#         "grid.linewidth": 0.5,  # thin gridlines
        "savefig.dpi": 300,     # higher resolution output.
    })

In [None]:
# rcParams.keys()

In [None]:
sns.set_style("whitegrid")
# BASE_COLOR = sns.color_palette("BrBG")[-2]
BASE_COLOR = '#196689'
BASE_COLOR_ARR = '#6c92ab'
# BASE_COLOR_DEP = '#e67f83'

BASE_COLOR_DEP = sns.color_palette("husl", 15)[3]

# 0 = down arrow, 1 = up arrow for growth and shrink indicators
SYMBOLS = [u'\u25BC', u'\u25B2'] 

# default plot size
# plt.rcParams['figure.figsize'] = 12,4

SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 12

# fig = plt.figure()
# fig.rcParams('suptitle', fontsize=20)

plt.rc('font', size=BIGGER_SIZE, weight='ultralight', family='sans-serif')          # controls default text sizes
plt.rc('axes', titlesize=BIGGER_SIZE, titlecolor='black', titleweight='bold')     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE, labelcolor='black', labelweight='ultralight')     # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE, titleweight="bold", figsize=[8,4])  # fontsize of the figure title
# plt.rc('axes', labelsize=16, titlesize=16
# plt.rcParams['font.family'] = 'Serif'
# plt.rc('text')
# plt.rcParams['xtick.labelsize']=8

# mpl.rcParams['lines.linewidth'] = 2
# mpl.rcParams['lines.linestyle'] = '--'

# plt.rcParams.update({
#     "font.weight": "bold",
#     "xtick.major.size": 5,
#     "xtick.major.pad": 7,
#     "xtick.labelsize": 15,
#     "grid.color": "0.5",
#     "grid.linestyle": "-",
#     "grid.linewidth": 5,
#     "lines.linewidth": 2,
#     "lines.color": "g",
# })

In [None]:
# SMALL_SIZE = 8
# MEDIUM_SIZE = 10
# BIGGER_SIZE = 14

# plt.rc('font', size=BIGGER_SIZE)          # controls default text sizes
# plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
# plt.rc('axes', labelsize=SMALL_SIZE)     # fontsize of the x and y labels
# plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
# plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
# plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
# plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

# matplotlib.rcParams.update({'font.size': 22})
# matplotlib.rc('font', **font)

# plt.title("Simple plot", fontdict={'fontweight': 'normal', 'weight': 'normal', 'family': 'sans-serif'})

In [None]:
FILE_NAME_RAW = '../data/flights_raw.pkl'
FILE_NAME_CLEAN = '../data/flights_clean.pkl'

In [None]:
sns.color_palette()

In [None]:
sns.color_palette("husl", 15)

In [None]:
test = sns.color_palette("Paired")
test

In [None]:
sns.color_palette("BrBG")

In [None]:
sns.color_palette("muted")

In [None]:
# blue
# #93c6e6

# matching complimentary
# #e6b093

### Dataset Overview

In [None]:
# load the cleaned file 
flights = pd.read_pickle(FILE_NAME_CLEAN)
flights.sample(5)

In [None]:
# make sure datatypes were preserved from cleaning step
flights.info(verbose=True, show_counts=True)

In [None]:
flights.shape

In [None]:
# get carrier descriptions
carriers = pd.read_csv('../data/lookup_tables/carriers.csv', index_col='Code')
carriers.head()

In [None]:
# get airplane data
planes = pd.read_csv('../data/lookup_tables/plane-data.csv', index_col='tailnum')
planes.tail()

In [None]:
# get airport descriptions
airports = pd.read_csv('../data/lookup_tables/airports.csv', index_col='iata')
airports.head()

### Feature engineering

In [None]:
# create a datetime field for time series analysis
flights.rename(columns={'dayofMonth':'day'}, inplace=True)
flights['date'] = pd.to_datetime(flights[['year', 'month', 'day']], yearfirst=True, errors='raise')
flights[['year', 'month', 'day', 'arrTime', 'date']][:5]

In [None]:
# build a flight status field to compare cancelled, diverted, ontime and delayed flights
flights['flight_status'] = np.where(flights['cancelled'] == True, 'cancelled',
                           np.where(flights['diverted'] == True, 'diverted',
                           np.where((flights['cancelled'] == False) & (flights['diverted'] == False) & (flights['arrDelay'] >= 15), 'delayed',    
                           np.where((flights['cancelled'] == False) & (flights['diverted'] == False) & (flights['arrDelay'] < 15), 'on_time', np.nan))))  

flights['flight_status'].value_counts()

In [None]:
# add new 'lane' feature consisting of both departure and arrival airport
flights['lane'] = flights['origin'] + '-' + flights['dest']

In [None]:
# extract delayed and ontime flights into seperate datasets
flight_delays = flights.loc[flights['flight_status'] == 'delayed'].copy()
flight_ontime = flights.loc[flights['flight_status'] == 'on_time'].copy()

In [None]:
# get all columns of type float
float_columns = list(flight_delays.select_dtypes(include=['float16']).columns)

# convert all float columns to int16, which should now be possible as cancelled and diverted data had a lot of missing values in
flight_delays[float_columns] = flight_delays[float_columns].astype('int16')
flight_ontime[float_columns] = flight_ontime[float_columns].astype('int16')

In [None]:
flight_delays.info()

In [None]:
# take a sample of delays to speed up performance
sample = np.random.choice(flight_delays.shape[0], 500000, replace=False)
flight_delays_sample = flight_delays.iloc[sample].copy()
flight_delays_sample.shape

In [None]:
# clear the garbage to free memory
gc.collect()

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 


#### 1. Let's get a first impression of how many flights are ontime, delayed, cancelled or diverted

In [None]:
# calculate number of flights per flight status and plot it
flight_status_summary = flights['flight_status'].value_counts(normalize=True).sort_values(ascending=False)
sns.barplot(x=flight_status_summary.index, y=flight_status_summary, color=BASE_COLOR)
plt.ylabel('')
plt.xlabel('Flight status')
plt.title('Proportions of flight status')

locs, labels = plt.xticks()

# for each bar, print a % text at the top of each bar
for loc, label in zip(locs,labels):
    count = flight_status_summary[label.get_text()]
    pct_string = '{:0.1f}%'.format(count*100)
    plt.text(loc, count+0.01, pct_string, ha='center', color='black', size=8, weight='ultralight')
    
ticks = np.arange(0, 1, 0.1)
labels = ['{:1.0f}%'.format(tick*100) for tick in ticks]
plt.yticks(ticks,labels)

plt.show()

>Around 76% of flights are on time, where as 21% are delayed. Only 2% of flights are cancelled or diverted which is not therefore not the main concern

In [None]:
#### 2) Start by looking at the distribution of the main variable of interest to predict delays: `arrDelay`.
Compare arrDelay distribution to depDelay distribution for flights that are delayed

In [None]:
# Take a closer look at distribution
# flight_delays[['arrDelay', 'depDelay']].describe([0.25, 0.5, 0.75, 0.85, 0.9, 0.95, 0.99]).round(0)

In [None]:
plt.figure(figsize=[12,4])

xbins = np.arange(0, flight_delays['arrDelay'].max()+15, 15)

# plot 1 - distribution of departure delays
ax1 = plt.subplot(1, 2, 1)
ax1.hist(data=flight_delays, x='depDelay', bins=xbins, color=BASE_COLOR_DEP)

plt.xlabel('Departure delays (in minutes)')
plt.ylabel('Count')
plt.xlim(0, 300)
plt.title('Distribution of departure delays')

# plot 2 - distribution of arrival delays
ax2 = plt.subplot(1, 2, 2, sharey=ax1, sharex=ax1)
ax2.hist(data=flight_delays, x='arrDelay', bins=xbins, color=BASE_COLOR_ARR)
plt.xlabel('Arrival delays (in minutes)')
plt.title('Distribution of arrival delays')
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout()
plt.show()

In [None]:
# As the distribution of delays are right skewed, lets plot a log distribution instead

plt.figure(figsize=[12,4])
log_binsize = 0.1
bins= 10 ** np.arange(0, np.log10(flight_delays['arrDelay'].max())+log_binsize, log_binsize)
ticks = [1, 3, 10, 30, 100, 300, 1000]
labels = ['{}'.format(tick) for tick in ticks]

ax2 = plt.subplot(1, 2, 2)
plt.hist(data=flight_ontime, x='arrDelay', bins=bins, color=BASE_COLOR_ARR, alpha=0.2)
plt.hist(data=flight_delays, x='arrDelay', bins=bins, color=BASE_COLOR_ARR)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.xlabel('Arrival delays (in minutes)')
plt.title('Distribution of arrival delays')
plt.ticklabel_format(style='plain', axis='y')
plt.xlim(1, 300)

ax1 = plt.subplot(1, 2, 1, sharex=ax2, sharey=ax2)
plt.hist(data=flight_delays, x='depDelay', bins=bins, color=BASE_COLOR_DEP)
plt.xticks(ticks, labels)
plt.xlabel('Departure delays (in minutes)')
plt.title('Distribution of departure delays')
plt.ticklabel_format(style='plain', axis='y')
plt.xlim(1, 300)

plt.tight_layout()

Both departure and arrival delays are right skewed. When plotted on a log scale, arrival delays are still right skewed whilst departure delays are more normally distributed.
Arrival delays were restricted to >= 15 minutes, which was not the case for departure delays, otherwise departure delays would also be right skewed.

Most arrivals are around 20 minutes late. There is a steep decline in delays > 20 minutes.

#### 3) What are the main reasons for delays ?

In [None]:
plt.figure(figsize=[8,4])
means = flight_delays[['carrierDelay','weatherDelay','NASDelay','securityDelay','lateAircraftDelay']].describe().T
means = means[['mean']].sort_values(by='mean', ascending=False)
sns.barplot(data=means, x=means.index, y='mean', color=BASE_COLOR)
plt.title('Distribution for reasons of delay')
plt.xlabel('Reasons for delay')
# plt.xticks(size=8, weight='ultralight')
# plt.yticks(size=8, weight='ultralight')
plt.show()

In [None]:
# fig, ax = plt.subplots(5, 1, figsize=(16, 20))

for col in ['carrierDelay','weatherDelay','NASDelay','securityDelay','lateAircraftDelay']:

    plt.figsize = (4,2)
    xbins = np.arange(0, flight_delays[col].max()+15, 15)

#     # plot 1 - distribution of departure delays
#     ax1 = plt.subplot(1, 2, 1)
    plt.hist(data=flight_delays, x=col, bins=xbins, color=BASE_COLOR_ARR)

    plt.xlabel('Departure delays (in minutes)')
    plt.ylabel('Count')
    plt.xlim(0, 150)
    plt.title('Distribution of {}'.format(col))
    plt.show()
    
    
#     f, ax = plt.subplots(5, 2, figsize=(16, 20))
# f.subplots_adjust(hspace=0.4)

# for i, j in enumerate(['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']):
#     binsize=5

In [None]:
flight_delays.describe()

#### 4) When is the best time of day/day of week/time of year to fly to minimise delays?

Build a time series to view flight and delay patterns at different period intervals to discover possible trends and peaks

In [None]:
# build a timeseries dataset with delays per day, with datatime field as index
flight_timeseries_day = flights.groupby('date')['flight_status'].value_counts().unstack()
flight_timeseries_day['total_flights'] = flight_timeseries_day.sum(axis=1)
flight_timeseries_day['diff_delay'] = flight_timeseries_day['delayed'].diff()
flight_timeseries_day['pct_delay'] = flight_timeseries_day['delayed'].pct_change()
flight_timeseries_day.head()

In [None]:
flight_timeseries_day.tail()

2008 does not contain data for the full year, only Jan - April is available. Drop this year from the data

In [None]:
flight_timeseries_day = flight_timeseries_day.loc[flight_timeseries_day.index.year < 2008]

#### Let's start by looking of the volume of flights vs delays per year

In [None]:
freq = 'Y'
flight_timeseries_year = flight_timeseries_day.resample(freq).sum()
flight_timeseries_year['total_flights'] = flight_timeseries_year.sum(axis=1)
flight_timeseries_year['pct_delay'] = flight_timeseries_year['delayed'].pct_change()
flight_timeseries_year['pct_all'] = flight_timeseries_year['total_flights'].pct_change()
flight_timeseries_year

In [None]:
plt.figure(figsize=[8,6])

sns.barplot(data=flight_timeseries_year, x=flight_timeseries_year.index.year, y='total_flights', color='lightgrey', label='All flights')
sns.barplot(data=flight_timeseries_year, x=flight_timeseries_year.index.year, y='delayed', color=BASE_COLOR, label='Delayed flights')

# for each blue bar, print % delays over total flights year on year 
locs, labels = plt.xticks()
for loc, label in zip(locs,labels):
    date = pd.to_datetime(label.get_text()).strftime('%Y')
    date = date + '-12-31'
    counts = flight_timeseries_year.loc[date]
    
    # add positive or negative sign
    symbol = ''
    if pd.isna(counts['pct_delay']):
        continue
    elif counts['pct_delay'] > 0:
        symbol = SYMBOLS[1]
    elif counts['pct_delay'] < 0:
        symbol = SYMBOLS[0]
    
    pct_string_delay = '{}{:0.2f}%'.format(symbol, round(counts['pct_delay']*100,2))
    pct_string_all = '{}{:0.2f}%'.format(symbol, round(counts['pct_all']*100,2))
    plt.text(loc, counts.delayed+100000, pct_string_delay, ha='center', color='black', fontsize=8)
    plt.text(loc, counts.total_flights+100000, pct_string_all, ha='center', color='black', fontsize=8)
    
binsize=1000000
yticks = np.arange(0, flight_timeseries_year['total_flights'].max()+binsize, binsize)
ylabels = ['{:1.0f}'.format(tick/1000000)+' mil' for tick in yticks]
plt.yticks(yticks, ylabels)    

plt.legend(bbox_to_anchor=(1, 1), loc='upper left') 
plt.title('Growth of total flights vs Delays per year')
plt.xlabel('Year')
plt.ylabel('Number of flights')

plt.show()

Both the total number of flights and delays increase every year. The growth rate of delays are higher than flight growth.

#### Can we spot peaks in months ? For example are summer months and xmas causing delays at airports ?

In [None]:
# sns.countplot(data=flights, x='month', color=BASE_COLOR)
# sns.countplot(data=flight_delays, x='month', color='grey')

# binsize=500000
# ticks = np.arange(0, 4000000, binsize)
# labels = ['{:1.0f}'.format(tick/1000000)+' mil' for tick in ticks]
# plt.yticks(ticks, labels) 
# plt.title('Total flights per month')
# plt.show()

In [None]:
freq = 'M'
delay_timeseries_month = flight_timeseries_day.resample(freq).sum()
delay_timeseries_month['total_flights'] = delay_timeseries_month.sum(axis=1)
delay_timeseries_month['pct_delay'] = delay_timeseries_month['delayed'].pct_change()
delay_timeseries_month['pct_all'] = delay_timeseries_month['total_flights'].pct_change()
delay_timeseries_month['diff_delay'] = delay_timeseries_month['delayed'].diff()
delay_timeseries_month['pct_diff_delay'] = delay_timeseries_month['diff_delay'].pct_change()
delay_timeseries_month.head()

In [None]:
plt.figure(figsize=[10,6])

sns.barplot(data=delay_timeseries_month, x=delay_timeseries_month.index.month, y='total_flights', color='lightgrey', label='All flights', errorbar=None, errwidth=1)
g = sns.barplot(data=delay_timeseries_month, x=delay_timeseries_month.index.month, y='delayed', color=BASE_COLOR, label='Delayed flights', errorbar=None, errwidth=1)

# for p in g.patches:
#     g.annotate(format(p.get_height(), '.0f'), 
#                    (p.get_x() + p.get_width() / 2., p.get_height()), 
#                    ha = 'center', va = 'center', 
#                    xytext = (0, 2), 
#                    textcoords = 'offset points')
    
binsize=100000
yticks = np.arange(0, delay_timeseries_month['total_flights'].max()+binsize, binsize)
ylabels = ['{:1.0f}'.format(tick/1000)+'k' for tick in yticks]
plt.yticks(yticks, ylabels)    

plt.legend(bbox_to_anchor=(1, 1), loc='upper left') 
plt.title('Average total flights vs delays per month')
plt.xlabel('Month')
plt.ylabel('Number of flights')

plt.show()

The biggest peaks in delays are in summer months 6-8 and xmas period months 12. Delays decrease in spring/autumn months

In [None]:
plt.figure(figsize=[10,6])

sns.barplot(data=flight_timeseries_day, x=flight_timeseries_day.index.dayofweek, y='total_flights', color='lightgrey', label='All flights', errorbar=None, errwidth=1)
g = sns.barplot(data=flight_timeseries_day, x=flight_timeseries_day.index.dayofweek, y='delayed', color=BASE_COLOR, label='Delayed flights', errorbar=None, errwidth=1)

for p in g.patches:
    g.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 2), 
                   textcoords = 'offset points')

plt.legend(bbox_to_anchor=(1, 1), loc='upper left') 
plt.title('Average total flights vs delays per day of week')
plt.xlabel('day of week')
plt.ylabel('Number of flights')

plt.tight_layout()

Most delays happens on Monday, Wednesday and Thursday when there are more flights than on other days. Airports are less busy with less delays on weekends. 

#### Can we confirm the seasonal patterns for summer and xmas periods in time series plotting?

In [None]:
plt.errorbar(x=delay_timeseries_month.index, y=delay_timeseries_month['delayed'])
plt.title('Flights by month')
plt.ylabel('Number of flights')
plt.show()

In [None]:
plt.errorbar(x=delay_timeseries_month.index, y=delay_timeseries_month['pct_delay'])
plt.errorbar(x=delay_timeseries_month.index, y=delay_timeseries_month['pct_all'])
plt.title('Flights by day')
plt.ylabel('Number of flights')
plt.show()

# TODO: try scaling techniques to nomarize these numbers for comparison

In [None]:
plt.errorbar(x=delay_timeseries_month.index, y=delay_timeseries_month['diff_delay'])
plt.title('Flights by month - difference')
plt.ylabel('Number of flights')
plt.show()

In [None]:
# Using moving average for rolling 25 years, smooth out yearly volatility and observe the long term trend
plt.figure(figsize=[16,8])

ticks = pd.date_range(start=delay_timeseries_month.index.min(), end=delay_timeseries_month.index.max(), freq='3M').strftime('%Y-%m')

plt.subplot(3,1,1)
ma = delay_timeseries_month['pct_delay'].rolling(window=4).mean().dropna()
plt.errorbar(x=ma.index, y=ma)
plt.ylabel('Nr flights delayed')
plt.xlabel('Period')
plt.title('4 month moving average trends in flight arrival delays')
plt.xticks(ticks,ticks, rotation=90)

plt.subplot(3,1,2)
ma = delay_timeseries_month['pct_delay'].rolling(window=5).mean().dropna()
plt.errorbar(x=ma.index, y=ma)
plt.ylabel('Nr flights delayed')
plt.xlabel('Year')
plt.title('5 month moving average trends in flight arrival delays')
plt.xticks(ticks,ticks, rotation=90)

plt.subplot(3,1,3)
ma = delay_timeseries_month['pct_delay'].rolling(window=6).mean().dropna()
ax = plt.errorbar(x=ma.index, y=ma)
plt.ylabel('Nr flights delayed')
plt.xlabel('Year')
plt.title('6 month moving average trends in flight arrival delays')
plt.xticks(ticks,ticks, rotation=90)

plt.tight_layout()

In [None]:
# Using moving average for rolling 25 years, smooth out yearly volatility and observe the long term trend
plt.figure(figsize=[16,8])

ticks = pd.date_range(start=delay_timeseries_month.index.min(), end=delay_timeseries_month.index.max(), freq='M').strftime('%Y-%m')

plt.subplot(3,1,1)
ma = delay_timeseries_month['pct_diff_delay'].rolling(window=2).mean().dropna()
plt.errorbar(x=ma.index, y=ma)
plt.ylabel('Nr flights delayed')
plt.xlabel('Period')
plt.title('2 month moving average trends in flight arrival delays')
plt.xticks(ticks,ticks, rotation=90)

plt.subplot(3,1,2)
ma = delay_timeseries_month['pct_diff_delay'].rolling(window=4).mean().dropna()
plt.errorbar(x=ma.index, y=ma)
plt.ylabel('Nr flights delayed')
plt.xlabel('Year')
plt.title('4 month moving average trends in flight arrival delays')
plt.xticks(ticks,ticks, rotation=90)

plt.subplot(3,1,3)
ma = delay_timeseries_month['pct_diff_delay'].rolling(window=6).mean().dropna()
ax = plt.errorbar(x=ma.index, y=ma)
plt.ylabel('Nr flights delayed')
plt.xlabel('Year')
plt.title('6 month moving average trends in flight arrival delays')
plt.xticks(ticks,ticks, rotation=90)

plt.tight_layout()

In [None]:
# plotting autocorrelations
fig = tsaplots.plot_acf(delay_timeseries_month['delayed'], lags=12, alpha=0.1)
plt.show()

In [None]:
# plotting partial autocorrelations
fig = tsaplots.plot_pacf(delay_timeseries_month['delayed'], lags=12, alpha=0.1, method = "ols")
plt.show()

Looking at autocorrelations and partial correlations, we pickup a correlation at lag 1, which suggest there might be a pattern with a month ?
There is also a strong correlation at lag 4 which suggest a seasonal pattern

In [None]:
rcParams['figure.figsize'] = 14,10
delay_timeseries_month.dropna(inplace=True)
decomposition = sm.tsa.seasonal_decompose(delay_timeseries_month['delayed'], extrapolate_trend='freq')
fig = decomposition.plot()
plt.show()

In [None]:
# rcParams['figure.figsize'] = 10,2
trend = decomposition.trend
ax = trend.plot()

In [None]:
# rcParams['figure.figsize'] = 10,2
season = decomposition.seasonal
ax = season.plot()

**Numerical summary:**

When flights increase, delays increase as well, although they grow not at the same rate. Increase in delays are more rapid.
There is definite upwards trend in flights and delays year apon year
There is a strong seasonal pattern. There are 2 strong peaks, the biggest one around xmas time in December - March, and another one during summer months June - August
Mondays, Wednesdays and Thursdays are the busiest times at airports, it is the most quiet over weekends

### Categorical analysis

In [None]:
BASE_COLOR_DEP = sns.color_palette('Paired')[0]
BASE_COLOR_DEP_LIGHT = sns.color_palette('Paired')[0]
BASE_COLOR_ARR = sns.color_palette('Paired')[2]

In [None]:
total_delayed_flights = flight_delays.shape[0]
print('Total delayed flights: {}'.format(total_delayed_flights))

In [None]:
flight_delays_enhanced.head()

Enrich delays with airport and plane descriptions

In [None]:
# add airport details like city and coordinates
flight_delays_enhanced = flight_delays.merge(airports, how='inner', right_on='iata', left_on='origin', suffixes=('_flight', '_origin'))
flight_delays_enhanced = flight_delays_enhanced.merge(airports, how='inner', right_on='iata', left_on='dest', suffixes=('_origin', '_dest'))

# add plane details like its age and manufacturer
flight_delays_enhanced = flight_delays_enhanced.merge(planes, how='inner', right_on='tailnum', left_on='tailNum', suffixes=('_flight', '_plane'))

# add carrier description
flight_delays_enhanced = flight_delays_enhanced.merge(carriers, how='inner', right_on='Code', left_on='uniqueCarrier', suffixes=('_flight', '_carrier'))
flight_delays_enhanced.head()

#### Which carriers cause the most delays ?

#### Which origin airports cause the most delays ?

In [None]:
def delays_by_cat(df, col, color=BASE_COLOR, theme='Origin airports', topn=30, binsize=100000):

    # calculate top category and order
    top = df[col].value_counts(ascending=False)
    top_order = top.index[:topn]    
    
    plt.figure(figsize=[14,6])
    ax = sns.countplot(data=df, x=col, color=color, order=top_order)
    
    plt.title('{} with the most delayed flights'.format(theme), weight='bold')
    plt.xlabel(theme)
    plt.ylabel('Number of delayed flights')

#   calculate and print % on the top of each bar
    ticks = ax.get_xticks()

    new_labels = []
    locs, labels = plt.xticks(rotation=90)
    for loc, label in zip(locs,labels):
        count = top[loc]
        perc = '{:0.1f}%'.format((count/top.sum())*100)
        # print only the first characters of xlabel descriptions
        text = top.index[loc][:20]
        new_labels.append(text)
        plt.text(loc, count+(0.05*binsize), perc, ha='center', color='black', fontsize=8)
              
    plt.xticks(ticks, new_labels, rotation=90, fontsize=10, weight='ultralight')

    # improve y ticks and labels   
    yticks = np.arange(0, top[0]+binsize, binsize)
    ylabels = ['{:1.0f}'.format(tick/1000)+'K' for tick in yticks]
    plt.yticks(yticks, ylabels)  

    plt.show()

In [None]:
delays_by_cat(df=flight_delays_enhanced, col='origin', theme='Origin airports', color=BASE_COLOR_DEP)

In [None]:
delays_by_cat(df=flight_delays_enhanced, col='dest', theme='Destination airports', color=BASE_COLOR_ARR)

In [None]:
delays_by_cat(df=flight_delays_enhanced, col='city_origin', theme='Origin cities', color=BASE_COLOR_DEP)

In [None]:
delays_by_cat(df=flight_delays_enhanced, col='city_dest', theme='Destination Cities', color=BASE_COLOR_ARR)

In [None]:
delays_by_cat(df=flight_delays_enhanced, col='Description', theme='Carriers', color=sns.color_palette("Paired", 20)[4])

Airports in large cities cause the most delays. The top origins are also the top destinations causing delays, meaning a delay in the origin is causing a delay in the destination.

#### Which lanes cause the most delays ?

In [None]:
delays_by_cat(df=flight_delays_enhanced, col='lane', theme='Lanes (origin and destination)', color=BASE_COLOR, binsize=5000)

When looking at lane level, the delays seems to be fairly equally distributed, although the top lanes have always one of the big cities as origin or destination

#### Which plane characteristics causes the most delays ?

In [None]:
flight_delays_enhanced.head()

In [None]:
col = ['origin', 'type', 'manufacturer', 'year_y', 'engine_type', 'model', 'year_issue']
col.sort()

flight_sample_enhanced = flight_sample.merge(planes, how='inner', right_on='tailnum', left_on='tailNum')
flight_sample_enhanced['year_y'] = flight_sample_enhanced['year_y'].replace('None', None)
flight_sample_enhanced['year_issue'] = flight_sample_enhanced['issue_date'].replace('None', None)
flight_sample_enhanced['year_issue'] = pd.to_datetime(flight_sample_enhanced['year_issue'], format="%m/%d/%Y").dt.year.astype('str')
flight_sample_enhanced['year_issue'] = flight_sample_enhanced['year_issue'].replace('nan', None)

for c in col:
    
    fig, ax = plt.subplots(ncols=1, figsize=(8,4))    
    
    top = flight_sample_enhanced[c].value_counts(normalize=True, ascending=False)[:20]
    top_order = top.index

    ax = sns.countplot(x=c, data=flight_sample_enhanced, hue='flight_status', order=top_order, palette='mako', dodge=False)

    ax.set_title('Delay vs on-time flights for column {}'.format(c))
    
    change_width(ax, .25)
    
    # get labels
    tickslabels = ax.get_xticklabels()

    new_tickslabels = []
    locs, labels = plt.xticks()
    for loc, label in zip(locs,labels):
        text = label.get_text()
        new_tickslabels.append(text[:15])

    ax.set_xticklabels(new_tickslabels, rotation=90)

    plt.tight_layout()
    plt.show()

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

#### 3) What is the difference between departure delay and arrival day ?

In [None]:
sample = np.random.choice(flights.shape[0], 300000, replace=False)
flight_sample = flights.iloc[sample].copy()

In [None]:
# plt.scatter(data=flight_delays_sample, x='depDelay', y='arrDelay')
plt.figure(figsize=[6,6])
sns.regplot(x=flight_sample['depDelay'], y=flight_sample['arrDelay'], fit_reg=True, truncate=True, jitter=0.05, scatter_kws={'alpha':0.01,'s':20, 'lw':0.1, 'edgecolor':'black'}, line_kws={'color': 'orange', 'lw':0.5})
plt.xlim(0,100)
plt.ylim(0,100)
plt.xlabel('Departure delays (in minutes)')
plt.ylabel('Arrival delays (in minutes)')
plt.show()

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!



## References
- [white text in pie chart](https://www.tutorialspoint.com/how-to-change-autopct-text-color-to-be-white-in-a-pie-chart-in-matplotlib)
- [interpretting acf and pacf graphs](https://towardsdatascience.com/interpreting-acf-and-pacf-plots-for-time-series-forecasting-af0d6db4061c)
- [formating xaxis date labels](https://stackoverflow.com/questions/56638648/seaborn-barplot-and-formatting-dates-on-x-axis)
- [annotate barplot](https://datavizpyr.com/how-to-annotate-bars-in-barplot-with-matplotlib-in-python/)