# Part II - Airline on-time performance - Exploration
## by Juanita Smith

## Introduction
Have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you'd had more data? This is our chance to find out.

This analysis will be focused on predicting flight delays or cancellations.

> This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008.
> - See more information from the data expo challenge in 2009 [here](https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2009).
> - See a full description of the features [here](https://www.transtats.bts.gov/DatabaseInfo.asp?QO_VQ=EFD&Yv0x=D.)
> - Data can be downloaded from [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7).

Dictionary:
1) Year 1987-2008 
2) Month 1-12 
3) DayofMonth 1-31 
4) DayOfWeek 1 (Monday) - 7 (Sunday) 
5) DepTime actual departure time (local, hhmm) 
6) CRSDepTime scheduled departure time (local, hhmm) 
7) ArrTime actual arrival time (local, hhmm) 
8) CRSArrTime scheduled arrival time (local, hhmm) 
9) UniqueCarrier unique carrier code 
10) FlightNum flight number 
11) TailNum plane tail number 
12) ActualElapsedTime in minutes 
13) CRSElapsedTime in minutes 
14) AirTime in minutes 
15) ArrDelay arrival delay, in minutes 
16) DepDelay departure delay, in minutes 
17) Origin origin IATA airport code 
18) Destination IATA airport code 
19) Distance in miles 
20) TaxiIn - The time elapsed between wheels down and arrival at the destination airport gate in minutes
21) TaxiOut - The time elapsed between departure from the origin airport gate and wheels off in minutes
22) Cancelled was the flight cancelled? 
23) CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 
24) Diverted 1 = yes, 0 = no 
25) CarrierDelay in minutes
26) WeatherDelay in minutes 
27) NASDelay in minutes 
28) SecurityDelay in minutes 
29) LateAircraftDelay in minutes


**Important to note:** According to the documentation, a late flight is defined as a flight arriving or departing 15 minutes or more after the scheduled time.

>**Rubric Tip**: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

> **Rubric Tip**: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

>**Rubric Tip**: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.  





In [1]:
# import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import glob

# import warnings
# warnings.filterwarnings("ignore")

# Import custom modules
from src.utils import reduce_mem_usage, create_folder

# set plots to be embedded inline
%matplotlib inline

# suppress matplotlib user warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="matplotlib")

# use high resolution if this project is run on an apple device
%config InlineBackend.figure_format='retina'

# Make your Jupyter Notebook wider
from IPython.display import display, HTML
display(HTML('<style>.container { width:80% !important; }</style>'))

# environment settings
# display all columns and rows during visual inspection
pd.options.display.max_columns = None
pd.options.display.max_rows = None


# stop scientific notation on graphs
pd.options.display.float_format = '{:.0f}'.format

In [2]:
sns.set_style("whitegrid")
BASE_COLOR = sns.color_palette()[0]

In [3]:
FILE_NAME_RAW = '../data/flights_raw.pkl'
FILE_NAME_CLEAN = '../data/flights_clean.pkl'

In [8]:
# load the cleaned file 
flights_new = pd.read_pickle(FILE_NAME_CLEAN)
flights_new.sample(10)

Unnamed: 0,year,month,dayofMonth,dayOfWeek,depTime,CRSDepTime,arrTime,CRSArrTime,uniqueCarrier,flightNum,tailNum,actualElapsedTime,CRSElapsedTime,airTime,arrDelay,depDelay,origin,dest,distance,taxiIn,taxiOut,cancelled,cancellationCode,diverted,carrierDelay,weatherDelay,NASDelay,securityDelay,lateAircraftDelay
26643600,2007,9,15,6,0 days 11:58:00,0 days 12:00:00,0 days 14:16:00,0 days 14:18:00,UA,355,N523UA,258,258,227,0,0,ORD,SEA,1721,6,25,False,,False,0,0,0,0,0
2583864,2004,5,3,1,0 days 17:50:00,0 days 17:50:00,0 days 19:27:00,0 days 19:29:00,XE,2835,N11539,97,99,76,0,0,PNS,IAH,489,9,12,False,,False,0,0,0,0,0
15670116,2006,3,24,5,0 days 10:27:00,0 days 10:30:00,0 days 13:38:00,0 days 13:41:00,UA,1486,N458UA,131,131,108,0,0,ONT,DEN,819,11,12,False,,False,0,0,0,0,0
24617098,2007,6,2,6,0 days 12:20:00,0 days 12:23:00,0 days 13:46:00,0 days 13:49:00,YV,7115,N858MJ,86,86,67,0,0,IAD,CAE,401,6,13,False,,False,0,0,0,0,0
7093524,2004,12,19,7,0 days 20:34:00,0 days 20:40:00,0 days 21:48:00,0 days 22:11:00,DH,1618,N660BR,73,91,62,0,0,CHS,IAD,441,4,7,False,,False,0,0,0,0,0
29686641,2008,2,21,4,0 days 06:00:00,0 days 06:00:00,0 days 09:01:00,0 days 08:29:00,UA,399,N435UA,301,269,266,32,0,LGA,DEN,1619,9,26,False,,False,0,0,32,0,0
13143465,2005,11,24,4,0 days 08:49:00,0 days 08:29:00,0 days 10:09:00,0 days 09:53:00,UA,846,N509UA,80,84,50,16,20,IAD,LGA,229,5,25,False,,False,0,0,0,0,16
4744767,2004,9,22,3,0 days 14:09:00,0 days 14:12:00,0 days 15:47:00,0 days 15:52:00,UA,591,N502UA,158,160,142,0,0,DEN,SEA,1024,4,12,False,,False,0,0,0,0,0
28304432,2007,12,18,2,0 days 09:33:00,0 days 09:25:00,0 days 11:04:00,0 days 10:55:00,WN,1599,N222WN,91,90,74,9,8,SAN,OAK,446,5,12,False,,False,0,0,0,0,0
6929775,2004,12,13,1,0 days 10:40:00,0 days 10:43:00,0 days 11:55:00,0 days 12:01:00,MQ,3697,N847AE,75,78,52,0,0,DFW,SGF,364,4,19,False,,False,0,0,0,0,0


In [9]:
# make sure datatypes is preserved
flights_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31218803 entries, 0 to 31254219
Data columns (total 29 columns):
 #   Column             Dtype          
---  ------             -----          
 0   year               int16          
 1   month              int8           
 2   dayofMonth         int8           
 3   dayOfWeek          int8           
 4   depTime            timedelta64[ns]
 5   CRSDepTime         timedelta64[ns]
 6   arrTime            timedelta64[ns]
 7   CRSArrTime         timedelta64[ns]
 8   uniqueCarrier      object         
 9   flightNum          int16          
 10  tailNum            object         
 11  actualElapsedTime  Int16          
 12  CRSElapsedTime     Int16          
 13  airTime            Int16          
 14  arrDelay           Int16          
 15  depDelay           Int16          
 16  origin             object         
 17  dest               object         
 18  distance           int16          
 19  taxiIn             Int16          
 20  

In [10]:
flights_new.shape

(31218803, 29)

In [7]:
# to speed up exploration take a sample
sample = np.random.choice(flights_new.shape[0], 500000, replace=False)
flight_sample_new = flights.loc[sample,:].copy()
flight_sample.head()

KeyError: '[12279874, 11679647, 5035967, 25633277, 15807220, 11050850, 3104152, 21953653, 11053472, 28415961, 14015189, 20478472, 11687776, 26266013, 24335979, 4663418, 5966349, 2525761, 8625868, 3254053, 18727082, 3708818, 3484245, 15727360, 7908160, 29110790, 1940368, 14977673, 15296654, 20342446, 3253595, 7804311, 9828438, 11052585, 15127082, 17478734, 14013567, 22162952, 19336213, 11299799, 21684174, 8118925, 6122471, 6277665, 641242, 1491979, 907623, 12872616, 18731700, 10445917, 6275960, 7467466, 6877114, 1922671, 11056816, 20962687, 94608, 12726658, 10422017, 8566581, 24964801, 3260720, 11685842, 2705019, 6261809, 10437390, 5533577, 6269281, 5081855, 1335411, 11682701, 3713925, 28316321, 6430561, 11063384, 26648878, 1935868, 15708480, 27497836, 26764922, 21341014, 1345183, 6854594, 20951266, 11682434, 26234034, 11681460, 16306913, 18105906, 11059880, 29798913, 18713781, 346786, 3253596, 3864525, 11235666, 22871617, 6887184, 7205905, 762091, 3855967, 29189699, 11054667, 672266, 8622173, 14196582, 25155923, 27903198, 4490597, 4255322, 7339056, 15132284, 29155094, 11053466, 25484203, 19164783, 23269667, 10483052, 22379814, 3851447, 6272104, 3858364, 3706058, 4513693, 11065448, 1941388, 9821069, 29335213, 15914755, 14581761, 16770417, 11055276, 15130123, 7454190, 5520869, 8622156, 7458696, 15916818, 9790081, 5680386, 3115192, 28931193, 16303362, 14013565, 1116914, 11057039, 2502721, 4492202, 14574572, 348332, 15987217, 23542071, 4441628, 11687097, 3897961, 27379156, 11681458, 2181717, 3586700, 24185654, 6281626, 8610319, 14006560, 22357183, 11960851, 11048710, 771263, 898876, 15691408, 5082684, 771696, 6134906, 8628544, 25364111, 6836285, 11066354, 3267874, 16710986, 6873320, 5674088, 7790605, 8613031, 15696761, 9232166, 17492875, 11103265, 14012757, 17469198, 3696047, 28099711, 28427901, 480707, 11684743, 762545, 9210830, 3987555, 11045468, 192990, 776413, 6695697, 12862071, 11063371, 12278497, 28170076, 6700613, 1917585, 11056460, 13439628, 6700810, 3889869, 6872322, 7462510, 6262892, 29353187, 6542099, 776446, 3697836, 20443970, 6712461, 16293125, 19328779, 29333410, 5684967, 13434325, 20220799, 14566358, 15127144, 5921620, 19918037, 1491976, 11686126, 6854907, 5502809, 20881197, 20653275, 918398, 15707004, 774272, 175626, 3116580, 8705132, 28813238, 11057797, 6280666, 8022067, 7454214, 8027736, 14005906, 5084762, 31092458, 10670404, 8606722, 12863838, 14053609, 3866722, 15672794, 19319598, 11677205, 16009071, 6302573, 183888, 10422042, 16302544, 23722903, 3116398, 10110153, 22967872, 6854902, 11987543, 1161991, 22473800, 8021239, 17482265, 7454173, 3863687, 5085938, 17921776, 8354133, 3259490, 11683313, 5512390, 17477404, 1499325, 17492323, 7468980, 18552952, 20866997, 27169101, 10447306, 919859, 3101001, 18086491, 11057805, 2883552, 25055118, 15704431, 8622360, 9815831, 25148838, 6275249, 26092768, 3863679, 7233008, 19328780, 12905152, 351330, 17483722, 3115228, 2303145, 19334136, 7414393, 13447032, 12849973, 749094, 6275781, 16300242, 30792178, 30389796, 2088267, 8023132, 1337094, 12710623, 6274400, 5529734, 2075019, 5342842, 14411745, 753256, 4418797, 21887591, 8610322, 17487145, 10055440, 28396242, 8935508, 8566560, 18464064, 757977, 2084189, 2082422, 1345593, 9217846, 16709258, 904980, 24637854, 5077690, 23236935, 175343, 1332647, 19934172, 3869864, 29122423, 14009180, 13118205, 28506410, 3703678, 28058421, 22892898, 6860091, 349658, 188291, 29721800, 13435697, 901828, 1942658, 6277325, 6725036, 9824942, 30597111, 5089349, 8620511, 13998761, 1919762, 2524559, 9832887, 6137411, 12269641, 22500078, 11068328, 4486166, 9510057, 196290, 175523, 28724774, 2530495, 9817615, 183271, 16980516, 4940813, 9828440, 25489953, 6717135, 27236533, 10428584, 18107734, 3096389, 24866414, 7137618, 6280665, 909641, 19534898, 7468874, 3253585, 6717971, 11067618, 8035332, 10679015, 773338, 9219263, 6118209, 27562453, 1215850, 7473297, 3703102, 16299856, 2089786, 19933170, 177691, 1334443, 8033702, 18040626, 27502173, 3979253, 16324529, 762152, 479658, 27963192, 11678173, 26160400, 1916169, 25820862, 27880935, 21063047, 3251431, 24094988, 19322224, 14000830, 14570154, 9831044, 8625508, 22167410, 7462821, 18734069, 24833021, 9461125, 11053675, 10433221, 29898473, 29212286, 4485130, 343203, 22213300, 8625498, 13437431, 10432779, 1944789, 15358455, 16405083, 11675634, 15702834, 9577922, 16303166, 14584411, 7459824, 23090112, 11735692, 29232660, 5513682, 174081, 5517169, 5668207, 1347999, 8621732, 11687564, 28733544, 11042796, 6258753, 10433372, 3104025, 17313002, 9817265, 4482170, 3688042, 8619757, 2082499, 18107718, 3119776, 2501470, 3715371, 20739299, 19319805, 22263190, 24997067, 12442980, 7468991, 21485467, 15697955, 6852018, 13449843, 29953253, 8620514, 19414807, 21971286, 2524500, 28456746, 10444206, 28759519, 29689803, 2512978, 1330313, 3688768, 10434991, 761749, 20481367, 11259668, 14413138, 9832908, 6281089, 25739975, 30742740, 2091660, 3117558, 11065917, 9232713, 6857464, 26883191, 11100457, 1915540, 15129191, 18393640, 20078860, 3870923, 4935060, 1166951, 6124906, 21882794, 23649465, 21064083, 15033061, 10059366] not in index'

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 




>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!



## References
- [how to read multiple csv files](https://sparkbyexamples.com/pandas/pandas-read-multiple-csv-files/)