
# Data Cleaning Flight Information

## Data Import and Inspection

This is the documentation for the 2006-2010 Flight Dataset. In this document, we generate a new feature, impute ill-inputed/converted data values, and finalize data properties and features for data analysis implementation.

### Import

In [1]:
from zipfile import ZipFile as zp
import numpy as np
import pandas as pd
import string

import time

import matplotlib.pyplot as plt

In [2]:
def get_csv_file(filename):
    t0 = time.time()
    with zp('../Data/RawData/{}.zip'.format(filename)) as flight_zpfl:
        with flight_zpfl.open('{}.csv'.format(filename)) as f_info:
            dataframe =pd.read_csv(f_info,delimiter=',')
    print("Import loading time was {} seconds".format(time.time()-t0))
    return(dataframe)

In [3]:
df_flights = get_csv_file('783548897_52017_1328_airline_delay_causes')

Import loading time was 0.5884170532226562 seconds


### Dataset Inspection

This section observes several attributes of the flight dataset.

In [4]:
df_flights.shape

(91837, 22)

This dataset contains 91837 rows and 22 features.

In [5]:
df_flights.head(2)

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,Unnamed: 21
0,2006,8,AA,American Airlines Inc.,ABQ,"Albuquerque, NM: Albuquerque International Sun...",310.0,73.0,17.53,8.83,...,26.51,5.0,1.0,3742.0,838.0,585.0,729.0,21.0,1569.0,
1,2006,8,AA,American Airlines Inc.,ANC,"Anchorage, AK: Ted Stevens Anchorage Internati...",62.0,38.0,11.53,0.88,...,11.27,0.0,0.0,2605.0,879.0,100.0,870.0,0.0,756.0,


We observe a random space " " in two feature titles (e.g. " month"). The following checks if some space exists in a feature title, and removes it.

In [6]:
col_names = list(df_flights.columns)
col_names_new =[]
for i in col_names:
    col_names_new.append(i.strip(" "))

In [7]:
print(col_names_new)

['year', 'month', 'carrier', 'carrier_name', 'airport', 'airport_name', 'arr_flights', 'arr_del15', 'carrier_ct', 'weather_ct', 'nas_ct', 'security_ct', 'late_aircraft_ct', 'arr_cancelled', 'arr_diverted', 'arr_delay', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay', 'Unnamed: 21']


The above list confirms corrections of existing whitespace in feature titles. We now replace this list of corrected titles to the old flight dataframe's title.

In [8]:
df_flights.columns = col_names_new

In [9]:
df_flights.dtypes

year                     int64
month                    int64
carrier                 object
carrier_name            object
airport                 object
airport_name            object
arr_flights            float64
arr_del15              float64
carrier_ct             float64
weather_ct             float64
nas_ct                 float64
security_ct            float64
late_aircraft_ct       float64
arr_cancelled          float64
arr_diverted           float64
arr_delay              float64
carrier_delay          float64
weather_delay          float64
nas_delay              float64
security_delay         float64
late_aircraft_delay    float64
Unnamed: 21            float64
dtype: object

Lastly, we provide some descriptive statistics for the "df_flights" dataset.

In [10]:
df_flights.describe()

Unnamed: 0,year,month,arr_flights,arr_del15,carrier_ct,weather_ct,nas_ct,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,Unnamed: 21
count,91837.0,91837.0,91653.0,91623.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,91653.0,0.0
mean,2007.968847,6.492198,376.501675,80.083298,22.397971,2.980848,27.622592,0.212576,26.843168,6.795271,0.889987,4408.519972,1253.141185,232.490448,1270.455904,7.625162,1644.807273,
std,1.399134,3.445474,1008.399764,221.292022,49.7845,11.647614,98.159085,1.026079,82.746186,27.862032,4.24516,13404.644997,3327.285102,954.124065,5478.810428,41.204998,5192.477674,
min,2006.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,2007.0,4.0,62.0,12.0,4.32,0.0,2.29,0.0,1.96,0.0,0.0,572.0,202.0,0.0,79.0,0.0,94.0,
50%,2008.0,6.0,125.0,28.0,10.0,0.81,6.87,0.0,6.79,1.0,0.0,1401.0,508.0,35.0,252.0,0.0,399.0,
75%,2009.0,9.0,269.0,62.0,21.71,2.39,17.51,0.0,18.67,5.0,1.0,3287.0,1137.0,180.0,685.0,0.0,1181.0,
max,2010.0,12.0,15993.0,4966.0,1792.07,641.54,2739.18,80.56,1885.47,1283.0,248.0,356883.0,134693.0,57707.0,130920.0,3119.0,145680.0,


#### Data Imputation

##### Checking for invalid entries

We verify no invalid entries our within our dataset. Such entries we may resolve are "NaN," corrupt, or "NA" values.

In [11]:
df_flights.isnull().sum()

year                       0
month                      0
carrier                    0
carrier_name               0
airport                    0
airport_name               0
arr_flights              184
arr_del15                214
carrier_ct               184
weather_ct               184
nas_ct                   184
security_ct              184
late_aircraft_ct         184
arr_cancelled            184
arr_diverted             184
arr_delay                184
carrier_delay            184
weather_delay            184
nas_delay                184
security_delay           184
late_aircraft_delay      184
Unnamed: 21            91837
dtype: int64

In [12]:
df_flights[df_flights['arr_flights'].isnull()== True]

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,Unnamed: 21
414,2006,1,EV,Atlantic Southeast Airlines,MCI,"Kansas City, MO: Kansas City International",,,,,...,,,,,,,,,,
1137,2006,1,TZ,ATA Airlines d/b/a ATA,BWI,"Baltimore, MD: Baltimore/Washington Internatio...",,,,,...,,,,,,,,,,
2337,2006,2,OH,Comair Inc.,GNV,"Gainesville, FL: Gainesville Regional",,,,,...,,,,,,,,,,
2347,2006,2,OH,Comair Inc.,ICT,"Wichita, KS: Wichita Dwight D Eisenhower National",,,,,...,,,,,,,,,,
2397,2006,2,OH,Comair Inc.,TRI,"Bristol/Johnson City/Kingsport, TN: Tri-Cities...",,,,,...,,,,,,,,,,
3425,2006,3,EV,Atlantic Southeast Airlines,ONT,"Ontario, CA: Ontario International",,,,,...,,,,,,,,,,
5619,2006,4,RU,ExpressJet Airlines Inc.,SBN,"South Bend, IN: South Bend International",,,,,...,,,,,,,,,,
6797,2006,5,OH,Comair Inc.,AVL,"Asheville, NC: Asheville Regional",,,,,...,,,,,,,,,,
6834,2006,5,OH,Comair Inc.,FWA,"Fort Wayne, IN: Fort Wayne International",,,,,...,,,,,,,,,,
6844,2006,5,OH,Comair Inc.,ICT,"Wichita, KS: Wichita Dwight D Eisenhower National",,,,,...,,,,,,,,,,


We observe that from the arr_flights column to the late_aircraft_delay column, there are ~146 ill entries, per column. Moreover, we observe 59112 bad entries in Unnamed:21.

Because of the $\dfrac{184}{\text{Dataset Size}}=\dfrac{184}{91837}=.200\%$ missing entries per column, we can proceed with removing them, without affecting future analyses. 

Even more, we have to removed the entire "Unnamed:21" column. With the lack of existence of entries in this column, we can simply disregard/delete it.

The following will remove the ill-entred values:

In [13]:
df_flights = df_flights[df_flights['arr_flights'].isnull()== False]

In [14]:
df_flights_col_noUnamed21 = list(df_flights.columns)
df_flights_col_noUnamed21.pop(-1)

'Unnamed: 21'

In [15]:
df_flights=df_flights[df_flights_col_noUnamed21] 

In [16]:
df_flights.shape

(91653, 21)

The above is a confirmation of removing the ill data-filled "Unnamed:21" feature.

In [17]:
df_flights = df_flights[df_flights['arr_del15'].isnull()== False]

In [18]:
df_flights.shape

(91623, 21)

In [19]:
df_flights.isnull().sum()

year                   0
month                  0
carrier                0
carrier_name           0
airport                0
airport_name           0
arr_flights            0
arr_del15              0
carrier_ct             0
weather_ct             0
nas_ct                 0
security_ct            0
late_aircraft_ct       0
arr_cancelled          0
arr_diverted           0
arr_delay              0
carrier_delay          0
weather_delay          0
nas_delay              0
security_delay         0
late_aircraft_delay    0
dtype: int64

In [20]:
df_flights.head(2)

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2006,8,AA,American Airlines Inc.,ABQ,"Albuquerque, NM: Albuquerque International Sun...",310.0,73.0,17.53,8.83,...,1.0,26.51,5.0,1.0,3742.0,838.0,585.0,729.0,21.0,1569.0
1,2006,8,AA,American Airlines Inc.,ANC,"Anchorage, AK: Ted Stevens Anchorage Internati...",62.0,38.0,11.53,0.88,...,0.0,11.27,0.0,0.0,2605.0,879.0,100.0,870.0,0.0,756.0


We can now confirm the following:

1. Corrected feature titles

2. All data entries have been imputed

3. We have removed the bad feature "Unnamed:21"

#### Selecting Features

In [21]:
df_flights.dtypes

year                     int64
month                    int64
carrier                 object
carrier_name            object
airport                 object
airport_name            object
arr_flights            float64
arr_del15              float64
carrier_ct             float64
weather_ct             float64
nas_ct                 float64
security_ct            float64
late_aircraft_ct       float64
arr_cancelled          float64
arr_diverted           float64
arr_delay              float64
carrier_delay          float64
weather_delay          float64
nas_delay              float64
security_delay         float64
late_aircraft_delay    float64
dtype: object

In [22]:
semi_final_feature = list(df_flights.columns)
df_flights_semi_final = df_flights[semi_final_feature ]

We observe the data types for a majority of the features are correct. However, in hindsight, our Month and Year features are not quite dates. I.e., Two features in our dataset are not of "Date" types.

The following creates a new "Date" feature in our dataset, for time series analysis.

##### Add Date Feature

In [23]:
df_flights_semi_final['Date'] = pd.to_datetime(
    dict(year = df_flights_semi_final['year'],
         month = df_flights_semi_final['month'],day =1)
                                    )

In [24]:
df_flights_semi_final.head(2)

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,Date
0,2006,8,AA,American Airlines Inc.,ABQ,"Albuquerque, NM: Albuquerque International Sun...",310.0,73.0,17.53,8.83,...,26.51,5.0,1.0,3742.0,838.0,585.0,729.0,21.0,1569.0,2006-08-01
1,2006,8,AA,American Airlines Inc.,ANC,"Anchorage, AK: Ted Stevens Anchorage Internati...",62.0,38.0,11.53,0.88,...,11.27,0.0,0.0,2605.0,879.0,100.0,870.0,0.0,756.0,2006-08-01


In [25]:
df_flights.columns[6:21]

Index(['arr_flights', 'arr_del15', 'carrier_ct', 'weather_ct', 'nas_ct',
       'security_ct', 'late_aircraft_ct', 'arr_cancelled', 'arr_diverted',
       'arr_delay', 'carrier_delay', 'weather_delay', 'nas_delay',
       'security_delay', 'late_aircraft_delay'],
      dtype='object')

##### Proportion Statistics

Learning from Patrick Senti's [Flight Data project](https://miraculixx.github.io/flightdelays/index.html) and [feedback](../Images/Feedback/Feedback1.jpg) to us, we create several features that are proportional represenations of the delay features we established above.

I.e., We convert several flight delay features into proportions by month or by year.

For each case, we convert the proportions from their respective metrics, delay by minutes and delay count.

The following table indicates the categories each set of features will be converted to proportional representations:

| Delay by Minutes | Delay Count | 
|---------|-----|
|arr_delay  | arr_flights  |
|   carrier_delay      | arr_del15  |
|         weather_delay                     |   carrier_ct  |
|nas_delay| weather_ct  |
|      security_delay      | nas_ct  |
|                late_aircraft_delay               |   security_ct  |
|                               |   late_aircraft_ct  |
|                               |   arr_cancelled  |
|                               |   arr_diverted  |

The following is a function that takes in an array of selected features with its respected dataset & time type, and outputs a new dataframe with the new features by grouped proportional times. .

In [26]:
df_flights.head(1)

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2006,8,AA,American Airlines Inc.,ABQ,"Albuquerque, NM: Albuquerque International Sun...",310.0,73.0,17.53,8.83,...,1.0,26.51,5.0,1.0,3742.0,838.0,585.0,729.0,21.0,1569.0


In [27]:
'''
Create a summary dataset for delay by minutes and delay counts in their respective proportions, and time
time (year or month) restrictions

Parameters: 
-Data: Dataset
-Seleceted Features: Desired Features
-time_type: Year or Month
'''
import math
def to_proportion(data, selected_features,time_type):
    '''
    Receive a list of the time values in our time_type
    
    E.x. Years is an array 2006,2007,2008,2009,2010
    '''
    time_list = list(data[time_type].unique())
    
    
     #Partition data for delay by minutes metrics and delay count metrics
    selected_features_mins = selected_features[9:] ##6 features
    selected_features_ct = selected_features[:9] ##9 features
    
    '''
    Generated titles for new features, by proportion. 
    These titles are implemented for both "delays by minutes and delays counts"
    
    I.e. Names + proportion_by_timeType== new features

    '''
    selected_feat_names_ct = []
    selected_feat_names_mins = []
    for i in range(0,len(selected_features_ct)):
        selected_feat_names_ct.append("{name}_prop_by_{time}".format(name = selected_features_ct[i], time = time_type))
        
        if i< (len(selected_features_mins)):
            selected_feat_names_mins.append("{name}_prop_by_{time}".format(name = selected_features_mins[i], time = time_type))
   
    ''' Concat the two '{name}_prop_by_{time}' lists'''
    feature_titles = selected_feat_names_mins+selected_feat_names_ct
    
    #Instantiate empty dataframe
    dataset_finished = pd.DataFrame(columns=feature_titles )
    
    #Empty dictionary to store our arrays with "selected_feat_names_{ct/mins}" keys
    dictionary = {}
    dictionary[time_type]=[]
    for i in selected_feat_names_ct:
        dictionary[i] = []
       
    dictionary_2 = {}
    dictionary_2[time_type]=[]
    for i in selected_feat_names_mins:
        dictionary_2[i] = []
        
    #Temporary list for attaining time values with respect to row operation    
    ls_time = []
    
    #Loop through the unique time entries
    for i_time, i_val in enumerate(time_list):
        #Add time values
        ls_time.append(i_val)
        
        #Get sum of delay counts by some time_type
        ds_delay_sum = 0.0
        for i, val in enumerate(selected_features_ct):
            ds_delay_sum += data.groupby(time_type).sum()[val][i_val]
        
        #Get sum of delay by mins by some time_type
        ds_delay_sum_mins = 0.0
        for i, val in enumerate(selected_features_mins):
            ds_delay_sum_mins += data.groupby(time_type).sum()[val][i_val]
        
        #Loop through our selected features for conversion to proportionality
        for i, val in enumerate(selected_features_ct):
            ds_sum = data.groupby(time_type).sum()[val][i_val]
            dictionary[selected_feat_names_ct[i]].append(ds_sum/ds_delay_sum)
            
            #Stop for loop operation at length of selected_features_mins
        for i, val in enumerate(selected_features_mins):
            ds_sum_2 = data.groupby(time_type).sum()[val][i_val]
            dictionary_2[selected_feat_names_mins[i]].append(ds_sum_2/ds_delay_sum_mins)
            
    #Add time information to both dictionaries        
    dictionary[time_type]=ls_time
    dictionary_2[time_type]=ls_time

    #Convert delay by count dictionary information to dataframe summary  
    key_info_ct = list(dictionary.keys())
    key_info_mins = list(dictionary_2.keys())
    for i, val in enumerate(key_info_ct):
        dataset_finished[key_info_ct[i]] = dictionary[key_info_ct[i]]
    
        #Convert delay by minutes dictionary information to dataframe summary     
    for i, val in enumerate(key_info_mins):
        dataset_finished[key_info_mins[i]] = dictionary_2[key_info_mins[i]]
        
        
    return dataset_finished

In [28]:
last_feat = list(df_flights.columns[6:21])
df_flights_final_proportion_byYear = to_proportion(df_flights_semi_final, last_feat, "year")

In [29]:
df_flights_final_proportion_byYear

Unnamed: 0,arr_delay_prop_by_year,carrier_delay_prop_by_year,weather_delay_prop_by_year,nas_delay_prop_by_year,security_delay_prop_by_year,late_aircraft_delay_prop_by_year,arr_flights_prop_by_year,arr_del15_prop_by_year,carrier_ct_prop_by_year,weather_ct_prop_by_year,nas_ct_prop_by_year,security_ct_prop_by_year,late_aircraft_ct_prop_by_year,arr_cancelled_prop_by_year,arr_diverted_prop_by_year,year
0,0.5,0.139099,0.027836,0.146872,0.00127,0.184923,0.679464,0.153698,0.043555,0.00633,0.052926,0.000562,0.050324,0.0116,0.00154,2006
1,0.5,0.142757,0.028452,0.139698,0.000876,0.188217,0.663208,0.160479,0.04631,0.006432,0.053219,0.000439,0.054079,0.014305,0.001528,2007
2,0.5,0.138786,0.026742,0.151039,0.00066,0.182773,0.686294,0.149281,0.040298,0.005545,0.053835,0.000319,0.049284,0.013455,0.001689,2008
3,0.5,0.140185,0.024925,0.15317,0.000597,0.181122,0.71736,0.13549,0.036005,0.004656,0.050118,0.000259,0.044453,0.009939,0.00172,2009
4,0.5,0.151899,0.022015,0.128278,0.000849,0.196959,0.72241,0.131587,0.037977,0.004016,0.042181,0.000342,0.04707,0.012684,0.001733,2010


In [30]:
df_flights_final_proportion_byMonth = to_proportion(df_flights_semi_final, last_feat, "month")

In [31]:
df_flights_final_proportion_byMonth 

Unnamed: 0,arr_delay_prop_by_month,carrier_delay_prop_by_month,weather_delay_prop_by_month,nas_delay_prop_by_month,security_delay_prop_by_month,late_aircraft_delay_prop_by_month,arr_flights_prop_by_month,arr_del15_prop_by_month,carrier_ct_prop_by_month,weather_ct_prop_by_month,nas_ct_prop_by_month,security_ct_prop_by_month,late_aircraft_ct_prop_by_month,arr_cancelled_prop_by_month,arr_diverted_prop_by_month,month
0,0.5,0.148793,0.026149,0.134675,0.001496,0.188888,0.695497,0.146379,0.043045,0.005635,0.046643,0.000673,0.050384,0.009897,0.001848,8
1,0.5,0.143305,0.029124,0.146847,0.000949,0.179776,0.686326,0.147882,0.041119,0.006214,0.052792,0.000392,0.047365,0.016334,0.001577,1
2,0.5,0.140318,0.030009,0.142287,0.000932,0.186453,0.668378,0.153652,0.041777,0.006377,0.053916,0.000411,0.051172,0.02274,0.001577,2
3,0.5,0.142012,0.024047,0.14218,0.000883,0.190878,0.683357,0.150596,0.042452,0.005124,0.051302,0.000428,0.05129,0.013946,0.001505,3
4,0.5,0.142434,0.022547,0.149571,0.000971,0.184477,0.713071,0.137809,0.038479,0.004273,0.049197,0.000365,0.045496,0.009882,0.001429,4
5,0.5,0.137454,0.024082,0.154109,0.000658,0.183697,0.711393,0.139707,0.037221,0.004772,0.051427,0.000289,0.045998,0.00768,0.001513,5
6,0.5,0.13492,0.030736,0.143087,0.000656,0.1906,0.656146,0.164694,0.045758,0.007161,0.054408,0.000369,0.056998,0.012212,0.002253,6
7,0.5,0.143582,0.029241,0.133986,0.000755,0.192436,0.671654,0.157626,0.046129,0.006795,0.049245,0.000385,0.055072,0.010982,0.002111,7
8,0.5,0.154819,0.023959,0.152002,0.000787,0.168433,0.750028,0.119693,0.034641,0.004063,0.04406,0.000261,0.036668,0.009245,0.00134,9
9,0.5,0.140305,0.020753,0.157576,0.000708,0.180659,0.715622,0.137513,0.036795,0.003918,0.051499,0.000296,0.045006,0.008094,0.001257,10


###### Percentage Representation

In future visualizations, it is ideal to simply know that these values are represented as percentages. 

I.e. We want future readers to avoid the hassel of moving their eyes continously and simultaneously converting decimals to percentage representations. 

The following alleviates this concern.

In [33]:
df_lengthMonth = len(list(df_flights_final_proportion_byMonth.columns))
df_colMonth = list(df_flights_final_proportion_byMonth.columns)[0:df_lengthMonth-1]
df_flights_final_proportion_byMonth[df_colMonth] = df_flights_final_proportion_byMonth[df_colMonth] *100
df_flights_final_proportion_byMonth

Unnamed: 0,arr_delay_prop_by_month,carrier_delay_prop_by_month,weather_delay_prop_by_month,nas_delay_prop_by_month,security_delay_prop_by_month,late_aircraft_delay_prop_by_month,arr_flights_prop_by_month,arr_del15_prop_by_month,carrier_ct_prop_by_month,weather_ct_prop_by_month,nas_ct_prop_by_month,security_ct_prop_by_month,late_aircraft_ct_prop_by_month,arr_cancelled_prop_by_month,arr_diverted_prop_by_month,month
0,50.0,14.879281,2.614873,13.467461,0.149616,18.88877,69.549679,14.63791,4.304487,0.56346,4.66431,0.067311,5.03836,0.989724,0.184759,8
1,50.0,14.330502,2.912371,14.68466,0.094874,17.977592,68.632579,14.788186,4.111925,0.621391,5.279161,0.039239,4.736468,1.633396,0.157655,1
2,50.0,14.03179,3.00091,14.228718,0.093246,18.645336,66.837797,15.365223,4.177687,0.637688,5.391592,0.041072,5.117196,2.274009,0.157734,2
3,50.0,14.201181,2.404698,14.217984,0.088325,19.087811,68.335732,15.059568,4.245204,0.512411,5.130208,0.042778,5.128987,1.394609,0.150504,3
4,50.0,14.243352,2.254674,14.957127,0.097132,18.447716,71.307123,13.780874,3.847896,0.42725,4.91966,0.036456,4.549632,0.988184,0.142925,4
5,50.0,13.745442,2.408188,15.410867,0.065794,18.369709,71.139261,13.970722,3.722125,0.477161,5.142697,0.028923,4.599841,0.767981,0.15129,5
6,50.0,13.492031,3.073614,14.308748,0.065613,19.059995,65.614589,16.469428,4.575815,0.716081,5.440786,0.03694,5.699807,1.22121,0.225343,6
7,50.0,14.358218,2.924063,13.398624,0.07546,19.243635,67.165446,15.762616,4.612866,0.679494,4.924535,0.038503,5.507229,1.098165,0.211147,7
8,50.0,15.481939,2.395944,15.200161,0.078662,16.843294,75.002835,11.969338,3.464078,0.406269,4.406045,0.026128,3.66682,0.924469,0.134018,9
9,50.0,14.030452,2.075318,15.757558,0.070814,18.065858,71.562228,13.751309,3.679541,0.391813,5.149861,0.029561,4.500551,0.809447,0.12569,10


In [34]:
df_lengthYear = len(list(df_flights_final_proportion_byYear.columns))
df_colYear = list(df_flights_final_proportion_byYear.columns)[0:df_lengthYear-1]
df_flights_final_proportion_byYear[df_colYear] = df_flights_final_proportion_byYear[df_colYear] *100
df_flights_final_proportion_byYear

Unnamed: 0,arr_delay_prop_by_year,carrier_delay_prop_by_year,weather_delay_prop_by_year,nas_delay_prop_by_year,security_delay_prop_by_year,late_aircraft_delay_prop_by_year,arr_flights_prop_by_year,arr_del15_prop_by_year,carrier_ct_prop_by_year,weather_ct_prop_by_year,nas_ct_prop_by_year,security_ct_prop_by_year,late_aircraft_ct_prop_by_year,arr_cancelled_prop_by_year,arr_diverted_prop_by_year,year
0,50.0,13.90992,2.783581,14.687226,0.126952,18.492322,67.946369,15.369794,4.355547,0.633026,5.292615,0.056167,5.032444,1.160048,0.153989,2006
1,50.0,14.27567,2.845172,13.969826,0.087627,18.821704,66.320811,16.047921,4.631022,0.643213,5.321865,0.04394,5.40789,1.430493,0.152844,2007
2,50.0,13.878563,2.674183,15.10394,0.065973,18.277341,68.62939,14.928106,4.029829,0.554535,5.383483,0.031862,4.928411,1.345516,0.168869,2008
3,50.0,14.018509,2.49249,15.317048,0.059711,18.112242,71.735997,13.549043,3.600465,0.465598,5.011791,0.02589,4.445317,0.99394,0.171959,2009
4,50.0,15.189903,2.201488,12.827822,0.084899,19.695889,72.240991,13.158659,3.797734,0.401564,4.218096,0.034247,4.707039,1.268362,0.173308,2010


## Export Data

We partition the "df_flights_semi_final_with_prop" dataset into the following:

1. **df_flights_final_proportion_byYear:** The flight data in proportion of minutes to total delay minutes in its respective year

2. **df_flights_final_proportion_byMonth:** The flight data in proportion of minutes to total delay minutes in its respective month

In [36]:
df_flights_final_proportion_byYear.to_csv('../Data/PreparedData/flight_data_byYear.csv',sep=',', header=True)

df_flights_final_proportion_byMonth.to_csv('../Data/PreparedData/flight_data_byMonth.csv',sep=',', header=True)

## Resources


1. [Flight Data](https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp)

2. [Dimple Basics](http://napitupulu-jon.appspot.com/posts/dimple-ud507.html)

3. [bz2 import](https://pymotw.com/2/bz2/)

4. [Data Dictionary](https://www.transtats.bts.gov/Fields.asp)

5. [Encoding German Codec](https://stackoverflow.com/questions/18197772/python-german-umlaut-issues-ascii-codec-cant-decode-byte-0xe4-in-position-1)

6. [Faster Data Loading through Sampling](http://nikgrozev.com/2015/06/16/fast-and-simple-sampling-in-pandas-when-loading-data-from-files/)

7. [Types of Recorded Delays](https://www.rita.dot.gov/bts/help/aviation/html/understanding.html#q4)

8. [The Great Recession](http://www.investopedia.com/terms/g/great-recession.asp)

9. [EDA Visualization Design/Planning](http://guides.library.georgetown.edu/datavisualization)

10. [How to add a Data Viz Legend](https://stackoverflow.com/questions/28739608/completely-custom-legend-in-matplotlib-python)




 ## Data Dictionary

1. **year:** Year

2. **month:** Month

3. **carrier:** Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.

4. **carrier_name:** Carrier Name

5. **airport:** Airport Code

6. **airport_name:** Airport Name

7. **arr_flights:** Count of flights that arrived on time

8. **arr_del15:** Count of arrival delays by 15mins or more

9. **carrier_ct:** Count delays due to carrier

10. **weather_ct:** Count of delays due to weather

11. **nas_ct:** Count of delays due to NAS

12. **security_ct:** Count of delays due to Security

13. **late_aircraft_ct:** Count of delays due to late aircraft

14. **arr_cancelled:** Arrivals Cancelled

15. **arr_diverted:** Arrivals diverted

16.  **arr_delay:** Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.

17. **carrier_delay:** Carrier Delay, in Minutes

18. **weather_delay:** Weather Delay, in Minutes

19. **nas_delay:** National Aviation System Delay, in Minutes

20. **security_delay:** Security Delay, in Minutes

21. **late_aircraft_delay:** Late Aircraft Delay, in Minutes
