In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import os
import sys
sys.path.append('D:/Springboard_DataSci/Assignments/Lib')
import TimeTracker

Before we begin formal training of the data, let's look at each weather code
one at a time and attempt to sift significant from insignificant codes. It is possible
that statistical significance may not kick in until the higher value of some codes.

In [2]:
stopwatch = TimeTracker.TimeTracker()
WORKING_DIR = r'D:\Springboard_DataSci\Assignments\Capstone_2--Airport_weather'
os.chdir(WORKING_DIR + r'\data')

In [3]:
# Get the data.
departure_events = pd.read_csv('departure_events.csv')
EFFECTS = ['FracCancelled','FracDelayed']
CODES = ['Cold','Fog','Hail','Wind','Rain','Sleet','Snow']
FOGIS2 = 'FogIs2'

In [4]:
'''Linear regression with significance testing.'''
def linreg(X, y, print_stats=True):  
    X2 = sm.add_constant(X)
    est = sm.OLS(y, X2)
    est2 = est.fit()
    if print_stats:
        print(est2.summary())
    return est2.pvalues[1:], est2.rsquared

In [5]:
'''Extracts just the p-values and R^2's from a multivariate linear regression.

summary_stats: List of lists of the p-value, R^2, and weather code.'''
def pvalue_and_R2(summary_stats, codes=CODES):
    df = pd.DataFrame(summary_stats, index=[_ for _ in codes for i in range(2)],
                      columns=['pValue', 'R^2', 'Effect'])
    return df[df.Effect == EFFECTS[0]].loc[:,:'R^2'],\
        df[df.Effect == EFFECTS[1]].loc[:,:'R^2'] #Cancellations then delays

In [6]:
# Evaluate significance for all codes between values of 0 and 1.
flights_affected = departure_events[EFFECTS]
summary_stats = []
for weather_code in CODES:
    x = departure_events[weather_code]
    indeces = (x <= 1)
    x = x[indeces]
    for effect in EFFECTS:
        y = departure_events[effect][indeces]
        # print('\nLinear regression analysis for', weather_code, 'of', effect+':')
        pvalue, R2 = linreg(x, y, print_stats=False)
        summary_stats.append([pvalue[0], R2, effect])
cancellation_stats, delay_stats = pvalue_and_R2(summary_stats)
print('\nCancellation summary stats:\n' + str(cancellation_stats))
print('\nDelay summary stats:\n' + str(delay_stats))


Cancellation summary stats:
              pValue       R^2
Cold    7.144773e-04  0.000334
Fog     5.483213e-01  0.000011
Hail    1.332955e-11  0.001334
Wind   5.663473e-111  0.014507
Rain    1.796061e-09  0.001197
Sleet  1.206266e-190  0.024982
Snow    1.461477e-80  0.010669

Delay summary stats:
              pValue       R^2
Cold    1.579043e-02  0.000170
Fog     2.026928e-01  0.000050
Hail    1.184639e-19  0.002397
Wind    8.345111e-94  0.012237
Rain   4.308106e-127  0.018852
Sleet   1.015732e-41  0.005326
Snow    7.386913e-39  0.005034


By inspection of the R^2 values, we can see the following:<br>
Cold: Can be discarded entirely. The R^2 is too small, and there is no Cold>1 code.<br>
Fog: The change between of 0 and 1 is inconsequential. We might be able to discard it depending
on what happens with the 2's.<br>
Hail, wind, rain, sleet, and snow all have small R^2 but miniscule p-values, so we will
leave them in. Remember that in the case of rain and snow, we have only compared 0's and 1's
so far. Judging by the box plots, we expect more significant results when the 2's and 3's are
factored in.

Let's see if we can omit fog entirely.

In [7]:
departure_events[FOGIS2] = (departure_events.Fog == 2).astype(int)
summary_stats = []
for effect in EFFECTS:
    pvalue, R2 = linreg(X=departure_events[FOGIS2], y=departure_events[effect], print_stats=False)
    summary_stats.append([pvalue[0], R2, effect])
cancellation_stats, delay_stats = pvalue_and_R2(summary_stats, codes=[FOGIS2])
print('\nCancellation summary stats for ' + FOGIS2 + ':\n' + str(cancellation_stats))
print('\nDelay summary stats for ' + FOGIS2 + ':\n' + str(delay_stats) + '\n')


Cancellation summary stats for FogIs2:
               pValue       R^2
FogIs2  1.111953e-133  0.017503

Delay summary stats for FogIs2:
               pValue       R^2
FogIs2  2.752392e-167  0.021922



Now we have statistical significance. This means that both for cancellations and delays,
there is no significant effect between fog's 0's and 1's, but there is between them and 2's.

Let's put all remaining codes together and do one combined linear regression. We'll get rid of 
Cold, the old Fog code, and some other unneeded columns. FogIs2 remains.

In [8]:
print(departure_events.columns)
departure_events = departure_events[[
    'ORIGIN', 'Region', 'Season', 'Hail', 'Wind', 'Rain', 'Sleet',
    'Snow', FOGIS2, 'FracCancelled', 'FracDelayed']]
for effect in EFFECTS:
    print('\nLinear regression analysis on all remaining variables on', effect+':')
    linreg(X=departure_events.loc[:, 'Hail':FOGIS2], y=departure_events[effect])

Index(['Unnamed: 0', 'ORIGIN', 'DepartureDate', 'ArrivDelay', 'DepartDelay',
       'Flights', 'WeatherCancelled', 'WeatherDelayed', 'Cold', 'Fog', 'Hail',
       'Wind', 'Rain', 'Sleet', 'Snow', 'FracCancelled', 'FracDelayed',
       'Month', 'Season', 'Region', 'FogIs2'],
      dtype='object')

Linear regression analysis on all remaining variables on FracCancelled:
                            OLS Regression Results                            
Dep. Variable:          FracCancelled   R-squared:                       0.140
Model:                            OLS   Adj. R-squared:                  0.140
Method:                 Least Squares   F-statistic:                     931.2
Date:                Wed, 29 Jul 2020   Prob (F-statistic):               0.00
Time:                        13:05:03   Log-Likelihood:                 64585.
No. Observations:               34288   AIC:                        -1.292e+05
Df Residuals:                   34281   BIC:                        -1.291e+0

The R^2's are not very high, but all p-values except for sleet's effect on delays
are highly significant. This means that we have probably found variables that are
contributing to the delays and cancellations, but the decision to group the flights by day
and not by hour may have created a lot of false negatives.

Let's see if we can get some help from introducing the squares of variables whose
values can exceed 1.

In [9]:
departure_events['RainSquared'] = np.square(departure_events.Rain)
departure_events['SnowSquared'] = np.square(departure_events.Snow)
print(departure_events.columns)
# Reorder the columns to put the fraction cancelled and delayed last.
departure_events = departure_events[[
    'ORIGIN', 'Region', 'Season', 'Hail', 'Wind', 'Rain', 'RainSquared', 'Sleet', 'Snow', 'SnowSquared',
    'FogIs2', 'FracCancelled', 'FracDelayed']]
for effect in EFFECTS:
    print('\nLinear regression analysis with Rain^2 and Snow^2 on', effect+':')
    linreg(X=departure_events.loc[:, 'Hail':FOGIS2], y=departure_events[effect])

Index(['ORIGIN', 'Region', 'Season', 'Hail', 'Wind', 'Rain', 'Sleet', 'Snow',
       'FogIs2', 'FracCancelled', 'FracDelayed', 'RainSquared', 'SnowSquared'],
      dtype='object')

Linear regression analysis with Rain^2 and Snow^2 on FracCancelled:
                            OLS Regression Results                            
Dep. Variable:          FracCancelled   R-squared:                       0.155
Model:                            OLS   Adj. R-squared:                  0.155
Method:                 Least Squares   F-statistic:                     786.4
Date:                Wed, 29 Jul 2020   Prob (F-statistic):               0.00
Time:                        13:05:03   Log-Likelihood:                 64885.
No. Observations:               34288   AIC:                        -1.298e+05
Df Residuals:                   34279   BIC:                        -1.297e+05
Df Model:                           8                                         
Covariance Type:            nonrobust   

Slightly better. Note that Snow (linear) is now insignificant on FracCancelled but not on FracDelayed.

Now let's develop a physical map. We would like to illustrate the fraction of delays and cancellations
geographically.