# Question #1:

### Part A (explanation):

     No, we cannot include all the variables in the table for our analysis. While we can reasonably infer that the dependent variable, Flight Status, will be significantly impacted by different values in most of the independent variables, it is absolutely critical to note that we are predicting DEPARTURE delays from Washington D.C. whereas we are given ARRIVAL data into New York and weather data upon ARRIVAL in New York. Therefore, we must omit weather as it is not an ex-ante predictor. That said, it seems all other predictors can stay in the model; any superfluous variables which we put into our test models will excised when we locate the optimal alpha penalty level.
     
     That said, I would like to make an important note on the selection of redundant dummies. Examining the code below, you will notice that for "Flight Delay", I have not opted to drop the redundant dummy based on the mode value. Rather, because this variable only had two possible options as values, I opted to keep the version of the dummy which we are "most interested in". That is, I retained "Flight Delay"  = Delayed.
     
    

### Part A (Python): Setting Up Variable Codings and Data Partitions

In [1]:
#Import Needed Packages
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import statistics as stat
from sklearn import metrics

#read in the data
df = pd.read_csv(r'C:\Users\Home\Documents\Data Mining\Assignments\Assignment 4\HW4_FlightDelays.csv')

#drop "Weather" as it is not an ex-ante predictor
df.drop('Weather', axis=1, inplace=True)

#Group variables into a list based on type (there are no numeric variables in this data set)
cvar_list = ['Binned_CRS_DEP_TIME','CARRIER','DEST','ORIGIN','DAY_WEEK','Flight Status']

#Creating Dummies for Categorical Variables
df2 = df.copy()
df2[cvar_list] = df[cvar_list].astype('category')
df2 = pd.get_dummies(df2, prefix_sep = '_')

#Finding mode of each column so we know what redundant dummy to drop 
# I am skipping finding the mode for "flight status", as I know my event of interest is "Yes" and, as such, I will drop the "No" equivalents)
time_mode = stat.multimode(df['Binned_CRS_DEP_TIME'])
carrier_mode = stat.multimode(df['CARRIER'])
dest_mode = stat.multimode(df['DEST'])
origin_mode = stat.multimode(df['ORIGIN'])
day_mode = stat.multimode(df['DAY_WEEK'])

delay_rdummy = 'Flight Status_On-time'

#remove one "redundant dummy", per each set of dummies
rdummies = ['Binned_CRS_DEP_TIME_'+str(time_mode[0]), 'CARRIER_'+carrier_mode[0],'DEST_'+dest_mode[0],'ORIGIN_'+origin_mode[0],'DAY_WEEK_'+str(day_mode[0]),delay_rdummy]
df3 = df2.copy()
df3 = df2.drop(columns=rdummies)

#Data Partition:
#Splitting the data into our partitions will return two dataframes, so we must prep like so:
testpart_size = .2
df_partition = df3

df_nontestdata, df_testdata = train_test_split(df_partition, test_size = testpart_size, random_state = 1)

print(df_nontestdata)

      Binned_CRS_DEP_TIME_1  Binned_CRS_DEP_TIME_2  Binned_CRS_DEP_TIME_3  \
710                       0                      0                      0   
1258                      0                      0                      0   
435                       0                      0                      0   
987                       0                      0                      0   
1286                      0                      0                      0   
...                     ...                    ...                    ...   
715                       0                      0                      0   
905                       0                      1                      0   
1096                      0                      0                      0   
235                       0                      0                      0   
1061                      0                      1                      0   

      Binned_CRS_DEP_TIME_4  Binned_CRS_DEP_TIME_5  Binned_CRS_DEP_TIME_7  

### Part B1 (Python): Running Logistic Regressions With Pre-Specified Penalty Levels To Find Most Important Predictor

In [2]:
#Logistic Regression Analysis:
DV = 'Flight Status_Delayed'
y = df_nontestdata[DV]
x = df_nontestdata.drop(columns = [DV])

#Setting penalty levels for use in below loops
iterable_alpha = range(1,11)

def summary_coef(model_object):
    n_predictors = x.shape[1]
    model_coef = pd.DataFrame(model_object.coef_.reshape(1,n_predictors),columns = x.columns.values)
    model_coef['Intercept'] = model_object.intercept_
    return (model_coef.transpose())

for i in iterable_alpha:
    iterable_classifier = LogisticRegression(C = 1/i,penalty = 'l1',solver='saga',max_iter = 200, random_state = 1).fit(x,y)

#Very important: remember (@ me, the student) that we are no longer in the world of y = mx+b. The coefficients that get printed out are only for the
#regression part of the logistic function. They still have to get plugged into the logistic function's wrapper formula to get P!!!!
    print ("\n","Alpha equals :",i)
    print(summary_coef(iterable_classifier))


 Alpha equals : 1
                              0
Binned_CRS_DEP_TIME_1 -0.402638
Binned_CRS_DEP_TIME_2 -0.514201
Binned_CRS_DEP_TIME_3 -0.550584
Binned_CRS_DEP_TIME_4 -0.493722
Binned_CRS_DEP_TIME_5  0.291855
Binned_CRS_DEP_TIME_7  0.102268
Binned_CRS_DEP_TIME_8  0.243245
CARRIER_CO             0.323165
CARRIER_DL            -0.485427
CARRIER_MQ             0.539780
CARRIER_OH            -0.636232
CARRIER_RU             0.000000
CARRIER_UA             0.000000
CARRIER_US            -1.080606
DEST_EWR               0.077356
DEST_JFK              -0.143823
ORIGIN_BWI             0.218863
ORIGIN_IAD             0.222245
DAY_WEEK_1             0.790290
DAY_WEEK_2             0.479761
DAY_WEEK_3             0.118975
DAY_WEEK_4            -0.168134
DAY_WEEK_6            -0.791380
DAY_WEEK_7             0.608921
Intercept             -0.779283

 Alpha equals : 2
                              0
Binned_CRS_DEP_TIME_1 -0.337190
Binned_CRS_DEP_TIME_2 -0.462513
Binned_CRS_DEP_TIME_3 -0.417466
Bi

### Part B2 (Python): Finding What Pre-Specified Penalty Level Causes The Most Important Predictor to Drop Out

In [3]:
#Please note: in this cell, I will manually plug different alphas into to the model until "Carrier_US" falls to zero
#I will show my attempts by continually commenting out previous efforts:

#trial_and_error_alpha = 10
#trial_and_error_alpha = 11
#trial_and_error_alpha = 12
#trial_and_error_alpha = 15
#trial_and_error_alpha = 18
#trial_and_error_alpha = 30
#trial_and_error_alpha = 32
trial_and_error_alpha = 33.5

trial_and_error_classifier = LogisticRegression(C = 1/trial_and_error_alpha,penalty = 'l1',solver='saga',max_iter = 200, random_state = 1).fit(x,y)
print(summary_coef(trial_and_error_classifier))

                             0
Binned_CRS_DEP_TIME_1  0.00000
Binned_CRS_DEP_TIME_2  0.00000
Binned_CRS_DEP_TIME_3  0.00000
Binned_CRS_DEP_TIME_4  0.00000
Binned_CRS_DEP_TIME_5  0.00000
Binned_CRS_DEP_TIME_7  0.00000
Binned_CRS_DEP_TIME_8  0.00000
CARRIER_CO             0.00000
CARRIER_DL             0.00000
CARRIER_MQ             0.00000
CARRIER_OH             0.00000
CARRIER_RU             0.00000
CARRIER_UA             0.00000
CARRIER_US             0.00000
DEST_EWR               0.00000
DEST_JFK               0.00000
ORIGIN_BWI             0.00000
ORIGIN_IAD             0.00000
DAY_WEEK_1             0.00000
DAY_WEEK_2             0.00000
DAY_WEEK_3             0.00000
DAY_WEEK_4             0.00000
DAY_WEEK_6             0.00000
DAY_WEEK_7             0.00000
Intercept             -0.76955


### Part B (explanation):

     Based on the above regressions from part B1, it appears that the carrier being US Airways is the most important predictor, in terms of the absolute value of its coefficient, of whether or not a flight will experience a departure delay. From part B2, it can be observed that "Carrier_US" will remain in the model until the penalty level reaches approximately 33.5

### Part C:

In [4]:
#Setup Logistic Regression with k-folds = 5
kfolds = 5

#Establishing alpha range for optimal logistic regression
min_alpha = .001
max_alpha = 100

#Because there are infinite values between min_alpha and max_alpha, we must specify how many alphas Python should look for
#Python will then divide that interval into an even number of searches. We need numpy for this
n_candidates = 1000
alpha_list= list(np.linspace(min_alpha, max_alpha, num = n_candidates))
c_list= list(1/np.linspace(min_alpha, max_alpha, num = n_candidates))

#Plug in classifier_optimal to our previous Logistic model to find the optimal predictors
classifier_optimal = LogisticRegressionCV(Cs = c_list,cv=kfolds,penalty = 'l1',solver='saga',max_iter=2000, random_state=1, n_jobs = -1).fit(x,y)
print(summary_coef(classifier_optimal))

#Find the optimal selected alpha
print(1/classifier_optimal.C_)

                              0
Binned_CRS_DEP_TIME_1 -0.450259
Binned_CRS_DEP_TIME_2 -0.540088
Binned_CRS_DEP_TIME_3 -0.660682
Binned_CRS_DEP_TIME_4 -0.533879
Binned_CRS_DEP_TIME_5  0.322535
Binned_CRS_DEP_TIME_7  0.124208
Binned_CRS_DEP_TIME_8  0.288676
CARRIER_CO             0.473356
CARRIER_DL            -0.481035
CARRIER_MQ             0.602919
CARRIER_OH            -1.099688
CARRIER_RU             0.000000
CARRIER_UA            -0.054363
CARRIER_US            -1.086747
DEST_EWR               0.025905
DEST_JFK              -0.185100
ORIGIN_BWI             0.439147
ORIGIN_IAD             0.293619
DAY_WEEK_1             0.852705
DAY_WEEK_2             0.531485
DAY_WEEK_3             0.177251
DAY_WEEK_4            -0.160297
DAY_WEEK_6            -0.800274
DAY_WEEK_7             0.662175
Intercept             -0.834278
[0.3012973]


### Part D:
#### Please note: for the more explicitly calculated Confusion Matrix, please see the Excel sheet submitted

In [5]:
#Calculate Performance of Final Selected Model Over Test Partition

#Actual values of the DV in the test partition
y_test_actual = df_testdata[DV]

#Predictor values in the test_partition
x_test = df_testdata.drop(columns = [DV])

#Predicted values of the DV in the test partition
y_test_predicted = classifier_optimal.predict(x_test)

#Create dataframe and outsheet for purposes of showing the creation and calculation of the Confusion Matrix.
confusion_matrix_basis = df_testdata.copy()
confusion_matrix_basis['Predicted Y Values'] = y_test_predicted

confusion_matrix_basis.to_excel(r'C:\Users\Home\Documents\Data Mining\Assignments\Assignment 4\Confusion_Matrix_Basis.xlsx', index = False, header=True)

#Get the number of observations in the test partition
n_obs_test = df_testdata.shape[0]

#Since we have predicted DV values as binaries and since we have the actual values of the DV in the test partition, 
#we can setup a Confusion Matrix
print("-----Confusion Matrix-----")
print(metrics.confusion_matrix(y_test_actual, y_test_predicted))
        
#Calculate accuracy rate
print("\n","The model's accuracy against the test partition is",classifier_optimal.score(x_test,y_test_actual))

-----Confusion Matrix-----
[[154  16]
 [ 69  25]]

 The model's accuracy against the test partition is 0.678030303030303


### Part E (explanation):
     This model can be used to predict the probability of getting delayed on a United (UA) from IAD to EWR on Monday at 11:00am. This particular carrier, flight segment, day and time, appear in the historical data used to generate our model. Therefore, the model can used to predict this probability. However, we will have to make sure the number and names of the predictors are the same across both final dataframes used in prediction.

### Part E (Python):

In [6]:
#Estimate Using New Data
df_newdata = pd.read_csv(r'C:\Users\Home\Documents\Data Mining\Assignments\Assignment 4\New_Flight.csv')

#Generate the categorical predictor list for the new data, making sure to drop "Flight Status" because we're just copying from above
Original_DV = 'Flight Status'
cpredictor_list = cvar_list.copy()
cpredictor_list.remove(Original_DV)

#Generate the list of the categorical predictors for our new data and establish data type
df_newdata2 = df_newdata.copy()
df_newdata2[cpredictor_list] = df_newdata[cpredictor_list].astype('category')

#Code categorical variables in the new data
df_newdata2 = pd.get_dummies(df_newdata2,prefix_sep='_')


#I need to add the dummies to the new data that are missing, since we created a total of 24 in the model
#Establish the needed list, establish the current list and loop to create dummies equal to 0 in the new dataframe.
needed_dummies = df3.columns
needed_dummies = needed_dummies.drop('Flight Status_Delayed')

current_dummies = df_newdata2.columns


df_newdata3 = df_newdata2.copy()

for i in needed_dummies:
    if i not in current_dummies:
        df_newdata3[i] = 0

#With the above manipulated new data, run the Logistic Regression for the new data
#(@ me, the student) Remember, we are predicting whether a flight will be delayed leaving DC. 1 = predict delay, 0 = predict no delay

#predicted_DC_departure_delay = classifier_optimal.predict(df_newdata3)
#print(predicted_DC_departure_delay)

#Find the probability associated with the predicted outcome (1st output: % chance DV = 0. 2nd output: % chance DV = 1)
predicted_DC_departure_delay_probability = classifier_optimal.predict_proba(df_newdata3)
print("The estimated probability that our Monday,11:00am United Airlines flight from Dulles to Newark","\n","will be delayed is",(predicted_DC_departure_delay_probability[:,1])*100,"%")

FileNotFoundError: [Errno 2] File C:\Users\Home\Documents\Data Mining\Assignments\Assignment 4\New_Flight.csv does not exist: 'C:\\Users\\Home\\Documents\\Data Mining\\Assignments\\Assignment 4\\New_Flight.csv'

In [7]:
y_test_actual

288     1
190     1
852     0
596     0
186     0
       ..
1007    0
943     0
746     0
631     0
478     0
Name: Flight Status_Delayed, Length: 264, dtype: uint8

In [None]:
y_test_predicted