Logistic Regression with Binary Classification for Flight Delays

Jalyn Buthman

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix


flights = pd.read_csv("/content/airlines_subset.csv")

print(flights.describe())
print()
print(f'Within this dataset 0 means that the flight was not delayed and 1 means that the flight was delayed.')
print(f'The mean value for "delayed" is 0.46, which means that almost half of the flights (46%) in the dataset were delayed.')
print()

# model with outcome variable 'delayed' and independant variable 'flight_length'
model = LogisticRegression(solver= 'liblinear', random_state = 0)
x = flights ['flight_length'].values.reshape(-1, 1)
y = flights ['delayed'].values.reshape(-1, 1)
model.fit(x, y)

#predicted probabilities for delayed flights as a new column in the dataframe
flights['logistic_regression'] = model.predict_proba(x)[:,1]

#10 largest predicted probabilities for delayed flights
print('The 10 highest predicted probabilities of a flight being delayed:')
print(flights.nlargest(10, 'logistic_regression'))
print()
print(f'The 10 flights with the highest probabilities of being delayed according to our model all had a .650 logistic regression score.')
print(f'We can also see here that the model is not perfect at predicting delayed flights, as it included 2 flights that we know were not delayed.')
print()

#confussion matrix
the_median = flights['logistic_regression'].median()
#print(the_median)
prediction = list(1 * flights['logistic_regression'] > the_median)
actual = list(flights['delayed'])
print('Confusion Matrix:')
print('[True positive   False positives]')
print ('[False negatives True negatives]')
print(confusion_matrix(prediction, actual))
print()


conf_mat = confusion_matrix(prediction, actual)

#calculate the precision
precision = conf_mat[0][0] / (conf_mat[0][0] + conf_mat[0][1])
print(f'Precision: {precision}')
print()

#calculate the recall
recall = conf_mat[0][0] / (conf_mat[0][0] + conf_mat[1][0])
print(f'Recall: {recall}')
print()
print(f'')

print(f'The precision and recall scores for this model could both be better, as they are closer to 0.5 than to 1.0. While they are not bad accuracy scores, there is room for improvement.')
print(f'I found that flight delays can be predicted based of flight length, but that more types of data (columns) should be analyzed to see if the accuracy of the model can be improved.')
print()


        flight_num  depart_time  flight_length         day     delayed
count   485.000000   485.000000     485.000000  485.000000  485.000000
mean   1424.109278   854.298969     192.195876    3.967010    0.461856
std     621.590467   278.963591      49.085361    1.927118    0.499058
min     309.000000   435.000000     150.000000    1.000000    0.000000
25%    1007.000000   595.000000     155.000000    2.000000    0.000000
50%    1355.000000   815.000000     160.000000    4.000000    0.000000
75%    1805.000000  1015.000000     235.000000    6.000000    1.000000
max    2455.000000  1325.000000     320.000000    7.000000    1.000000

Within this dataset 0 means that the flight was not delayed and 1 means that the flight was delayed.
The mean value for "delayed" is 0.46, which means that almost half of the flights (46%) in the dataset were delayed.

The 10 highest predicted probabilities of a flight being delayed:
     flight_num  depart_time  flight_length airline airport_from  day  \
35

  y = column_or_1d(y, warn=True)
