#### **Project : The Prediction of Flight delays and cancellations**


##### **Context :**

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.


In this project we have 3 CSV files containing the different datasets to merge after (during the analysis).
- airlines.CSV
- airports.CSV
- flights.CSV

<p>Thusly the airlines dataset describes the features related to every airline, it contains the following columns:</p>
<ol>
  <li> IATA_CODE; Airline Identifier : The Primary Key</li>
  <li> AIRLINE ;Airport's Name / Foreigh Key from airports</li>
</ol>

<p>Concerning the airports dataset describes the features related to the airports (obviously XD), it contains the following columns:</p>
<ol>
  <li> IATA_CODE; Location Identifier : 3 characters identifying every airport location</li>
  <li> AIRPORT; Airport's Name / Foreigh Key from airports</li>
  <li>CITY</li>
  <li>STATE</li>
  <li>COUNTRY : Country name of the airport</li>
  <li>LATITUDE : The Latitude of the Airport</li>
  <li>LONGITUDE : The Longitude of the Airport</li>
</ol>

<p>And finally the flights dataset containing the following columns:</p>
<ol>
  <li> YEAR; the Year of the Flight Trip</li>
  <li> MONTH </li>
  <li>DAY</li>
  <li>COUNTRY : Country name of the airport</li>
  <li>...</li>
</ol>

![flights_cancellations.jpeg](attachment:6810e973-f048-4cd3-aa95-147ad29b1c47.jpeg)

#### **Part 1 : Getting the DATA**

In [128]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

%matplotlib inline

# Increase the size of seaborn plots
sns.set(rc = {'figure.figsize': (8, 8)})

In [129]:
raw_data_airlines = pd.read_csv("../input/flight-delays/airlines.csv")
raw_data_airports = pd.read_csv("../input/flight-delays/airports.csv")
raw_data_flights = pd.read_csv("../input/flight-delays/flights.csv", nrows = 7000)

In [130]:
raw_data_airports.head()

In [131]:
display(raw_data_airlines.head(), "The Shape of the Airlines DataSet : {}".format(raw_data_airlines.shape))

In [132]:
display(raw_data_airports.head(), "The Shape of the Airports DataSet : {}".format(raw_data_airports.shape))

In [133]:
pd.set_option('max_columns', None)

In [134]:
display(raw_data_flights.head(), "The Shape of the Flights DataSet : {}".format(raw_data_flights.shape))

#### **Copying the Data in order to compare the final data results with the raw data**

In [135]:
df_airlines = raw_data_airlines.copy()
df_airports = raw_data_airports.copy()
df_flights = raw_data_flights.copy()

In [136]:
# This function would be useful whenever we have a bunch of dataframe and we want to compare btw the final results and the raw data
def copy_dataframe(df):
    df = df.copy()
    return df

In [137]:
# a quick check for the copied data
display(df_airlines.head(), df_airports.tail(), df_flights.head())

> Exploring the data types and the info about the datasets

In [138]:
# For the airlines DataFrame
display(df_airlines.info(),
        df_airports.info(),
        df_flights.info())

In [139]:
# Displaying the Unique Values of each columns of the flights dataset
for column in df_flights:
  if len(df_flights[column].unique()) < 10:
    print('The Number of Values for the Feature {} : {} ---- {}'.format(column, df_flights[column].unique(), len(df_flights[column].unique())))
  else:
    print('The Number of values for the Feature {} : {}'.format(column, len(df_flights[column].unique())))


For now we will keep the columns of the airports and airlines dataframes (we could remove LATITUDE and LONGITUDE from the airports dataset), and we will explore deeply the flights dataset to check for eventually missings values, outliers, the number of unique values, distribution and at long last correlation between the different features to finally merge the different datasets into one final dataset to be able to build our ML model.

In [140]:
{column: len(df_flights[column].unique()) for column in df_flights.columns}

In [141]:
df_flights.isna().any()

In [142]:
df_flights.isna().mean()

In [143]:
# import seaborn as sns
# fig, ax = plt.subplots(figsize=(8,8))
# sns.heatmap(df_flights.isnull(), cbar = False, yticklabels = False, vmin=0.5, vmax=0.7, ax = ax)

Checking for duplicated rows

In [144]:
### Checking for the DataSet shape before removing the Data
print("The Shape of the Data Before removing duplicate rows is {}".format(df_flights.shape))

### The Code to remove the duplicate rows
duplicate_rows_df_flights = df_flights[df_flights.duplicated()]

### Displaying the Duplicate rows DataFrame
display("The Shape of the Duplicate Rows DataFrame is {}".format(duplicate_rows_df_flights.shape))

So there's no duplicate rows in the dataframe, let's move on toturing the Data.

#### Dropping the Rows with a big % of missing data

In [145]:
columns_2_drop = ['CANCELLATION_REASON',
       'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY',
       'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY']

In [146]:
df_flights = df_flights.drop(columns = columns_2_drop, axis = 1)

In [147]:
# Checking for the size of the remaining columns
df_flights.shape[1]

In [148]:
df_flights.columns

In [149]:
# We need to see all the dataframe, so we will use this pandas function
pd.set_option('max_columns', None)

In [150]:
df_flights.iloc[25:36, :]

In [151]:
# Investigate all unique values in each column
{column: len(df_flights[column].unique()) for column in df_flights.columns}

With the code above we can see that the 'TAIL_NUMBER' and the 'FLIGHT_NUMBER' have a lot of unique values which will affect the speed of the processing when we will be developing the model (by Hot Encoding the Data). 

In [152]:
pd.set_option('max_columns', None)

#### **Dropping the Uneeded Cols**

In [153]:
# we will begin by reseting the Index
df_flights = df_flights.set_index("FLIGHT_NUMBER")

In [154]:
df_flights.head()

In [155]:
df_flights = df_flights.sort_index(ascending = True)

In [156]:
df_flights.head()

In [157]:
df_flights = df_flights.drop(columns = ['TAIL_NUMBER', 'YEAR', 'MONTH', 'DAY'])

In [158]:
df_flights.tail()

In [159]:
# Printing out the correlation matrix of our data frame
corr_matrix = df_flights.corr()
display(corr_matrix)

In [160]:
# Checking for the correlation 
# # Plotting the heatmap along with the correlation matrix
# heat_map = sns.heatmap(corr_matrix, annot=True, fmt="f", linewidths=3.5)
plt.figure(figsize=(20, 10))
heat_map = sns.heatmap(corr_matrix, annot = True, fmt="f", linewidths = 3.5, 
                       xticklabels = corr_matrix.columns, yticklabels = corr_matrix.columns, cmap = "RdBu")
display(heat_map)

In [161]:
# Displaying the Counting plots for Categorical Data
categorical_columns = ['AIRLINE', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']


for cat_col in categorical_columns:
    plt.figure()
    sns.countplot(x = cat_col, data = df_flights.iloc[150:200, :], hue = 'CANCELLED',  palette = "Set1")

In [162]:
# Another function to deal with the categorical data
from matplotlib import rcParams

categorical_columns = ['AIRLINE', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']

# figure size in inches
rcParams['figure.figsize'] = 11.7,8.27

# Defining the Categorical DataFrame
categorical_df_fligths = df_flights.select_dtypes("object")

# Displaying the last 4 rows of the categorical dataframe
categorical_df_fligths.tail()

In [163]:
plot_kinds = ['count', 'strip', 'swarm']

for cat_col in categorical_columns:
    for plot_kind in plot_kinds:
        display('The Factor Plot of the {} Kind'.format(plot_kind.upper()))
        sns.factorplot(cat_col, data = categorical_df_fligths.iloc[150:200, :], kind = plot_kind)
        plt.gcf().set_size_inches(15, 8)

**Corollary :** By loooking at the different plots above, we can't really possibly draw conclusions from the differents insights, so for our analysis we will just move on trying to hot encode our dataset and move towards building a ML Logistic Regression Model.

#### **Hot Encoding the Data**

**Remark:**
- OneHotEncoder for Unordered (Nominal Data).
- OrdinalEncoder for Ordered Data (Ordinal Data).

In our case the data seems to be Nominal One , because it doesn't pursue a logical order.

In [164]:
def onehot_encode(df, column_dict):
    df = df.copy()
    for column, prefix in column_dict.items():
        dummy_data = pd.get_dummies(df[column], prefix = prefix)
        df = pd.concat([df, dummy_data], axis = 1)
        df = df.drop(column, axis = 1) # we remove the columns once we get our dummies
    return df

In [165]:
# One-hot encode nominal feature columns
df_flights = onehot_encode(
        df_flights,
        column_dict={
            'AIRLINE': 'AL',
            'ORIGIN_AIRPORT': 'OA',
            'DESTINATION_AIRPORT': 'DA'
        }
)

In [166]:
df_flights.iloc[1000:1010, :]

In [167]:
# df_flights.drop('TAIL_NUMBER', axis = 1, inplace = True)

In [168]:
df_flights

### Fill NaN values with the mean of each column

In [169]:
# Displaying the Columns with one missing value or more
missing_value_df_flights = df_flights.loc[:, df_flights.isna().sum() > 0]

missing_value_df_flights.isna().mean()

In [170]:
# Fill the remaining missing values with the mean value of every column
cols_with_remaining_missing_values = df_flights.loc[:, df_flights.isna().sum() > 0].columns

cols_with_remaining_missing_values

In [171]:
# Now we will fill the remaining missing values
for column in cols_with_remaining_missing_values:
    df_flights[column] = df_flights[column].fillna(df_flights[column].mean())

In [172]:
# Checking for the filling step
df_flights.isna().sum()

##### **Conclusion :** Finally there's no missing value in our dataset, now we can scale our data and split the dataset to build the ML model.

In [173]:
# Splitting the Data into Xs and y
X = df_flights.drop('CANCELLED', axis = 1).values
y = df_flights[['CANCELLED']].values

Z = df_flights.drop('CANCELLED', axis=1) # DataSet to restore column names 

In [174]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state = 234)

In [175]:
X.shape

In [176]:
X_train.shape

In [177]:
# Construcring the standard scaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

In [178]:
X_train = scaler.transform(X_train)

In [179]:
X_train_copy = X_train.copy

In [180]:
X_train = scaler.transform(X_train)

In [181]:
X_train = pd.DataFrame(X_train, columns = Z.columns)

In [182]:
X_train

#### **Part 3 : Features Selection**

In [183]:
# Splitting into Xs & y
X_x = df_flights.drop('CANCELLED', axis = 1)

In [184]:
y_y = df_flights["CANCELLED"]

In [185]:
{column: len(X_x[column].unique()) for column in X_x.columns}

In [186]:
type(y_y)

In [187]:
print(X_x.shape)
print(y_y.shape)

##### In order to select features we need to create a Tree Based Model.

<p>In our example, we will choose the Decision Tree Model</p>

In [190]:
# Importing the corresponding libraries
from sklearn.tree import DecisionTreeClassifier

In [191]:
decision_tree = DecisionTreeClassifier(random_state=15, criterion = 'entropy', max_depth = 10)

In [193]:
decision_tree.fit(X_x, y_y)

In [195]:
# Running the feature importance 
fi_col = []
fi = []

# Looking for features importance
for i, column in enumerate(df_flights.columns):
  print("The Feature Imortance of the '{}' ---:--- {}".format(column, decision_tree.feature_importances_[i]))

  fi_col.append(column)
  fi.append(decision_tree.feature_importances_[i])

#### **For more visuality we will create a DataFrame, and we will put the the Feature importance on it.**

In [196]:
# Create a DataFrame
fi_df = zip(fi_col, fi)

# Creating a Pandas DataFrame
fi_df = pd.DataFrame(fi_df, columns = ['Feature','Feature Importance'])
fi_df

In [197]:
###  Ordering the Data

fi_df = fi_df.sort_values('Feature Importance', ascending = False).reset_index()


# fi_df.drop("level_0", axis = 1, inplace = True)

fi_df

In [198]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, train_size = 0.9, random_state = 12)

# Printing the Shape of the Multiple Sets
display("The Shape of X Sets")
print(X_train.shape)
print(X_test.shape)
print(X_valid.shape)
  
display("The Shape of y Sets")
print(y_train.shape)
print(y_test.shape)
print(y_valid.shape)

#### **Part 4 : Building the ML Model (Logistic Regression)**

![download.png](attachment:4e086fe6-6710-43f7-81c6-d84f9920c6e2.png)

Some Details:

- We fit an "S" Shape logistic Function.
- The curve tells you the porbability if a Flight will be cancelled.

In [201]:
# Importing the Corresponding libraries for model training
from sklearn.linear_model import LogisticRegression

In [202]:
# Creating the ML Model and training it
log_reg = LogisticRegression(random_state = 15, solver = 'lbfgs')

# Training the model on the Data
log_reg.fit(X_train, y_train)

In [204]:
# Predicting the Data
y_pred = log_reg.predict(X_train)

# Displaying the Predicting Vector
display(type(y_pred), y_pred, y_pred.shape)

In [205]:
# Predicting the Density of Probability
prob_y_pred = log_reg.predict_proba(X_train)

# Displaying the Matrix of Probabilities Predictions
display(type(prob_y_pred), prob_y_pred, prob_y_pred.shape)

In [206]:
# Predict logarithm of probability estimates
log_prob_y_pred = log_reg.predict_log_proba(X_train)

# Displaying the Matrix of Probabilities Predictions
display(type(log_prob_y_pred), log_prob_y_pred, log_prob_y_pred.shape)

In [207]:
# Coefficients of the feature
log_reg.coef_

In [208]:
# Inspecting the shape of the Data
log_reg.predict(X_train).shape

#### **Step 5 : Evaluating the Model**

In [210]:
# Importing the libraries
from sklearn.metrics import r2_score, explained_variance_score, confusion_matrix, accuracy_score, classification_report, log_loss

In [211]:
# Scoring the Training Data
print("The Training Accuracy : ", log_reg.score(X_train, y_train))

# Scoring the Testing Data
print("The Testing Accuracy : ", log_reg.score(X_test, y_test))

# Classification Report
print(classification_report(y_train, y_pred))

### Defining the confusion Matrix

In [212]:
def plot_confusion_matrix(cm, classes = None, title = 'Confusion Matrix'):
  if classes is not None:
    sns.heatmap(cm, cmap = 'BuPu', xticklabels=classes, yticklabels=classes, vmin=0., vmax=1., annot=True, annot_kws={'size':50})
  else:
    sns.heatmap(cm, vmin = 0, vmax = 1)
  plt.title(title)
  plt.ylabel('True Labels')
  plt.xlabel("Predicted Labels")

In [213]:
# Visualizing cm

cm = confusion_matrix(y_train, y_pred)
cm_norm = cm / cm.sum(axis=1).reshape(-1,1)

plot_confusion_matrix(cm_norm, classes = log_reg.classes_, title='Confusion matrix')

In [214]:
# Calculating False Positives (FP), False Negatives (FN), True Positives (TP) & True Negatives (TN)

FP = cm.sum(axis=0) - np.diag(cm)
FN = cm.sum(axis=1) - np.diag(cm)
TP = np.diag(cm)
TN = cm.sum() - (FP + FN + TP)


# Sensitivity, hit rate, recall, or true positive rate
TPR = TP / (TP + FN)
print("The True Positive Rate is:", TPR)

# Precision or positive predictive value
PPV = TP / (TP + FP)
print("The Precision is:", PPV)

# False positive rate or False alarm rate
FPR = FP / (FP + TN)
print("The False positive rate is:", FPR)


# False negative rate or Miss Rate
FNR = FN / (FN + TP)
print("The False Negative Rate is: ", FNR)



##Total averages :
print("")
print("The average TPR is:", TPR.sum()/2)
print("The average Precision is:", PPV.sum()/2)
print("The average False positive rate is:", FPR.sum()/2)
print("The average False Negative Rate is:", FNR.sum()/2)

In [215]:
from sklearn.metrics import plot_confusion_matrix

cm_plotting = plot_confusion_matrix(log_reg, X_test, y_test, cmap = 'Blues', values_format = '.3g')

In [216]:
cm_plotting.confusion_matrix

#### **Step 6 : Evaluating the Log Loss**

In [217]:
# Log Loss on the Training DataSet
print("The Log Loss of the Training Dataset {}".format(log_loss(y_train, prob_y_pred)))


# Log Loss on the Testing Dataset 
prob_y_test_pred = log_reg.predict_proba(X_test)
print("The Log Loss of the Testing Dataset {}".format(log_loss(y_test, prob_y_test_pred)))

#### **Dummy Classifier Comparising**

REMARK : We observe that the Log Loss of the Testing DataSet is much bigger 

In [218]:
# Training a Dummy Classifier
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
score = dummy_clf.score(X_test, y_test)

pred_proba_test = dummy_clf.predict_proba(X_test)
log_loss2 = log_loss(y_test, pred_proba_test)

print("Testing Acc:", score)
print("Log Loss:", log_loss2)