# **Set Up**

In [None]:
!pip install rulefit interpret -q -U --progress-bar off
# !pip install cvae==0.0.3 -q -U --progress-bar off

In [None]:
# Data
import math
import numpy as np
import pandas as pd
from tqdm import tqdm

# Data Visualization
import plotly.express as px
import matplotlib.pyplot as plt

# Data Processing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# Linear Models
from rulefit import RuleFit
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import RidgeCV, RidgeClassifierCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingClassifier

# Model Metrics
from sklearn.metrics import r2_score as R2
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, matthews_corrcoef

# Model Interpretations
from interpret import show
import statsmodels.api as sm
from interpret.perf import ROC
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree, export_text
from interpret.glassbox import ExplainableBoostingClassifier

# Dimensionality Reduction Methods
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
# from cvae.cvae import CompressionVAE


In [None]:
# Data file path
# file_path = "./Datasets/FlightDelays.csv"
file_path = "/kaggle/input/airline-delay-and-cancellation-data-2009-2018/2018.csv"

# Reproducibility
SEED = 42
np.random.seed(SEED)

# **Flight Delays Dataset**

In [None]:
# Load the data as a data frame
df = pd.read_csv(file_path)
df.drop(columns=['Unnamed: 27'], inplace=True)

# Quick Look at the data frame
df.head()

In [None]:
n_samples, n_features = df.shape

print(f"No. of Samples : {n_samples}")
print(f"No. of Features: {n_features}")

Within our dataset, we observe a **substantial volume** of **7,213,446 data points**, indicating ample information for analysis. Complementing this abundance, we have **28 features** to explore, each offering unique insights into the **airline delay problem**. While not all features may hold equal significance, their collective examination promises a comprehensive understanding of the underlying patterns and factors influencing flight delays.

In [None]:
# Data information
df.info()

Within our dataset, we find a total of **28 features**, comprising both **quantitative and qualitative attributes**. Specifically, **23 features** are **quantitative**, representing **numerical data**, while the remaining **five features** are **qualitative**, denoting **categorical or descriptive information.**

In [None]:
categorical_features, numerical_features = [], []

for col in df.columns:
    if df[col].dtype == "object":
        categorical_features.append(col)
    else:
        numerical_features.append(col)

print(f"Categorical features: {categorical_features}")
print(f"Numerical   features: {numerical_features}")

It seems like the majority of the data that we have contains NAN.

In [None]:
df.isnull().sum()

Upon inspecting the dataset, it's evident from the head and tail that there are numerous missing values present. This necessitates the removal of these entries to ensure data integrity. Despite the vast number of data points, eliminating these features is essential as they lack meaningful information. While these features may have potential contributions, their absence of data renders them ineffective for analysis.

# **Data CLeaning**

Let's start by removing the unwanted columns.

In [None]:
# Identify the columns to remove
unwanted_cols = [feature for feature, null_values in df.isnull().sum().items() if null_values>=7000000]

# Remove columns
df.drop(columns=unwanted_cols, inplace=True)

Given that a significant portion of the data contains null values, it's imperative to remove these instances and focus solely on the useful data. This ensures the integrity and accuracy of our analysis, allowing us to extract meaningful insights without the interference of incomplete or unreliable information.

In [None]:
# # Identify the columns to remove
# cols_to_remove = [feature for feature, null_values in df.isnull().sum().items() if null_values>=5000000]

# # Remove columns
# df.drop(columns=cols_to_remove, inplace=True)

One approach is to discard features with null values, which can be effective when the number of null values is substantial. However, considering the vast size of our dataset, another strategy is to retain the features and only eliminate the individual data points or rows containing null values. This ensures that we preserve as much valuable information as possible while still addressing the issue of missing data.

In [None]:
# Remove NAN value rows
df = df.dropna().reset_index()
df.drop(columns=["index"], inplace=True)

# Quick Look
df.head()

In [None]:
# Checking for the Null Values
df.isnull().sum()

To streamline our analysis, we'll begin by removing unnecessary features from the dataset. These features, which may contain special values or are deemed irrelevant to our analysis, will be excluded to focus on the most pertinent aspects of the data.

In [None]:
features = df.columns.tolist()
for index, feature in enumerate(features):
    print(f"{index+1}: {feature}")

Here's the categorized list of features along with explanations:

1. **General Features**:
   - FL_DATE: Indicates the date of the flight departure.
   - OP_CARRIER: Denotes the operating carrier or airline for the flight.
   - OP_CARRIER_FL_NUM: Represents the flight number assigned by the operating carrier.
   - ORIGIN: Specifies the departure airport code or location.
   - DEST: Indicates the destination airport code or location.
   - CANCELLED: An indicator for whether the flight was canceled.
   - DIVERTED: An indicator for whether the flight was diverted to a different destination.


2. **Departure Features**:
   - CRS_DEP_TIME: Represents the scheduled departure time of the flight.
   - DEP_TIME: Indicates the actual departure time of the flight.
   - DEP_DELAY: Represents the delay in departure time, if any.
   - TAXI_OUT: Denotes the time spent taxiing out before takeoff.
   - WHEELS_OFF: Indicates the time at which the aircraft wheels leave the ground for takeoff.


3. **In-Flight Features**:
   - TAXI_IN: Denotes the time spent taxiing in after landing.
   - CRS_ELAPSED_TIME: Represents the scheduled elapsed time of the flight.
   - ACTUAL_ELAPSED_TIME: Indicates the actual elapsed time of the flight.
   - AIR_TIME: Denotes the time spent in the air during the flight.
   - DISTANCE: Indicates the distance traveled by the flight.


4. **Arrival Features**:
   - CRS_ARR_TIME: Represents the scheduled arrival time of the flight.
   - ARR_TIME: Indicates the actual arrival time of the flight.
   - ARR_DELAY: Denotes the delay in arrival time, if any.
   - WHEELS_ON: Indicates the time at which the aircraft wheels make contact with the ground upon landing.


5. **Delay Features**:
   - CARRIER_DELAY: Denotes the delay attributed to the carrier or airline.
   - WEATHER_DELAY: Represents the delay attributed to weather conditions.
   - NAS_DELAY: Denotes the delay attributed to the National Airspace System (NAS).
   - SECURITY_DELAY: Indicates the delay attributed to security-related issues.
   - LATE_AIRCRAFT_DELAY: Denotes the delay attributed to issues with the aircraft being late.

# **Data Preperation/Processing**

Now that we've obtained the dataset, our next step is to delve deeper into it to gain a comprehensive understanding. This involves exploring various aspects of the dataset to uncover insights and patterns hidden within the data. Let's embark on this journey of exploration to unravel the intricacies of our dataset.

In [None]:
df.info()

In [None]:
# Data for American Airlines
df = df[df.OP_CARRIER=="AA"]
df.drop(columns="OP_CARRIER", inplace=True)

The dataset comprises a total of twenty-six features, categorized into different data types. Specifically, there are nineteen numerical features represented as floating point numbers, three features represented as integers, and four features classified as objects or categories.

In [None]:
# Flight Date
df.FL_DATE

Flight date and other date-related features are currently stored as object data types. To ensure proper handling and analysis, we will convert these features into datetime format.

In [None]:
# Change to Datetime format
df.FL_DATE = pd.to_datetime(df.FL_DATE)
df.FL_DATE

Considering the impact of weather and seasonal patterns on flight delays, it's important to capture the month and day of the week in our analysis. Executives have noted that weekends and winters tend to experience higher delays, highlighting the significance of these temporal factors. Hence, we'll create features to represent the month and day of the week in our dataset.

In [None]:
df["FL_MON"] = df.FL_DATE.apply(lambda x: x.month)
df["FL_DOW"] = df.FL_DATE.apply(lambda x: x.dayofweek)

In [None]:
# Removing the Datetime feature
df.drop(columns=["FL_DATE"], inplace=True)

It's crucial to identify if the arrival or departure airports are hub airports. American Airlines (AA) operates 10 hubs as of 2019, including airports like Charlotte, Chicago–O'Hare, Dallas/Fort Worth, Los Angeles, Miami, New York–JFK, New York–LaGuardia, Philadelphia, Phoenix–Sky Harbor, and Washington–National. We can encode whether the ORIGIN and DEST airports are AA hubs using their IATA codes and remove specific columns like FL_NUM, ORIGIN, and DEST for simplicity and clarity in our analysis.

In [None]:
df.ORIGIN

In [None]:
df.DEST

In [None]:
# List the hubs
hubs = ['CLT', 'ORD', 'DFW', 'LAX', 'MIA', 'JFK', 'LGA', 'PHL', 'PHX', 'DCA']

# Convert to Bool
df['ORIGIN_HUB'] = df.ORIGIN.isin(hubs).astype('int')
df['DEST_HUB'] = df.DEST.isin(hubs).astype('int')

In [None]:
# We can get rid of ORIGIN and DEST
df.drop(columns=['ORIGIN', 'DEST', 'OP_CARRIER_FL_NUM'], inplace=True)

# Quick Look
df.head()

In [None]:
[feature for feature in df.columns if 'DELAY' in feature ]

1. **DEP_DELAY**: This feature represents the delay in departure time, measured in minutes. It indicates how late the flight was in departing from the scheduled departure time.

2. **ARR_DELAY**: ARR_DELAY signifies the delay in arrival time, also measured in minutes. It indicates the deviation of the actual arrival time from the scheduled arrival time.

3. **CARRIER_DELAY**: This feature denotes the delay attributed to the carrier or airline, which includes issues such as maintenance problems, crew scheduling issues, or other airline-related factors.

4. **WEATHER_DELAY**: WEATHER_DELAY represents the delay caused by adverse weather conditions, such as thunderstorms, snowstorms, or hurricanes, which affect the flight's departure or arrival.

5. **NAS_DELAY**: NAS_DELAY stands for National Airspace System delay, which includes delays attributed to air traffic control, airport operations, or other factors related to the national airspace system.

6. **SECURITY_DELAY**: SECURITY_DELAY indicates the delay caused by security-related issues, such as security checks, passenger screening procedures, or security breaches.

7. **LATE_AIRCRAFT_DELAY**: This feature represents the delay attributed to the aircraft being late for reasons such as maintenance issues, turnaround time, or other aircraft-related factors.

These delay-related features provide insights into the various factors contributing to flight delays, allowing airlines to identify areas for improvement and enhance operational efficiency.

---
Among the delay-related features provided, only the "CARRIER_DELAY" directly pertains to delays caused by the airline itself. This includes factors such as maintenance issues, crew scheduling problems, or other airline-specific issues.

In [None]:
df

In [None]:
# Remove other delays from the data
cols_to_rem = ['DEP_DELAY', 'ARR_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
df.drop(columns=cols_to_rem, inplace=True)

# Splitting Data
X = df.copy()
y = X.pop("CARRIER_DELAY")

# Standard scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.to_numpy())

# Splitting into Training & Testing
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.1, shuffle=True, random_state=42)

print(f"Training Size: {len(X_train)}")
print(f"Testing  Size: {len(X_test)}")

In [None]:
X

In [None]:
# Classification Label
y_train_class = y_train.apply(lambda x: 1 if x>15 else 0)
y_test_class = y_test.apply(lambda x: 1 if x>15 else 0)