# **Phase 3: Experimentation, Fine-Tuning, and Final Report**

==========================================================================================================================================================

# Abstract

In this project, we tackled the challenging problem of predicting flight delay severity using the OTPW dataset, a compilation of flight performance and weather data. Our objective was to develop a robust, scalable machine learning pipeline capable of delivering accurate predictions while addressing common pitfalls such as data leakage and overfitting. In the initial phases, we established a solid base with multinomial logistic regression, setting the foundation for more advanced methodologies. Subsequent phases focused on the implementation and evaluation of sophisticated models, including ElasticNet Logistic Regression, Gradient Boosted Decision Trees (GBDT), and Multilayer Perceptron (MLP) neural networks, with the latter being a key focus of this phase.

For the MLP architecture, we explored multiple configurations, including single-hidden-layer and two-hidden-layer networks, designed to capture complex nonlinear relationships in the data. Leveraging custom-built functions, we implemented a distributed training mechanism using PySpark to ensure scalability across CPU-based clusters. To enhance the model's robustness, early stopping strategies were employed, guided by validation set performance metrics. These strategies not only mitigated overfitting but also facilitated hyperparameter tuning, such as learning rate decay and batch size optimization. Moreover, feature engineering played a pivotal role, with the integration of time-based features. These enhancements allowed us to capture intricate temporal and spatial dependencies within the dataset.

Our experiments demonstrated significant improvements over the baseline, with the MLP model achieving a test accuracy of 0.8993 in predicting flight delay severity across five classes. 



#Team Members
<table>
    <tr>
        <th>Name</th>
        <th>Email</th>
        <th>Photos</th>
    </tr>
    <tr>
        <td>Achyuth Kolluru</td>
        <td>akolluru@berkeley.edu</td>
        <td> <img src="https://raw.githubusercontent.com/AchyuthKoll/w261_images/master/Achyuth.jpg" width="150"> </td>
    </tr>
    <tr>
        <td>Bernardo Cobos</td>
        <td>bernardoc@berkeley.edu</td>
        <td> <img src="https://raw.githubusercontent.com/AchyuthKoll/w261_images/master/Bernardo.png" width="150"> </td>
    </tr>
    <tr>
        <td>Hitesh Basantani</td>
        <td>hitesh.basantani@berkeley.edu</td>
        <td> <img src="https://raw.githubusercontent.com/AchyuthKoll/w261_images/master/Hitesh.jpg" width="150"> </td>
    </tr>
    <tr>
        <td>Omar Jamil</td>
        <td>ojamil@berkeley.edu</td>
        <td> <img src="https://raw.githubusercontent.com/AchyuthKoll/w261_images/master/Omar.png" width="150"> </td>
    </tr>
    <tr>
        <td>Mohammad Hafezi</td>
        <td>hafezi@berkeley.edu</td>
        <td> <img src="https://raw.githubusercontent.com/AchyuthKoll/w261_images/master/Mohammad.jpg" width="150"> </td>
    </tr>
</table>


#Credit Assignment Plan
<table border="1" cellpadding="4" cellspacing="0">
    <tr>
        <th>Phase</th>
        <th>Description & SMART Goal</th>
        <th>Assigned Member</th>
        <th>Status</th>
        <th>Estimated Person-Hours</th>
    </tr>
    <!-- FP Phase 1 -->
    <tr>
        <td rowspan="3">FP Phase 1: Project Plan</td>
        <td>Clarify objectives, datasets, and project scope. SMART Goal: Complete project summary by Nov. 4.</td>
        <td>All Members</td>
        <td>Completed</td>
        <td>27</td>
    </tr>
    <tr>
        <td>Dataset Overview</td>
        <td>Describe datasets, joins, tasks, and metrics. SMART Goal: Document data structure by Nov. 4.</td>
        <td>All Members</td>
        <td>Completed</td>
        <td>15</td>
    </tr>
    <tr>
        <td>Initial EDA</td>
        <td>Conduct initial EDA to identify trends and outliers. SMART Goal: Summarize key findings by Nov. 4.</td>
        <td>All Members</td>
        <td>Completed</td>
        <td>12</td>
    </tr>
    <!-- FP Phase 2 -->
    <tr>
        <td rowspan="4">FP Phase 2: EDA & Baseline Pipeline</td>
        <td>Handle missing or inconsistent data. SMART Goal: Complete data cleaning by Nov. 10.</td>
        <td>Hitesh, Omar</td>
        <td>Completed</td>
        <td>60</td>
    </tr>
    <tr>
        <td>Detailed EDA</td>
        <td>Analyze dataset distributions and correlations in depth. SMART Goal: Complete by Nov. 10.</td>
        <td>Achyuth, Hitesh, Bernardo</td>
        <td>Completed</td>
        <td>18</td>
    </tr>
    <tr>
        <td>Baseline Model</td>
        <td>Develop baseline pipeline for model benchmarking. SMART Goal: Build baseline model by Nov. 12.</td>
        <td>Bernardo, Mohammad, Omar</td>
        <td>Completed</td>
        <td>12</td>
    </tr>
    <tr>
        <td>Scalability and Efficiency</td>
        <td>Implement distributed/parallel training and scoring. SMART Goal: Complete scalable model by Nov. 20.</td>
        <td>Bernardo, Achyuth, Mohammad</td>
        <td>Completed</td>
        <td>12</td>
    </tr>
    <!-- FP Phase 3 -->
    <tr>
        <td rowspan="3">FP Phase 3: Algorithm Selection & Final Report</td>
        <td>Compare model performances and select the optimal one. SMART Goal: Finalize model by Dec. 5.</td>
        <td>Bernardo, Omar, Achyuth</td>
        <td>Completed</td>
        <td>24</td>
    </tr>
    <tr>
        <td>Model Fine-Tuning</td>
        <td>Optimize hyperparameters and finalize model. SMART Goal: Achieve target metrics by Dec. 10.</td>
        <td>Hitesh, Achyuth</td>
        <td>Completed</td>
        <td>12</td>
    </tr>
    <tr>
        <td>Final Report</td>
        <td>Prepare and submit the final report. SMART Goal: Submit by Dec. 14.</td>
        <td>Omar, Mohammad</td>
        <td>Completed</td>
        <td>24</td>
    </tr>
</table>



# Introduction & Project Description 

We set out to predict flight delay severity, defined initially by the `DEP_DELAY_GROUP` variable, to assist airlines, airports, and travelers in anticipating and mitigating disruptions. Delay severity classification enables effective resource allocation, schedule adjustments, and contingency planning. Our earlier phases classified flight severity using a multinomial logistic regression model, which, while interpretable, did not fully leverage the complexity of our integrated flight-weather dataset.

In this phase, our focus shifted toward experimenting with more sophisticated models and feature sets. By leveraging advanced machine learning techniques, such as ElasticNet logistic regression, GBDTs, and MLP neural networks, we aimed to capture non-linear relationships and complex temporal or network patterns. We introduced time-based features to reflect recency and seasonality of delays. Our final goal was to achieve improved predictive performance on both the validation sets and an unseen year-2019 test set.


# Data Description & Feature Engineering 

We used the 5-year OTPW dataset, which integrates flight data from the U.S. Department of Transportation with weather data from the National Oceanic and Atmospheric Administration for the years 2015 - 2019. Training was done on 2015-2018 data, while 2019 was held out as a test set. The dataset includes detailed records of flight schedules, delays, cancellations, and extensive hourly weather observations aligned to the departure origin airports. 

## Data Lineage & Transformations:  
Missing values were imputed using group-based averages, and we introduced indicator columns to track imputation. We then standardized data formats and ensured time alignment to prevent data leakage from future periods.To handle the grouping of delays, quantie-based bucketing was applied to the ['Departure Delay'] variable. Specifically delays were grouped into four buckets using the 50th (Q2) percentile and 75th (Q3) percentile of the data distribution as thresholds. The defined buckets were:

- **Bucket 0**: Flights with no delay (`Dep_Delay = 0`).
- **Bucket 1**: Small delays (`15 minutes ≤ Dep_Delay < 30 minutes`).
- **Bucket 2**: Moderate delays (`30 minutes ≤ Dep_Delay < 45 minutes`).
- **Bucket 3**: Significant delays (`Dep_Delay ≥ 45 minutes`).
- **Bucket 4**: Flights that were canceled (`Dep_Delay = 13`).

The reduction in number of buckets for our target variable was to deal with overly sparse distribution of the original values. 

## Feature Families:  
- Temporal Features: [e.g., `DAY_OF_WEEK` , `MONTH, DEP_TIME_BLK`] capture daily/weekly patterns.  
- Flight Characteristics: [e.g., `OP_UNIQUE_CARRIER`, `TAIL_NUM`, `CRS_ELAPSED_TIME`] encode airline/operator identity and flight-level attributes.  
- Weather Features: [e.g., `HourlyDryBulbTemperature`, `HourlyVisibility`, `HourlyPrecipitation`] provide environmental context.  
- Time-based Feature (New in Phase 3): Seasonality.  

## EDA of New Features:

For the EDA we included ~ 16 numerical variables containing hourly Weather data, cloud/sky information, day of week, month of year, etc. 

In this work, we are trying to predict flight delays bucketed in to 5 groups. These groups will be based on the variable `DEP_DELAY`. Data below consists of all 5 years of data sampled to 5% for the sake of plotting. We see that majority of the flights are on time or have relatively small departure delay. We also notice a large tail to the right extending all the way out to 2000 minute delay which is a nearly 33 hour delay. While this looks to be an outlier, we decided to keep this included as this delay is possible for some routes which run on a much lower frequency. 


<img src="https://github.com/basantani-hitesh/w261/blob/main/departure_delay_5y.png?raw=true">

We also reviewed the distribution of the hourly weather information, such as HourlyPrecipitaion, Wind gusts, pressure changes. Some of these are highlighted below. These variables directly end up as inputs in to our model.

### Hourly Station Pressure 

The hourly station pressure data is in the range of 20 inHg to 31 inHg. This data is multi-modal. Coupled with the average pressure for a given airport, we thought this variable might provide predictive power in flight delays as it might be indicative of an upcoming storm.

<img src="https://github.com/basantani-hitesh/w261/blob/main/HourlyStationPressure.png?raw=true">

### Hourly Dry Bulb Teperature

The hourly dry bulb temperature has a range from -25F to 120 F showcasing the temperature ranges experienced by all the airports in the United states over the course of 5 years. The distribution has a mode of around 65-70F with left skew towards the lower temperatures. The outliers below 0F and above 105F might be of interest in the predictability.

<img src="https://github.com/basantani-hitesh/w261/blob/main/hourlydrybulbtemperature.png?raw=true">

## Derived Features  

### Sky Conditions  

`Coverage1`, `Coverage2`, `Coverage3` (sky coverage layers), along with their corresponding `LayerAmount` and `CloudBaseHeight`, provide granular weather details derived from the `HourlySkyConditions` column. These features were parsed to enable more detailed weather analysis relevant to flight performance and delay predictions.

The `HourlySkyConditions` column represents multiple layers of sky coverage for a given timestamp. Each layer contains:  
- **Sky Coverage**: Codes like `CLR` (Clear), `BKN` (Broken), `OVC` (Overcast), which describe the extent of cloud coverage.  
- **Layer Amount**: A numerical value representing cloud density.  
- **Cloud Base Height**: The altitude of the cloud layer's base, measured in hundreds of feet.

An example of the original format:  
`"BKN:07 250 BKN:07 100 OVC:08 180"`  
This describes three layers:  
1. Broken clouds with density `07` at a base height of `250` (hundreds of feet).  
2. Broken clouds with density `07` at a base height of `100`.  
3. Overcast clouds with density `08` at a base height of `180`.

#### Parsing Process  
1. **Regular Expression Extraction**:  
   - A regular expression was used to parse the `HourlySkyConditions` string, splitting it into distinct components:
     - `Coverage`: Extracts the sky coverage codes (e.g., `BKN`, `OVC`).
     - `LayerAmount`: Extracts the corresponding numerical density values.
     - `CloudBaseHeight`: Extracts the altitude values.

2. **Column Derivation**:  
   - Up to three separate columns (`Coverage1`, `Coverage2`, `Coverage3`) were created to represent the sequential layers of coverage in the string.
   - Similarly, three columns for `LayerAmount` and three for `CloudBaseHeight` were derived, corresponding to each coverage layer.

3. **Handling Missing Layers**:  
   - If the original string contained fewer than three layers, the missing `Coverage` values were set to `CLR` (Clear).
   - Missing `LayerAmount` and `CloudBaseHeight` values were set to `0` to maintain consistency.

#### Example Transformation  
**Original Input**: `"BKN:07 250 BKN:07 100 OVC:08 180"`  
**Resulting Columns**:  
- `Coverage1`: `BKN`, `Coverage2`: `BKN`, `Coverage3`: `OVC`  
- `LayerAmount1`: `07`, `LayerAmount2`: `07`, `LayerAmount3`: `08`  
- `CloudBaseHeight1`: `250`, `CloudBaseHeight2`: `100`, `CloudBaseHeight3`: `180`

This transformation enabled more granular analysis of cloud coverage patterns, aligning weather data with flight conditions more effectively.

### Imputation Indicators  

Binary indicator columns were introduced across all imputed columns to track where missing values were replaced. These indicators are critical for downstream analysis, as they retain the original missingness information for potential pattern identification.  

#### Implementation:  
- **MissingValueIndicators**:  
   For delay-related columns like `DEP_DELAY` and `DEP_DELAY_GROUP`, for example, indicators such as `DEP_DELAY_missing` and `DEP_DELAY_GROUP_missing` were created. These flags help identify rows where delay data was unavailable. The same approach was applied across the boards to columns where missing values were being imputed.

These indicators ensure that the imputation process is transparent, providing flexibility for modeling and exploratory analyses.

### Feature Transformations  

- **Categorical Encoding**:  
  Label encoding was applied to columns such as `DEP_TIME_BLK`, `CANCELLATION_CODE`, and `OP_UNIQUE_CARRIER` using PySpark’s `StringIndexer`. This transformed categorical variables into numerical indices for compatibility with machine learning models. For each column where label encoding was applied, a new column as introduced with the corresponding indexed values (for example, 'CANCELLATION_CODE_index')

- **One Hot Encoding**:
  For Pyspark ML's multinomial logistic regression function, categorical variables also needed to be one-hot encoded. This was done using a OneHotEncoder for the categorically-encoded variables.

- **Further Imputation**
  Many hourly weather readings, such as `HourlyPrecipitation` and `HourlyVisibility`, still had missing values. All of these were datatype `double`, so in the ML pipeline, these were imputed via an Imputer class, with a "mean" imputation strategy. 

- **Vectorization**
  Finally, Pyspark's logistic regression requires all features to be in a single vector, while the categorical predicted label can be in its own non one-hot encoded column. These transformed features were all brought together using a VectorAssembler object.

### Seasonality in Departure Delay

In this section we will discuss our efforts in extracting the seasonality trends in the departure delay and cancellations to try and extract information that may give predictive power to our model build. 

#### Average Delay by Carrier

To get a feel what the recent airline delay looks like, we computed a new column called the average_departure_delay. This column contains the average delay experienced by the airline grouped by the hour and day of the week. In hindsight, average departure delay by each airport might provide more predictive power than grouping by airline, but as we see in the results, there is some predictve power in the metric we have computed here as well. Below heat map shows the delay for a couple of airlines, a comprehensive list is available in the appendix 1. 

Below graphs show the heat maps made from this data for Delta and US airlines. This average delay from teh data set is saved for each for each entry in the data table for use by our model

<img src="https://github.com/basantani-hitesh/w261/blob/main/departure_delay_delta.png?raw=true">

<img src="https://github.com/basantani-hitesh/w261/blob/main/departure_delay_us.png?raw=true">

#### Seasonality in Flight Cancellation by Airline

One feature we thought that would add predictive power to our model is the average departure delay of flights for each airline by day of the week and by month of the year. For this feature creation, we grouped the average delay by the airline for each departure day. Then we ran the seasonality analysis using seasonal decomposition in pyspark, running the seasonal decomposition to determine trends over a 7 day recurring period and another analysis by a 30 day repeating period to get the weekly and monthly seasonality summary. These features then got saved in to a new column called weekly and monthly seasonality. 

These graphs illustrate the seasonality component by week. We The seasonality component is calculated using the formula:

<br>

                                                            Seasonality Component = Data − (Trend + Residual)

<br>


This is derived by fitting a seasonal model that minimizes the residuals to extract the repeating weekly and monthly delay patterns. The monthly patterns of delay are shown below:
<br>

<img src="https://github.com/basantani-hitesh/w261/blob/main/monthly_seasonality.png?raw=true">
<br>


# Leakage Analysis

A leakage analysis was conducted to address all potential cardinal sins of leakage, ensuring the integrity of the model's evaluation and performance in deployment scenarios.

### Temporal Leakage
The dataset was split temporally, with the last year (2019) reserved for testing and the preceding four years (2015–2018) used for training and validation. This approach ensures that no future information influences the training process, maintaining realistic evaluation metrics.

### Target Leakage
While efforts were made to prevent target leakage, there is some risk associated with features like `avg_delay` and the seasonality columns (`weekly_seasonal_component`, `monthly_seasonal_component`). These features were created by grouping data based on factors such as departure airport, airline, and time of the week or month. However, these groupings were not strictly confined to past data, meaning they may inadvertently include information from the current or future periods, posing a potential leakage risk.

### Data Split Leakage
No data split leakage exists, as training, validation, and test datasets were separated based on date, with no overlap. However, as mentioned above some of the features may be subject to leakage due to the nature of how they were calculated.

### Preprocessing Leakage
Preprocessing steps, such as imputing missing values, were applied to the entire dataset rather than separately for the training and test sets. For instance, mean-based imputations for weather features and time series data used statistics computed across both training and test data. While this is not a classical example of leakage, it does mean that imputations may incorporate information from the test set, potentially influencing model training.

### Proxy Variables
Proxy variables that could inadvertently reveal target information were identified and removed during training. This includes variables that directly or indirectly contained information about delays or cancellations, which were excluded to maintain the model's predictive integrity.

### Model or Pipeline Overfitting
There is a suspicion of model or pipeline overfitting. The MLP model predominantly predicted only Classes 0 and 1, effectively acting as a binary classifier. This behavior suggests the model struggled to generalize across all delay categories, indicating inefficiency or potential overfitting within the pipeline. More time should be invested in evaluating whether the observed behavior stems from model overfitting during cross-validation folds. This could be caused by factors such as an imbalance in the class distribution across folds, insufficient regularization, or improperly isolatied preprocessing steps. Since some preprocessing steps were computed using data from multiple folds, this could indavetently leak information between training and test sets. 

### Label Contamination
While no feature engineering was directly based on the target variable, features like `avg_delay`, `weekly_seasonal_component`, and `monthly_seasonal_component` may have introduced indirect leakage: These features are derived from group-level aggregations and seasonality trends, which could inadvertently include future or current data points when applied to the test set. This potential source of leakage was unfortunately discovered after modeling was performed, but is being noted here for completeness.

### Summary
Despite efforts to minimize leakage, some potential risks remain, particularly concerning features like `avg_delay` and the seasonality components. These aspects should be addressed in future iterations by recalculating features strictly from past data and isolating preprocessing to the training dataset. 


# Modeling Pipelines & Experiments

Our modeling pipelines consist of sequential steps:

1. **Preprocessing & Encoding**:  
   Input data → Imputation of missing weather values → One-hot encoding of categorical variables + Cardinality reduction of target variable 'Departure Delay Group' into 5 buckets → Assembling into a feature vector.

2. **Dimensionality & Feature Enhancement**:  
   Introduction of time-based features were integrated after initial cleaning and before final vector assembly.

3. **Model Training**:  
   Models tested include multi-class elasticNet Logistic Regression, Gradient-Boosted Trees (binary by default but adapted via re-labeling), and MLP (supports multi-class). We set aside validation sets by splitting the training data for time-series cross-validaton. This heped us to to tune hyperparameters. We also implemented early stopping.

**Hyperparameters & Settings**:  
   - Gradient-Boosted Trees: Tuned `num_rounds`, `eta`, and `max_depth`  
   - MLP: Tuned `maxIter`, `blockSize`, and experimented with different network layer configurations.
   - ElasticNet Logistic Regression: Tuned `regParam` and `elasticNetParam`

**Loss Functions**:  
   - MLP uses categorical cross-entropy loss for multi-class classification.
$$
L = -\sum_{i=1}^N y_i \log(\hat{y}_i)
$$
   - The Logistic Regression model used ElasticNet Loss/regularization.
   $$
L_{reg} = L + \alpha \lambda \sum_{j=1}^M |w_j| + (1 - \alpha) \lambda \sum_{j=1}^M w_j^2
$$
   - Gradient-Boosted Trees use impurity measures (like Gini) internally.

## Cluster & Runtime:  
**Cluster Configuration**: 

- Runtime: 
  - DBR 16.0 ML | Spark 3.5.0 | Scala 2.12

- Driver: 
  - Standard_Ds5_v2 | 56 GB | 16 Cores

- Workers: 
  - Standard_DS3_v2 | 84 GB | 24 Cores
    - 6-10 workers were used

## Experiments:

We evaluated multiple models, starting with a baseline classifier and progressing through logistic regression, XGBoost, and a Multilayer Perceptron (MLP). Each model's performance is assessed on the temporally split dataset.

### 1. Baseline Classifier

A dummy classifier (always guessing "NO DELAY") was used to set baseline metrics. Those are as follow: 

- Weighted Precision: 0.6517

- Weighted Recall: 0.8073

- Weighted F1-Score: 0.7212

### 2. Logistic Regression

The Phase 1 and 2 logistic regression models were updated to now account for ElasticNet. ElasticNet loss is similar to categorical cross-entropy loss, except it introduces penalties for the model weights  \\(w_i\\).

ElasticNet Regularization Loss:
$$
L_{reg} = L + \alpha \lambda \sum_{j=1}^M |w_j| + (1 - \alpha) \lambda \sum_{j=1}^M w_j^2
$$

This meant that we needed to choose two hyperparameters in Phase 3: 

- Regularization parameter \\( \lambda \\)
- ElasticNet "mixing" parameter \\( \alpha \\)

Time series cross-validation grid search was performed to choose these hyperparamaters. The hyperparameters we tried were: 

- \\( \lambda \\): **0.01**, 0.1
- \\( \alpha \\): 0.0, 0.5, **1.0**

On the 5-year dataset, the best hyperparameters were those highlighted above. 

#### Results
The test results on the 5y data (ie, evaluated on the data for the year 2019) are as follow:

- Weighted Precision: 0.6788

- Weighted Recall: 0.8095

- Weighted F1-Score: 0.7259

### 3. XGBoost

For the XGBoost model, we used a grid search to arrive at the following model parameters: num_round=50, max_depth=4, eta=0.3, num_workers=4, max_bin=256, tree_method="approx". During the grid search, we searched the following hyperparameters to arrive at the final values: The num_rounds [50, 100, 2], eta = [0.2, 0.3, 2], max_depth = [2, 4, 1].  

Due to memory constraints, we only trained on 10% of the full data on this algorithm. The test_data was evaluated on the full dataset. The results are shown below.

We see a weighted accuracy of ~ 68% with model being most accurate for class no delay and a delay of 30-45 minutes (class 2). The prediction for class 1, 3, and 4 is very poor < 25%.A full matrix of performance for each class is shown in the table below 

<img src="https://github.com/basantani-hitesh/w261/blob/main/xgboost_results.png?raw=true" alt="Image">

We used the get feature importance parameter to get a breakdown of which features were important according to our model. The results are shown below where the hour_of_day was the most important predictor for this algorithm. This jives well with the seasonality delay we saw during feature engineering where the delay is significant for certain parts of the day vs. others for each airline.  

<img src="https://github.com/basantani-hitesh/w261/blob/main/get_booster.png?raw=true">

Other strong predictors were the average delay, this showed that the average departure delay feature described above in the engineering section. We also had temperature, visibility, and precipitation and pressure as contributors to the model from the weather features. 

The tail number or the particular route of a given flight along with the cloud information, one hot encoded made it as the top 10 parameters during training.

### 4. Multi-Layer Percepton (Neural Network)

For the MLP model in Phase 3, we conducted hyperparameter tuning using grid search with cross-validation to determine the optimal configuration for network architecture and training settings. This process aimed to maximize performance on the validation set while ensuring generalization to the test set.

#### Hyperparameter Tuning Details

- **Search Strategy**: Grid search with cross-validation.
- **Parameters Explored**:
  - **Network Architecture**: Variations in the number of layers and units per layer.
  - **Training Configuration**: Maximum iterations and block size for batch processing.

#### Best Parameters Identified

The optimal configuration identified during the grid search was as follows:

- **Network Architecture**: 4 layers including two hidden layers and input/output layers: `[36, 30, 15, 5]`.
- **Max Iterations**: 250.
- **Block Size**: 128.

This configuration provided the best balance between accuracy and training efficiency.

#### Results

The Multilayer Perceptron (MLP) model demonstrated strong performance on the test dataset, with the following key metrics:

- **Weighted Precision**: 0.8435  
- **Accuracy**: 0.8993  
- **Weighted F1-Score**: 0.8650  

#### Class-Level Performance

The detailed class-level metrics are as follows:

| Class   | Precision | Recall   | F1-Score |
|---------|-----------|----------|----------|
| Class 0 | 0.979711  | 0.994790 | 0.987193 |
| Class 1 | 0.533457  | 0.975482 | 0.689727 |
| Class 2 | 0.000000  | 0.000000 | 0.000000 |
| Class 3 | 0.000000  | 0.000000 | 0.000000 |
| Class 4 | 0.000000  | 0.000000 | 0.000000 |

#### Observations

1. **High Performance for "No Delay"**:  
   The model performed exceptionally well for **Class 0 ("No Delay")**, achieving an F1-score of 0.9872 with near-perfect precision and recall. This indicates the model's strong ability to identify flights without delays.

2. **Moderate Performance for "Small Delays"**:  
   For **Class 1 ("Small Delays")**, the model achieved a respectable F1-score of 0.6897, with high recall (0.9755) but relatively low precision (0.5335). This suggests the model tends to over-predict small delays.

3. **Poor Performance for Other Classes**:  
   The model failed to predict Classes 2, 3, and 4 (moderate delays, significant delays, and cancellations). The F1-scores for these classes are 0.0000, indicating no true positives were predicted for these categories.

## Experiment Summary Table:

| Exp ID | Model                         | Best Hyperparameters                       | Validation Recall | Train Time (min) | Notes                                   |
|--------|-------------------------------|-------------------------------------------|-------------------|------------------|------------------------------------------|
| 0      | Dummy Classifier              | N/A                                       | 0.8073            | N/A              | Baseline for comparison                   |
| 1      | ElasticNet Logistic Regression | `regParam` = 0.01 and `elasticNetParam` = 1.0 | 0.8060            | 14 Minutes       |                                          |
| 2      | Gradient-Boosted Trees        | `num_round` = 50, `max_depth` = 4, `eta` = 0.3, `num_workers` = 4, `max_bin` = 256, `tree_method` = "approx" | 0.6800            | 19 Minutes      | Trained on 10% of the data due to memory constraints. Best performance for "No Delay" and "30-45 minutes" delay categories. Poor performance on Classes 1, 3, and 4. |
| 3      | MLP                           | `Network Architecture`: [36, 30, 15, 5], `Max Iterations`: 250, `Block Size`: 128 | 0.9059            | 144 Minutes      | Achieved strong performance for binary classification but struggled with multi-class delays. |



# Results And Discussion

## Summary of Experiments

We conducted a series of experiments using four models: a Dummy Classifier as a baseline, ElasticNet Logistic Regression, Gradient-Boosted Trees, and a Multilayer Perceptron. Each model was evaluated on a temporally split dataset to ensure no temporal data leakage and to implement realistic performance metrics. Below is a detailed analysis of the results; highlighting strengths, weaknesses, and areas for improvement.

### 1. Baseline Classifier
The dummy classifier, which always predicts "No Delay," set the baseline for comparison. It achieved a Weighted Recall of 0.8073, which is deceptively high due to the class imbalance in the dataset where the majority of flights experience no delays. However, its Weighted Precision and F1-Score (0.6517 and 0.7212, respectively) highlight the limitations of this simplistic approach, particularly in handling delay predictions.

---

### 2. ElasticNet Logistic Regression
The ElasticNet Logistic Regression model provided a modest improvement over the baseline, achieving a Weighted F1-Score of 0.7259. The Weighted Recall (0.8060), however, was comparable to the dummy classifier, but the addition of regularization helped improve Weighted Precision, demonstrating a better trade-off between false positives and false negatives. However, its linear nature limited its ability to capture complex patterns, especially for less frequent delay classes.

---

### 3. Gradient-Boosted Trees (GBT)
GBT demonstrated the strongest performance for the "30-45 minutes" delay categories, benefiting from its ability to model non-linear relationships. However, its Weighted Recall (0.6800) was lower than both the baseline and ElasticNet Logistic Regression, primarily due to underperformance on minority classes like "Small Delays," "Significant Delays," and "Cancellations." This result is partly attributed to training on only 10% of the data due to memory constraints, which exacerbated the class imbalance issue.

Key insights include:
- The model showed potential to handle delays with distinct patterns, such as "30-45 minutes."
- It was hampered by resource constraints, limiting its ability to generalize.

---

### 4. Multilayer Perceptron (MLP)
The MLP model achieved the highest overall metrics, with a Weighted Recall of 0.9059 and a Weighted F1-Score of 0.8650. However, a detailed analysis of the confusion matrix revealed significant limitations:
- The model primarily acted as a **binary classifier**, predicting only "No Delay" (Class 0) and "Small Delays" (Class 1).
- It failed entirely to predict moderate delays (Class 2), significant delays (Class 3), and cancellations (Class 4), resulting in F1-Scores of 0.0000 for these classes.

Despite its strong performance for the dominant classes, the MLP struggled with multi-class classification, underscoring the need for strategies to address class imbalance and improve generalization across all delay categories.

---

## Performance Comparison Table

| Metric                   | Dummy Classifier | ElasticNet Logistic Regression | Gradient-Boosted Trees | MLP   |
|--------------------------|------------------|--------------------------------|-------------------------|-------|
| **Weighted Precision**   | 0.6517           | 0.6788                         | 0.6500                  | 0.8435|
| **Weighted Recall**      | 0.8073           | 0.8060                         | 0.6800                  | 0.9059|
| **Weighted F1-Score**    | 0.7212           | 0.7259                         | 0.6900                  | 0.8650|

---

## Key Observations
1. **Baseline Classifier**: The high recall of the dummy classifier highlights the dataset's imbalance, as it achieves reasonable performance by always predicting the majority class.
2. **ElasticNet Logistic Regression**: While robust for initial testing, the linear approach struggled to capture the complex relationships necessary for multi-class delay prediction.
3. **Gradient-Boosted Trees**: GBT seemed to perform best in capturing non-linear relationships, particularly for "No Delay" and "30-45 minutes" delay categories, but suffered due to insufficient training data and class imbalance.
4. **Multilayer Perceptron**: The MLP achieved the highest overall metrics but failed to effectively address the multi-class nature of the problem, acting more like a binary classifier.

---

## Gap Analysis

### Key Issues Identified:
1. **Class Imbalance**: All models struggled with the highly imbalanced nature of the dataset, leading to poor performance on minority classes.
2. **Underutilization of Features**: While the feature engineering process introduced valuable predictors, the limited dataset used for GBT training restricted the ability to fully leverage the predictive power of these features.
3. **Model Behavior**: The MLP’s binary-like classification behavior underscores the need for more advanced techniques to address multi-class imbalances.

### Recommendations for Improvement
1. **Oversampling/Undersampling**:
   - Use techniques such as SMOTE (Synthetic Minority Oversampling) to improve representation of minority classes during training.
   - Investigate undersampling dominant classes to balance the dataset.

2. **Alternative Architectures**:
   - Experiment with ensemble methods (e.g., Random Forests or hybrid models combining GBT and MLP) to better capture multi-class distinctions.
   - Use attention-based architectures to focus on minority class patterns.

3. **Enhanced Feature Engineering**:
   - Revisit temporal features to identify trends or anomalies specific to delayed flights.
   - Re-calculate features to ensure leakage integrity. 

4. **Improved Training Strategies**:
   - Apply class-weighting or cost-sensitive learning to prioritize accurate predictions for minority classes.
   - Train GBT on the full dataset to improve generalization.



# Conclusion 

This project focused on predicting flight delay severity using the OTPW dataset, an integration of flight performance and weather data. Predicting delays is critical for optimizing resource allocation improving traveler experience, and aiding airline decision-making. Our hypothesis was that a custom ML pipeline with engineered features could accurately classify flight delays into five severity categories, leveraging advanced models to capture complax patterns.

We explored various models, incuding ElasticNet Logistic Regression, Gradient Boosted Trees, and a Multilayer Peerceptron (MLP) neural network. The MLP demonstrated the best overall performance, achieving a weighted F1-score of 0.8650. This was driven by features like departure time blocks, weather conditions, and carrier-based seasonality patterns. Hyperparameter tuning across all models further enhanced their predicitve capability, with the best configuration for MLP being a two-hidden-layer architecture using ReLU activation, early stopping, and a learning rate scheduler.

The results highlight the significance of sophisticated modeling and feature engineering in addressing complex, imbalanced datasets. While the MLP performed well for dominant classes, the challenges of predicting minority classes and handling class imbalance underscore areas for future work. Oversampling methods like SMOTE, improved regularization, and ensemble approaches may help balance predictions across all delay categories. Additionally, further refinement of temporal features and advanced architectures could enhance generalization and interpretability.

In summary, this project established a scalable, distributed ML pipeline and demonstrated the effectiveness of custom featues and advanced models in prediciting flight delays. Future iterations can build on these findings to achieve greater accuracy and class balance ensuring broader applicabality and impact.


# Extra Credit


## CPU based MLP

To address the requirements of the project, a custom Multilayer Perceptron (MLP) neural network was implemented using a fully distributed approach on a CPU cluster. This implementation avoids relying on PySpark ML libraries, adhering to the extra credit guidelines. The architecture of the network includes flexibility for experimenting with one or two hidden layers, utilizing ReLU activation functions for non-linearity and softmax for output layer classification.

The training process uses a synchronous gradient descent approach, where gradients are calculated for mini-batches across distributed partitions using Spark RDDs. For optimization, initialization was applied to weight matrices to improve convergence by normalizing the variance of inputs and outputs. The forward and backward passes compute the activations and gradients layer by layer, with support for a secondary hidden layer. Early stopping was incorporated to halt training upon stagnation of validation accuracy, with a patience threshold of three epochs. A custom learning rate scheduler reduces the learning rate as training progresses, helping the model converge to a local optimum.

For testing purposes, the MLP was evaluated on the 3-month (3m) dataset, ensuring a smaller-scale validation of the distributed training pipeline and balanced dataset strategies.

---

### Results

The MLP model was trained and evaluated on the balanced training data generated through oversampling to mitigate class imbalance. The results on the 3m test dataset are as follows:

- **Test Accuracy**: 57.25%
- **Confusion Matrix**:

  | True Label | Predicted: Class 0 | Predicted: Class 1 | Predicted: Class 2 | Predicted: Class 3 | Predicted: Class 4 |
  |------------|---------------------|---------------------|---------------------|---------------------|---------------------|
  | **Class 0** | 157200              | 3354                | 73171               | 7741                | 8                   |
  | **Class 1** | 0                   | 0                   | 71                  | 31554               | 0                   |
  | **Class 2** | 0                   | 0                   | 0                   | 9613                | 0                   |
  | **Class 3** | 0                   | 0                   | 0                   | 10747               | 0                   |
  | **Class 4** | 0                   | 0                   | 0                   | 0                   | 107                 |

---

The test accuracy of 57.25% is a moderate classification performance, with the model heavily favoring Class 0 in predictions. The confusion matrix highlights the imbalance in predictive capability, as the majority of misclassifications occur for Class 1 and Class 2, which are primarily misclassified as Class 3.

### Discussion

While the architecture of the implemented MLP is designed for scalability, the results indicate limitations in capturing the intricacies of the dataset. The model's tendency to bias predictions toward certain classes, despite oversampling, suggests that further refinements are necessary. Possible improvements include experimenting with deeper architectures, adjusting the learning rate decay, or introducing regularization techniques such as dropout to prevent overfitting on oversampled data. Additionally, exploring ensemble methods or Gradient Boosted Decision Trees alongside the MLP could yield complementary insights and better classification performance.

Overall, the implementation successfully fulfills the extra credit requirements by employing a fully distributed, scalable CPU-based neural network. The use of the 3m dataset for testing purposes provided a meaningful evaluation of the methodology and highlighted areas for future improvement in achieving better generalization and balance in predictions.
