# Data Science Technology and Systems PG (11523)
# Final Project | Semester 2, 2025

### Natalia Andrea Cubillos Villegas
### Student ID: U3246979

# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/EhWeqeQsh-9Mr1fneZc9_0sBOBzEdXngvxFJtAlIa-eAgA?e=8ukWwa). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".

**Unfortunately, I was not able to create a new notebook instance because the AWS platform indicated that access was denied for my account.**
**I attempted to modify my role and increase permissions, but this was not possible.**

**I also tried to create a new session through Python; however, when checking Access Management >> Users, I received the following message:**

***“Required: An IAM user is an identity with long-term credentials that is used to interact with AWS in an account.”***

**Moreover, I tried to access SageMaker from one of our labs. Although I successfully logged in, I encountered the same “Access Denied” message when attempting to implement the Linear Learner Estimator.**

**Given these limitations, I decided to complete the tasks required for this part of the assignment locally using Jupyter Notebook instead.**

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [20]:
# Loading the necessary libraries
import pandas as pd
import numpy as np
import os

# Setting the random seed for reproducibility
np.random.seed(42)

# Defining the base path for data files
base_path = "Final_Project"
data_v1 = pd.read_csv(f"{base_path}/combined_csv_v1.csv")
data_v2 = pd.read_csv(f"{base_path}/combined_csv_v2.csv")

### Working with data version 1

In [21]:
# Splitting data into traiing, validation, and test sets (70%, 15%, 15%)
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_v1, test_size=0.15, random_state=42, stratify=data_v1['target'])
train_data, val_data = train_test_split(train_data, test_size=0.1765, random_state=42, stratify=train_data['target'])  # 0.1765 x 0.85 = 0.15

# Displaying the shapes of the datasets
print(f"Train data shape - Data Version 1: {train_data.shape}")
print(f"Validation data shape - Data Version 1: {val_data.shape}")
print(f"Test data shape - Data Version 1: {test_data.shape}")

# Tackling class imbalance using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train = train_data.drop('target', axis=1)
y_train = train_data['target']
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f'Original dataset shape - Data Version 1: {y_train.value_counts().to_dict()}')
print(f'Resampled dataset shape - Data Version 1: {y_resampled.value_counts().to_dict()}')

Train data shape - Data Version 1: (1144871, 94)
Validation data shape - Data Version 1: (245380, 94)
Test data shape - Data Version 1: (245339, 94)
Original dataset shape - Data Version 1: {0.0: 904547, 1.0: 240324}
Resampled dataset shape - Data Version 1: {1.0: 904547, 0.0: 904547}


In [22]:
# Using Logistic Regression as a baseline model instead of Linear Learner Estimator from AWS SageMaker
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Training the Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_resampled, y_resampled)

#______________________________________________________________________
print("Evaluating Logistic Regression Model - Data Version 1")
print("\n")
print("=========== Validation Set Evaluation ===========")

# Making predictions on the validation set
y_val_pred = logistic_model.predict(val_data.drop('target', axis=1))

# Evaluating the model on the validation set
print(confusion_matrix(val_data['target'], y_val_pred))
print(classification_report(val_data['target'], y_val_pred))

# Calculating ROC-AUC for the validation set
y_val_proba = logistic_model.predict_proba(val_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (validation) - Data Version 1:", round(roc_auc_score(val_data['target'], y_val_proba), 3))

#______________________________________________________________________
# Evaluating the model on the test set
print("=========== Test Set Evaluation ===========")

# Making predictions on the test set
y_test_pred  = logistic_model.predict(test_data.drop('target', axis=1))

# Evaluating the model on the test set
print(confusion_matrix(test_data['target'], y_test_pred))
print(classification_report(test_data['target'], y_test_pred))

# Calculating ROC-AUC for the test set
y_test_proba = logistic_model.predict_proba(test_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (test) - Data Version 1:", round(roc_auc_score(test_data['target'], y_test_proba), 3))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluating Logistic Regression Model - Data Version 1


[[131525  62347]
 [ 27305  24203]]
              precision    recall  f1-score   support

         0.0       0.83      0.68      0.75    193872
         1.0       0.28      0.47      0.35     51508

    accuracy                           0.63    245380
   macro avg       0.55      0.57      0.55    245380
weighted avg       0.71      0.63      0.66    245380

ROC-AUC (validation) - Data Version 1: 0.614
[[131622  62217]
 [ 27214  24286]]
              precision    recall  f1-score   support

         0.0       0.83      0.68      0.75    193839
         1.0       0.28      0.47      0.35     51500

    accuracy                           0.64    245339
   macro avg       0.55      0.58      0.55    245339
weighted avg       0.71      0.64      0.66    245339

ROC-AUC (test) - Data Version 1: 0.613


### Working with data version 2

In [None]:
# Splitting data into traiing, validation, and test sets (70%, 15%, 15%)
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_v2, test_size=0.15, random_state=42, stratify=data_v2['target'])
train_data, val_data = train_test_split(train_data, test_size=0.1765, random_state=42, stratify=train_data['target'])  # 0.1765 x 0.85 = 0.15

# Displaying the shapes of the datasets
print(f"Train data shape - Data V2: {train_data.shape}")
print(f"Validation data shape - Data V2: {val_data.shape}")
print(f"Test data shape - Data V2: {test_data.shape}")

# Tackling class imbalance using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train = train_data.drop('target', axis=1)
y_train = train_data['target']
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f'Original dataset shape - Data V2: {y_train.value_counts().to_dict()}')
print(f'Resampled dataset shape - Data V2: {y_resampled.value_counts().to_dict()}')

Train data shape - Data V2: (1144871, 86)
Validation data shape - Data V2: (245380, 86)
Test data shape - Data V2: (245339, 86)


In [None]:
# Using Logistic Regression as a baseline model instead of Linear Learner Estimator from AWS SageMaker
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Training the Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_resampled, y_resampled)

#______________________________________________________________________
print("Evaluating Logistic Regression Model - Data Version 2")
print("\n")
print("=========== Validation Set Evaluation ===========")

# Making predictions on the validation set
y_val_pred = logistic_model.predict(val_data.drop('target', axis=1))

# Evaluating the model on the validation set
print(confusion_matrix(val_data['target'], y_val_pred))
print(classification_report(val_data['target'], y_val_pred))

# Calculating ROC-AUC for the validation set
y_val_proba = logistic_model.predict_proba(val_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (validation) - Data Version 2:", round(roc_auc_score(val_data['target'], y_val_proba), 3))

#______________________________________________________________________
# Evaluating the model on the test set
print("=========== Test Set Evaluation ===========")

# Making predictions on the test set
y_test_pred  = logistic_model.predict(test_data.drop('target', axis=1))

# Evaluating the model on the test set
print(confusion_matrix(test_data['target'], y_test_pred))
print(classification_report(test_data['target'], y_test_pred))

# Calculating ROC-AUC for the test set
y_test_proba = logistic_model.predict_proba(test_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (test) - Data Version 2:", round(roc_auc_score(test_data['target'], y_test_proba), 3))

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[[115354  78518]
 [ 21514  29994]]
              precision    recall  f1-score   support

         0.0       0.84      0.60      0.70    193872
         1.0       0.28      0.58      0.37     51508

    accuracy                           0.59    245380
   macro avg       0.56      0.59      0.54    245380
weighted avg       0.72      0.59      0.63    245380



# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

### Working with data version 1

In [None]:
# Splitting data into train,testsets(70%,15%,15%)
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_v1, test_size=0.15, random_state=42)
val_data, test_data = train_test_split(test_data, test_size=0.5, random_state=42)
print(f"Train data shape - Data V1  : {train_data.shape}")
print(f"Validation data shape - Data V1  : {val_data.shape}")
print(f"Test data shape - Data V1  : {test_data.shape}")

# Tackling class imbalance using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train = train_data.drop('target', axis=1)
y_train = train_data['target']
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f'Original dataset shape - Data V1 : {y_train.value_counts().to_dict()}')
print(f'Resampled dataset shape - Data V1 : {y_resampled.value_counts().to_dict()}')

# Using XGBoost as an alternative model
import xgboost as xgb
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_resampled, y_resampled)
y_val_pred = xgb_model.predict(val_data.drop('target', axis=1))

#______________________________________________________________________
print("Evaluating XGBoost Model - Data Version 1")
print("\n")
print("=========== Validation Set Evaluation ===========")

# Making predictions on the validation set
y_val_pred = xgb_model.predict(val_data.drop('target', axis=1))

# Evaluating the model on the validation set
print(confusion_matrix(val_data['target'], y_val_pred))
print(classification_report(val_data['target'], y_val_pred))

# Calculating ROC-AUC for the validation set
y_val_proba = xgb_model.predict_proba(val_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (validation) - Data Version 1:", round(roc_auc_score(val_data['target'], y_val_proba), 3))

#______________________________________________________________________
# Evaluating the model on the test set
print("=========== Test Set Evaluation ===========")

# Making predictions on the test set
y_test_pred  = xgb_model.predict(test_data.drop('target', axis=1))

# Evaluating the model on the test set
print(confusion_matrix(test_data['target'], y_test_pred))
print(classification_report(test_data['target'], y_test_pred))

# Calculating ROC-AUC for the test set
y_test_proba = xgb_model.predict_proba(test_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (test) - Data Version 1:", round(roc_auc_score(test_data['target'], y_test_proba), 3))

Train data shape - Data V1  : (1390251, 94)
Validation data shape - Data V1  : (122669, 94)
Test data shape - Data V1  : (122670, 94)
Original dataset shape - Data V1 : {0.0: 1098140, 1.0: 292111}
Resampled dataset shape - Data V1 : {1.0: 1098140, 0.0: 1098140}


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[[88778  8527]
 [20272  5092]]
              precision    recall  f1-score   support

         0.0       0.81      0.91      0.86     97305
         1.0       0.37      0.20      0.26     25364

    accuracy                           0.77    122669
   macro avg       0.59      0.56      0.56    122669
weighted avg       0.72      0.77      0.74    122669

ROC-AUC (validation) - Data Version 1: 0.643

Confusion Matrix (test) - Data Version 1
[[88558  8255]
 [20600  5257]]

Classification Report (test) - Data Version 1
              precision    recall  f1-score   support

         0.0      0.811     0.915     0.860     96813
         1.0      0.389     0.203     0.267     25857

    accuracy                          0.765    122670
   macro avg      0.600     0.559     0.563    122670
weighted avg      0.722     0.765     0.735    122670

ROC-AUC (test) - Data Version 1: 0.65


### Working with Data Version 2

In [None]:
# Splitting data into train,testsets (70%,15%,15%)
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_v2, test_size=0.15, random_state=42)
val_data, test_data = train_test_split(test_data, test_size=0.5, random_state=42)
print(f"Train data shape - Data V2  : {train_data.shape}")
print(f"Validation data shape - Data V2  : {val_data.shape}")
print(f"Test data shape - Data V2  : {test_data.shape}")

# Tackling class imbalance using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train = train_data.drop('target', axis=1)
y_train = train_data['target']
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f'Original dataset shape - Data V2 : {y_train.value_counts().to_dict()}')
print(f'Resampled dataset shape - Data V2 : {y_resampled.value_counts().to_dict()}')

# Using XGBoost as an alternative model
import xgboost as xgb
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_resampled, y_resampled)
y_val_pred = xgb_model.predict(val_data.drop('target', axis=1))

#______________________________________________________________________
print("Evaluating XGBoost Model - Data Version 2")
print("\n")
print("=========== Validation Set Evaluation ===========")

# Making predictions on the validation set
y_val_pred = xgb_model.predict(val_data.drop('target', axis=1))

# Evaluating the model on the validation set
print(confusion_matrix(val_data['target'], y_val_pred))
print(classification_report(val_data['target'], y_val_pred))

# Calculating ROC-AUC for the validation set
y_val_proba = xgb_model.predict_proba(val_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (validation) - Data Version 2:", round(roc_auc_score(val_data['target'], y_val_proba), 3))

#______________________________________________________________________
# Evaluating the model on the test set
print("=========== Test Set Evaluation ===========")

# Making predictions on the test set
y_test_pred  = xgb_model.predict(test_data.drop('target', axis=1))

# Evaluating the model on the test set
print(confusion_matrix(test_data['target'], y_test_pred))
print(classification_report(test_data['target'], y_test_pred))

# Calculating ROC-AUC for the test set
y_test_proba = xgb_model.predict_proba(test_data.drop('target', axis=1))[:, 1]
print("ROC-AUC (test) - Data Version 2:", round(roc_auc_score(test_data['target'], y_test_proba), 3))

Train data shape - Data V2  : (1390251, 86)
Validation data shape - Data V2  : (122669, 86)
Test data shape - Data V2  : (122670, 86)
Original dataset shape - Data V2 : {0.0: 1098140, 1.0: 292111}
Resampled dataset shape - Data V2 : {1.0: 1098140, 0.0: 1098140}


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[[83180 14125]
 [14688 10676]]
              precision    recall  f1-score   support

         0.0       0.85      0.85      0.85     97305
         1.0       0.43      0.42      0.43     25364

    accuracy                           0.77    122669
   macro avg       0.64      0.64      0.64    122669
weighted avg       0.76      0.77      0.76    122669

ROC-AUC (validation) - Data Version 2: 0.715

Confusion Matrix (test) - Data Version 2
[[82824 13989]
 [14701 11156]]

Classification Report (test) - Data Version 2
              precision    recall  f1-score   support

         0.0      0.849     0.856     0.852     96813
         1.0      0.444     0.431     0.437     25857

    accuracy                          0.766    122670
   macro avg      0.646     0.643     0.645    122670
weighted avg      0.764     0.766     0.765    122670

ROC-AUC (test) - Data Version 2: 0.722


# Conclusions

## Logistic Regression

In the first version, the Logistic Regression model was trained using balanced data generated through SMOTE. The validation and test results were quite similar, which shows consistency and stable model behavior.

However, the performance was moderate. The recall for delayed flights (class 1) reached 0.47, with a precision of 0.28 and an overall accuracy of 0.63. This means that while the model managed to identify some delayed flights, it still missed a considerable number (false negatives). The ROC-AUC score indicated an average ability to separate delayed from on-time flights, showing that there’s still room for improvement in feature selection and tuning.

For Version 2, Logistic Regression showed very similar performance to Version 1. The accuracy slightly decreased to 0.59, but the recall for delayed flights improved to 0.58, which is important because the business goal prioritises identifying possible delays. Despite this gain, the model still struggles with precision (0.27), meaning many predicted delays were false alarms. Overall, the model captured more real delays but at the cost of more false positives.


## XG Boost

XGBoost outperformed Logistic Regression across almost every metric. The accuracy increased to 0.77, and the ROC-AUC of 0.72 indicates a stronger ability to distinguish between delayed and non-delayed flights. Precision and recall for delayed flights were both around 0.43–0.44, meaning that while not perfect, the model is making more reliable predictions overall.

This improvement suggests that XGBoost can better capture nonlinear relationships and interactions between features, which is important for complex scenarios like flight delays, where multiple factors such as weather, airport traffic, and time of day influence the outcome.

Overall, Logistic Regression provided a simple and interpretable baseline but suffered from low recall and precision for delayed flights.
XGBoost, on the other hand, achieved a notable improvement in both ROC-AUC and F1-score, demonstrating that tree-based models handle complex relationships much better.

If more time were available, the next steps would include:

- Performing hyperparameter tuning (GridSearchCV or Bayesian optimisation).
- Testing ensemble models or combining Logistic Regression and XGBoost for more stable results.
- Exploring additional features such as temperature, wind speed, or airport congestion.
- Adjusting decision thresholds to favor recall, since missing a delay alert is more costly for the business.