# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/EhWeqeQsh-9Mr1fneZc9_0sBOBzEdXngvxFJtAlIa-eAgA?e=8ukWwa). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

**Note:** In case of the data is too much to be uploaded to the AWS, please use 20% of the data only for this task.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [4]:
# Standard library
import io
import json
import os
import pathlib
import re
import tarfile
import tempfile
import time
import warnings
from pathlib import Path

# Third-party
import boto3
import numpy as np
import pandas as pd
from botocore.exceptions import ClientError
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    average_precision_score,
)

# SageMaker
import sagemaker
from sagemaker import Session, image_uris
from sagemaker.local import LocalSession
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.s3 import S3Uploader, S3Downloader
from sagemaker.amazon.linear_learner import LinearLearner

warnings.filterwarnings("ignore")


In [5]:
# Initialize SageMaker and AWS region
sess = sagemaker.Session()
region = boto3.Session().region_name

# Load dataset
CSV_PATH = "combined_csv.csv"
RANDOM_STATE = 42
df = pd.read_csv(CSV_PATH)

# Check missing values
missing_counts = df.isna().sum().sort_values(ascending=False)
missing_cols = missing_counts[missing_counts > 0]
missing_total_before = int(df.isna().sum().sum())

# Drop rows with missing values
df_clean = df.dropna().reset_index(drop=True)

# Identify column types
numeric_columns = df_clean.select_dtypes(include="number").columns.tolist()
non_numeric_columns = df_clean.select_dtypes(exclude="number").columns.tolist()

# Find target column
if "target" in df_clean.columns:
    target_col = "target"
elif "is_delay" in df_clean.columns:
    target_col = "is_delay"
else:
    target_col = None

# Check if stratified sampling can be used
stratified = (
    target_col is not None
    and 2 <= df_clean[target_col].nunique() <= 20
)

# Check missing values again after cleaning
missing_total_after = int(df_clean.isna().sum().sum())

# Summary
results = {
    "missing_counts": missing_counts,
    "missing_columns": missing_cols,
    "total_missing_before": missing_total_before,
    "total_missing_after": missing_total_after,
    "numeric_columns": numeric_columns,
    "non_numeric_columns": non_numeric_columns,
    "target_column": target_col,
    "stratified": stratified,
    "cleaned_shape": df_clean.shape,
}

results


In [None]:
from sklearn.model_selection import train_test_split
import os

# Choose cleaned dataset if it exists
data = df_clean if "df_clean" in globals() else df

# Pick stratify column if valid
if "target_col" in globals() and target_col in data.columns and 2 <= data[target_col].nunique() <= 20:
    stratify_col = data[target_col]
else:
    stratify_col = None

# 70% train / 30% temp split
train_df, temp_df = train_test_split(
    data,
    test_size=0.3,
    random_state=RANDOM_STATE,
    stratify=stratify_col
)

# Stratify again for validation/test split
if "target_col" in globals() and target_col in temp_df.columns and 2 <= temp_df[target_col].nunique() <= 20:
    strat_temp = temp_df[target_col]
else:
    strat_temp = None

# Split 30% temp into 15% validation and 15% test
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    random_state=RANDOM_STATE,
    stratify=strat_temp
)

# Save data splits
os.makedirs("splits", exist_ok=True)
train_df.to_csv("splits/train.csv", index=False)
val_df.to_csv("splits/val.csv", index=False)
test_df.to_csv("splits/test.csv", index=False)

# Print summary
def show_info(name, df, label):
    print(f"{name}: {df.shape[0]} rows × {df.shape[1]} cols")
    if label and label in df.columns:
        counts = df[label].value_counts(dropna=False).sort_index()
        percents = (counts / len(df) * 100).round(2)
        dist = ", ".join([f"{i}: {counts[i]} ({percents[i]}%)" for i in counts.index])
        print("class distribution ->", dist)
    print()

print("Splits saved in ./splits/")
show_info("Train 70%", train_df, target_col if "target_col" in globals() else None)
show_info("Validation 15%", val_df, target_col if "target_col" in globals() else None)
show_info("Test 15%", test_df, target_col if "target_col" in globals() else None)


In [None]:
import boto3
from pathlib import Path
from botocore.exceptions import ClientError

# Local files
DATA_SPLITS = {
    "train": Path("splits/train.csv"),
    "val":   Path("splits/val.csv"),
    "test":  Path("splits/test.csv"),
}

# S3 config
BUCKET_NAME = "c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft"
KEY_PREFIX  = "linear-learner-delay/"

# S3 client
s3 = boto3.client("s3")

def upload_file_checked(local_file: Path, bucket: str, key: str):
    """Upload to S3 and confirm object exists."""
    if not local_file.exists():
        print(f"skip (not found): {local_file}")
        return
    try:
        s3.upload_file(str(local_file), bucket, key)
        s3.head_object(Bucket=bucket, Key=key)
        print(f"ok: s3://{bucket}/{key}")
    except ClientError as e:
        msg = e.response.get("Error", {}).get("Message", str(e))
        print(f"error: {key} -> {msg}")

# Upload all splits
for name, path in DATA_SPLITS.items():
    key = f"{KEY_PREFIX}{path.name}"
    upload_file_checked(path, BUCKET_NAME, key)

# Print S3 URIs
print("\nS3 URIs:")
for name in ("train", "val", "test"):
    print(f"s3://{BUCKET_NAME}/{KEY_PREFIX}{name}.csv")


In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

# Initialize SageMaker session
session = sagemaker.Session()

# Get AWS region
region = session.boto_region_name

# Get execution role
role = get_execution_role()


In [None]:
import io
import boto3
import pandas as pd
import numpy as np

# S3 dataset channels
train_s3 = f"s3://{bucket}/{prefix}train.csv"
val_s3   = f"s3://{bucket}/{prefix}val.csv"
test_s3  = f"s3://{bucket}/{prefix}test.csv"

# SageMaker inputs (no header expected after sanitize)
train_input = TrainingInput(train_s3, content_type="text/csv")
val_input   = TrainingInput(val_s3,   content_type="text/csv")

# Info
print("region:", region)
print("role:", role)
print(train_s3, val_s3, test_s3, sep="\n")

# S3 client
s3 = boto3.client("s3")

# ---------- helpers ----------
def read_csv_from_s3(key: str) -> pd.DataFrame:
    body = s3.get_object(Bucket=bucket, Key=key)["Body"].read()
    return pd.read_csv(io.BytesIO(body))

def drop_suffix_dot1(df: pd.DataFrame) -> pd.DataFrame:
    return df[[c for c in df.columns if not c.endswith(".1")]]

def cast_bool_like(df: pd.DataFrame) -> pd.DataFrame:
    for c in df.columns:
        if df[c].dtype == bool:
            df[c] = df[c].astype(int)
        elif df[c].dtype == object:
            uniq = set(df[c].dropna().astype(str).str.lower().unique())
            if uniq <= {"true", "false"}:
                df[c] = df[c].astype(str).str.lower().map({"true": 1, "false": 0})
    return df

def coerce_all_numeric(df: pd.DataFrame) -> pd.DataFrame:
    return df.apply(lambda s: pd.to_numeric(s, errors="coerce"))

def put_first(df: pd.DataFrame, label: str = "target") -> pd.DataFrame:
    if label in df.columns:
        cols = [label] + [c for c in df.columns if c != label]
        return df[cols]
    return df

# ---------- main ----------
def sanitize_one(name: str) -> None:
    """
    Load s3://{bucket}/{prefix}{name}.csv (with header), clean, ensure binary label in col 0,
    write back to the same key without header.
    """
    key = f"{prefix}{name}.csv"

    # load and clean
    df = read_csv_from_s3(key)
    df = drop_suffix_dot1(df)
    df = cast_bool_like(df)
    df = coerce_all_numeric(df).fillna(0.0)
    df = put_first(df, "target")

    # ensure label is {0,1}
    y = df.iloc[:, 0].to_numpy()
    uniq = np.unique(y[~pd.isna(y)])
    if not set(uniq) <= {0, 1}:
        if len(uniq) == 2:
            lo, hi = sorted(uniq)
            df.iloc[:, 0] = (df.iloc[:, 0] == hi).astype(int)
        else:
            thr = np.nanmedian(y)
            df.iloc[:, 0] = (df.iloc[:, 0] > thr).astype(int)

    # write back (no header/index)
    buf = io.StringIO()
    df.to_csv(buf, header=False, index=False)
    s3.put_object(Bucket=bucket, Key=key, Body=buf.getvalue(), ContentType="text/csv")

    # verify
    s3.head_object(Bucket=bucket, Key=key)
    print(f"sanitized: s3://{bucket}/{key}  shape={df.shape}")

def sanitize_all():
    for part in ("train", "val", "test"):
        sanitize_one(part)

# run:
# sanitize_all()

In [None]:
# Image URI for Linear Learner
img_uri = image_uris.retrieve(region=region, framework="linear-learner")

# Estimator configuration
output_url = f"s3://{bucket}/{prefix}output/"
instance_type = "ml.m5.large"

ll_estimator = sagemaker.estimator.Estimator(
    image_uri=img_uri,
    role=role,
    instance_count=1,
    instance_type=instance_type,
    output_path=output_url,
    sagemaker_session=sess,
    max_run=2000,
)

# Hyperparameters
hp = {
    "predictor_type": "binary_classifier",
    "epochs": 5,
    "mini_batch_size": 256,
    "num_models": 32,
    "loss": "auto",
}
ll_estimator.set_hyperparameters(**hp)

# Data channels
train_channel = TrainingInput(f"s3://{bucket}/{prefix}train.csv", content_type="text/csv")
valid_channel = TrainingInput(f"s3://{bucket}/{prefix}val.csv",   content_type="text/csv")
channels = {"train": train_channel, "validation": valid_channel}

# Fit
ll_estimator.fit(channels, logs=False)



In [None]:
import io, time
from pathlib import Path
from sagemaker.transformer import Transformer

# deploy
ep_name = f"ll-{int(time.time())}"
pred = est.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=ep_name)
print("endpoint:", pred.endpoint_name)

# load test set
resp = s3.get_object(Bucket=bucket, Key=f"{prefix}test.csv")
df_test = pd.read_csv(io.BytesIO(resp["Body"].read()), header=None)
y_true = df_test.iloc[:, 0].astype("int32").to_numpy()
X_test = df_test.iloc[:, 1:]

# upload features-only (in-memory)
csv_buf = io.StringIO()
X_test.to_csv(csv_buf, header=False, index=False)
csv_bytes = io.BytesIO(csv_buf.getvalue().encode("utf-8"))
test_x_key = f"{prefix}test_x.csv"
s3.put_object(Bucket=bucket, Key=test_x_key, Body=csv_bytes.getvalue())
test_x_uri = f"s3://{bucket}/{test_x_key}"

# batch transform
xformer = est.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}batch_out/",
    accept="application/jsonlines",
    assemble_with="Line",
)

xformer.transform(data=test_x_uri, content_type="text/csv", split_type="Line")
xformer.wait()

In [None]:
import json
import numpy as np
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
)

# ---- Load Batch Transform output files ----
s3_resource = boto3.resource("s3")
bucket_obj = s3_resource.Bucket(bucket)
out_keys = [
    obj.key
    for obj in bucket_obj.objects.filter(Prefix=f"{prefix}batch_out/")
    if obj.key.endswith(".out")
]

# ---- Extract predictions ----
probs, preds = [], []
for k in out_keys:
    text = bucket_obj.Object(k).get()["Body"].read().decode("utf-8")
    if not text.strip():
        continue
    for line in text.splitlines():
        rec = json.loads(line)
        p = rec.get("score", rec.get("scores", [0])[0])
        lbl = rec.get("predicted_label", int(p >= 0.5))
        probs.append(float(p))
        preds.append(int(lbl))

y_prob = np.array(probs, dtype=float)
y_pred = np.array(preds, dtype=int)

if len(y_true) != len(y_pred):
    raise ValueError("Length mismatch between true and predicted values")

# ---- Calculate metrics ----
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred, zero_division=0)
auc = roc_auc_score(y_true, y_prob)
cm = confusion_matrix(y_true, y_pred)

# ---- Print results ----
print("Batch Transform - Evaluation Metrics")
print(f"Accuracy     : {acc:.4f}")
print(f"Precision    : {prec:.4f}")
print(f"Recall       : {rec:.4f}")
print(f"F1-score     : {f1:.4f}")
print(f"ROC AUC      : {auc:.4f}")
print("Confusion matrix [[TN FP]\n [FN TP]]:")
print(cm)

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [4]:
# Retrieve XGBoost image
xgb_image = image_uris.retrieve(framework="xgboost", region=region, version="1.7-1")

# Define estimator configuration
xgb_output_path = f"s3://{bucket}/{prefix}xgb_results/"
xgb_estimator = sagemaker.estimator.Estimator(
    image_uri=xgb_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=xgb_output_path,
    sagemaker_session=sess,
    max_run=2000,
)

# Set hyperparameters for binary classification
xgb_hyperparams = {
    "objective": "binary:logistic",
    "num_round": 75,
    "max_depth": 5,
    "eta": 0.15,
    "subsample": 0.7,
    "eval_metric": "auc",
}
xgb_estimator.set_hyperparameters(**xgb_hyperparams)

# Create data channels
train_data = TrainingInput(f"s3://{bucket}/{prefix}train.csv", content_type="text/csv")
val_data   = TrainingInput(f"s3://{bucket}/{prefix}val.csv",   content_type="text/csv")
channels = {"train": train_data, "validation": val_data}

# Start training
xgb_estimator.fit(channels, logs=False)



In [5]:
import time
from sagemaker.transformer import Transformer

# ----- Deploy a realtime endpoint -----
ep_name = f"xgb-rt-{int(time.time())}"
xgb_predictor = xgb_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.c5.xlarge",
    endpoint_name=ep_name,
)
print("XGBoost endpoint:", xgb_predictor.endpoint_name)

# ----- Configure Batch Transform -----
bt_output = f"s3://{bucket}/{prefix}xgb_bt_out/"
xgb_bt = xgb_estimator.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=bt_output,
    assemble_with="Line",
    accept="application/jsonlines",
)

# ----- Run Batch Transform on features-only test CSV -----
test_x_uri = f"s3://{bucket}/{prefix}test_x.csv"
xgb_bt.transform(data=test_x_uri, content_type="text/csv", split_type="Line")
xgb_bt.wait()

In [None]:
import io
import json
import re
import boto3
import numpy as np
import pandas as pd
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
)

# ---------- S3 handles ----------
s3_res = boto3.resource("s3")
s3_cli = boto3.client("s3")
bucket_obj = s3_res.Bucket(bucket)

# Gather XGBoost batch outputs (deterministic order)
bt_prefix = f"{prefix}xgb_batch_out/"
shard_keys = sorted(k.key for k in bucket_obj.objects.filter(Prefix=bt_prefix) if k.key.endswith(".out"))

def get_prob(line: str) -> float | None:
    """
    Try to read a probability from a single output line.
    Supports JSON (scalar/dict/list) and simple CSV/whitespace tokens.
    Returns None if nothing numeric is found.
    """
    line = line.strip()
    if not line:
        return None

    # JSON path
    try:
        obj = json.loads(line)

        # scalar number
        if isinstance(obj, (int, float)):
            return float(obj)

        # dict with common fields, then first numeric fallback
        if isinstance(obj, dict):
            for key in ("score", "probability", "prediction", "predicted_value"):
                val = obj.get(key)
                if isinstance(val, (int, float)):
                    return float(val)
            for key in ("scores", "predictions"):
                val = obj.get(key)
                if isinstance(val, (list, tuple)) and val and isinstance(val[0], (int, float)):
                    return float(val[0])
            for val in obj.values():
                if isinstance(val, (int, float)):
                    return float(val)

        # list: return first numeric
        if isinstance(obj, list):
            for val in obj:
                if isinstance(val, (int, float)):
                    return float(val)
    except json.JSONDecodeError:
        pass

    # CSV / whitespace tokens
    for tok in re.split(r"[,\s]+", line):
        if not tok:
            continue
        try:
            return float(tok)
        except ValueError:
            continue

    return None

# ---------- Parse predictions ----------
probs, preds = [], []
for key in shard_keys:
    text = bucket_obj.Object(key).get()["Body"].read().decode("utf-8")
    if not text:
        continue
    for ln in text.splitlines():
        p = get_prob(ln)
        if p is None:
            continue
        probs.append(p)
        preds.append(int(p >= 0.5))

y_prob = np.asarray(probs, dtype=float)
y_pred = np.asarray(preds, dtype=int)

# ---------- Load ground truth (label is first column) ----------
obj = s3_cli.get_object(Bucket=bucket, Key=f"{prefix}test.csv")
test_df = pd.read_csv(io.BytesIO(obj["Body"].read()), header=None)
y_true = test_df.iloc[:, 0].astype(int).to_numpy()

# Align lengths if needed (common with header rows or partial shards)
if len(y_true) != len(y_pred):
    print(
        f"[notice] length mismatch: y_true={len(y_true)}, y_pred={len(y_pred)}. "
        "Truncating to the shorter length."
    )
    m = min(len(y_true), len(y_pred))
    y_true, y_pred, y_prob = y_true[:m], y_pred[:m], y_prob[:m]

# ---------- Metrics ----------
acc  = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec  = recall_score(y_true, y_pred, zero_division=0)
f1   = f1_score(y_true, y_pred, zero_division=0)

# compute AUC only when labels contain both classes and probabilities vary
if y_true.size > 0 and np.unique(y_true).size == 2 and np.std(y_prob) > 0:
    auc = roc_auc_score(y_true, y_prob)
else:
    auc = float("nan")

cm = confusion_matrix(y_true, y_pred)

# ---------- Report ----------
print("XGBoost Batch Transform — Test Summary")
print(f"Accuracy   : {acc:.4f}")
print(f"Precision  : {prec:.4f}")
print(f"Recall     : {rec:.4f}")
print(f"F1-score   : {f1:.4f}")
print(f"ROC AUC    : {auc:.4f}")
print("Confusion matrix [[TN FP]\n [FN TP]]:")
print(cm)
