# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/EhWeqeQsh-9Mr1fneZc9_0sBOBzEdXngvxFJtAlIa-eAgA?e=8ukWwa). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

**Note:** In case of the data is too much to be uploaded to the AWS, please use 20% of the data only for this task.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [4]:
# Standard library
import io
import json
import os
import pathlib
import re
import tarfile
import tempfile
import time
import warnings
from pathlib import Path

# Third-party
import boto3
import numpy as np
import pandas as pd
from botocore.exceptions import ClientError
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    average_precision_score,
)

# SageMaker
import sagemaker
from sagemaker import Session, image_uris
from sagemaker.local import LocalSession
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.s3 import S3Uploader, S3Downloader
from sagemaker.amazon.linear_learner import LinearLearner

warnings.filterwarnings("ignore")


In [5]:
# Initialize SageMaker and AWS region
sess = sagemaker.Session()
region = boto3.Session().region_name

# Load dataset
CSV_PATH = "combined_csv.csv"
RANDOM_STATE = 42
df = pd.read_csv(CSV_PATH)

# --- Check for missing values ---
# Count NaNs in each column (sorted descending)
nan_counts = df.isna().sum().sort_values(ascending=False)

# Keep only columns that actually have missing values
missing_columns = nan_counts[nan_counts > 0]

# Total number of missing values in the dataset
total_missing_before = df.isna().sum().sum()

# --- Handle missing values ---
# Drop all rows that contain any missing values
df.dropna(inplace=True)

# Identify numeric and non-numeric columns
numeric_cols = df.select_dtypes(include="number").columns
non_numeric_cols = df.select_dtypes(exclude="number").columns

# --- Determine target column ---
# Select target column if present ('target' or 'is_delay')
target_col = "target" if "target" in df.columns else (
    "is_delay" if "is_delay" in df.columns else None
)

# Use stratified sampling if target has a reasonable number of unique values
if target_col is not None and 2 <= df[target_col].nunique() <= 20:
    strat = df[target_col]
else:
    strat = None

# --- Final check after dropping missing values ---
total_missing_after = df.isna().sum().sum()

# --- Optional: return results instead of printing ---
results = {
    "nan_counts": nan_counts,
    "missing_columns": missing_columns,
    "total_missing_before": total_missing_before,
    "total_missing_after": total_missing_after,
    "numeric_columns": list(numeric_cols),
    "non_numeric_columns": list(non_numeric_cols),
    "target_column": target_col,
    "stratified": strat is not None,
}

results

In [None]:
# --- Split dataset into train/val/test (70/15/15) ---

# 70% train / 30% temp
train_df, temp_df = train_test_split(
    df,
    test_size=0.30,
    random_state=RANDOM_STATE,
    stratify=strat,  # use stratified sampling if available
)

# Stratify for temp split (validation/test)
if target_col is not None and 2 <= temp_df[target_col].nunique() <= 20:
    strat_temp = temp_df[target_col]
else:
    strat_temp = None

# Split remaining 30% → 15% validation, 15% test
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,  # half of 30% = 15%
    random_state=RANDOM_STATE,
    stratify=strat_temp,
)

# --- Save splits ---
os.makedirs("splits", exist_ok=True)
train_df.to_csv("splits/train.csv", index=False)
val_df.to_csv("splits/val.csv", index=False)
test_df.to_csv("splits/test.csv", index=False)


# --- Reporting function ---
def describe(name, d, tgt):
    """Print dataset shape and target class distribution."""
    print(f"{name}: {len(d)} rows, {d.shape[1]} cols")
    if tgt and tgt in d.columns:
        vc = d[tgt].value_counts(dropna=False).sort_index()
        pct = (vc / len(d) * 100).round(2)
        dist = ", ".join([f"{k}: {vc[k]} ({pct[k]}%)" for k in vc.index])
        print(" class distribution ->", dist)
    print()


# --- Output summary (same as original) ---
target_col = "target"
print("saved splits to ./splits/")
describe("train 70%", train_df, target_col)
describe("val 15%", val_df, target_col)
describe("test 15%", test_df, target_col)


In [None]:
import os

# --- Ensure output directory exists ---
os.makedirs("splits", exist_ok=True)

# --- Function: Move target column to the first position ---
def put_label_first(df, label="target"):
    """Return a DataFrame with the target label as the first column."""
    if label in df.columns:
        ordered_cols = [label] + [col for col in df.columns if col != label]
        return df[ordered_cols]
    return df

# --- Reorder columns ---
train_df = put_label_first(train_df, target_col)
val_df   = put_label_first(val_df, target_col)
test_df  = put_label_first(test_df, target_col)

# --- Save CSV splits (with headers, no index) ---
train_df.to_csv("splits/train.csv", index=False, header=True)
val_df.to_csv("splits/val.csv", index=False, header=True)
test_df.to_csv("splits/test.csv", index=False, header=True)


In [None]:
import boto3
import pathlib
from botocore.exceptions import ClientError

# --- Local dataset paths ---
paths = {
    "train": "splits/train.csv",
    "val":   "splits/val.csv",
    "test":  "splits/test.csv",
}

# --- S3 configuration ---
bucket = "c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft"
prefix = "linear-learner-delay/"

# --- Create S3 client ---
s3 = boto3.client("s3")

# --- Upload function with verification ---
def upload_and_verify(local_path, bucket, key):
    """Upload file to S3 and verify upload success."""
    try:
        s3.upload_file(local_path, bucket, key)
        s3.head_object(Bucket=bucket, Key=key)
        print(f"uploaded → s3://{bucket}/{key}")
    except ClientError as e:
        print(f"could not verify {key}: {e}")

# --- Upload each dataset split ---
for name, local_path in paths.items():
    s3_key = f"{prefix}{pathlib.Path(local_path).name}"
    upload_and_verify(local_path, bucket, s3_key)

# --- Print resulting S3 URIs (same as original output) ---
print("\nS3 URIs for dataset:")
print(f"s3://{bucket}/{prefix}train.csv")
print(f"s3://{bucket}/{prefix}val.csv")
print(f"s3://{bucket}/{prefix}test.csv")


In [None]:
from sagemaker import get_execution_role
import sagemaker

# --- Initialize SageMaker session ---
sess = sagemaker.Session()

# --- Get current AWS region ---
region = sess.boto_region_name

# --- Retrieve execution role for the current SageMaker environment ---
role = get_execution_role()


In [None]:
# --- Define S3 dataset channels for SageMaker ---
train_s3 = f"s3://{bucket}/{prefix}train.csv"
val_s3   = f"s3://{bucket}/{prefix}val.csv"
test_s3  = f"s3://{bucket}/{prefix}test.csv"

# Training inputs (header-less expected later after sanitization)
train_input = TrainingInput(train_s3, content_type="text/csv")
val_input   = TrainingInput(val_s3,   content_type="text/csv")

# --- Console info (kept exactly as original) ---
print("region:", region)
print("role:", role)
print(train_s3, val_s3, test_s3, sep="\n")

# --- S3 client ---
s3 = boto3.client("s3")


# ---------- Helpers ----------
def load_csv_with_header(key: str) -> pd.DataFrame:
    """Load a CSV object from S3 into a pandas DataFrame (assumes a header exists)."""
    body = s3.get_object(Bucket=bucket, Key=key)["Body"].read()
    return pd.read_csv(io.BytesIO(body))


def drop_dot1(df: pd.DataFrame) -> pd.DataFrame:
    """Drop duplicate columns that end with '.1'."""
    return df[[c for c in df.columns if not c.endswith(".1")]]


def bool_to_int(df: pd.DataFrame) -> pd.DataFrame:
    """Convert boolean columns to int; map string booleans ('true'/'false') to 1/0."""
    for c in df.columns:
        if df[c].dtype == bool:
            df[c] = df[c].astype(int)
        elif df[c].dtype == object:
            vals = set(df[c].dropna().astype(str).str.lower().unique())
            if vals <= {"true", "false"}:
                df[c] = (
                    df[c]
                    .astype(str)
                    .str.lower()
                    .map({"true": 1, "false": 0})
                )
    return df


def to_numeric(df: pd.DataFrame) -> pd.DataFrame:
    """Coerce all columns to numeric, setting non-convertible values to NaN."""
    return df.apply(lambda col: pd.to_numeric(col, errors="coerce"))


def put_label_first(df: pd.DataFrame, label: str = "target") -> pd.DataFrame:
    """Move the label column to be the first column if it exists."""
    if label in df.columns:
        cols = [label] + [c for c in df.columns if c != label]
        return df[cols]
    return df


# ---------- Main sanitization ----------
def sanitize_one(name: str) -> None:
    """
    Load {prefix}{name}.csv (with header) from S3, clean it, ensure binary label in col 0,
    then upload back to the same key WITHOUT header. Print a short summary.
    """
    key = f"{prefix}{name}.csv"

    # Load with header, then clean
    df = load_csv_with_header(key)
    df = drop_dot1(df)
    df = bool_to_int(df)
    df = to_numeric(df).fillna(0.0)

    # Put label first
    df = put_label_first(df, "target")

    # Force the first column (label) to be strictly 0/1
    y = df.iloc[:, 0].values
    uy = np.unique(y[~pd.isna(y)])

    if not set(uy) <= {0, 1}:
        if len(uy) == 2:
            lo, hi = sorted(uy)
            df.iloc[:, 0] = (df.iloc[:, 0] == hi).astype(int)
        else:
            thr = np.nanmedian(y)
            df.iloc[:, 0] = (df.iloc[:, 0] > thr).astype(int)

    # Upload back WI


In [None]:
# Retrieve the built-in Linear Learner image
container = image_uris.retrieve(framework="linear-learner", region=region)

# Configure the estimator
est = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}output/",
    sagemaker_session=sess,
    max_run=3600,
)

# Set hyperparameters
est.set_hyperparameters(
    predictor_type="binary_classifier",
    epochs=10,
    mini_batch_size=256,
    num_models=32,
    loss="auto",
)

# Training inputs
train_input = TrainingInput(f"s3://{bucket}/{prefix}train.csv", content_type="text/csv")
val_input   = TrainingInput(f"s3://{bucket}/{prefix}val.csv",   content_type="text/csv")

# Train
est.fit({"train": train_input, "validation": val_input}, logs=False)


In [None]:
# --- Deploy real-time endpoint ---
predictor = est.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
)
endpoint_name = predictor.endpoint_name
print("endpoint:", endpoint_name)

# --- Load test set (label + features) from S3 ---
obj = s3.get_object(Bucket=bucket, Key=f"{prefix}test.csv")
test_df = pd.read_csv(io.BytesIO(obj["Body"].read()), header=None)
y_true = test_df.iloc[:, 0].astype(int).values
X = test_df.iloc[:, 1:]

# --- Upload features-only file for batch transform ---
local_x = "/tmp/test_x.csv"
X.to_csv(local_x, header=False, index=False)
s3.upload_file(local_x, bucket, f"{prefix}test_x.csv")
test_x_s3 = f"s3://{bucket}/{prefix}test_x.csv"

# --- Configure and run Batch Transform ---
from sagemaker.transformer import Transformer

transformer = est.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}batch_out/",
    assemble_with="Line",
    accept="application/jsonlines",
)
transformer.transform(data=test_x_s3, content_type="text/csv", split_type="Line")
transformer.wait()


In [None]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
)

# --- Collect Batch Transform outputs from S3 ---
s3r = boto3.resource("s3")
bkt = s3r.Bucket(bucket)
outs = [o.key for o in bkt.objects.filter(Prefix=f"{prefix}batch_out/") if o.key.endswith(".out")]

# --- Parse predictions ---
y_pred, y_prob = [], []
for key in outs:
    body = bkt.Object(key).get()["Body"].read().decode("utf-8").strip()
    if not body:
        continue
    for line in body.splitlines():
        rec = json.loads(line)
        prob = rec.get("score", rec.get("scores", [None])[0])
        lab = rec.get("predicted_label", int(prob >= 0.5) if prob is not None else 0)
        y_prob.append(prob if prob is not None else float(lab))
        y_pred.append(int(lab))

y_prob = np.array(y_prob, dtype=float)
y_pred = np.array(y_pred, dtype=int)

assert len(y_true) == len(y_pred) == len(y_prob), "length mismatch"

# --- Metrics ---
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred, zero_division=0)
try:
    auc = roc_auc_score(y_true, y_prob)
except ValueError:
    auc = float("nan")
cm = confusion_matrix(y_true, y_pred)

# --- Output (unchanged) ---
print("Test metrics via (Batch Transform)")
print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")
print(f"ROC AUC  : {auc:.4f}")
print("Confusion matrix [[TN FP]\n [FN TP]]:")
print(cm)


# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [4]:
# Retrieve the XGBoost container image
container = image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.5-1",
)

# Define the estimator
xgb_est = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}xgb_output/",
    sagemaker_session=sess,
    max_run=1800,
)

# Hyperparameters for binary classification
xgb_est.set_hyperparameters(
    objective="binary:logistic",
    num_round=50,
    max_depth=4,
    eta=0.2,
    subsample=0.8,
    eval_metric="logloss",
)

# Training and validation inputs
train_input = TrainingInput(f"s3://{bucket}/{prefix}train.csv", content_type="text/csv")
val_input   = TrainingInput(f"s3://{bucket}/{prefix}val.csv",   content_type="text/csv")

# Launch training
xgb_est.fit({"train": train_input, "validation": val_input})


In [5]:
# --- Deploy XGBoost real-time endpoint ---
xgb_predictor = xgb_est.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
)
endpoint_name = xgb_predictor.endpoint_name
print("Deployed XGBoost endpoint:", endpoint_name)

# --- Configure Batch Transform for XGBoost ---
from sagemaker.transformer import Transformer

transformer = xgb_est.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}xgb_batch_out/",
    assemble_with="Line",
    accept="application/jsonlines",
)

# --- Run Batch Transform on test features (CSV without header) ---
test_x_s3 = f"s3://{bucket}/{prefix}test_x.csv"
transformer.transform(data=test_x_s3, content_type="text/csv", split_type="Line")
transformer.wait()


In [None]:
import re
import json
import io
import boto3
import pandas as pd
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
)

# --- S3 resources ---
s3r = boto3.resource("s3")
bkt = s3r.Bucket(bucket)
s3 = boto3.client("s3")

# Collect all .out shards from XGBoost Batch Transform (sorted for determinism)
outs = sorted(
    [
        o.key
        for o in bkt.objects.filter(Prefix=f"{prefix}xgb_batch_out/")
        if o.key.endswith(".out")
    ]
)

def parse_prob(line: str):
    """
    Extract a float probability from a batch-transform output line.
    Handles JSON (scalar/dict/list), CSV, or whitespace-separated formats.
    Returns None if no numeric value is found.
    """
    line = line.strip()
    if not line:
        return None

    # 1) Try JSON first
    try:
        obj = json.loads(line)

        # JSON scalar
        if isinstance(obj, (int, float)):
            return float(obj)

        # JSON dict
        if isinstance(obj, dict):
            # common keys for probability
            for k in ("score", "probability", "prediction", "predicted_value"):
                v = obj.get(k)
                if isinstance(v, (int, float)):
                    return float(v)
            # list-bearing keys
            for k in ("scores", "predictions"):
                v = obj.get(k)
                if isinstance(v, list) and v and isinstance(v[0], (int, float)):
                    return float(v[0])
            # generic: first numeric value
            for v in obj.values():
                if isinstance(v, (int, float)):
                    return float(v)

        # JSON list
        if isinstance(obj, list):
            for v in obj:
                if isinstance(v, (int, float)):
                    return float(v)

    except json.JSONDecodeError:
        pass

    # 2) Try CSV / whitespace
    for token in filter(None, re.split(r"[,\s]+", line)):
        try:
            return float(token)
        except ValueError:
            continue

    return None

# --- Parse predictions from shards ---
y_pred, y_prob = [], []
for key in outs:
    body = bkt.Object(key).get()["Body"].read().decode("utf-8")
    for line in body.splitlines():
        prob = parse_prob(line)
        if prob is None:  # skip empty/garbage lines safely
            continue
        y_prob.append(prob)
        y_pred.append(int(prob >= 0.5))

y_prob = pd.Series(y_prob, dtype="float64").to_numpy()
y_pred = pd.Series(y_pred, dtype="int64").to_numpy()

# --- Load ground-truth labels (first column is label) ---
obj = s3.get_object(Bucket=bucket, Key=f"{prefix}test.csv")
test_df = pd.read_csv(io.BytesIO(obj["Body"].read()), header=None)
y_true = test_df.iloc[:, 0].astype(int).to_numpy()

# --- Sanity check: length match ---
if len(y_true) != len(y_pred):
    print(
        f"warning: length mismatch — labels={len(y_true)}, preds={len(y_pred)}. "
        f"Did batch split your test into multiple shards or include a header?"
    )
    n = min(len(y_true), len(y_pred))
    y_true, y_pred, y_prob = y_true[:n], y_pred[:n], y_prob[:n]

# --- Metrics ---
acc  = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec  = recall_score(y_true, y_pred, zero_division=0)
f1   = f1_score(y_true, y_pred, zero_division=0)
try:
    auc = roc_auc_score(y_true, y_prob)
except ValueError:
    auc = float("nan")
cm = confusion_matrix(y_true, y_pred)

# --- Output (unchanged) ---
print("Test metrics via (XGBoost Batch Transform)")
print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")
print(f"ROC AUC  : {auc:.4f}")
print("Confusion matrix [[TN FP]\n [FN TP]]:")
print(cm)
