# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/EhWeqeQsh-9Mr1fneZc9_0sBOBzEdXngvxFJtAlIa-eAgA?e=8ukWwa). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

**Note:** In case of the data is too much to be uploaded to the AWS, please use 20% of the data only for this task.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [1]:
import os, io, json, time, tarfile, tempfile, shutil, warnings
warnings.filterwarnings("ignore")

import boto3
import numpy as np
import pandas as pd
import boto3
import re
import pathlib
from botocore.exceptions import ClientError
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score
)

import sagemaker

from sagemaker import image_uris, Session
from sagemaker.local import LocalSession
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.amazon.linear_learner import LinearLearner
from sagemaker.serializers import CSVSerializer
from sklearn.model_selection import train_test_split
from sagemaker.deserializers import JSONDeserializer
from sagemaker.s3 import S3Uploader, S3Downloader
from sagemaker.amazon.linear_learner import LinearLearner

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
sess = sagemaker.Session()
region = boto3.Session().region_name

In [3]:
CSV_PATH = "combined_csv_20.csv"
RANDOM_STATE = 42

In [4]:
df = pd.read_csv(CSV_PATH)

In [5]:
# Show number of NaNs in each column
nan_counts = df.isna().sum().sort_values(ascending=False)
print(nan_counts)

# show columns that actually have NaNs
nan_counts = nan_counts[nan_counts > 0]
print("\nColumns with missing values only:")
print(nan_counts)
total_missing = df.isna().sum().sum()

#Dropping NA
df.dropna(inplace=True)

# total missing
total_missing = df.isna().sum().sum()
print("\nAfter dropping NA missing values in dataset:", total_missing)

target           0
DayofMonth_24    0
DayOfWeek_3      0
DayOfWeek_2      0
DayofMonth_31    0
                ..
Month_5          0
Month_4          0
Month_3          0
Month_2          0
is_holiday_1     0
Length: 94, dtype: int64

Columns with missing values only:
Series([], dtype: int64)

After dropping NA missing values in dataset: 0


In [6]:
numeric_cols = df.select_dtypes(include="number").columns
non_numeric_cols = df.select_dtypes(exclude="number").columns

print("\nNumeric columns:", list(numeric_cols))
print("Non-numeric columns:", list(non_numeric_cols))


Numeric columns: ['target', 'Distance', 'DepHourofDay', 'AWND_O', 'AWND_O.1', 'PRCP_O', 'PRCP_O.1', 'TAVG_O', 'TAVG_O.1', 'AWND_D', 'AWND_D.1', 'PRCP_D', 'PRCP_D.1', 'TAVG_D', 'TAVG_D.1', 'SNOW_O', 'SNOW_O.1', 'SNOW_D', 'SNOW_D.1']
Non-numeric columns: ['Year_2015', 'Year_2016', 'Year_2017', 'Year_2018', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12', 'DayofMonth_2', 'DayofMonth_3', 'DayofMonth_4', 'DayofMonth_5', 'DayofMonth_6', 'DayofMonth_7', 'DayofMonth_8', 'DayofMonth_9', 'DayofMonth_10', 'DayofMonth_11', 'DayofMonth_12', 'DayofMonth_13', 'DayofMonth_14', 'DayofMonth_15', 'DayofMonth_16', 'DayofMonth_17', 'DayofMonth_18', 'DayofMonth_19', 'DayofMonth_20', 'DayofMonth_21', 'DayofMonth_22', 'DayofMonth_23', 'DayofMonth_24', 'DayofMonth_25', 'DayofMonth_26', 'DayofMonth_27', 'DayofMonth_28', 'DayofMonth_29', 'DayofMonth_30', 'DayofMonth_31', 'DayOfWeek_2', 'DayOfWeek_3'

In [7]:
# choosing target column
# choose target column
target_col = "target" if "target" in df.columns else ("is_delay" if "is_delay" in df.columns else None)

if target_col is not None and 2 <= df[target_col].nunique() <= 20:
    strat = df[target_col]
else:
    strat = None

#### Splitting 70-15-15

In [8]:
# 70/30 split
train_df, temp_df = train_test_split(
    df,
    test_size=0.30,
    random_state=RANDOM_STATE,
    stratify=strat
)

In [9]:
#For 30%
if target_col is not None and 2 <= temp_df[target_col].nunique() <= 20:
    strat_temp = temp_df[target_col]
else:
    strat_temp = None

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,
    random_state=RANDOM_STATE,
    stratify=strat_temp
)

In [10]:
#Saving to train easier
os.makedirs("splits", exist_ok=True)
train_df.to_csv("splits/train.csv", index=False)
val_df.to_csv("splits/val.csv", index=False)
test_df.to_csv("splits/test.csv", index=False)

In [11]:
# report
def describe(name, d, tgt):
    print(f"{name}: {len(d)} rows, {d.shape[1]} cols")
    if tgt and tgt in d.columns:
        vc = d[tgt].value_counts(dropna=False).sort_index()
        pct = (vc/len(d)*100).round(2)
        print(" class distribution ->", ", ".join([f"{k}: {vc[k]} ({pct[k]}%)" for k in vc.index]))
    print()

target_col = "target"

print("saved splits to ./splits/")
describe("train 70%", train_df, target_col)
describe("val 15%", val_df,   target_col)
describe("test 15%", test_df,  target_col)

saved splits to ./splits/
train 70%: 146996 rows, 94 cols
 class distribution -> 0: 109906 (74.77%), 1: 37090 (25.23%)

val 15%: 31499 rows, 94 cols
 class distribution -> 0: 23551 (74.77%), 1: 7948 (25.23%)

test 15%: 31500 rows, 94 cols
 class distribution -> 0: 23552 (74.77%), 1: 7948 (25.23%)



In [12]:
import os
os.makedirs("splits", exist_ok=True)

def put_label_first(df, label="target"):
    if label in df.columns:
        cols = [label] + [c for c in df.columns if c != label]
        return df[cols]
    return df

train_df = put_label_first(train_df, target_col)
val_df   = put_label_first(val_df,   target_col)
test_df  = put_label_first(test_df,  target_col)

train_df.to_csv("splits/train.csv", index=False, header=True) 
val_df.to_csv(  "splits/val.csv",   index=False, header=True)
test_df.to_csv( "splits/test.csv",  index=False, header=True)

### For Simple Storage Service in AWS cloud storage

In [13]:
paths = {
    "train": "splits/train.csv",
    "val":   "splits/val.csv",
    "test":  "splits/test.csv",
}

bucket='c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft'
prefix = "linear-learner-delay"

In [14]:
# Creating an S3 client
s3 = boto3.client("s3")

def upload_and_verify(local_path, bucket, key):
    s3.upload_file(local_path, bucket, key)
    try:
        s3.head_object(Bucket=bucket, Key=key)
        print(f"uploaded → s3://{bucket}/{key}")
    except ClientError as e:
        print(f"could not verify {key}: {e}")

# Upload each split
for _, local_path in paths.items():
    s3_key = f"{prefix}{pathlib.Path(local_path).name}"
    upload_and_verify(local_path, bucket, s3_key)

print("\nS3 URIs for dataset:")
print(f"s3://{bucket}/{prefix}train.csv")
print(f"s3://{bucket}/{prefix}val.csv")
print(f"s3://{bucket}/{prefix}test.csv")

uploaded → s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytrain.csv
uploaded → s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delayval.csv
uploaded → s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytest.csv

S3 URIs for dataset:
s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytrain.csv
s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delayval.csv
s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytest.csv


In [15]:
from sagemaker import get_execution_role

sess = sagemaker.Session()
region = sess.boto_region_name
role = sagemaker.get_execution_role() 

#### Building a Model

In [16]:
# training data channels
train_s3 = f"s3://{bucket}/{prefix}train.csv"
val_s3   = f"s3://{bucket}/{prefix}val.csv"
test_s3  = f"s3://{bucket}/{prefix}test.csv"

In [17]:
train_input = TrainingInput(train_s3, content_type="text/csv")
val_input   = TrainingInput(val_s3,   content_type="text/csv")

print("region:", region)
print("role:", role)
print(train_s3, val_s3, test_s3, sep="\n")

region: us-east-1
role: arn:aws:iam::416916046524:role/c182567a4701745l12017053t1w4-SageMakerExecutionRole-ndTKtIN19IYB
s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytrain.csv
s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delayval.csv
s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytest.csv


#### Here Santize S3 CSV's due to previous error

In [19]:
s3 = boto3.client("s3")

def load_csv_with_header(key):
    body = s3.get_object(Bucket=bucket, Key=key)["Body"].read()
    return pd.read_csv(io.BytesIO(body))

#Dropping the duplicate columns ending as .1
def drop_dot1(df): return df[[c for c in df.columns if not c.endswith(".1")]]

In [20]:
# Parsing only 0/1 numeric-only
def bool_to_int(df):
    for c in df.columns:
        if df[c].dtype == bool: df[c] = df[c].astype(int)
        elif df[c].dtype == object:
            v = set(df[c].dropna().astype(str).str.lower().unique())
            if v <= {"true","false"}: df[c] = df[c].astype(str).str.lower().map({"true":1,"false":0})
    return df

def to_numeric(df): return df.apply(lambda col: pd.to_numeric(col, errors="coerce"))

In [21]:
#Applying label here
def put_label_first(df, label="target"):
    if label in df.columns:
        cols = [label] + [c for c in df.columns if c != label]
        return df[cols]
    return df

In [22]:
#Sanitzing and uploacdig back without header
def sanitize_one(name):
    key = f"{prefix}{name}.csv"
    df = load_csv_with_header(key)
    df = drop_dot1(df)
    df = bool_to_int(df)
    df = to_numeric(df).fillna(0.0)
    df = put_label_first(df, "target")

    # force label to 0/1
    y = df.iloc[:,0].values
    uy = np.unique(y[~pd.isna(y)])
    if not set(uy) <= {0,1}:
        if len(uy)==2:
            lo, hi = sorted(uy); df.iloc[:,0] = (df.iloc[:,0]==hi).astype(int)
        else:
            thr = np.nanmedian(y); df.iloc[:,0] = (df.iloc[:,0]>thr).astype(int)

    # upload back WITHOUT header
    local = f"/tmp/{name}.csv"
    df.to_csv(local, index=False, header=False)
    s3.upload_file(local, bucket, key)
    print(f"{name}: {df.shape}, pos_rate={df.iloc[:,0].mean():.3f} → s3://{bucket}/{key}")

for split in ["train","val","test"]:
    sanitize_one(split)

train: (146996, 86), pos_rate=0.252 → s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytrain.csv
val: (31499, 86), pos_rate=0.252 → s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delayval.csv
test: (31500, 86), pos_rate=0.252 → s3://c182567a4701745l12017053t1w416916046524-labbucket-fd5b733ssgft/linear-learner-delaytest.csv


#### Applying Sage Estimator

In [23]:
container = image_uris.retrieve(framework="linear-learner", region=region)

est = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}output/",
    sagemaker_session=sess,
    max_run=3600,
)

In [24]:
est.set_hyperparameters(
    predictor_type="binary_classifier",
    epochs=10,
    mini_batch_size=256,
    num_models=32,
    loss="auto"
)

In [25]:
train_input = TrainingInput(f"s3://{bucket}/{prefix}train.csv", content_type="text/csv")
val_input   = TrainingInput(f"s3://{bucket}/{prefix}val.csv",   content_type="text/csv")
est.fit({"train": train_input, "validation": val_input}, logs=False)

INFO:sagemaker:Creating training-job with name: linear-learner-2025-10-27-09-30-36-524



2025-10-27 09:30:36 Starting - Starting the training job...........
2025-10-27 09:31:37 Starting - Preparing the instances for training....
2025-10-27 09:32:03 Downloading - Downloading input data........
2025-10-27 09:32:48 Downloading - Downloading the training image...............
2025-10-27 09:34:09 Training - Training image download completed. Training in progress..................................................
2025-10-27 09:38:26 Uploading - Uploading generated training model...
2025-10-27 09:38:43 Completed - Training job completed


#### Deploying a model

In [26]:
predictor = est.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")
endpoint_name = predictor.endpoint_name
print("endpoint:", endpoint_name)

INFO:sagemaker:Creating model with name: linear-learner-2025-10-27-09-39-26-418
INFO:sagemaker:Creating endpoint-config with name linear-learner-2025-10-27-09-39-26-418
INFO:sagemaker:Creating endpoint with name linear-learner-2025-10-27-09-39-26-418


-------!endpoint: linear-learner-2025-10-27-09-39-26-418


#### Batch Transform

In [27]:
obj = s3.get_object(Bucket=bucket, Key=f"{prefix}test.csv")
test_df = pd.read_csv(io.BytesIO(obj["Body"].read()), header=None)
y_true = test_df.iloc[:,0].astype(int).values
X      = test_df.iloc[:,1:]

local_x = "/tmp/test_x.csv"
X.to_csv(local_x, header=False, index=False)
s3.upload_file(local_x, bucket, f"{prefix}test_x.csv")
test_x_s3 = f"s3://{bucket}/{prefix}test_x.csv"

from sagemaker.transformer import Transformer
transformer = est.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}batch_out/",
    assemble_with="Line",
    accept="application/jsonlines",
)
transformer.transform(data=test_x_s3, content_type="text/csv", split_type="Line")
transformer.wait()

INFO:sagemaker:Creating model with name: linear-learner-2025-10-27-09-43-29-405
INFO:sagemaker:Creating transform job with name: linear-learner-2025-10-27-09-43-29-997


.......................................
...

#### Performance Metrics

In [28]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

s3r = boto3.resource("s3"); bkt = s3r.Bucket(bucket)
outs = [o.key for o in bkt.objects.filter(Prefix=f"{prefix}batch_out/") if o.key.endswith(".out")]

y_pred, y_prob = [], []
for key in outs:
    for line in bkt.Object(key).get()["Body"].read().decode("utf-8").strip().splitlines():
        rec = json.loads(line)
        prob = rec.get("score", rec.get("scores", [None])[0])
        lab  = rec.get("predicted_label", int(prob >= 0.5) if prob is not None else 0)
        y_prob.append(prob if prob is not None else float(lab))
        y_pred.append(int(lab))

y_prob = np.array(y_prob, float); y_pred = np.array(y_pred, int)
assert len(y_true)==len(y_pred)==len(y_prob), "length mismatch"

acc  = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec  = recall_score(y_true, y_pred, zero_division=0)
f1   = f1_score(y_true, y_pred, zero_division=0)
try: auc = roc_auc_score(y_true, y_prob)
except ValueError: auc = float("nan")
cm = confusion_matrix(y_true, y_pred)

print("Test metrics via (Batch Transform)")
print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")
print(f"ROC AUC  : {auc:.4f}")
print("Confusion matrix [[TN FP]\n [FN TP]]:")
print(cm)


Test metrics via (Batch Transform)
Accuracy : 0.7595
Precision: 0.6086
Recall   : 0.1311
F1-score : 0.2157
ROC AUC  : 0.6892
Confusion matrix [[TN FP]
 [FN TP]]:
[[22882   670]
 [ 6906  1042]]


# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

### Building Model

In [69]:
# Retrieve XGBoost image from previous
container = image_uris.retrieve(framework="xgboost", region=region, version="1.5-1")

# Define the estimator
xgb_est = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}xgb_output/",
    sagemaker_session=sess,
    max_run=1800
)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [70]:
# Simple hyperparameters for binary classification
xgb_est.set_hyperparameters(
    objective="binary:logistic", 
    num_round=50,               
    max_depth=4,                 
    eta=0.2,                   
    subsample=0.8,
    eval_metric="logloss"
)

In [71]:
# Point to training and validation sets in S3
train_input = TrainingInput(f"s3://{bucket}/{prefix}train.csv", content_type="text/csv")
val_input   = TrainingInput(f"s3://{bucket}/{prefix}val.csv",   content_type="text/csv")

# Start training job
xgb_est.fit({"train": train_input, "validation": val_input})

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2025-10-27-10-19-12-482


2025-10-27 10:19:13 Starting - Starting the training job...
2025-10-27 10:19:27 Starting - Preparing the instances for training...
2025-10-27 10:19:51 Downloading - Downloading input data...
2025-10-27 10:20:41 Downloading - Downloading the training image......
2025-10-27 10:21:32 Training - Training image download completed. Training in progress....
2025-10-27 10:22:03 Uploading - Uploading generated training model...
2025-10-27 10:22:21 Completed - Training job completed
..Training seconds: 150
Billable seconds: 150


#### Deploying Model

In [72]:
xgb_predictor = xgb_est.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large"
)
endpoint_name = xgb_predictor.endpoint_name
print("Deployed XGBoost endpoint:", endpoint_name)

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2025-10-27-10-23-05-122
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2025-10-27-10-23-05-122
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2025-10-27-10-23-05-122


-------!Deployed XGBoost endpoint: sagemaker-xgboost-2025-10-27-10-23-05-122


#### Batch Transform on Test Data

In [73]:
transformer = xgb_est.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}xgb_batch_out/",
    assemble_with="Line",
    accept="application/jsonlines",
)

test_x_s3 = f"s3://{bucket}/{prefix}test_x.csv"
transformer.transform(data=test_x_s3, content_type="text/csv", split_type="Line")
transformer.wait()

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2025-10-27-10-27-07-499
INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2025-10-27-10-27-08-074


..............................
...

#### Evaluate Predictions

In [75]:
import re, json, io, boto3, pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

s3r = boto3.resource("s3")
bkt = s3r.Bucket(bucket)

# collect all .out shards (sort for deterministic order)
outs = sorted([o.key for o in bkt.objects.filter(Prefix=f"{prefix}xgb_batch_out/") if o.key.endswith(".out")])

def parse_prob(line: str):
    """Return a float probability from a batch transform output line.
    Handles: plain number, CSV, JSON dict/list."""
    line = line.strip()
    if not line:
        return None
    # 1) try JSON
    try:
        obj = json.loads(line)
        if isinstance(obj, (int, float)):
            return float(obj)
        if isinstance(obj, dict):
            # common keys
            for k in ("score", "probability", "prediction", "predicted_value"):
                if k in obj and isinstance(obj[k], (int, float)):
                    return float(obj[k])
            # some models put a list under 'scores' or 'predictions'
            for k in ("scores", "predictions"):
                v = obj.get(k)
                if isinstance(v, list) and len(v) and isinstance(v[0], (int, float)):
                    return float(v[0])
            # fallback: first numeric value found
            for v in obj.values():
                if isinstance(v, (int, float)):
                    return float(v)
        if isinstance(obj, list):
            # first numeric entry
            for v in obj:
                if isinstance(v, (int, float)):
                    return float(v)
    except json.JSONDecodeError:
        pass

    # 2) try CSV / whitespace-separated
    parts = [p for p in re.split(r"[,\s]+", line) if p]
    for p in parts:
        try:
            return float(p)
        except ValueError:
            continue
    return None

y_pred, y_prob = [], []
for key in outs:
    body = bkt.Object(key).get()["Body"].read().decode("utf-8")
    for line in body.splitlines():
        prob = parse_prob(line)
        if prob is None:  # skip empty/garbage lines safely
            continue
        y_prob.append(prob)
        y_pred.append(int(prob >= 0.5))

# load ground-truth labels (first column is label)
s3 = boto3.client("s3")
obj = s3.get_object(Bucket=bucket, Key=f"{prefix}test.csv")
test_df = pd.read_csv(io.BytesIO(obj["Body"].read()), header=None)
y_true = test_df.iloc[:, 0].astype(int).values

# sanity check: length match
if len(y_true) != len(y_pred):
    print(f"warning: length mismatch — labels={len(y_true)}, preds={len(y_pred)}. "
          f"Did batch split your test into multiple shards or include a header?")
    # you can bail out or truncate to min length:
    n = min(len(y_true), len(y_pred))
    y_true, y_pred, y_prob = y_true[:n], y_pred[:n], y_prob[:n]

# metrics
acc  = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec  = recall_score(y_true, y_pred, zero_division=0)
f1   = f1_score(y_true, y_pred, zero_division=0)
try:
    auc = roc_auc_score(y_true, y_prob)
except ValueError:
    auc = float("nan")
cm = confusion_matrix(y_true, y_pred)

print("Test metrics via (XGBoost Batch Transform)")
print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")
print(f"ROC AUC  : {auc:.4f}")
print("Confusion matrix [[TN FP]\n [FN TP]]:")
print(cm)


Test metrics via (XGBoost Batch Transform)
Accuracy : 0.7764
Precision: 0.6887
Recall   : 0.2080
F1-score : 0.3195
ROC AUC  : 0.7407
Confusion matrix [[TN FP]
 [FN TP]]:
[[22805   747]
 [ 6295  1653]]


In [76]:
s3r = boto3.resource("s3")
bkt = s3r.Bucket(bucket)

outs = [o.key for o in bkt.objects.filter(Prefix=f"{prefix}xgb_batch_out/") if o.key.endswith(".out")]

y_pred, y_prob = [], []

for key in outs:
    body = bkt.Object(key).get()["Body"].read().decode("utf-8").strip().splitlines()
    for line in body:
        try:
            rec = json.loads(line)
            if isinstance(rec, dict):
                prob = rec.get("score", rec.get("scores", [None])[0])
            else:
                prob = float(rec)
        except Exception:
            try:
                prob = float(line.strip())
            except:
                continue
        if prob is None:
            continue
        y_prob.append(prob)
        y_pred.append(int(prob >= 0.5))

In [77]:
s3 = boto3.client("s3")
obj = s3.get_object(Bucket=bucket, Key=f"{prefix}test.csv")
test_df = pd.read_csv(io.BytesIO(obj["Body"].read()), header=None)
y_true = test_df.iloc[:, 0].astype(int).values

# Metrics
acc  = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, zero_division=0)
rec  = recall_score(y_true, y_pred, zero_division=0)
f1   = f1_score(y_true, y_pred, zero_division=0)

print("Test metrics via (XGBoost Batch Transform)")
print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")
print(f"ROC AUC  : {auc:.4f}")
print("Confusion matrix [[TN FP]\n [FN TP]]:")
print(cm)

Test metrics via (XGBoost Batch Transform)
Accuracy : 0.7764
Precision: 0.6887
Recall   : 0.2080
F1-score : 0.3195
ROC AUC  : 0.7407
Confusion matrix [[TN FP]
 [FN TP]]:
[[22805   747]
 [ 6295  1653]]
