# CSIRO Pasture Biomass Baseline

This notebook builds a reproducible baseline for the CSIRO biomass prediction challenge. It is designed to run end-to-end on Kaggle, assuming the competition data is available under `/kaggle/input/csiro-biomass/`.



## Notebook Roadmap

1. Imports and configuration
2. Load the tabular data and inspect it briefly
3. Feature engineering helpers
4. Train/validation split with cross-validation diagnostics
5. Fit the final model on full data
6. Generate the Kaggle submission file
7. (Optional) Visualise a sample pasture image



In [None]:
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt

from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

warnings.filterwarnings("ignore", category=UserWarning)
pd.set_option("display.max_columns", 50)

RANDOM_STATE = 42
DATA_DIR = Path("/kaggle/input/csiro-biomass")
TRAIN_CSV = DATA_DIR / "train.csv"
TEST_CSV = DATA_DIR / "test.csv"
SAMPLE_SUB_CSV = DATA_DIR / "sample_submission.csv"
TRAIN_IMAGE_SAMPLE = DATA_DIR / "train" / "ID1011485656.jpg"

if not TRAIN_CSV.exists():
    raise FileNotFoundError(
        "Expected competition data under /kaggle/input/csiro-biomass/. "
        "Please add the dataset to the notebook before running it."
    )



In [None]:
train_df = pd.read_csv(TRAIN_CSV)
test_df = pd.read_csv(TEST_CSV)
sample_submission = pd.read_csv(SAMPLE_SUB_CSV)

print(f"Train rows: {len(train_df):,}")
print(f"Test rows: {len(test_df):,}")
print(f"Sample submission rows: {len(sample_submission):,}")
train_df.head()


In [None]:
train_summary = (
    train_df.describe(include="all", datetime_is_numeric=True)
    .transpose()
)
train_summary


In [None]:
missing_ratio = train_df.isna().mean().sort_values(ascending=False)
missing_ratio[missing_ratio > 0]


## Feature Engineering

We will keep the baseline focused on the tabular metadata. Each sample contains five target measurements. The steps below pivot the training targets into a wide format and create simple date-derived features.



In [None]:
TARGET_COLUMNS = [
    "Dry_Green_g",
    "Dry_Dead_g",
    "Dry_Clover_g",
    "GDM_g",
    "Dry_Total_g",
]
META_COLUMNS = [
    "sample_id",
    "image_path",
    "Sampling_Date",
    "State",
    "Species",
    "Pre_GSHH_NDVI",
    "Height_Ave_cm",
]

def prepare_metadata(df: pd.DataFrame) -> pd.DataFrame:
    meta = (
        df[META_COLUMNS]
        .drop_duplicates(subset=["sample_id"])
        .set_index("sample_id")
    )
    meta["Sampling_Date"] = pd.to_datetime(meta["Sampling_Date"], errors="coerce")
    meta["sampling_year"] = meta["Sampling_Date"].dt.year
    meta["sampling_month"] = meta["Sampling_Date"].dt.month
    meta["sampling_dayofyear"] = meta["Sampling_Date"].dt.dayofyear
    meta = meta.drop(columns=["Sampling_Date", "image_path"], errors="ignore")
    return meta

def prepare_targets(df: pd.DataFrame) -> pd.DataFrame:
    pivot = df.pivot(index="sample_id", columns="target_name", values="target")
    for target in TARGET_COLUMNS:
        if target not in pivot.columns:
            pivot[target] = np.nan
    return pivot[TARGET_COLUMNS]

train_meta = prepare_metadata(train_df)
train_targets = prepare_targets(train_df)

# Align features and targets on the same sample ids
common_ids = train_meta.index.intersection(train_targets.index)
X = train_meta.loc[common_ids].copy()
y = train_targets.loc[common_ids].copy()

print(f"Training samples: {len(X):,}")
X.head()


In [None]:
numeric_features = [
    "Pre_GSHH_NDVI",
    "Height_Ave_cm",
    "sampling_year",
    "sampling_month",
    "sampling_dayofyear",
]
categorical_features = ["State", "Species"]

one_hot_kwargs = {"handle_unknown": "ignore"}
if "sparse_output" in OneHotEncoder.__init__.__code__.co_varnames:
    one_hot_kwargs["sparse_output"] = False
else:
    one_hot_kwargs["sparse"] = False

preprocessor = ColumnTransformer(
    transformers=[
        ("num", SimpleImputer(strategy="median"), numeric_features),
        (
            "cat",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="most_frequent")),
                    ("encoder", OneHotEncoder(**one_hot_kwargs)),
                ]
            ),
            categorical_features,
        ),
    ]
)

base_regressor = HistGradientBoostingRegressor(
    random_state=RANDOM_STATE,
    max_depth=None,
    learning_rate=0.1,
    max_iter=500,
)

model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("regressor", MultiOutputRegressor(base_regressor)),
    ]
)
model


In [None]:
def root_mean_squared_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return float(np.sqrt(np.mean((y_true - y_pred) ** 2)))

cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_metrics = []

for fold, (train_idx, valid_idx) in enumerate(cv.split(X), start=1):
    X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

    fold_model = clone(model)
    fold_model.fit(X_train, y_train)
    y_pred = fold_model.predict(X_valid)

    fold_summary = {"fold": fold}
    overall_rmse = root_mean_squared_error(y_valid.values, y_pred)
    fold_summary["overall_rmse"] = overall_rmse

    for idx, target in enumerate(TARGET_COLUMNS):
        target_rmse = root_mean_squared_error(y_valid.iloc[:, idx].values, y_pred[:, idx])
        fold_summary[f"rmse_{target}"] = target_rmse

    cv_metrics.append(fold_summary)
    print(
        f"Fold {fold}: overall RMSE = {overall_rmse:.2f} "
        + ", ".join(
            f"{target}={fold_summary[f'rmse_{target}']:.2f}" for target in TARGET_COLUMNS
        )
    )

cv_results = pd.DataFrame(cv_metrics)
cv_results


In [None]:
cv_results.mean(numeric_only=True)


In [None]:
final_model = clone(model)
final_model.fit(X, y)

test_meta = prepare_metadata(test_df)
print(f"Test samples prepared: {len(test_meta):,}")

test_predictions = final_model.predict(test_meta)
prediction_wide = pd.DataFrame(
    test_predictions, index=test_meta.index, columns=TARGET_COLUMNS
).reset_index()

prediction_long = prediction_wide.melt(
    id_vars="sample_id", value_vars=TARGET_COLUMNS,
    var_name="target_name", value_name="target"
)
submission = sample_submission.drop(columns="target", errors="ignore").merge(
    prediction_long, on=["sample_id", "target_name"], how="left"
)

missing_predictions = submission["target"].isna().sum()
if missing_predictions:
    raise ValueError(
        f"Submission contains {missing_predictions} missing predictions. "
        "Check feature preparation for unmatched sample_ids."
    )

submission.head()


In [None]:
OUTPUT_PATH = Path("submission.csv")
submission.to_csv(OUTPUT_PATH, index=False)
print(f"Saved submission to {OUTPUT_PATH.resolve()} with {len(submission):,} rows.")


## Environment & Optional GPU Diagnostics

Kaggle provides a P100 accelerator for this notebook. The model trains on CPU-only features, so GPU usage is optional. Run the cell below to confirm the GPU is visible if you want to add image models later.



In [None]:
!nvidia-smi


## Visualise A Sample Pasture Image

Use the cell below to display one of the training images (`ID1011485656.jpg`) so you can inspect the raw data manually.



In [None]:
if TRAIN_IMAGE_SAMPLE.exists():
    image = Image.open(TRAIN_IMAGE_SAMPLE)
    plt.figure(figsize=(6, 6))
    plt.imshow(image)
    plt.axis("off")
    plt.title(TRAIN_IMAGE_SAMPLE.name)
else:
    print(f"Image not found at {TRAIN_IMAGE_SAMPLE}. Check the dataset path.")


## Next Steps

- Explore richer image features (e.g., CNN embeddings) now that the notebook runs on a GPU.
- Try additional regressors (CatBoost, LightGBM, XGBoost) or stacking ensembles.
- Engineer agronomic features from dates, species composition, or NDVI deltas.
- Log experiments with Weights & Biases or Kaggle notebooks for reproducibility.

