This notebook demonstrates a complete baseline pipeline for the CSIRO Image2Biomass Prediction competition.
The goal of this challenge is to predict pasture biomass (in grams) across five key vegetation components using images and associated metadata.

**Objective**
Build a model that estimates:

* Dry green vegetation (excluding clover)
* Dry dead material
* Dry clover biomass
* Green dry matter (GDM)
* Total dry biomass

**This Notebook Includes**

1. Environment setup and data loading
2. Metadata preprocessing (dates, categories, NDVI, etc.)
3. Baseline model training using Random Forest
4. Local validation using R² score
5. Test prediction and submission.csv creation

**Model Approach**

We start with a simple tabular baseline using only numeric and categorical features.
Later stages will extend this to a CNN + metadata hybrid model leveraging EfficientNet image embeddings.




In [None]:
# =========================================================
# 1️⃣ Setup and Imports
# =========================================================
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Kaggle dataset path (automatically mounted in environment)
BASE_PATH = "/kaggle/input/csiro-biomass"

print("Available files:", os.listdir(BASE_PATH))

In [None]:
# =========================================================
# 2️⃣ Load CSV Files
# =========================================================
train = pd.read_csv(f"{BASE_PATH}/train.csv")
test = pd.read_csv(f"{BASE_PATH}/test.csv")
sample_sub = pd.read_csv(f"{BASE_PATH}/sample_submission.csv")

print("Train shape:", train.shape)
print("Test shape:", test.shape)
train.head()

In [None]:
# =========================================================
# 3️⃣ Data Preprocessing
# =========================================================

# Handle Sampling_Date if exists
if "Sampling_Date" in train.columns:
    train["Sampling_Date"] = pd.to_datetime(train["Sampling_Date"], errors="coerce")
    train["year"] = train["Sampling_Date"].dt.year
    train["month"] = train["Sampling_Date"].dt.month
    train["day"] = train["Sampling_Date"].dt.day
else:
    train["year"] = train["month"] = train["day"] = 0

if "Sampling_Date" in test.columns:
    test["Sampling_Date"] = pd.to_datetime(test["Sampling_Date"], errors="coerce")
    test["year"] = test["Sampling_Date"].dt.year
    test["month"] = test["Sampling_Date"].dt.month
    test["day"] = test["Sampling_Date"].dt.day
else:
    test["year"] = test["month"] = test["day"] = 0

# Encode categorical columns
for col in ["State", "Species", "target_name"]:
    le = LabelEncoder()
    le.fit(pd.concat([train[col].astype(str), test.get(col, pd.Series([])).astype(str)]))
    train[col] = le.transform(train[col].astype(str))
    if col in test.columns:
        test[col] = le.transform(test[col].astype(str))
    else:
        test[col] = 0


In [None]:
# =========================================================
# 4️⃣ Prepare Features
# =========================================================
feature_cols = ["State", "Species", "Pre_GSHH_NDVI", "Height_Ave_cm", "year", "month", "day", "target_name"]

# Make sure all columns exist
for col in feature_cols:
    if col not in test.columns:
        test[col] = 0
    if col not in train.columns:
        train[col] = 0

X = train[feature_cols].copy()
y = train["target"].copy()

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# =========================================================
# 5️⃣ Train Baseline Model
# =========================================================
model = RandomForestRegressor(
    n_estimators=400,
    max_depth=12,
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

# Validate
y_pred = model.predict(X_valid)
r2 = r2_score(y_valid, y_pred)
print("Validation R²:", round(r2, 4))


In [None]:
# =========================================================
# 6️⃣ Generate Predictions and Submission
# =========================================================
test_features = test[feature_cols]
test["target"] = model.predict(test_features)

submission = test[["sample_id", "target"]].copy()
submission.to_csv("/kaggle/working/submission.csv", index=False)
print("✅ submission.csv created successfully!")
submission.head()


In [None]:
# Show path of final submission file
!ls -lh /kaggle/working/
