

---

The final Python code is structured into four separate cells to ensure a clear, repeatable, and highly competitive workflow for the CSIRO - Image2Biomass Prediction competition.

## 1. Training Cell Explanation (3-Target Model)

This cell handles all data preparation and model training, incorporating the competitive necessity of log-transformation and target subsetting.

* **Target Selection (3 Units):** The code extracts only the **three independent targets** to predict: `Dry_Total_g`, `GDM_g`, and `Dry_Green_g`. This strategy ensures the final two dependent components are mathematically derived later, guaranteeing adherence to physical constraints. The model's output layer is set to **3 units**.
* **Log Transformation:** Target values are transformed using $\mathbf{\log(target + EPS)}$. This technique is essential for stabilizing the variance of highly skewed mass measurements and improving the performance of the Mean Squared Error (MSE) loss function used for training.
* **Data Structure:** The code converts the **long-format** $\text{train.csv}$ (one row per image-target pair) into a **wide format** (one row per image) required for the deep learning model.
* **Architecture:** A simple **Convolutional Neural Network (CNN)** is used as a baseline to process the image data. The model is compiled with MSE loss, appropriate for the log-transformed regression task.

---

## 2. Inference Cell Explanation (Constraint Enforcement)

This cell applies the trained 3-target model to the test data and critically enforces the competition's physical rules via the custom inference function.

* **The `enforce_biological_constraints_corrected` Function:** This is the core of the competitive solution. It accepts the model's 3 log-predictions and performs three steps:
    1.  **Inverse Log-Transformation:** Converts the predictions back to grams using $\mathbf{\exp(\text{prediction}) - EPS}$.
    2.  **Hierarchy Enforcement:** Ensures the predicted masses respect the principle of inclusion: $\mathbf{\text{Dry\_Total\_g} \ge \text{GDM\_g}}$ and $\mathbf{\text{GDM\_g} \ge \text{Dry\_Green\_g}}$.
    3.  **Dependent Target Derivation:** It calculates the final two components, ensuring perfect physical compliance:
        * $\mathbf{\text{Dry\_Dead\_g} = \text{Dry\_Total\_g} - \text{GDM\_g}}$
        * $\mathbf{\text{Dry\_Clover\_g} = \text{GDM\_g} - \text{Dry\_Green\_g}}$
* **Submission Formatting:** The final 5-column wide prediction array is converted back into the required **long format** (columns: $\text{sample\_id}$ and $\text{target}$) using $\text{melt}$ and $\text{merge}$ operations to produce the $\text{submission.csv}$ file.

---

## 3. Analysis Cell Explanation

This cell validates the structural and logical integrity of the final output.

* **Format Check:** Confirms the output file adheres to the required **two-column long format** ($\text{sample\_id}$ and $\text{target}$).
* **Constraint Check:** Mathematically verifies the two primary derivation rules using the predicted values for the first test sample. This step is the ultimate proof that the derivation logic in the inference cell worked correctly.

---

## 4. EDA Cell Explanation (Exploratory Data Analysis)

This cell helps visualize and understand the characteristics of the competition data before complex modeling.

* **Target Distribution:** It plots histograms and box plots of the raw biomass targets ($\text{target}$ column) to show their distribution and range. This visualization confirms the **high skewness** of the data, which justifies the use of the **log transformation** in the training cell.
* **Image Visualization:** It loads and plots sample images from both the **training** and **test** sets. This confirms that the image loading paths are correct for both data sets and helps visually assess the quality and characteristics of the pasture images.
* **Warning Suppression:** Includes the $\text{warnings}$ module to suppress specific `FutureWarning` messages from the $\text{seaborn}$ library, resulting in a clean notebook execution.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm 
import os

# --- A. CONFIGURATION ---
DATA_PATH = '/kaggle/input/csiro-biomass/'
TRAIN_CSV = os.path.join(DATA_PATH, 'train.csv')
TEST_CSV = os.path.join(DATA_PATH, 'test.csv') 
TRAIN_IMG_DIR = DATA_PATH 
IMG_SIZE = (128, 128) 
EPS = 1e-6 

# üõë Targets the model WILL predict (The 3 independent components)
PREDICTED_TARGETS = ['Dry_Total_g', 'GDM_g', 'Dry_Green_g']
# All 5 targets are used for the final submission column list
TARGET_NAMES = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']

IMAGE_PATH_COL = 'image_path'
TARGET_COL = 'target' 

# --- B. DATA LOADING & PREPARATION ---

print("--- Data Loading and Log-Transformation (3 Targets) ---")

df_train_long = pd.read_csv(TRAIN_CSV)

# Pivot the long data into wide format
df_train_wide = df_train_long.pivot_table(
    index=[IMAGE_PATH_COL],
    columns='target_name',
    values=TARGET_COL
).reset_index()

# Drop NaNs based ONLY on the three targets we plan to predict
df_train_wide.dropna(subset=PREDICTED_TARGETS, inplace=True)

# Log-Transformation of Targets (Only the 3 predicted targets)
y_targets = df_train_wide[PREDICTED_TARGETS].values
y_targets_log = np.log(y_targets + EPS)

# Load Images
X_images = []
for index, row in tqdm(df_train_wide.iterrows(), total=len(df_train_wide), desc="Loading Train Images"):
    image_path = os.path.join(TRAIN_IMG_DIR, row[IMAGE_PATH_COL])
    try:
        img = load_img(image_path, target_size=IMG_SIZE)
        img_array = img_to_array(img) / 255.0 
        X_images.append(img_array)
    except FileNotFoundError:
        continue 
X_images = np.array(X_images)

# Split Data - Train the model using the LOG-TRANSFORMED targets
X_train_img, X_val_img, y_train_log, y_val_log = train_test_split(
    X_images, y_targets_log, test_size=0.2, random_state=42
)

# --- C. MODEL DEFINITION & TRAINING ---

def create_image_model(img_shape, num_targets):
    input_img = Input(shape=img_shape, name='image_input')
    x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_img)
    x = MaxPooling2D((2, 2))(x)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.5)(x)
    
    # Output layer is Dense(3) for the 3 predicted components
    output_biomass = Dense(num_targets, activation='linear', name='biomass_output')(x)
    model = Model(inputs=input_img, outputs=output_biomass)
    return model

# Initialize and compile the model
biomass_model = create_image_model(
    img_shape=X_train_img.shape[1:], 
    num_targets=len(PREDICTED_TARGETS)
)

biomass_model.compile(optimizer='adam', loss='mse', metrics=['mae']) 

n_epoch=40
print(f"\n--- Starting Model Training ({n_epoch} Epochs) ---")
biomass_model.fit(
    X_train_img, y_train_log, 
    validation_data=(X_val_img, y_val_log),
    epochs=n_epoch, batch_size=32, verbose=1
)

# Store necessary variables globally for the next cell
global TEST_CSV_VAR, TARGET_NAMES_VAR, PREDICTED_TARGETS_VAR, IMAGE_PATH_COL_VAR, SUBMISSION_ID_COL_VAR, TARGET_COL_VAR, EPS_VAR, TEST_IMG_DIR_VAR, IMG_SIZE_VAR, biomass_model_var
TEST_CSV_VAR = TEST_CSV
TARGET_NAMES_VAR = TARGET_NAMES
PREDICTED_TARGETS_VAR = PREDICTED_TARGETS
IMAGE_PATH_COL_VAR = IMAGE_PATH_COL
SUBMISSION_ID_COL_VAR = 'sample_id'
TARGET_COL_VAR = TARGET_COL
EPS_VAR = EPS
TEST_IMG_DIR_VAR = os.path.join(DATA_PATH)
IMG_SIZE_VAR = IMG_SIZE
biomass_model_var = biomass_model

print('training complete')

In [None]:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tqdm.auto import tqdm 
import os

# --- A. INFERENCE FUNCTIONS ---

def enforce_biological_constraints_corrected(predictions_wide_log, PREDICTED_TARGETS, TARGET_NAMES, EPS):
    """
    Applies constraints by predicting 3 targets and deriving the other 2.
    """
    
    # 1. Inverse Log-Transformation
    predictions_wide_g = np.exp(predictions_wide_log) - EPS
    
    # Create a DataFrame using only the 3 PREDICTED targets
    df_pred = pd.DataFrame(predictions_wide_g, columns=PREDICTED_TARGETS)

    # 2. Enforces Hierarchy on PREDICTED components
    df_pred['GDM_g'] = np.maximum(df_pred['GDM_g'].values, df_pred['Dry_Green_g'].values) # GDM_g >= Dry_Green_g
    df_pred['Dry_Total_g'] = np.maximum(df_pred['Dry_Total_g'].values, df_pred['GDM_g'].values) # Dry_Total_g >= GDM_g
    
    # Ensure all predictions are non-negative before derivation
    for col in PREDICTED_TARGETS:
        df_pred[col] = np.maximum(0, df_pred[col])

    # 3. Derives Missing Targets (CRITICAL)
    df_pred['Dry_Clover_g'] = np.maximum(0, df_pred['GDM_g'].values - df_pred['Dry_Green_g'].values)
    df_pred['Dry_Dead_g'] = np.maximum(0, df_pred['Dry_Total_g'].values - df_pred['GDM_g'].values)
    
    # Reorder columns to match the required final TARGET_NAMES order
    final_order = TARGET_NAMES 
    predictions_final = df_pred[final_order].values

    return predictions_final

def run_inference(model, X_test_images, PREDICTED_TARGETS, TARGET_NAMES, EPS):
    # Get log-transformed predictions (3 outputs)
    predictions_log = model.predict(X_test_images)
    
    # Enforce constraints and transform back to grams (5 outputs)
    predictions_g = enforce_biological_constraints_corrected(predictions_log, PREDICTED_TARGETS, TARGET_NAMES, EPS)
    
    return predictions_g

# --- B. EXECUTION ---

print("\n--- Generating Submission File ---")

# Load configuration from global variables
TEST_CSV = TEST_CSV_VAR
TARGET_NAMES = TARGET_NAMES_VAR
PREDICTED_TARGETS = PREDICTED_TARGETS_VAR
IMAGE_PATH_COL = IMAGE_PATH_COL_VAR
SUBMISSION_ID_COL = SUBMISSION_ID_COL_VAR
TARGET_COL = TARGET_COL_VAR
EPS = EPS_VAR
TEST_IMG_DIR = TEST_IMG_DIR_VAR
IMG_SIZE = IMG_SIZE_VAR
biomass_model = biomass_model_var

df_test_long = pd.read_csv(TEST_CSV)
df_test_wide_meta = df_test_long.drop_duplicates(subset=[IMAGE_PATH_COL]).reset_index(drop=True)

# Prepare wide test data for prediction
X_test_images_wide = []
for index, row in tqdm(df_test_wide_meta.iterrows(), total=len(df_test_wide_meta), desc="Loading Test Images"):
    image_path_full = os.path.join(TEST_IMG_DIR, row[IMAGE_PATH_COL])
    try:
        img = load_img(image_path_full, target_size=IMG_SIZE)
        X_test_images_wide.append(img_to_array(img) / 255.0) 
    except FileNotFoundError:
        continue 
X_test_images_wide = np.array(X_test_images_wide)


if len(X_test_images_wide) > 0:
    predictions_wide_g = run_inference(biomass_model, X_test_images_wide, PREDICTED_TARGETS, TARGET_NAMES, EPS)

    # Convert wide predictions (in grams) back to a long DataFrame
    df_pred_wide = pd.DataFrame(predictions_wide_g, columns=TARGET_NAMES)
    df_pred_wide.insert(0, IMAGE_PATH_COL, df_test_wide_meta[IMAGE_PATH_COL])

    # Melt the wide prediction table back to long format 
    df_pred_long = df_pred_wide.melt(
        id_vars=[IMAGE_PATH_COL],
        value_vars=TARGET_NAMES,
        var_name='target_name',
        value_name=TARGET_COL
    )

    # Merge predictions back to the original test structure
    df_submission = df_test_long[[SUBMISSION_ID_COL, IMAGE_PATH_COL, 'target_name']].merge(
        df_pred_long,
        on=[IMAGE_PATH_COL, 'target_name'],
        how='left'
    )
    
    # Final submission structure
    df_submission = df_submission[[SUBMISSION_ID_COL, TARGET_COL]]

    # Save to CSV
    SUBMISSION_FILE = 'submission.csv'
    df_submission.to_csv(SUBMISSION_FILE, index=False)

    print(f"Successfully generated submission file: {SUBMISSION_FILE}")
else:
    print("No test images were loaded. Submission file not created.")

In [None]:
import pandas as pd
import numpy as np
import os

# --- CONFIGURATION (Load from global) ---
SUBMISSION_FILE = 'submission.csv'
SUBMISSION_ID_COL = SUBMISSION_ID_COL_VAR
TARGET_COL = TARGET_COL_VAR
EPS = EPS_VAR

# --- FILE VERIFICATION ---

print("\n--- Final Submission File Verification and Content Analysis ---")

if not os.path.exists(SUBMISSION_FILE):
    print(f"FATAL ERROR: Submission file '{SUBMISSION_FILE}' not found.")
else:
    df_submission = pd.read_csv(SUBMISSION_FILE)

    # 1. Validation Checks
    expected_cols = [SUBMISSION_ID_COL, TARGET_COL]
    if df_submission.columns.tolist() != expected_cols:
        print(f"‚ùå FAIL: Expected columns {expected_cols}, found {df_submission.columns.tolist()}.")
    else:
        print("‚úÖ PASS: Submission file has the correct columns and order.")

    # 2. Print Structure
    print("-" * 50)
    print(f"Shape: {df_submission.shape}")
    
    print("\nSubmission Head (First 10 rows, showing constrained predictions):")
    print(df_submission.head(10).to_markdown(index=False))
    
    # 3. Post-Processing Constraint Check (Validation based on the first sample)
    
    if len(df_submission) >= 5:
        # Sort the first 5 rows to ensure correct mapping for constraint check
        df_check = df_submission.head(5).sort_values(by=SUBMISSION_ID_COL)
        
        # Mapping values based on the component name in sample_id
        T = df_check[df_check[SUBMISSION_ID_COL].str.contains('Total_g')]['target'].iloc[0]
        M = df_check[df_check[SUBMISSION_ID_COL].str.contains('GDM_g')]['target'].iloc[0]
        G = df_check[df_check[SUBMISSION_ID_COL].str.contains('Green_g')]['target'].iloc[0]
        D = df_check[df_check[SUBMISSION_ID_COL].str.contains('Dead_g')]['target'].iloc[0]
        C = df_check[df_check[SUBMISSION_ID_COL].str.contains('Clover_g')]['target'].iloc[0]
        
        # Check Total Derivation: T = M + D
        total_derived_check = M + D
        
        # Check GDM Derivation: M = G + C
        gdm_derived_check = G + C
        
        print("\n--- Biological Constraint Check (First Sample) ---")
        print(f"Dry_Total_g (T): {T:.4f} | GDM_g (M): {M:.4f} | Dry_Green_g (G): {G:.4f}")
        
        # Check if derived components match the total/GDM:
        if np.isclose(T, total_derived_check, atol=EPS * 10):
            print(f"‚úÖ PASS: Dry_Total_g (T={T:.4f}) matches GDM + Dry_Dead ({total_derived_check:.4f})")
        else:
            print(f"‚ùå FAIL: Dry_Total_g ({T:.4f}) should equal GDM + Dry_Dead ({total_derived_check:.4f})")

        if np.isclose(M, gdm_derived_check, atol=EPS * 10):
            print(f"‚úÖ PASS: GDM_g (M={M:.4f}) matches Dry_Green + Dry_Clover ({gdm_derived_check:.4f})")
        else:
            print(f"‚ùå FAIL: GDM_g ({M:.4f}) should equal Dry_Green + Dry_Clover ({gdm_derived_check:.4f})")
    
    print("-" * 50)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import os
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import warnings # <-- ADDED

# Suppress the specific FutureWarning from seaborn
warnings.filterwarnings("ignore", category=FutureWarning, module="seaborn")

# --- Configuration (Reloaded from globals) ---
DATA_PATH = '/kaggle/input/csiro-biomass/'
TRAIN_CSV = os.path.join(DATA_PATH, 'train.csv')
TEST_CSV = os.path.join(DATA_PATH, 'test.csv') 
IMAGE_PATH_COL = 'image_path'
TARGET_COL = 'target'
TARGET_NAMES = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']
IMG_SIZE = (128, 128)

# --- 1. Load Data (If not already loaded in the environment) ---
try:
    df_train_long = pd.read_csv(TRAIN_CSV)
    df_test_long = pd.read_csv(TEST_CSV)
except NameError:
    # If the environment was reset, load data again
    df_train_long = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'))
    df_test_long = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))

# --- 2. Target Variable Analysis (Training Data) ---
print("--- Target Variable Distribution Analysis (Training Data) ---")

# Plot the distribution of the 'target' column (all components combined)
plt.figure(figsize=(12, 5))
# This plotting call generates the FutureWarning
sns.histplot(df_train_long[TARGET_COL], bins=50, kde=True, log_scale=False)
plt.title(f'Distribution of Raw Biomass Target ({TARGET_COL})')
plt.xlabel('Biomass (grams)')
plt.show()

# Print descriptive statistics for raw targets
print("\nDescriptive Statistics for Raw Targets (All Components):")
print(df_train_long[TARGET_COL].describe().to_markdown())

# Plot individual distributions for the five components
df_pivot = df_train_long.pivot_table(
    index=IMAGE_PATH_COL,
    columns='target_name',
    values=TARGET_COL
)

plt.figure(figsize=(15, 8))
df_pivot[TARGET_NAMES].boxplot()
plt.title('Box Plot of Biomass Distribution by Component')
plt.ylabel('Biomass (grams)')
plt.grid(True, axis='y')
plt.show()

# --- 3. Image Visualization (Train and Test) ---
print("\n--- Sample Image Visualization ---")

# Get a sample train image path
train_image_path = os.path.join(DATA_PATH, df_train_long[IMAGE_PATH_COL].iloc[0])
# Get the single test image path (it's the first and likely only one)
test_image_path = os.path.join(DATA_PATH, df_test_long[IMAGE_PATH_COL].iloc[0])

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Load and plot a TRAIN image
try:
    img_train = load_img(train_image_path)
    axes[0].imshow(img_train)
    axes[0].set_title(f'Sample Train Image\n{os.path.basename(train_image_path)}')
    axes[0].axis('off')
except FileNotFoundError:
    axes[0].set_title('Train Image Not Found')

# Load and plot the TEST image
try:
    img_test = load_img(test_image_path)
    axes[1].imshow(img_test)
    axes[1].set_title(f'Sample Test Image\n{os.path.basename(test_image_path)}')
    axes[1].axis('off')
except FileNotFoundError:
    axes[1].set_title('Test Image Not Found')

plt.tight_layout()
plt.show()

print("\nEDA complete. Review the plots for distribution analysis.")