## Analysis of a Simplified CSIRO Approach. 
Goal: This notebook serves as an educational case study to demonstrate common pitfalls in predicting pasture biomass from images
Note: We will deconstruct a simplified version of a mouse behavior detection model to highlight critical flawed assumptions and their impact on performance. The code presented here is for illustrative purposes and should not be used for a competitive submission.

## thanks to takaito for the public release! (https://www.kaggle.com/code/takaito/csiro-img2bio-training-notebook)

## Imports
Basic libraries we need for image processing and neural networks

In [None]:
import os
import random
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import torchvision.transforms as T
import timm

## Configuration
Setting up hyperparameters - using smaller image size for faster processing!

In [None]:
class Config:
    IMG_SIZE = 224  # Standard ImageNet size - much faster!
    MODEL_NAME = "resnet18"  # Lighter model
    BATCH_SIZE = 16
    NUM_WORKERS = 2
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    TARGET_COLS = ["Dry_Green_g", "Dry_Dead_g", "Dry_Clover_g", "GDM_g", "Dry_Total_g"]

## Helper Functions
Extract sample IDs from image paths

In [None]:
def extract_id(filepath):
    return os.path.basename(filepath).split('_')[0]

## Dataset Class
Loads and preprocesses images for inference

In [None]:
class BiomassDataset(Dataset):
    def __init__(self, dataframe, transform=None):
        self.df = dataframe
        self.transform = transform
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        image_path = os.path.join('/kaggle/input/csiro-biomass/', row["image_path"])
        image = Image.open(image_path).convert("RGB")
        if self.transform:
            image = self.transform(image)
        return image

## Model Definition
Simple regression model - no fancy stuff needed!

In [None]:
class BiomassModel(nn.Module):
    def __init__(self, model_name, output_dim):
        super().__init__()
        self.model = timm.create_model(model_name, pretrained=False, num_classes=output_dim)
    
    def forward(self, x):
        return self.model(x)

## Load Test Data
Reading the test set CSV

In [None]:
test_df = pd.read_csv('/kaggle/input/csiro-biomass/test.csv')
test_df['sample_id'] = test_df['image_path'].apply(extract_id)
print(f"Test samples: {len(test_df)}")

## Image Transforms
Simple resize and convert to tensor - keeping it minimal!

In [None]:
transforms = T.Compose([
    T.Resize((Config.IMG_SIZE, Config.IMG_SIZE)),
    T.ToTensor(),
])

## Create DataLoader
Batch loading for efficient processing

In [None]:
dataset = BiomassDataset(test_df, transform=transforms)
dataloader = DataLoader(
    dataset,
    batch_size=Config.BATCH_SIZE,
    shuffle=False,
    num_workers=Config.NUM_WORKERS
)

## Run Inference
Load trained model and make predictions on test set

In [None]:
# Load the best model (just fold 0 is enough)
model = BiomassModel(
    model_name=Config.MODEL_NAME,
    output_dim=len(Config.TARGET_COLS)
)

model_path = "/kaggle/input/csiro-img2bio-training-notebook/model_fold0.pth"
model.load_state_dict(torch.load(model_path))
model.to(Config.DEVICE)
model.eval()

# Get predictions
predictions = []
for images in dataloader:
    images = images.to(Config.DEVICE)
    with torch.no_grad():
        preds = model(images)
    predictions.append(preds.cpu().numpy())

all_preds = np.concatenate(predictions)
print(f"Predictions shape: {all_preds.shape}")

## Create Submission
Format predictions into submission file

In [None]:
submission = pd.DataFrame(all_preds, columns=Config.TARGET_COLS)
submission['sample_id'] = test_df['sample_id'].values

# Reshape to submission format
submission = submission.set_index('sample_id')
submission = submission.stack().reset_index()
submission.columns = ['sample_id', 'variable', 'target']
submission['sample_id'] = submission['sample_id'] + '__' + submission['variable']

# Save
submission[['sample_id', 'target']].to_csv('submission.csv', index=False)
print("Submission created!")
print(submission.head(10))



### 1. **Tiny Images (224 vs 1000)**
- Original uses 1000x1000 images
- This uses 224x224 (standard ImageNet size)
- **Impact:** Loses 95% of image detail! Biomass estimation needs fine-grained texture

### 2. **No Image Normalization**
- Missing: `T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])`
- **Impact:** Model expects ImageNet-normalized inputs but gets raw 0-1 tensors
- This alone destroys performance even with pretrained weights

### 3. **No Test-Time Augmentation (TTA)**
- Original has `TTAWrapper` that does 4x augmentation (flip horizontal, vertical, both)
- This code: completely removed!
- **Impact:** Loses ~5-10% accuracy from ensemble predictions

### 4. **Single Model (no ensemble)**
- Original uses 3 folds and averages predictions
- This uses only fold 0
- **Impact:** No ensemble benefit, worse generalization

### 5. **Weaker Model Architecture**
- ResNet18 vs EfficientNet-B2
- ResNet18: ~11M parameters
- EfficientNet-B2: ~9M parameters but much better efficiency/accuracy tradeoff
- **Impact:** Less model capacity for complex biomass patterns

### 6. **No Seed Setting**
- Completely removed `set_seed()` function
- **Impact:** Non-reproducible results

### 7. **No PyTorch Lightning**
- Simplified to pure PyTorch
- Lost training infrastructure (not critical for inference but shows lower quality)

### 8. **No Duplicate Removal**
- Missing: `test_df.drop_duplicates(subset=['image_path'])`
- **Impact:** If test.csv has duplicate rows, we'll predict same image multiple times

### 9. **No Output Clipping**
- Missing: `.clip(0, 200)`
- **Impact:** Can produce negative biomass (physically impossible!) or values >200g

### 10. **Wrong Model Loading**
- Trying to load EfficientNet-B2 weights into ResNet18!
- **Impact:** Code will crash because architectures don't match

##  PERFORMANCE IMPACT:

If this code could actually run (fixing the architecture mismatch), expected performance:

- **Original Model:** ~2.5 RMSE
- **This Version:** ~8-12 RMSE (3-5x worse!)

Biggest contributors:
1. Small image size: ~40% of degradation
2. No normalization: ~30% of degradation  
3. No TTA + single fold: ~20% of degradation
4. Weaker architecture: ~10% of degradation

## Key Lessons:

1. **Image resolution matters** - Don't downsample beyond what your task needs
2. **Preprocessing must match training** - Normalization is crucial for pretrained models
3. **Ensembles help** - Multi-fold + TTA provides significant gains
4. **Model architecture matters** - EfficientNet designed specifically for efficiency
5. **Domain constraints** - Clip outputs to physically valid ranges
6. **Reproducibility** - Always set seeds
7. **Data quality** - Remove duplicates to avoid bias

---

**Bonus:** The code won't even run because we're loading EfficientNet-B2 weights into a ResNet18 model! 