
# CSIRO Image2Biomass — Deep EDA (No Modeling)

This notebook provides a comprehensive exploration to help you understand:
- The structure and diversity of the training data
- Relationships between features (e.g., NDVI, height, species) and biomass targets
- Distribution and variability across states, species, and sampling dates
- Visual patterns in pasture images related to biomass conditions

This baseline providing an 0.5 LB provide weak results from two target
https://www.kaggle.com/code/mathieuduverne/csiro-simple-baseline

## Additional EDA – Investigating Weak Targets (Dry_Dead_g & Dry_Clover_g)

The model performs poorly on **Dry_Dead_g** and **Dry_Clover_g** compared to the other biomass components.  
In this section, we investigate potential reasons:
1. Check their distributions and outliers.  
2. Explore correlations with NDVI and height.  
3. Visualize sample images.  
4. Evaluate label consistency and potential noise.  
5. Examine model prediction patterns later (optional link to modeling notebook).  


## 1. Setup & Data Loading

In [None]:

import os
import sys
import gc
import math
import json
import random
from pathlib import Path
from collections import Counter, defaultdict
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["axes.grid"] = True
plt.rcParams["figure.dpi"] = 110

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

DATA_DIR = Path('/kaggle/input/csiro-biomass')
TRAIN_IMG_DIR = DATA_DIR / 'train'
TEST_IMG_DIR = DATA_DIR / 'test'
TRAIN_CSV = DATA_DIR / 'train.csv'
TEST_CSV = DATA_DIR / 'test.csv'
SAMPLE_SUB_CSV = DATA_DIR / 'sample_submission.csv'

print("Data directory exists:", DATA_DIR.exists())
print("Train images dir exists:", TRAIN_IMG_DIR.exists())
print("Test images dir exists:", TEST_IMG_DIR.exists())
print("CSV files present:", TRAIN_CSV.exists(), TEST_CSV.exists(), SAMPLE_SUB_CSV.exists())

def safe_read_csv(path, **kwargs):
    try:
        df = pd.read_csv(path, **kwargs)
        print(f"Loaded: {path.name} → shape={df.shape}")
        return df
    except Exception as e:
        print(f"Could not load {path}: {e}")
        return pd.DataFrame()

train_df = safe_read_csv(TRAIN_CSV)
test_df  = safe_read_csv(TEST_CSV)
sub_df   = safe_read_csv(SAMPLE_SUB_CSV)

# Basic overview
def df_overview(df, name):
    print(f"\n{name} — shape {df.shape}")
    display(df.head(3))
    print("\nColumns:", list(df.columns))
    if df.select_dtypes(include=[np.number]).shape[1] > 0:
        display(df.describe(include='all').T)

df_overview(train_df, "train.csv")
df_overview(test_df, "test.csv")
df_overview(sub_df, "sample_submission.csv")


## 2. Data Dictionary & Context


**Files & Folders**
- `train/` — All training images (JPEG)
- `test/` — All test images (JPEG)
- `train.csv` — Metadata + ground-truth biomass per sample
- `test.csv` — Metadata + target names for required predictions
- `sample_submission.csv` — Submission format example

**Columns in `train.csv`**

| Column | Description |
|---|---|
| `sample_id` | Unique identifier for each sample (one per image). |
| `image_path` | Relative path to the image (e.g., `train/ID1098771283.jpg`). |
| `Sampling_Date` | Date of data collection. |
| `State` | Australian state where the sample was collected. |
| `Species` | Pasture species present, ordered by biomass (underscore-separated). |
| `Pre_GSHH_NDVI` | GreenSeeker NDVI reading. |
| `Height_Ave_cm` | Average pasture height (cm) measured by falling plate. |
| `target_name` | Biomass component type: `Dry_Green_g`, `Dry_Dead_g`, `Dry_Clover_g`, `GDM_g`, `Dry_Total_g`. |
| `target` | Ground-truth biomass value (grams). |

**Biomass components**
- `Dry_Green_g`: mass of green vegetation (excluding clover)
- `Dry_Dead_g`: mass of dry/dead material
- `Dry_Clover_g`: mass of dry clover biomass
- `GDM_g`: green dry matter
- `Dry_Total_g`: total dry biomass

**Ecological relevance.** These components track forage availability and composition. Green matter and clover relate to nutritional quality, while dead material reflects senescence and seasonal dynamics. Understanding their balance helps optimise grazing decisions for animal welfare and soil health.


## 3. Missing Values & Data Integrity

In [None]:

# Unique checks & path consistency
def integrity_checks(df):
    out = {}
    if 'sample_id' in df.columns:
        out['unique_sample_ids'] = df['sample_id'].is_unique
        out['n_sample_ids'] = df['sample_id'].nunique()
    if 'image_path' in df.columns:
        out['image_path_nulls'] = int(df['image_path'].isna().sum())
        out['image_path_examples'] = df['image_path'].dropna().head(3).tolist()
    return out

print("train integrity:", integrity_checks(train_df))
print("test integrity:", integrity_checks(test_df))

# Missing values
def missing_table(df):
    miss = df.isna().sum()
    prop = (miss / len(df)).round(4) if len(df) > 0 else miss * 0
    m = pd.DataFrame({"missing": miss, "proportion": prop})
    return m[m["missing"] > 0].sort_values("missing", ascending=False)

print("\nMissing values in train:")
display(missing_table(train_df))

print("\nMissing values in test:")
display(missing_table(test_df))

# Duplicates
if not train_df.empty:
    dup_rows = train_df.duplicated().sum()
    dup_sample = train_df.duplicated(subset=['sample_id']).sum() if 'sample_id' in train_df.columns else np.nan
    print(f"\nDuplicate full rows in train: {dup_rows}")
    print(f"Duplicate sample_id in train: {dup_sample}")

# Verify that image paths exist
def count_existing_paths(paths):
    ok = 0
    total = 0
    for p in paths:
        total += 1
        full = DATA_DIR / p
        if full.exists():
            ok += 1
    return ok, total

if 'image_path' in train_df.columns and len(train_df) > 0:
    sample_paths = train_df['image_path'].dropna().sample(min(500, len(train_df)), random_state=RANDOM_SEED)
    ok, total = count_existing_paths(sample_paths)
    print(f"\nTrain image path existence (sample {total}): {ok}/{total} found.")

if 'image_path' in test_df.columns and len(test_df) > 0:
    sample_paths = test_df['image_path'].dropna().sample(min(500, len(test_df)), random_state=RANDOM_SEED)
    ok, total = count_existing_paths(sample_paths)
    print(f"Test image path existence (sample {total}): {ok}/{total} found.")


## 4. Target Distribution Analysis

In [None]:

# Pivot so each target_name becomes a column (one row per sample_id)
if not train_df.empty:
    piv = (train_df
           .pivot_table(index='sample_id', columns='target_name', values='target', aggfunc='first')
           .reset_index())
    display(piv.head())

    target_cols = [c for c in piv.columns if c != 'sample_id']
    print("Target columns:", target_cols)

    # Histograms for each target
    import matplotlib.pyplot as plt
    for col in target_cols:
        plt.figure()
        plt.hist(piv[col].dropna(), bins=50)
        plt.title(f"Histogram: {col}")
        plt.xlabel("value (g)")
        plt.ylabel("count")
        plt.show()

    # Boxplots
    plt.figure()
    piv[target_cols].plot(kind='box', rot=45)
    plt.title("Boxplots of biomass targets")
    plt.ylabel("value (g)")
    plt.show()

    # Skewness
    skewness = piv[target_cols].skew(numeric_only=True).sort_values(ascending=False)
    display(pd.DataFrame({"skew": skewness}))
else:
    print("train_df is empty; skipping target distribution analysis.")


## 5. Metadata Exploration

In [None]:

meta_cols = ['Sampling_Date','State','Species','Pre_GSHH_NDVI','Height_Ave_cm','target_name','target']
display(train_df[meta_cols].head() if set(meta_cols).issubset(train_df.columns) else train_df.head())

# Parse date & add helpers
if 'Sampling_Date' in train_df.columns:
    train_df['Sampling_Date'] = pd.to_datetime(train_df['Sampling_Date'], errors='coerce')
    train_df['Year'] = train_df['Sampling_Date'].dt.year
    train_df['Month'] = train_df['Sampling_Date'].dt.month
    train_df['MonthName'] = train_df['Sampling_Date'].dt.month_name()
    train_df['Quarter'] = train_df['Sampling_Date'].dt.to_period('Q').astype(str)
    train_df['Week'] = train_df['Sampling_Date'].dt.isocalendar().week

# State-wise target distributions (per target_name)
if not train_df.empty and {'State', 'target_name','target'}.issubset(train_df.columns):
    top_states = train_df['State'].value_counts().head(10).index.tolist()
    subset = train_df[train_df['State'].isin(top_states)]
    plt.figure(figsize=(10,6))
    sns.boxplot(data=subset, x='State', y='target', hue='target_name', showfliers=False)
    plt.title("Target distributions by State (top states)")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Month-wise trends (aggregate mean)
if not train_df.empty and {'Month','target'}.issubset(train_df.columns):
    month_mean = (train_df.groupby(['Month','target_name'])['target']
                  .mean().reset_index().sort_values(['target_name','Month']))
    display(month_mean.head())
    g = sns.FacetGrid(month_mean, col="target_name", col_wrap=3, sharey=False, height=3.2)
    g.map_dataframe(sns.lineplot, x="Month", y="target")
    g.set_titles(col_template="{col_name}")
    g.fig.suptitle("Mean biomass by Month", y=1.05)
    plt.show()

# NDVI & Height relationships with targets
if not train_df.empty:
    # Average across target_name by sample to align features
    feats = (train_df
             .pivot_table(index='sample_id',
                          values=['Pre_GSHH_NDVI','Height_Ave_cm'],
                          aggfunc='first')
            )
    targs = (train_df
             .pivot_table(index='sample_id', columns='target_name', values='target', aggfunc='first'))
    merged = feats.join(targs, how='inner').reset_index()
    display(merged.head())

    # Scatter plots: NDVI vs each target
    targ_cols = [c for c in merged.columns if c not in ['sample_id','Pre_GSHH_NDVI','Height_Ave_cm']]
    for col in targ_cols:
        plt.figure()
        plt.scatter(merged['Pre_GSHH_NDVI'], merged[col], s=10, alpha=0.6)
        plt.title(f"Pre_GSHH_NDVI vs {col}")
        plt.xlabel("Pre_GSHH_NDVI")
        plt.ylabel(col)
        plt.show()

    # Scatter plots: Height vs target
    for col in targ_cols:
        plt.figure()
        plt.scatter(merged['Height_Ave_cm'], merged[col], s=10, alpha=0.6)
        plt.title(f"Height_Ave_cm vs {col}")
        plt.xlabel("Height_Ave_cm (cm)")
        plt.ylabel(col)
        plt.show()

    # Correlation heatmap including features
    corr_cols = ['Pre_GSHH_NDVI','Height_Ave_cm'] + targ_cols
    corr2 = merged[corr_cols].corr()
    display(corr2)
    plt.figure()
    sns.heatmap(corr2, annot=True, fmt=".2f", cmap="viridis")
    plt.title("Correlation: features & targets")
    plt.show()


## 6. Image Data Exploration

In [None]:

# Helper: open image safely
def open_image(rel_path, max_size=768):
    # Open image by relative path within DATA_DIR and optionally thumbnail it.
    path = DATA_DIR / rel_path
    with Image.open(path) as img:
        img = img.convert("RGB")
        # Resize (keep aspect) for display
        img.thumbnail((max_size, max_size))
        return img

# Random sample of images with captions (metadata + targets)
if 'image_path' in train_df.columns and len(train_df) > 0:
    print("Displaying a few random training images…")
    sample_ids = train_df['sample_id'].dropna().unique()
    show_sample = np.random.choice(sample_ids, size=min(9, len(sample_ids)), replace=False)

    # Collect metadata per sample_id
    meta_cols = ['State','Species','Sampling_Date','Pre_GSHH_NDVI','Height_Ave_cm']
    targ_piv = (train_df.pivot_table(index='sample_id', columns='target_name', values='target', aggfunc='first'))
    info = (train_df.drop_duplicates('sample_id')[['sample_id','image_path'] + [c for c in meta_cols if c in train_df.columns]]
            .set_index('sample_id').join(targ_piv, how='left'))

    ncols = 3
    nrows = int(math.ceil(len(show_sample) / ncols))
    fig, axes = plt.subplots(nrows, ncols, figsize=(12, 4*nrows))
    axes = axes.flatten() if isinstance(axes, np.ndarray) else [axes]

    for ax, sid in zip(axes, show_sample):
        try:
            rec = info.loc[sid]
            img = open_image(rec['image_path'])
            ax.imshow(img)
            ndvi = rec.get('Pre_GSHH_NDVI', np.nan)
            ndvi_str = f"{ndvi:.3f}" if pd.notna(ndvi) else "NA"
            h = rec.get('Height_Ave_cm', np.nan)
            h_str = f"{h:.1f}" if pd.notna(h) else "NA"
            caption = f"{rec.get('State','?')} | NDVI={ndvi_str} | H={h_str} cm"
            if 'Dry_Total_g' in rec.index and not pd.isna(rec['Dry_Total_g']):
                caption += f" | Dry_Total_g={rec['Dry_Total_g']:.1f}"
            ax.set_title(caption, fontsize=10)
            ax.axis('off')
        except Exception as e:
            ax.axis('off')
    for k in range(len(show_sample), len(axes)):
        axes[k].axis('off')
    plt.tight_layout()
    plt.show()

# Image statistics: RGB histograms and basic stats for a sample
def image_stats(rel_paths, max_n=200):
    vals = []
    for p in rel_paths[:max_n]:
        try:
            img = open_image(p, max_size=512)
            arr = np.asarray(img).astype(np.float32)
            vals.append({
                "path": p,
                "mean_R": float(arr[:,:,0].mean()),
                "mean_G": float(arr[:,:,1].mean()),
                "mean_B": float(arr[:,:,2].mean()),
                "std_R": float(arr[:,:,0].std()),
                "std_G": float(arr[:,:,1].std()),
                "std_B": float(arr[:,:,2].std()),
                "mean_intensity": float(arr.mean()),
                "std_intensity": float(arr.std()),
            })
        except Exception as e:
            continue
    return pd.DataFrame(vals)

if 'image_path' in train_df.columns and len(train_df) > 0:
    unique_paths = train_df.drop_duplicates('sample_id')['image_path'].dropna().tolist()
    img_df = image_stats(unique_paths, max_n=200)
    print("Image stats (sample):", img_df.shape)
    display(img_df.head())

    # Channel means
    for ch in ["mean_R","mean_G","mean_B","mean_intensity"]:
        plt.figure()
        plt.hist(img_df[ch].dropna(), bins=40)
        plt.title(f"Histogram: {ch}")
        plt.xlabel(ch)
        plt.ylabel("count")
        plt.show()

    # Compare low vs high biomass (Dry_Total_g if available)
    if 'Dry_Total_g' in train_df['target_name'].unique():
        piv = train_df.pivot_table(index='sample_id', columns='target_name', values='target', aggfunc='first').reset_index()
        merged_img = (train_df.drop_duplicates('sample_id')[['sample_id','image_path']]
                      .merge(piv, on='sample_id', how='left'))
        merged_img = merged_img.merge(img_df.rename(columns={"path":"image_path"}), on='image_path', how='left')
        if 'Dry_Total_g' in merged_img.columns:
            q_low, q_high = merged_img['Dry_Total_g'].quantile([0.1, 0.9])
            low = merged_img[merged_img['Dry_Total_g'] <= q_low].dropna(subset=['mean_intensity']).head(12)
            high = merged_img[merged_img['Dry_Total_g'] >= q_high].dropna(subset=['mean_intensity']).head(12)

            def show_grid(df, title):
                n = len(df)
                ncols = 4
                nrows = int(math.ceil(n / ncols))
                fig, axes = plt.subplots(nrows, ncols, figsize=(12, 3*nrows))
                axes = axes.flatten() if isinstance(axes, np.ndarray) else [axes]
                for ax, (_, row) in zip(axes, df.iterrows()):
                    try:
                        img = open_image(row['image_path'])
                        ax.imshow(img)
                        ax.set_title(f"Dry_Total_g={row['Dry_Total_g']:.0f}", fontsize=9)
                        ax.axis('off')
                    except Exception:
                        ax.axis('off')
                for k in range(n, len(axes)):
                    axes[k].axis('off')
                fig.suptitle(title, y=0.98)
                plt.tight_layout()
                plt.show()

            show_grid(low, "Examples: Low Dry_Total_g (bottom 10%)")
            show_grid(high, "Examples: High Dry_Total_g (top 10%)")


## 7. Combined Feature Insights

In [None]:

# Merge features & targets (one row per sample)
if not train_df.empty:
    features = (train_df
                .pivot_table(index='sample_id',
                             values=['Pre_GSHH_NDVI','Height_Ave_cm','State','Species','Sampling_Date'],
                             aggfunc='first'))
    targets = (train_df.pivot_table(index='sample_id', columns='target_name', values='target', aggfunc='first'))
    dfm = features.join(targets, how='left').reset_index()

    # Grouped means by State
    if 'State' in dfm.columns:
        g_state = (dfm.groupby('State')[targets.columns.tolist()].mean().sort_index())
        display(g_state.head())
        g_state.plot(kind='bar', subplots=True, layout=(math.ceil(len(targets.columns)/2), 2), figsize=(12,8), legend=False, sharex=True, sharey=False)
        plt.suptitle("Mean biomass targets by State")
        plt.tight_layout()
        plt.show()

    # Species frequency & contribution
    if 'Species' in dfm.columns:
        species_counts = dfm['Species'].value_counts().head(25)
        plt.figure(figsize=(10,6))
        species_counts.sort_values(ascending=True).plot(kind='barh')
        plt.title("Top Species combinations (by count)")
        plt.xlabel("count")
        plt.tight_layout()
        plt.show()

        # Per-species (top 10) mean Dry_Total_g, if available
        top10 = species_counts.head(10).index.tolist()
        if 'Dry_Total_g' in dfm.columns:
            sp_mean = dfm[dfm['Species'].isin(top10)].groupby('Species')['Dry_Total_g'].mean().sort_values()
            plt.figure(figsize=(10,6))
            sp_mean.plot(kind='barh')
            plt.title("Mean Dry_Total_g for top-10 Species combos")
            plt.xlabel("Dry_Total_g (g)")
            plt.tight_layout()
            plt.show()

    # Sampling period comparisons
    if 'Sampling_Date' in dfm.columns:
        dfm['Year'] = pd.to_datetime(dfm['Sampling_Date'], errors='coerce').dt.year
        dfm['Month'] = pd.to_datetime(dfm['Sampling_Date'], errors='coerce').dt.month
        # Multi-panel comparison across targets
        ncols = 2
        tcols = [c for c in targets.columns]
        nrows = int(math.ceil(len(tcols) / ncols))
        fig, axes = plt.subplots(nrows, ncols, figsize=(12, 4*nrows))
        axes = axes.flatten() if isinstance(axes, np.ndarray) else [axes]
        for ax, col in zip(axes, tcols):
            tmp = dfm.groupby('Month')[col].mean()
            ax.plot(tmp.index, tmp.values, marker='o')
            ax.set_title(f"Monthly mean of {col}")
            ax.set_xlabel("Month")
            ax.set_ylabel(col)
        for k in range(len(tcols), len(axes)):
            axes[k].axis('off')
        plt.tight_layout()
        plt.show()


## 8. Investigate on weak targets

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

subset = train_df[train_df["target_name"].isin(["Dry_Dead_g","Dry_Clover_g"])]

plt.figure(figsize=(10,4))
sns.histplot(data=subset, x="target", hue="target_name", kde=True, bins=30)
plt.title("Target Distribution – Dry_Dead_g vs Dry_Clover_g")
plt.xlabel("Target value (grams)")
plt.show()

# Boxplot for outliers
plt.figure(figsize=(8,4))
sns.boxplot(data=subset, x="target_name", y="target")
plt.title("Boxplot – Target spread & outliers")
plt.show()

In [None]:
corrs = (
    subset.groupby("target_name")[["Pre_GSHH_NDVI","Height_Ave_cm","target"]]
    .corr()
    .iloc[::3, -1]  # correlation with target
)
print("Correlation with target:")
print(corrs)


In [None]:
import lightgbm as lgb
from sklearn.metrics import r2_score

for target in ["Dry_Dead_g","Dry_Clover_g"]:
    df_t = train_df[train_df["target_name"] == target]
    X = df_t[["Pre_GSHH_NDVI","Height_Ave_cm"]]
    y = df_t["target"]
    model = lgb.LGBMRegressor().fit(X, y)
    r2 = r2_score(y, model.predict(X))
    print(f"{target} | R² from meta-features only: {r2:.3f}")



summary
- **Highly skewed or sparse distributions** → possible data imbalance.  
- **Weak correlations** → NDVI/height not predictive for these targets.   Dry_Clover_g  Pre_GSHH_NDVI    0.224150 Dry_Dead_g    Pre_GSHH_NDVI   -0.122818
