# Starter pack

This notebook shows:

- how to load samples from PASTIS using the dataloaders provided in the repository.
- how to visualize the images
- compute the challenge metric
- make a random submission
- explain how your submissions will be processed automatically


In [None]:
# Fill these file paths with the locations on your machine.
# PATH_TO_CODE = "/Users/louis.stefanuto.c/Documents/pastis-benchmark-mines2024/baseline/"
PATH_TO_DATA = (
    "/Users/louis.stefanuto.c/Documents/pastis-benchmark-mines2024/DATA-mini/"
)

The data folder should have this hierarchy:

```
.
├── TRAIN/
│   ├── DATA_S2/
│   │   ├── S2_10000.npy
│   │   ├── S2_10001.npy
│   │   ├── S2_10002.npy
│   │   └── ...
│   ├── ANNOTATIONS/
│   │   ├── TARGET_10000.npy
│   │   ├── TARGET_10001.npy
│   │   ├── TARGET_10002.npy
│   │   └── ...
│   └── metadata.json
└── TEST/
    ├── DATA_S2
    │   └── ...
    └── metadata.json
```


# GeoJson


The geojson file contains all the metadata about the patch sequences (id, dates of the sentinel 2 images, geometry of the parcels ...)

- ID: unique id of the patch
- N_Parcel: nb of parcels in the patch
- Parcel_Cover: % of the patch covered by land
- region: the patches are sampled from various regions, we added this information if you want to apply region-based normalization/transformations ...
- dates-S2: dates of the images in the sequence for this patch
- geometry: multipolygon shape telling where the parcels are


In [None]:
from pathlib import Path
import geopandas as gpd

METADATA = Path(PATH_TO_DATA) / "metadata.geojson"

mtd = gpd.read_file(METADATA)
mtd

## Sentinel images - Loading and viz


In [None]:
import torch
import numpy as np
import pandas as pd

In [None]:
from baseline.dataset import BaselineDataset
from baseline.collate import pad_collate

# Set seeds for PyTorch and numpy
torch.manual_seed(1234)
np.random.seed(1234)

dt = BaselineDataset(Path(PATH_TO_DATA))

dl = torch.utils.data.DataLoader(
    dt, batch_size=32, collate_fn=pad_collate, shuffle=True
)

Let's query a batch from the dataloader:


In [None]:
x, y = dl.__iter__().__next__()

Each batch is made of:

- an **image tensor** of shape $(B, T, C, H, W)$, where:
  - $B$ is the batch size
  - $T$ is the temporal length of the image time series
  - $(C, H, W)$ is the dimension of the patches, channel first
- a **label tensor** of shape $(B, H, W)$, the semantic segmentation ground truth mask (with values in $[0, 19]$)


In [None]:
print(x["S2"].shape)
print(y.shape)

Ok, we have a batch of sequences and their corresponding mask labels. First let's have a look at a single sequence:

In [None]:
from baseline.viz import plot_sequence_grid

IDX_IMG = 1  # Choose your sequence in [0, BATCH_SIZE-1]

plot_sequence_grid(x, bid=IDX_IMG, N=15, M=3)

Let's plot a single image from this sequence (left) and its corresponding mask (right). The mask is your target, what you have to predict.

In [None]:
from baseline.viz import plot_s2_and_labels

plot_s2_and_labels(x, y, bid=IDX_IMG, t_show=1)

## Metrics

The challenge metric is the mean Intersection over Union.

- For each class, compute the binary IoU
- Average the IoUs (`average="macro"`)

We use the sklearn `jaccard_score` function for all the evaluation tasks (see later).


In [None]:
from sklearn.metrics import jaccard_score

NUM_CLASSES = 20
BATCH_SHAPE = y.shape  # (C, H, W)

np.random.seed(1234)  # for replicable results

# Create a random prediction matrix
preds = np.random.randint(low=0, high=NUM_CLASSES, size=BATCH_SHAPE)

# Compare it to the ground truth
miou = jaccard_score(y.flatten(), preds.flatten(), average="macro")
print("miou:", miou)

## Submission

This section shows you how to submit predictions on Kaggle.

Your submission must be in the CSV format. It should have two columns:

- **ID**: the ID of the image
- **MASKS**: contains the 1D-flattened string conversion of the 2D segmentation masks

To generate the `MASKS` column, we provide you a `masks_to_str` function. We also provide the decoding script so you have a plain understanding of how we will process your submission.


In [None]:
from baseline.submission_tools import decode_masks, masks_to_str

- Generate a random submission (and solution)

> Note: 474 is the size of the test set


In [None]:
# Set the random seed
np.random.seed(1234)

X = np.random.randint(0, NUM_CLASSES, size=(474, 128, 128), dtype=np.uint8)
masks = masks_to_str(X)
print(X.shape)

submission = pd.DataFrame.from_dict({"ID": range(len(X)), "MASKS": masks})
submission["ID"] = submission["ID"] + 20000

# Note that the index=False argument is important.
submission.to_csv("submission_random.csv", index=False)

Finally, note that the test set is splitted into two subsets: a public set (50%) and a private set (50%).

During the competition, your scores will be computed **only on the public score**.

At the end of the competition, your scores will be updated to take with your score on the private set into account. The goal is to avoid indirect overfitting on the test set via repeated submissions. This will also add some suspense at the end of the challenge 😇.


## Evaluation pipeline

(Useless for predictions but may be useful if you have issues when submitting)

Here is a short review of how our automated pipeline evaluates your predictions against the test ground truths:


In [None]:
df = pd.read_csv("submission_random.csv")

# Verify the shape of the restored array
X_restored = decode_masks(df["MASKS"].to_list())
print(X_restored.shape)

Let's check that the restored submission batch is the same as the one you submitted.


In [None]:
# Reconstruction test - Should be True
(X == X_restored).all()

Let's compute the mIOU between the original prediction batch tensor and its restored version.

If everything went well, the cell should return `1.0`.


In [None]:
miou = jaccard_score(X.flatten(), X_restored.flatten(), average="macro")
print("miou:", miou)

Finally, here is the evaluation pipeline in Kaggle. It incorporates all the aforementioned elements. Note that Kaggle opens your CSV with Pandas before the `score` function. It expects the `ID` column to be the first one of your CSV.


In [None]:
"""
TODO: Enter any documentation that only people updating the metric should read here.

All columns of the solution and submission dataframes are passed to your metric, except for the Usage column.

Your metric must satisfy the following constraints:
- You must have a function named score. Kaggle's evaluation system will call that function.
- You can add your own arguments to score, but you cannot change the first three (solution, submission, and row_id_column_name).
- All arguments for score must have type annotations.
- score must return a single, finite, non-null float.
"""

import numpy as np
import pandas as pd
import pandas.api.types
from sklearn.metrics import jaccard_score


class ParticipantVisibleError(Exception):
    # If you want an error message to be shown to participants, you must raise the error as a ParticipantVisibleError
    # All other errors will only be shown to the competition host. This helps prevent unintentional leakage of solution data.
    pass


def decode_masks(
    masks: list[str],
    target_shape: tuple[int, int] = (128, 128),
) -> np.ndarray:
    """
    Convert each string in masks back to a 1D list of integers.

    Args:
        masks (list[str]): list of stringified masks

    Returns:
        np.ndarray: reconstructed batch of masks
    """
    return np.array(
        [
            np.fromstring(mask, sep=" ", dtype=np.uint8).reshape(target_shape)
            for mask in masks
        ]
    )


def score(
    solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str
) -> float:
    """
    Scoring function. Takes the solution and submission and returns the mIOU.
    """
    # Check submission files
    COL_MASK = "MASKS"
    expected_columns = ["ID", COL_MASK]
    for col in expected_columns:
        if col not in submission.columns:
            raise ParticipantVisibleError(
                f"Required column: {col} not found in the submission dataframe. Check your column names."
            )

    if not pandas.api.types.is_string_dtype(submission[COL_MASK]):
        raise ParticipantVisibleError(
            f"Submission column {col} must be an object (str) column."
        )

    # Parse and decode the masks into tensors
    masks_submission = decode_masks(submission[COL_MASK].to_list())
    masks_solution = decode_masks(solution[COL_MASK].to_list())

    if not masks_submission.shape == masks_solution.shape:
        raise ParticipantVisibleError(
            f"Submission should be of shape {masks_solution.shape} after decoding. Got submission.shape: {masks_submission.shape}."
        )

    return jaccard_score(
        y_true=masks_solution.flatten(),
        y_pred=masks_submission.flatten(),
        average="macro",
    )