# In-Depth Guide: Data Preparation

This notebook covers preparing your data for use with **detectree2**, including basic tiling, advanced options, multi-class data handling, and visual inspection of tiles.

For the full tutorial, see the [documentation](https://patball1.github.io/detectree2/tutorials/02_data_preparation.html).

## Setup

In [None]:
!pip install torch torchvision torchaudio
!pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install detectree2

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Core Tiling (RGB)

The recommended file structure for training:

```
Paracou/
  rgb/
    Paracou_RGB_2016_10cm.tif    # RGB orthomosaic in local UTM CRS
  ms/
    Paracou_MS_2016.tif          # Multispectral orthomosaic (optional)
  crowns/
    UpdatedCrowns8.gpkg          # Crown polygons (GeoPackage/Shapefile)
```

In [None]:
from detectree2.preprocessing.tiling import tile_data
import geopandas as gpd
import rasterio

# Set up input paths
site_path = "/path/to/data/Paracou"
img_path = site_path + "/rgb/2016/Paracou_RGB_2016_10cm.tif"
crown_path = site_path + "/crowns/220619_AllSpLabelled.gpkg"

data = rasterio.open(img_path)
crowns = gpd.read_file(crown_path)
crowns = crowns.to_crs(data.crs.data)

# Set tiling parameters
buffer = 30
tile_width = 40
tile_height = 40
threshold = 0.6
out_dir = site_path + "/tiles/"

tile_data(img_path, out_dir, buffer, tile_width, tile_height, crowns, threshold, mode="rgb")

> **Note:** If tiles are outputting as blank images, set `dtype_bool=True` in the `tile_data` function.

> **Note:** Relax the `threshold` value if your trees are sparsely distributed or if you want to include non-forest areas.

## Advanced Tiling Options

Key parameters for `tile_data`:

- **`tile_placement`**: `"grid"` (default, fixed grid) or `"adaptive"` (places tiles only where crowns exist)
- **`overlapping_tiles`**: When `True`, adds shifted tiles to capture crowns on boundaries
- **`ignore_bands_indices`**: Zero-based indices of bands to skip (multispectral only)
- **`nan_threshold`**: Max proportion of NaN pixels before a tile is discarded
- **`use_convex_mask`**: Masks pixels far from any labelled crown
- **`enhance_rgb_contrast`**: Applies percentile contrast stretch for hazy/dark imagery
- **`additional_nodata`**: List of pixel values to treat as no-data
- **`mask_path`**: Vector file defining area of interest
- **`multithreaded`**: Parallel tile processing for large orthomosaics

## Recipe 1: Batch Tiling from Multiple Orthomosaics

Tile data from several orthomosaics into a single output directory to create a larger, more diverse training dataset.

In [None]:
from detectree2.preprocessing.tiling import tile_data
import geopandas as gpd
import rasterio

sites = [
    {
        "img_path": "/path/to/data/SiteA/ortho.tif",
        "crown_path": "/path/to/data/SiteA/crowns.gpkg",
    },
    {
        "img_path": "/path/to/data/SiteB/ortho.tif",
        "crown_path": "/path/to/data/SiteB/crowns.gpkg",
    },
]

output_dir = "/path/to/my-combined-training-data/"

for site in sites:
    with rasterio.open(site["img_path"]) as raster:
        crowns = gpd.read_file(site["crown_path"])
        crowns = crowns.to_crs(raster.crs)
        tile_data(
            img_path=site["img_path"],
            out_dir=output_dir,
            crowns=crowns,
            tile_placement="adaptive",
            mode="ms",
            buffer=30,
            tile_width=40,
            tile_height=40,
            threshold=0.6,
        )

## Recipe 2: Tiling Noisy Multispectral Rasters

For large, real-world multispectral datasets that may contain various no-data artifacts.

In [None]:
from detectree2.preprocessing.tiling import tile_data
import geopandas as gpd
import rasterio

img_path = "/path/to/your/large_ms_ortho.tif"
crown_path = "/path/to/your/crowns.gpkg"
output_dir = "/path/to/ms_tiles"

with rasterio.open(img_path) as raster:
    crowns = gpd.read_file(crown_path)
    crowns = crowns.to_crs(raster.crs)

    tile_data(
        img_path=img_path,
        out_dir=output_dir,
        crowns=crowns,
        mode="ms",
        tile_placement="adaptive",
        additional_nodata=[-10000, -20000],
        tile_width=80,
        buffer=10,
        tile_height=80,
        threshold=0.6,
    )

## Handling Multi-Class Data

For multi-class problems (e.g., species or disease mapping), you need to provide a class label for each crown polygon.

In [None]:
import geopandas as gpd

crown_path = "/path/to/crowns/Danum_lianas_full2017.gpkg"
crowns = gpd.read_file(crown_path)

# The 'status' column here indicates the class of each crown
print(crowns.head())
class_column = 'status'

In [None]:
from detectree2.preprocessing.tiling import record_classes

out_dir = "/path/to/tiles/"
record_classes(
    crowns=crowns,
    out_dir=out_dir,
    column=class_column,
    save_format='json'
)

In [None]:
# Tile the data with class information
tile_data(
    img_path=img_path,
    out_dir=out_dir,
    crowns=crowns,
    class_column=class_column,
    buffer=30,
    tile_width=40,
    tile_height=40,
    threshold=0.6,
)

## Utilities

### Converting Multispectral Tiles to RGB

Convert MS tiles to 3-band RGB for visualization or use with RGB-trained models. Two methods:
- `"pca"`: PCA to find the 3 most important components
- `"first-three"`: Simply takes the first three bands

In [None]:
from detectree2.preprocessing.tiling import create_RGB_from_MS

ms_tile_folder = "/path/to/ms_tiles/"
rgb_output_folder = "/path/to/rgb_tiles_from_ms/"

create_RGB_from_MS(
    tile_folder_path=ms_tile_folder,
    out_dir=rgb_output_folder,
    conversion="pca"
)

### Splitting Data into Train/Test/Validation Folds

In [None]:
from detectree2.preprocessing.tiling import to_traintest_folders

data_folder = "/path/to/tiles/"
to_traintest_folders(data_folder, data_folder, test_frac=0.15, strict=False, folds=5)

> **Note:** If `strict=True`, training/validation geojsons that overlap with test tiles (including buffers) are automatically removed, ensuring strict spatial separation. This can reduce the amount of training data significantly.

## Visually Inspecting Your Tiles

It is recommended to visually inspect the tiles before training to ensure crowns and images align.

In [None]:
from detectron2.data import DatasetCatalog, MetadataCatalog
from detectron2.utils.visualizer import Visualizer
from detectree2.models.train import combine_dicts, register_train_data
import random
import cv2
from PIL import Image

name = "Paracou"
train_location = "/path/to/tiles/train"
dataset_dicts = combine_dicts(train_location, 1)  # The number gives the fold to visualise
trees_metadata = MetadataCatalog.get(name + "_train")

for d in dataset_dicts:
    img = cv2.imread(d["file_name"])
    visualizer = Visualizer(img[:, :, ::-1], metadata=trees_metadata, scale=0.3)
    out = visualizer.draw_dataset_dict(d)
    image = cv2.cvtColor(out.get_image()[:, :, ::-1], cv2.COLOR_BGR2RGB)
    display(Image.fromarray(image))