# Adding new data to Open-Canopy

In this notebook, we describe how to add new data to the Open-Canopy dataset. The process is straightforward once you understand the structure of the dataset. Quoting section A-3 of the supplementary material of the [paper](https://arxiv.org/pdf/2407.09392):
"
The composition of the `canopy_height` folder is the following:
- The file `geometries.geojson` stores a list of [...] geolocated geometries, giving access to the splits of the dataset. It can be loaded using the python package geopandas. Each geometry designates either a train, validation, test or buffer area. This information is stored in the column `split` [...]. Additionally, each geometry is associated to a year (corresponding to the year of the corresponding LiDAR acquisition), stored in the column `lidar_year`.
*NB: In the original "Open-Canopy" dataset, the column "lidar_year" refers to the acquisition year of lidar data,
but in a more general context, it designates a spatio-temporal zone with a unique crs, and should be filled with the name of the folder where to retrieve data for each geometry.*
- The file `forest_mask.parquet` stores geolocated geometries of forests’ outlines. It can be loaded using the python package geopandas.[...]
- Each folder 2021, 2022 and 2023 contains three files:
    - `lidar.vrt` is a geolocalized virtual file that gives access to SPOT 6-7 images stored in the subfolder spot. It can be accessed through Qgis software 2 or python rasterio library 3 for instance. It has the same extent as the geometries of the associated year.
    - Similarly `lidar.vrt` gives access to ALS-derived (LiDAR) canopy height maps stored in the subfolder lidar.
    - Similarly `lidar_classification.vrt` gives access to classification rasters stored in the subfolder lidar_classification.
"

**Hence, perform the following steps to add new data**
- Create a new folder in the folder `canopy_height`.
- In the newly created folder, add the following files:
    - a GeoTIFF (`tif` extension) file (or a`vrt` with its associated files) named `spot.tif` for satellite imagery, at resolution 1.5m, with four bands RGB and NIR. See [preprocessing](../src/preprocessing/README.md) for pansharpening of SPOT imagery at 1.5m resolution. If you plan to use satellite or aerial imagery that comes from a different sensor, we recommend first applying histogram matching or a related technique with SPOT preprocessed data.
    - [For evaluation/training] a `tif` (or `vrt`) named `lidar.tif` for ground truth canopy height, stored in decimeters, at 1.5m resolution
    - [Optional for evaluation/training] a mask, stored in a `tif` (or `vrt`) named `lidar_classification.tif` specifying where to perform evaluation/training (cf. [evaluation config](../src/metrics/configs/compute_metrics_config.yaml)).

Then update the `geometries.geojson` file to reflect the new training or evaluation set, as shown below.


In [None]:
# Import libraries
import geopandas as gpd
import pandas as pd
import os
import rasterio
from shapely.geometry import box
from pathlib import Path
import shutil

In [None]:
# Make a copy of the existing geometries.geojson file before updating it
path_to_dataset = Path('../datasets/canopy/canopy_height')
new_data_name = "my_new_data" # Update with the name of the new folder where you have stored new data
split = "test" # update with "train/val/test/predict"
extension = ".tif" # Replace with .vrt if using vrt

path_to_new_data = os.path.join(path_to_dataset, new_data_name)
path_to_geometries = os.path.join(path_to_dataset, 'geometries.geojson')
print('Creating backup of the original geometries.geojson file')
shutil.copy(path_to_geometries, os.path.join(path_to_dataset, "initial_geometries.geojson"))

In [None]:
# Create a geometry corresponding to the new data
with rasterio.open(os.path.join(path_to_new_data, "spot"+extension)) as src:
    # Get the bounding box of the image
    bounds = src.bounds
# Create a polygon from the bounds
new_geometry = box(bounds.left, bounds.bottom, bounds.right, bounds.top)

# Update the "geometries.geojson" file
gdf = gpd.read_file(path_to_geometries)
# In the original "Open-Canopy" dataset, the column "lidar_year" refers to the acquisition year of lidar data,
# but in a more general context, it just designates a folder with data as described above
new_gdf = gpd.GeoDataFrame({'lidar_year': [new_data_name],'split': [split], 'geometry': [new_geometry]}, crs=gdf.crs)
# Append the new GeoDataFrame to the original one
gdf = pd.concat([gdf, new_gdf], ignore_index=True)
# NB: other columns in gdf do not need to be filled
# If you want to perform evaluation only on the new data, just keep the new geometry
# gdf = new_gdf
# Save the new geometries
gdf.to_file(path_to_geometries, driver="GeoJSON")

In [None]:
# Check the result
gdf.tail()

In [None]:
# If you do not have a classification_mask, you can create one corresponding to the pixels with height higher than a given threshold (e.g., 2m)
threshold = 20 # heights are stored in decimeters
with rasterio.open(os.path.join(path_to_new_data, "lidar"+extension)) as src:
    lidar = src.read(1)
    profile = src.profile.copy()
# Threshold to get the mask, and use value 5 (corresponding to high vegetation)
mask = (lidar >= threshold)*5
# Save the mask
profile.update(dtype=rasterio.uint8, count=1)
with rasterio.open(os.path.join(path_to_new_data, "lidar_classification.tif"), 'w', **profile) as dst:
    dst.write(mask.astype(rasterio.uint8), 1)

In [None]:
# Check that all needed files are present:
assert os.path.exists(path_to_new_data), 'New data folder not found'
assert os.path.exists(os.path.join(path_to_new_data, "spot"+extension)), 'spot file not found'
assert os.path.exists(os.path.join(path_to_new_data, "lidar"+extension)), 'lidar file not found'
assert os.path.exists(os.path.join(path_to_new_data, "lidar_classification"+extension)), 'lidar_classification file not found'


## Run training/evaluation
For training (after updating the configs with e.g., your custom model)
```bash
python src/train.py model=my_custom_model
```
For evaluation only:
```bash
python src/eval.py
```
