# Data Extractions from OpenEO

To run the extractions, you need an account in the [Copernicus Data Space Ecosystem (CDSE)](https://dataspace.copernicus.eu/).

In [None]:
!pip install git+https://github.com/ScaleAGData/scaleag-vito.git@prometheo-integration --quiet

In [None]:
from pathlib import Path
import geopandas as gpd
from scaleagdata_vito.presto.datasets_prometheo import ScaleAgDataset
from scaleagdata_vito.openeo.extract_sample_scaleag import generate_input_for_extractions, extract
from scaleagdata_vito.presto.presto_df import load_dataset
from scaleagdata_vito.presto.utils import train_test_val_split, finetune_on_task, load_finetuned_model, evaluate_finetuned_model, get_pretrained_model_url
from scaleagdata_vito.presto.inference import PrestoPredictor, reshape_result, plot_results
from scaleagdata_vito.utils.map import ui_map
from scaleagdata_vito.utils.dateslider import date_slider
from scaleagdata_vito.openeo.extract_sample_scaleag import collect_inputs_for_inference

### Before we start...

**Check your data!** Investigate validity of geometries uniqueness of sample IDs, presence of outliers and so on before starting the extraction. Achieving good performance making use of a limited amount of data is a challening task per se. Therefore, **the quality of your data will greatly impact your final results.**

Data requirements:
- Points or Polygons (will be aggregated in points)
- Lat-Lon (crs:4326) 
- Format: parquet, GeoJSON, shapefile, GPKG
For each geometry:
- Date (if available) 
- Unique ID
- Annotations

Good practice:

Remove polygons close to borders (e.g. apply buffer) to ensure data are contained in the field
If the annotations are accurate, point geometries should be preferred. However, especially in regression tasks (i.e., continuous output values) such us yield estimation the target values might be noisy. In that case, we recommend subdividing the polygons in subfields of 20m x 20m (to cover more measurements) and computing the median yield for a smoother and more reliable target

##### Assess data correctness before launching the OpenEO jobs 
You can run some checks on your input file to make sure they are suitable to run the extractions successfully. In particular, it is important to check the validity of the geometries and, to also have a column containing a unique id for each sample. Do these checks beforehand by running the first section of the notebook `data_investigation.ipynb`

##### Requirements for running the extractions
- Account in [Copernicus Data Space Ecosystem (CDSE)](https://dataspace.copernicus.eu/). You can sign up for free and have a monthly availability of 10000 credits.
- A dataset with valid geometries (Points or Polygons) in lat-lon projection.
- Preferably a dataset with unique IDs per sample 
- A labelled dataset. Not required for the extraction process, but for the following fine-tuning steps.

##### EO data extractions
In this first step, we extract for each sample in your dataset the required EO time series from CDSE using OpenEO.
For running the job, the user should indicate the following job_dictionary fields:

```python
    job_params = dict(
        output_folder=..., # where to save the extracted dataset
        input_df=..., # input georeferenced dataset to run the extractions for 
        start_date=..., # string indicating from which date to extract data  
        end_date=..., # string indicating until which date to extract the data 
        unique_id_column=..., # name of the column in the input_df containing the unique ID of the samples  
        composite_window=..., # "month" or "dekad" are supported. Default is "dekad"
    )
```
in particular:
- If the `date` information associated with the label is provided, the `start_date` of the time-series is automatically set to 9 months before the date, whereas the `end_date` is set to 9 months after. If `date` is not available, the user needs to manually indicate the desired `start_date` and `end_date` for the extractions. **The indicated period should cover 1 year**. 
- `composite_window` indicates the time-series granularity, which can be dekadal or monthly. 
  - `dekad`: each time step in the extracted time series corresponds to a mean-compositing operation on 10-days acquisitions. Accordingly with the start and end date, each month will be covered by 3 time steps which, by default, correspond to the 1st, 11th and 21th of the month. 
  - `month`: each time step in the extracted time series corresponds to a mean-compositing operation on 30-days acquisitions. Each month will be covered by 1 time step which, by default, correspond to the 1st of the month.

The following decadal/monthly time series will be extracted for the indicated time range:

- Sentinel-2 L2A data (all bands)
- Sentinel-1 VH and VV
- Average air temperature and precipitation sum derived from AgERA5
- Slope and elevation from Copernicus DEM

Presto accepts 1D time-series. Therefore, if Polygons are provided for the extractions, the latter are spatially aggregated in points which will correspond to the centroid lat lon geolocation.

### 1) Run extractions

In [None]:
# Dataset Parameters
start_date="2023-01-01"
end_date="2023-12-31"
composite_window="dekad"
unique_id_column="Field_ID"
input_df="/home/giorgia/Private/data/geomaize/correct/Maize_2023_valid.geojson"
output_folder="/home/giorgia/Private/data/geomaize/new_extractions"

In [None]:
# check input data structure 
gpd.read_file(input_df).head(5)

In [None]:
job_params = dict(
    output_folder=output_folder,
    input_df=input_df,
    start_date=start_date,
    end_date=end_date,
    unique_id_column=unique_id_column,
    composite_window=composite_window,
)
extract(generate_input_for_extractions(job_params))

Once the dataset will be extracted, it can be loaded with the `load_dataset` function by specifying the path where the `.parquet` files have been downloaded. Moreover, **if** we are dealing with 1 year of data falling in the **same time period**, the following manipulations of the dataset are also possible.

- `window_of_interest`: the user can specify a time window of interest out of the whole available time-series. `start_date` and `end_date` should be provided as strings in a list.
- `use_valid_time`: the user might want to define the window of interest based on the `date` the label is associated with. If so, also `required_min_timesteps` should be provided
- `buffer_window`: buffers the `start_date` and `end_date` by a number of time steps specified with this argument. 

In the following cell, we load the extracted dataset for 1 year of data.

### 2) Presto datasets initialization

In [None]:
extractions = load_dataset(
    output_folder,
    composite_window=composite_window,
)
train_df, test_df, val_df = train_test_val_split(
    extractions,
    uniform_sample_by=unique_id_column,
    sampling_frac=0.8,
    nmin_per_class=1, # do not change this parameter
)

We now set up the parameters needed for initializing presto datasets for the specific task:
- `num_timesteps`: can be inferred by the max number of the `available_timesteps` 
- `target_name`: name of the column containing the target data

In [None]:
# visualize distribution to check for outliers to exclude if needed
num_timesteps = extractions.available_timesteps.max()
task_type = "regression"
target_name="Yield kg/H"

We Initialize the training, validation and test datasets objects to be used for training Presto.

In [None]:
train_ds = ScaleAgDataset(
    dataframe=train_df,
    num_timesteps=num_timesteps,
    task_type=task_type,
    target_name=target_name,
    composite_window=composite_window,
)

val_ds = ScaleAgDataset(
    dataframe=val_df,
    num_timesteps=num_timesteps,
    task_type=task_type,
    target_name=target_name,
    composite_window=composite_window,
)

test_ds = ScaleAgDataset(
    dataframe=test_df,
    num_timesteps=num_timesteps,
    task_type=task_type,
    target_name=target_name,
    composite_window=composite_window,
)

### 3) Presto Finetuning

In this section Presto will be Fine-Tuned in a supervised way for the target downstream task. first we set up the following experiment parameters:

- `output_dir` : where to dave the model 
- `experiment_name` : the model name

In [None]:
# Set model Hyperparameters
models_dir = Path("/home/giorgia/Private/data/geomaize/models/")
experiment_name = "presto-ss-wc-10D-ft-dek-geomaize-lognorm"
model_output_dir = models_dir / experiment_name
model_output_dir.mkdir(parents=True, exist_ok=True)

In [None]:
# Construct the model with finetuning head starting from the pretrained model
finetuned_model = finetune_on_task(
    train_ds=train_ds,
    val_ds=val_ds,
    pretrained_model_path=get_pretrained_model_url(composite_window=composite_window),
    output_dir=model_output_dir,
    experiment_name=experiment_name,
    batch_size=32,
    num_workers=0,
    max_epochs=100, # max num of training rounds the model should go through
    patience=10, # how many epochs to wait for improvement before stopping
    lr=1e-3, # learning rate. this paramenter shoul be kept < 1e-3. usually very low (eg 2e-5)
    )
# save ids to csvs for experiment replication
val_df['sample_id'].to_csv(model_output_dir / "val_sample_ids.csv", index=False)
test_df['sample_id'].to_csv(model_output_dir / "test_sample_ids.csv", index=False)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

finetuned_model = load_finetuned_model(
    model_path = model_output_dir / experiment_name,
    task_type=task_type,
)

metrics, preds_original_units, targets_original_units = evaluate_finetuned_model(finetuned_model, test_ds, num_workers=0, batch_size=32)

plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 2)
plt.scatter(x=np.arange(len(preds_original_units)), y=preds_original_units, label='preds')
plt.scatter(x=np.arange(len(targets_original_units)), y=targets_original_units, label='targets')
plt.xticks(ticks=np.arange(len(test_df)), labels=test_df.sample_id.to_list(), rotation=45)
plt.legend()

### 4) Inference using Fine-Tuned end-to-end Presto

In this section, we apply the fine tuned model to generate a yield map on an unseen area. 
We need to indicate the spatial and temporal extent. The 2 cells below, offer a simple way for the user to provide these information and perform once again the extraction from CDSE of the EO time-series required by Presto. 
We also need to indicate the `output_dir` of where to save the datacube of the extraction, its `output_filename` and the `composite_window` which will be the same as used for finetuning the model.

In [None]:
map = ui_map(area_limit=7)

In [None]:
# select 1 year of data
slider = date_slider()

In [None]:
output_dir = Path("/home/giorgia/Private/data/geomaize/regression")
output_filename = "inference_area"
inference_file = output_dir / f"{output_filename}.nc"

In [None]:
collect_inputs_for_inference(
    spatial_extent=map.get_extent(),
    temporal_extent=slider.get_processing_period(),
    output_path=output_dir,
    output_filename=f"{output_filename}.nc",
    composite_window=composite_window,
)

Once the datacube has been extracted, we can perform the inference task using the finetuned model and visualize the predicted map. 

In [None]:
inference_file = output_dir / "inference_area.nc"
mask_path = None

In [None]:
finetuned_model = load_finetuned_model(model_output_dir / experiment_name, task_type=task_type)
presto_model = PrestoPredictor(
    model=finetuned_model,
    batch_size=50,
    task_type=task_type,
    composite_window=composite_window,
)

predictions = presto_model.predict(inference_file, mask_path=mask_path)
predictions_map = reshape_result(predictions, path_to_input_file=inference_file)

In [None]:
plot_results(prob_map=predictions_map, path_to_input_file=inference_file, task=task_type, ts_index=14)