# Data Extractions from OpenEO

To run the extractions, you need an account in the [Copernicus Data Space Ecosystem (CDSE)](https://openeo.dataspace.copernicus.eu/).

In [None]:
from loguru import logger
import geopandas as gpd
from pathlib import Path
from scaleagdata_vito.openeo.extract_sample_scaleag import (
    generate_extraction_job_command,
)
from scaleagdata_vito.presto.presto_df import load_dataset

### Assess data are correct before launching the OpenEO jobs 
You can run some checks on your input file to make sure they are suitable to run the extractions successfully. In particular, it is important to check the validity of the geometries and, ideally, also to have a column containing a unique id for each sample 


In [38]:
def check_unique_id(df_path, unique_id):
    df = gpd.read_file(df_path)
    if df[unique_id].nunique() != df.shape[0]:
        logger.info("IDs are not unique!")
        return df[df[unique_id].duplicated(keep=False)]
    else:
        logger.info("IDs are unique")
        return None


def check_valid_geometry(df_path, save_to=""):
    df = gpd.read_file(df_path)
    df_valid = df[df.geometry.is_valid]
    if len(df_valid) < len(df):
        logger.info("Invalid geometries found! Returning invalid geometries")
        df_invalid = df[~df.geometry.is_valid]
        if save_to:
            filename = Path(save_to) / f"{Path(df_path).stem}_invalid.geojson"
            logger.info(f"Saving invalid geometries to {filename}")
            Path(save_to).mkdir(parents=True, exist_ok=True)
            df_invalid.to_file(filename)
        return df_invalid
    else:
        logger.info("All geometries are valid")
        return None

In [39]:
input_file = "/projects/TAP/HEScaleAgData/data/GEOMaize/Maize_Yield_Polygon_North_Ghana/Polygon_North/Maize_2021.shp"
invalid_geom = check_valid_geometry(input_file, save_to="")
non_unique_ids = check_unique_id(input_file, unique_id="Field_ID")

[32m2025-02-17 16:24:45.821[0m | [1mINFO    [0m | [36m__main__[0m:[36mcheck_valid_geometry[0m:[36m15[0m - [1mInvalid geometries found! Returning invalid geometries[0m


[32m2025-02-17 16:24:45.843[0m | [1mINFO    [0m | [36m__main__[0m:[36mcheck_unique_id[0m:[36m7[0m - [1mIDs are unique[0m


#### Get command to run OpenEO extractions

1) To set up the job, we adapt the job parameters to our needs. The user has to indicate the following fields in order to generate the command to be run in the terminal for starting the extraction 

    ```python
    job_params = dict(
        output_folder=..., 
        input_df=...,
        start_date=...,
        end_date=...,
        unique_id_column=...,
        composite_window=..., # "month" or "dekad" are supported. Default is "dekad"
    )

    ```

In [24]:
job_params = dict(
    output_folder="/projects/TAP/HEScaleAgData/data/GEOMaize/Maize_Yield_Polygon_North_Ghana/Polygon_North/2021",
    input_df="/projects/TAP/HEScaleAgData/data/GEOMaize/Maize_Yield_Polygon_North_Ghana/Polygon_North/Maize_2021.shp",
    start_date="2021-07-01",
    end_date="2021-10-31",
    unique_id_column="Field_ID",
    composite_window="dekad",
)
generate_extraction_job_command(job_params)

python scaleag-vito/scripts/extractions/extract.py -output_folder /projects/TAP/HEScaleAgData/data/GEOMaize/Maize_Yield_Polygon_North_Ghana/Polygon_North/2021 -input_df /projects/TAP/HEScaleAgData/data/GEOMaize/Maize_Yield_Polygon_North_Ghana/Polygon_North/Maize_2021.shp --start_date 2021-07-01 --end_date 2021-10-31 --unique_id_column Field_ID --composite_window dekad


2) In the terminal you will be asked for authentication and be provided with a link. click on the link and login with your CDSE credentials.  
3) Once the extractions of the dataset will be over, you can load your dataset as follows

In [None]:

dataset_df = load_dataset(job_params["output_folder"])