# Geospatial Job Management and Visualization with OpenEO

This notebook demonstrates how to authenticate with the OpenEO backend, create a spatial grid for a specific region, prepare jobs for geospatial analysis, run them in parallel, and visualize the job statuses using interactive maps.

We will go through the following steps:
1. **Authentication and Backend Initialization**
2. **Generating a Spatial Grid for the Antwerp Region**
3. **Preparing Jobs for Parallel Processing**
4. **Running and Tracking Jobs Using MultiBackendJobManager**
5. **Visualizing Job Status Using Plotly Maps**

### 1. Authentication and Backend Initialization

We start by connecting to the Copernicus Dataspace OpenEO backend and authenticating using OpenID Connect. The `MultiBackendJobManager` is initialized to manage jobs across multiple backends.


In [1]:
# 1. Importing Required Packages

import json
import openeo
import pandas as pd
import shapely
from openeo.extra.job_management import MultiBackendJobManager, CsvJobDatabase

# Authenticate and add the backend
connection = openeo.connect(url="openeofed.dataspace.copernicus.eu").authenticate_oidc()

# initialize the job manager
manager = MultiBackendJobManager()
manager.add_backend("cdse", connection=connection, parallel_jobs=2)

Authenticated using refresh token.


### 3. Generating a Spatial Grid for the Antwerp Region

We define a bounding box for Antwerp in WGS84 coordinates and convert it to UTM (Universal Transverse Mercator) coordinates. A grid is created using these UTM coordinates and then converted back to WGS84 for further processing.

We also save the grid as a GeoJSON file for future use.


In [None]:
# 3. Generate the grid for Antwerp
import geopandas as gpd
from shapely.geometry import box
import numpy as np

# Define the bounding box, transformers, and grid size
grid_size_m = 5000
x_coords = np.arange(595000, 600000, grid_size_m)
y_coords = np.arange(5660000, 56730000, grid_size_m)

# Create polygons for the grid
polygons = [box(x, y, x + grid_size_m, y + grid_size_m) for x in x_coords for y in y_coords]

# Create a GeoDataFrame and save it
grid_gdf_utm = gpd.GeoDataFrame({'geometry': polygons}, crs="EPSG:32631")
grid_gdf_latlon = grid_gdf_utm.to_crs("EPSG:4326")
grid_gdf_latlon['id'] = range(len(grid_gdf_latlon))
grid_gdf_latlon.to_file("./antwerp_grid_5km.geojson", driver="GeoJSON")


### 4. Visualizing the Spatial Grid

Using Plotly, we visualize the spatial grid we just created.


In [None]:
# 4. Visualizing the grid using Plotly
import plotly.express as px

bboxes = gpd.read_file("./antwerp_grid_5km.geojson")

fig = px.choropleth_mapbox(
    bboxes,
    geojson=bboxes.geometry,
    locations=bboxes.index,
    mapbox_style="carto-positron",
    center={"lat": 51.15, "lon": 4.4},
    zoom=8,
    title="Spatial Grid for Antwerp Region"
)

fig.update_geos(fitbounds="locations")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show("png")


### 5. Preparing Jobs for Geospatial Processing

Here, we define a function to create a job configuration DataFrame. The jobs are set to analyze NDVI (Normalized Difference Vegetation Index) using Sentinel-2 data over a specific time range.


In [7]:
# 5. Prepare the job DataFrame
def prepare_jobs_df(temporal_range, grid_df) -> pd.DataFrame:
    jobs = []
    for _, row in grid_df.iterrows():
        startdate, enddate = temporal_range[0], temporal_range[1]
        spatial_extent = shapely.to_geojson(row.geometry)
        jobs.append({
            "spatial_extent": spatial_extent,
            "start_date": startdate,
            "end_date": enddate
        })
    return pd.DataFrame(jobs)

# Create the jobs DataFrame
jobs_df = prepare_jobs_df(["2024-05-01", "2024-08-01"], bboxes)

job_tracker = 'job_tracker.csv'
job_db = CsvJobDatabase(path=job_tracker)
if not job_db.exists():
    df = manager._normalize_df(jobs_df)
    job_db.persist(df)

### 6. Running the Jobs with MultiBackendJobManager

We define a function to start a job for each grid square, running it in parallel using the `MultiBackendJobManager`.


In [8]:
# 6. Start the jobs using the job manager
def start_job(row: pd.Series, connection: openeo.Connection, **kwargs) -> openeo.BatchJob:
    spatial_extent = json.loads(row["spatial_extent"])
    startdate = row["start_date"]
    enddate = row["end_date"]

    cube = connection.load_collection(
        "SENTINEL2_L2A",
        temporal_extent=[startdate, enddate],
        spatial_extent=spatial_extent,
        bands=["B04", "B08"])

    ndvi_cube = cube.ndvi().mean_time()

    return ndvi_cube.create_job(
        title=f"test excercise ndvi",
        out_format="NetCDF",
        job_options={"executor-memory": "2G", "executor-memoryOverhead": "2G", "python-memory": "500m"}
    )


### 7. Visualizing Job Status

Set up a threaded approach to run the jobs and visualise the same time

We load the job tracker file and visualize the status of each job in the spatial grid using a Plotly choropleth map.


In [None]:
# Step 5: Initialize job database
import plotly.express as px
from shapely import geometry
import json
import geopandas as gpd
import pandas as pd
import time
from plotly import offline
from IPython.display import clear_output

# Update colors based on job status
color_dict = {
    "not_started": 'lightgrey', 
    "created": 'gold', 
    "queued": 'lightsteelblue', 
    "running": 'navy', 
    "finished": 'lime',
    "error": 'darkred',
    "skipped": 'darkorange',
    None: 'grey'  # Default color for no status
}

# Step 6: Start job manager in a separate thread
manager.start_job_thread(start_job=start_job, job_db=job_db)

# Step 7: Visualization Loop
# Initialize the figure outside the loop

while not manager._stop_thread:
    try:
        # Read job statuses from the tracker
        status_df = pd.read_csv(job_tracker)

        # Parse spatial extent into geometries
        status_df['geometry'] = status_df['spatial_extent'].apply(lambda x: geometry.shape(json.loads(x)))
        status_df = gpd.GeoDataFrame(status_df, geometry='geometry', crs='EPSG:4326')

        # Use the 'status' column to determine colors, with a fallback for NaNs or None
        status_df['color'] = status_df['status'].map(color_dict).fillna(color_dict[None])

        minx, miny, maxx, maxy = status_df.total_bounds
        center_lat = (miny + maxy) / 2
        center_lon = (minx + maxx) / 2

        fig = px.choropleth_mapbox(
            status_df,
            geojson=status_df.geometry.__geo_interface__,  # Use the correct GeoJSON representation
            locations=status_df.index,
            color='status',  # Use 'status' for the color
            color_discrete_map=color_dict,  # Map colors directly from the dictionary
            mapbox_style="carto-positron",
            center={"lat": center_lat, "lon": center_lon},  # Center on your area of interest
            zoom=8,
            title="Job Status Overview",
            labels={'status': 'Job Status'} 
        )
        fig.update_geos(fitbounds="locations")
        fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})

        # Display the updated figure
        clear_output()
        offline.iplot(fig)

        # Check if all jobs are done
        if status_df['status'].isin(["not_started", "created", "queued", "running"]).sum() == 0:
            manager.stop_job_thread()

        time.sleep(15)  # Wait before the next update

    except KeyboardInterrupt:
        break