# Copernicus Data Access with openEO

Michael Mommert, Stuttgart University of Applied Sciences, 2025

This Notebook introduces the openEO client library for Python. [OpenEO](https://openeo.org) provides a common API to easily connect to Earth observation data backends. This makes it significantly easier to access a wide range of data products from different products. This Notebook is based on the [official quickstart guide](https://openeo.org/documentation/1.0/python) and [another example Notebook](https://documentation.dataspace.copernicus.eu/notebook-samples/openeo/NDVI_Timeseries.html).

In this tutorial, we will focus on the Copernicus Data Space Ecosystem (CDSE) as our backend; if you need more information on this, you can refer to the [Copernicus Data Space Ecosystem Documentation Portal](https://documentation.dataspace.copernicus.eu/).

Before we can use openEO, we have to install it:

In [None]:
%pip install numpy matplotlib pandas geopandas rasterio openeo

Now we can import the openEO module and others:

In [None]:
import os
import json
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd

import rasterio
import openeo

## Backends

A wide range of backends are available for use with openEO. Those include the Copernicus programme, the Google Earth Engine and others. A full list is available through the [open EO Hub](https://hub.openeo.org/).

We will now explore the Copernicus backend. The first step is to establish a connection to the backend:

In [None]:
connection = openeo.connect("https://openeo.dataspace.copernicus.eu/openeo/1.2")

Now we can explore the available collections, each of which is a dataset provided to the user:

In [None]:
connection.list_collections()

As you can see, a huge variety of datasets across different data modalities and data products are available. We can explore the individual collections interactively within this Jupyter Notebook.

In the following, we will focus on the Sentinel-2 Level-2A data collection.

## Authentication

Before we can access data from this collection, we have to authenticate. In the case of the Copernicus Data System Ecosystem (CDSE), this is done using OpenID through a web browser. Note that you need to be registered with the CDSE (or whatever backend you plan to use).

In [None]:
connection.authenticate_oidc()

Now that we are registered, we can access the data collection, which is done via datacubes.

## Datacubes

A [datacube](https://openeo.org/documentation/1.0/datacubes.html#what-are-datacubes) defines a slice through a specific data collection. Consider the case that we are interested in accessing Sentinel-2 data for a specific area taken in June of 2025. We can create the corresponding datacube as follows:

In [None]:
datacube = connection.load_collection(
  "SENTINEL2_L2A",
  spatial_extent={"west": 9.253, "south": 48.698, "east": 9.275, "north": 48.710},
  temporal_extent=["2025-06-15", "2025-06-30"],  # start and end date
  max_cloud_cover=20,  # permitted cloud cover percentage
  bands=["B04", "B03", "B02", "B08"]
)

As you can see, we can define the spatial extent, the temporal extent, the maximum cloud cover and the bands that we are interested. For a full list of features, please refer to the corresponding [documentation](https://open-eo.github.io/openeo-python-client/api.html#openeo.rest.datacube.DataCube.load_collection). The datacube per se does nothing. We could, in theory apply a process to this datacube, which processes the data (aggregation, filtering etc). In this example, we will not do this, but you find additional information on how to do this [here](https://openeo.org/documentation/1.0/python/#applying-processes).

Instead, we will access the information and download them into our Jupyter environment. Note that this will take a few minutes. Your request will not be immediately processed. Instead, your request will be queued and processed when the required resources are available. This is called batch processing.

In [None]:
# we define how we want to save the data
result = datacube.save_result("GTiff")

# we create a new job at the backend
job = result.create_job(validate=True)

# now we start the job and wait for the data to download into a designated directory
job.start_and_wait()
data_dir = 'sen2_data'
job.get_results().download_files(data_dir)

The data will be written to file in the directory that we can provided (if the directory does not yet exist, a new one will be created).

We can now open and display the downloaded images:

In [None]:
# compile a list of downloaded images
filelist = []
for filename in sorted(os.listdir(data_dir)):
    if filename.endswith('tif'):
        filelist.append(os.path.join(data_dir, filename))

n_rows = int(np.sqrt(len(filelist)))
n_cols = int(len(filelist)/n_rows)

f, ax = plt.subplots(n_rows, n_cols, figsize=(n_rows*5, n_cols*5))
ax = np.ravel(ax) # flatten ax array

for i in range(len(filelist)):
    # read in image data
    dataset = rasterio.open(filelist[i])
    rgb = np.dstack(dataset.read((1,2,3)))  # trim NIR band and rearrange data

    # normalize rgb data
    rgb = (rgb - np.percentile(rgb, 1))/(np.percentile(rgb, 99)-np.percentile(rgb, 1))

    # plot data
    ax[i].imshow(rgb)
    ax[i].set_title(filelist[i])
    ax[i].axis('off')

f.tight_layout()


**Exercise**: Define a datacube yourself and download the corresponding data. Choose a small area (e.g., using services like [geojson.io](https://geojson.io/)) and a short period of time to keep the download small. If you run into errors, carefully read the error message and fix your query.

In [None]:
# use this cell of the exercise

## Applying Processes to Areas of Interest

Instead of downloading image crops and analysing those locally on your computer, you can also query individual areas of interest (e.g., polygons) and analyse those directly on the backend computer. This has the advantage that you don't have to download large amounts of data; instead you bring your analysis to the computer where the data is stored.

Let's begin by defining some areas of interest. In our case, we will define some polygons based on the images above. These polygons are stored in a geojson file. Let's read in this geojson file:

In [None]:
# download geojson file
!wget https://github.com/Hochschule-fuer-Technik-Stuttgart/teaching-mommert/blob/main/remotesensing/openeo/aoi.geojson?raw=true -O aoi.geojson

# read in polygons
polygons = gpd.read_file('aoi.geojson')
polygons.set_crs(epsg=4326)  # set a coordinate reference system

Let's visualize these polygons on the images that we already downloaded:

In [None]:
# plot images
n_rows = int(np.sqrt(len(filelist)))
n_cols = int(len(filelist)/n_rows)

f, ax = plt.subplots(n_rows, n_cols, figsize=(n_rows*5, n_cols*5))
ax = np.ravel(ax) # flatten ax array

for i in range(len(filelist)):
    # read in image data
    dataset = rasterio.open(filelist[i])
    rgb = np.dstack(dataset.read((1,2,3)))  # trim NIR band and rearrange data
    crs = dataset.crs  # extract crs 
    transform = dataset.transform  # extract transform

    # normalize rgb data
    rgb = (rgb - np.percentile(rgb, 1))/(np.percentile(rgb, 99)-np.percentile(rgb, 1))

    # plot data
    ax[i].imshow(rgb)
    ax[i].axis('off')

    # add polygons to plot
    local_polygons = polygons.to_crs(crs)  # convert polygons to local crs
    for geometry in local_polygons.geometry:
        if geometry.geom_type == 'Polygon':
            x, y = geometry.exterior.xy  # extract node coordinates in local crs
            pixels = np.array([~transform * (xi, yi) for xi, yi in zip(x, y)])  # convert node coordinates to image coordinates using inverse transform
            ax[i].plot(pixels[:, 0], pixels[:, 1], color='red', linewidth=1)  # plot polygons

f.tight_layout()

For these areas of interest, we would now like to compute the mean ndvi values for each observation over a full year. We will do this using processes, which run on the backend. This means that we can compute these ndvi values without having to download the actual satellite imagery.

First, we have to define another datacube. This datacube will not include information on the spatial extent. This information will come directly from the polygons, which we defined in the geojson file. Note that we are using the lowest possible maximum cloud cover of 1% here to only query data without cloud contaminations.

In [None]:
aoicube = connection.load_collection(
  "SENTINEL2_L2A",
  temporal_extent=["2024-10-01", "2025-09-30"], #, "2024-11-29", "2024-12-26", "2025-02-17", "2025-03-18", "2025-04-10", 
                    #"2025-05-10", "2025-06-19", "2025-07-02", "2025-08-13", "2025-09-20"]], # a list of more than two dates indicates a sequence
  max_cloud_cover=1,  # permitted cloud cover percentage
  bands=["B04", "B08"]  # only retrieve red and nir bands
)

Now we can define the individual bands and the band arithmetic to compute ndvi. Finally, we create a process to aggregate the data spatially, based on our polygons and using the mean:

In [None]:
red = aoicube.band("B04")
nir = aoicube.band("B08")
ndvi = (nir - red) / (nir + red)

# read areas of interest as json
with open('aoi.geojson', 'r') as inf:
    aoi_json = json.load(inf)

# define batch process
ndvi_timeseries = ndvi.aggregate_spatial(geometries=aoi_json, reducer="mean")

Just like we did before, we can now send the process to the backend to get processed. We will wait for the results and write them to a file.

In [None]:
# execute job
job = ndvi_timeseries.execute_batch(out_format="CSV", title="NDVI timeseries")

Once the job finished, we can download the results and read them into a Pandas Dataframe.

In [None]:
# download results and read them in as a Pandas Dataframe
job.get_results().download_file("ndvi/ndvi_timeseries.csv")
df = pd.read_csv("ndvi/ndvi_timeseries.csv", index_col=0)

We plot the results, one line per mean ndvi for each polygon, as a function of time:

In [None]:
ndvi_filename = "ndvi/ndvi_timeseries.csv"
ndvi_df = pd.read_csv(ndvi_filename, index_col=0).sort_index()
ndvi_df.index = pd.to_datetime(ndvi_df.index)

fig, ax = plt.subplots(figsize=(12, 6))
ndvi_df.groupby("feature_index")["band_unnamed"].plot(marker="o", ax=ax)
ax.set_title("NDVI timeseries")
ax.set_ylabel("NDVI")
ax.set_ylim(0, 1)

We find lots of variations, but also some systematics: which line corresponds to which polygon?

**Exercise**: Define your own polygons using [https://geojson.io](https://geojson.io) and download aggregated ndvi values. Pick some fields in a subtropical country.

In [None]:
# use this cell for the exercise