<img align="right" src="https://github.com/eo2cube/eo2cube_book/blob/7880672deff906b41f993c856fe1a7eb38ed5b3a/images/banner_siegel.png?raw=true" style="width:1000px;">

## Why do we need EO cloud platforms?

In recent decades, Earth observation data and their analysis have proven to be important tools for monitoring the natural resources of our planet. Due to the increasing number of satellite platforms and sensors, unprecedented amounts of information about land and sea surfaces are now available. This provides scientists with new opportunities to understand and quantify environmental changes on various spatial and temporal scales. But it also introduces new obstacles user of earth observation data have to tackle.


### Big Data Problem

The growing accessibility of extensive Earth observation (EO) data from various satellites poses challenges in terms of the time needed for downloading and pre-processing data on individual computers or infrastructures. Under the European Union's Copernicus program, over 64 million products have been released, totaling more than 25 Petabytes of data.

Managing data discovery, download, and access becomes a formidable task when working with multiple datasets. Users are required to navigate through diverse interfaces, adhere to varying access requirements, and handle the heterogeneity of data formats.

To address this issue, innovative technological approaches are needed for the collection, management, distribution, and analysis of satellite data. The challenges associated with "Big Data," such as data volume, speed, and diversity, require a change in thinking and a departure from conventional local processing and data distribution methods.

In [None]:
## What is cloud computing?

<p align="center"> <img src="../../images/clouds.png" style="width:700px;"> </p>

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. (Mell, P. and Grance, T. (2011))

### EO in the cloud

The advent of cloud technology enables the relocation of algorithms and tools to data, facilitating access to extensive Earth observation (EO) datasets for a diverse user base. This approach allows users to manage and visualize specific data of interest without the necessity of downloading them. By eliminating the need for large-scale data transfers, this method ensures the efficient and effective utilization of EO data, preventing impediments to accessibility and usability.

Cloud-based Earth observation (EO) platforms signify a paradigm shift in EO data analysis by providing a holistic ecosystem that integrates storage, processing, analysis tools, collaboration, and visualization seamlessly. Through the utilization of cloud-based infrastructures, researchers and analysts can enhance their workflow by reducing the time-consuming processes of data transfer and pre-processing.

This relieves scientists of tedious pre-processing tasks, allowing them to focus their main research objects, such as time-series analyses or environmental modelling

<p align="center"> <img src="../../images/eo_workflow.png" style="width:700px;"> </p>


## Infrastructure Versus Platform

More and more Cloud-based Earth observation solutions from different providers have emerged in order to meet the growing demand. When talking about Cloud-based Earth observation (EO) platforms we can distinguish between 2 types of providers: 

**Infrastructure providers**

Provide the necessary support for handling Earth observation (EO) data by offering essential infrastructure. This includes computing resources, storage capacity, and networking capabilities crucial for the efficient processing and analysis of extensive EO data.

Examples of infrastructure providers

- [Amazon Web Services (AWS)](https://aws.amazon.com/): AWS offers a diverse range of cloud services, including storage (Amazon S3) and computing (Amazon EC2), suitable for processing and storing Earth observation (EO) data. Numerous open datasets are accessible on AWS through this link.

- [Microsoft Azure](https://azure.microsoft.com): Azure delivers cloud-based services such as Azure Storage, Azure Virtual Machines, and Azure Machine Learning, supporting EO applications and workflows.

   [Google Cloud Platform (GCP)](https://cloud.google.com/): GCP provides infrastructure services like Google Cloud Storage and Google Compute Engine, ideal for managing and analyzing EO data. Explore various open datasets on GCP via this link.

- [Cloudferro](https://cloudferro.com): Cloudferro specializes in providing cloud infrastructure, with a focus on processing and analyzing geospatial data. Their offerings include scalable and secure cloud resources that are specifically optimized for Earth Observation (EO) applications. Cloudferro delivers high-performance computing, storage, and networking services designed to meet the unique demands of EO data processing workflows. Users can access a variety of open data sets on the Cloudferro platform.

**Platform providers**

Offer all-in-one Earth observation (EO) platforms that bring together infrastructure, tools, and services. These platforms typically offer various integrated features like data storage, processing, analysis, visualization, and collaboration tools. With a user-friendly interface, they simplify the EO data process, letting users access, process, and analyze data without dealing with the technical side.

Examples of platform providers


- [Google Earth Engine](https://earthengine.google.org): Platform for Earth observation (EO) data analysis. It grants access to extensive satellite imagery and geospatial datasets, coupled with robust processing capabilities and built-in algorithms.

- [Sinergise Sentinel-Hub](https://www.sentinel-hub.com): Focuses on satellite data access and processing. The platform offers user-friendly APIs and tools for seamless access, processing, and visualization of EO data.

- [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com): Provides access to diverse global datasets, including satellite imagery, climate, and environmental data, aiming to facilitate large-scale data analysis and support sustainable development and conservation efforts.

- [Euro Data Cube](https://eurodatacube.com): Utilizes various cloud infrastructures to create an interactive development environment with standardized access to diverse EO data. It features a data exploration and analysis via the JupyterLab environment.

- [Digital Earth Africa](https://www.digitalearthafrica.org/): Provides a routine, reliable and operational service, using Earth observations with focus on the african continent. The aim is to deliver decision-ready products enabling policy makers, scientists, the private sector and civil society to address social, environmental and economic changes on the continent and develop an ecosystem for innovation across sectors.

## Data Cube

Generally a data cube is a n-dimensional data structure. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest. Actually the "cube" in "data cube" is just a metaphor to help illustrate a data structure. In fact data cubes can be 1- dimensional, 2-dimensional, 3-dimensional, or higher-dimensional. Basically, a data cube is a just way to represent data in multiple dimensions.
Dimensions and measures in a data cube are structured hierarchically. The dimensions define the edges of the cube (e.g time, product, customer). Measurements (e.g sales, quantity, profit) within a data cube are explained by multiple dimensions and have a certain value. This structure allows for easy aggregation and slicing of the data along any dimension, facilitateing users to address complex queries and conduct advanced analysis on the data.
This concept of data cubes is nothing new and finds application across various fields (e.g. econimics, biology, medicine) and  also has become a promising solution to efficiently and effectively handle earth observation data.

**Key components**:

- **Dimensions**: These represent the categories along which data is analyzed, such as time, geography, and product categories.

- **Measures**: These are the numerical values or metrics, like sales, profit, or quantity, that quantify the data.

- **Cells**: Points of intersection across dimensions form cells, each containing a specific measure.

But lets have a look at a more straightforward example. Suppose we aim to examine the annual temperature and precipitation in three different cities (Würzburg, Bamberg, Frankfurt). In this scenario, our data cube has dimensions for years, climate variables, and cities. Each dimension comes with distinct labels; for instance, Würzburg, Bamberg, and Frankfurt serve as labels for the location dimension. In the cells we have a corresponding value (mm, °C) for each measured feature (precipitation, temperature).
The hierarchical arrangement of dimensions and measures creates a cube-like structure, enabling users to explore relationships and dependencies within the data. 

<p align="center"> <img src="../../images/dc_concept.png" style="width:1000px;"> </p>

Mell, P. and Grance, T. (2011), The NIST Definition of Cloud Computing, Special Publication (NIST SP), National Institute of Standards and Technology, Gaithersburg, MD, [online], https://doi.org/10.6028/NIST.SP.800-145 (Accessed February 1, 2024)

In [13]:
import os
import dask
from dask.distributed import Client

In [11]:
host = "hpdar03c02s02.cos.lrz.de"
jl_port=58032
dask_url = f"https://portal.terrabyte.lrz.de/node/{host}/{jl_port}/proxy/" + "{port}/status"
dask_address = "127.0.0.1:0"
dask_url

'https://portal.terrabyte.lrz.de/node/hpdar03c02s02.cos.lrz.de/58032/proxy/{port}/status'

In [14]:
dir_tmp = './tmp'
dask.config.set({'temporary_directory': dir_tmp,
                 'distributed.dashboard.link': dask_url})

<dask.config.set at 0x7f24e0055a20>

In [15]:
client = Client(threads_per_worker=1, dashboard_address=dask_address)
client

  next(self.gen)
  next(self.gen)
  next(self.gen)
  next(self.gen)
  next(self.gen)
  next(self.gen)
  next(self.gen)
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):

In [17]:
import os
import dask
from dask.distributed import Client