# Introduction

## 1. Sinergise and EO research

[Sinergise](https://sinergise.com/en/what-we-do) is a technology company developing [geographic information systems (GIS)](https://en.wikipedia.org/wiki/Geographic_information_system), established in 2008 by spinning-off from [Cosylab](https://www.cosylab.com/about-us/) after its GIS business reached a sustainable level.

Sinergise's focus lies mainly in building large turnkey solutions in the fields of geospatial services, agriculture and real-estate administration. However, wide availability of satellite data, e.g. through the [Copernicus](https://www.copernicus.eu/en/about-copernicus) and [Landsat](https://landsat.gsfc.nasa.gov/about/) programmes, has enabled many constructive applications of it and presented the company with new opportunities in the field of [Earth observation (EO)](https://en.wikipedia.org/wiki/Earth_observation).

Given the large and ever increasing volumes of satellite imagery, it is necessary to make use of various techniques that are able to automatically extract the complex patterns embedded within its spatio-temporal structure. This has led the company to expand into the domains of machine learning, computer vision, and data science.


### 1.1 `eo-learn`

Among other activities, the EO research team at Sinergise is developing [`eo-learn`](https://eo-learn.readthedocs.io/en/latest/), a Python library which acts as a bridge between the field of EO and the established Python ecosystem in the aforementioned domains. It enables remote sensing experts to make use of extensive processing tools and libraries, while also making it easier for non-experts to get into the field.

## 2. EO data

While `eo-learn` is not limited to specific sources of satellite image data, [Sentinel-2](https://sentinel.esa.int/web/sentinel/missions/sentinel-2) is generally recognised as the most useful.

Sentinel-2 is an Earth observation mission from the [Copernicus Programme](https://www.copernicus.eu/en/about-copernicus) by the EU, developed and being operated by the [European Space Agency (ESA)](https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Overview3).

The mission systematically acquires optical imagery at high spatial resolution (up to 10 m) over land and coastal waters. It supports a broad range of services and applications such as agricultural, water quality, and land use/land cover monitoring, emergencies management, or humanitarian relief.


### 2.1 Tiling system
Sentinel-2 uses a system for assigning coordinates to locations on the surface of the Earth that is aligned with NATO's [Military Grid Reference System (MGRS)](https://hls.gsfc.nasa.gov/products-description/tiling-system/) and its naming convention derived from the [Universal Transverse Mercator (UTM) system](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system).

The system divides the Earth's surface into zones with vertical width of 6° of longitude and horizontal width of 8° of latitude, as shown in the map below. The resulting grid is further subdivided into tiles of size 109.8 by 109.8 km (or 10,980 by 10,980 pixels at 10 m resolution), with an overlap of 4.9 km on each side.

*Sentinel-2 tiling system:*
<img src="images/MGRS_GZD-1.png">

Note that only land and coastal waters are actually observed, excluding the seas. The Antarctic is also not considered by the mission, thus limiting satellite coverage between latitudes 56° south and 84° north.


### 2.2 Frequency of observations
To achieve frequent revisits, two (mostly) identical Sentinel-2 satellites (Sentinel-2A and Sentinel-2B) operate in tandem. This allows for what would be a 10-day revisit cycle at the equator with one satellite to be completed in 5 days with 2 satellites under cloud-free conditions.

Similarly, revisit time comes down to 2 or 3 days at mid-latitudes.


### 2.3 Spectral bands (image channels)
Each satellite carries a multispectral imaging instrument with thirteen sensors which measure the solar irradiance reflected from the atmosphere and surface of the Earth (reflectance).

The sensors, covering different parts of the electromagnetic (EM) spectrum, serve different purposes and have different spatial resolutions. For example, bands `B02`, `B03`, and `B04`, which roughly correspond to blue, green, and red light, respectively, are often used to generate true colour images (TCIs), but may not suffice to distinguish between certain types of objects.

On the other hand, band `B11` is sensitive to the part of the EM spectrum that the atmosphere largely absorbs, meaning that significant responses can only be expected from high-altitude objects, such as clouds or snow in the Himalayas. This can make it a particularly useful feature in cloud detection.

#### 2.3.1 Specifications
The full table of specifications for the twin satellites is displayed below:

| Band |      Observing      | S2A central wavelength (nm) | S2A bandwidth (nm) | S2B central wavelength (nm) | S2B bandwidth (nm) | Spatial resolution (m) |
|------|:-------------------:|:---------------------------:|:------------------:|:---------------------------:|:------------------:|:----------------------:|
| B01  |   Coastal aerosol   |            442.7            |         21         |            442.2            |         21         |           60           |
| B02  |         Blue        |            492.4            |         66         |            492.1            |         66         |           10           |
| B03  |        Green        |            559.8            |         36         |            559.0            |         36         |           10           |
| B04  |         Red         |            664.6            |         31         |            664.9            |         31         |           10           |
| B05  | Vegetation red edge |            704.1            |         15         |            703.8            |         16         |           20           |
| B06  | Vegetation red edge |            740.5            |         15         |            739.1            |         15         |           20           |
| B07  | Vegetation red edge |            782.8            |         20         |            779.7            |         20         |           20           |
| B08  |         NIR         |            832.8            |         106        |            832.9            |         106        |           10           |
| B8A  |      Narrow NIR     |            864.7            |         21         |            864.0            |         22         |           20           |
| B09  |     Water vapour    |            945.1            |         20         |            943.2            |         21         |           60           |
| B10  |    SWIR - Cirrus    |            1373.5           |         31         |            1376.9           |         30         |           60           |
| B11  |         SWIR        |            1613.7           |         91         |            1610.4           |         94         |           20           |
| B12  |         SWIR        |            2202.4           |         175        |            2185.7           |         185        |           20           |

The central wavelengths and bandwidths roughly represent the parts of the EM spectrum that are covered by individual bands. This can be illustrated with more detail by observing their [spectral response](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/performance), i.e. how sensitive each sensor on-board the twin satellites is to particular wavelengths.

*Sentinel-2 spectral response in the visible and near-infrared (VNIR) range:*
<img src="images/Performance_Figure_1_full.png">

*Sentinel-2 spectral response in the short-wavelength infrared (SWIR) range:*
<img src="images/Performance_Figure_2_full.png">


### 2.4  Volume
A single observation of a tile produces about [600 MB of (raw) data](https://sentinel.esa.int/web/sentinel/missions/sentinel-2/data-products). Taking the total number of tiles and the frequency of observations into account, Sentinel-2 annually acquires petabytes of new data. Depending on the application, processing this data can be considered as a "big data" problem.


### 2.5 Format
The [Copernicus Open Access Hub](https://scihub.copernicus.eu/) provides complete, free, and open access to Sentinel-2 user products, among others, subject to the [conditions of EU law](https://sentinel.esa.int/documents/247904/690755/Sentinel_Data_Legal_Notice).

The user products conform to a certain [naming convention](https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/naming-convention) and [data format](https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/data-formats), but these details are mostly abstracted away by Sinergise's [Sentinel Hub](https://www.sentinel-hub.com/about) services, which make data processing much more simple.

Particularly useful, some of Sentinel Hub's functionalities can be reproduced via [`sentinelhub-py`](https://sentinelhub-py.readthedocs.io/en/latest/), a set of Python utility packages also developed by Sinergise's EO research team. In practical workflows, `sentinelhub-py` is essential for acquiring data, which is then represented within `eo-learn` using [NumPy](https://numpy.org/doc/stable/reference/index.html) and [Shapely](https://shapely.readthedocs.io/en/latest/manual.html) objects, among others.


### 2.6 Visualisation
Sinergise maintains its showcase tool, [EO Browser](https://apps.sentinel-hub.com/eo-browser/), which demonstrates Sentinel Hub's features by allowing one to explore the world map, overlayed with either predefined or custom visualisations.

*Etna volcano eruption (dated 16. 3. 2017), where the true colour image is overlayed with SWIR bands `B11` and `B12`:*
<img src="images/etna_volcano_eruption.jpg">

The so-called [custom scripts](https://www.sentinel-hub.com/develop/documentation/custom-processing-scripts) can help to monitor specific situations, e.g. by [emphasising wildfires](https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/markuse_fire/), and [contests](https://www.sentinel-hub.com/contest) are frequently organised to aid relevant global efforts. They are also used to create [stylised satellite images](https://www.sentinel-hub.com/explore/education/custom-scripts-tutorial). For the purposes of this workshop, EO Browser may be useful to preview the effects of simple processing steps.

## 3. Machine learning problems in EO

### 3.1 Land cover classification

One of the major applications of machine learning to satellite data lies in determining what the different surfaces in the images represent (e.g. vegetation, bare ground, water bodies, etc.) and what they are used for (e.g. arable land, pastures, urban areas, and others).

Although the terms "land use" and "land cover" are [distinct](https://en.wikipedia.org/wiki/Land_cover#Distinction_from_%22land_use%22), they are commonly referred to jointly in what is called land use/land cover (LULC) classification.

#### 3.1.1 Classification or semantic segmentation?
Classification in the context of image data should generally be understood as mapping an image to a label. While this is [sometimes](https://arxiv.org/abs/1709.00029) performed for land cover as well, it is usually more useful to segment the image, i.e. assign a label to each individual pixel.

Where confusion in terminology may arise, is due to the different models that can be involved in the process. For a model that accepts an image as its input and produces a segmented image in turn, the situation is clear. However, working with pixel-based models, where each pixel is processed independently from its neighbourhood, reduces the problem to pixel-wise classification. If such models are then applied to the whole image, segmentations can be obtained as well (although they may contain more noise).

*An example segmentation of a satellite image:*
<img src="images/lulc_segment.png">


### 3.2 Cloud detection
Cloud *masking* is crucial for retrieval of accurate surface reflectances within atmospheric correction processes. The problem reduces to cloud *detection*, which has been tackled in the development of special processors, such as [MAJA](https://labo.obs-mip.fr/multitemp/maccs-how-it-works/), [Sen2Cor](https://step.esa.int/main/third-party-plugins-2/sen2cor/), and Sinergise's [s2cloudless](https://github.com/sentinel-hub/sentinel2-cloud-detector).

Recently, cloud detection was given further attention at the [Cloud Masking Inter-comparison Exercise (CMIX)](https://earth.esa.int/web/sppa/meetings-workshops/hosted-and-co-sponsored-meetings/acix-ii-cmix-2nd-ws), an international collaborative initiative, which intended to contribute to a better understanding of the strengths and weaknesses of various established algorithms.

#### 3.2.1 s2cloudless
s2cloudless is a binary classifier based on the [LightGBM](https://lightgbm.readthedocs.io/en/latest/) model architecture. Publicly available since January 2018, it has since been incorporated into `eo-learn` and Sentinel Hub services as well. 

It was designed for processing speed and intended to achieve state-of-the-art results among the competition. Indeed, s2cloudless performed well in [internal validation](https://medium.com/sentinel-hub/improving-cloud-detection-with-machine-learning-c09dc5d7cf13), it appears to handle most situations properly, and is extensively used.

*Examples of good behaviour:*
<img src="images/medium_s2c_1.gif">

However, there are also certain scenarios, in which it misses the mark. Bright objects such as houses, dirt roads, beaches, desert sand, and snow, for example, are known to be potential sources of issues.

*Examples of subpar behaviour:*
<img src="images/medium_s2c_2.gif">

#### 3.2.2 Modeling approaches
It must be noted that s2cloudless is a single-scene (monotemporal) pixel-based classifier. Observing the reflectance values of misclassified pixels, they can appear so alike actual clouds that any such model might be practically unable to distinguish between them.

One way of tackling this problem would be to use additional context from the area surrounding the pixel, since one can usually discern between objects, such as houses and clouds, by their shape alone. Convolutional networks might perform well for this approach, but can be too demanding for the amount of data that we are dealing with. Through appropriate pre-processing, simpler algorithms can somewhat take the pixel neighbourhood into account as well.

The second approach focuses on additional observations of the same geographic area but on different dates. Structures, which are frequently misclassified, can be expected to change very little over time. Therefore, observing an area over multiple time frames, one should be able to accurately perceive which cloud-like pixels are permanent (houses, beaches, etc.) and which are temporary (actual clouds), as demonstrated in [one of our blogposts](https://medium.com/sentinel-hub/on-cloud-detection-with-multi-temporal-data-f64f9b8d59e5).

*An example of an area that is repeatedly misclassified:*
<img src="images/medium_miss_rep.png">


### 3.3 Other

- Land cover prediction ([1](https://www.tandfonline.com/doi/full/10.1080/01431161.2017.1343512)),
- Time-series classification ([1](https://arxiv.org/abs/1811.10166), [2](https://www.mdpi.com/2072-4292/10/8/1221/htm), [3](https://arxiv.org/abs/1911.07757)),
- Change detection ([1](https://arxiv.org/abs/1812.05815)),
- Super-resolution ([1](https://arxiv.org/abs/1803.04271), [2](https://arxiv.org/abs/1907.06490)),
- etc.

## 4. Overview of the workshop

### 4.1 Objective

During the workshop, we will get familiar with the `eo-learn` framework and how it can be used to process EO data. Specifically, we will use it to prepare a small data set for the land cover problem and to build a pipeline that can be easily scaled, leading to a practical machine learning (ML) application.


### 4.2 Choice of ML model

We will be using [LightGBM](https://lightgbm.readthedocs.io/en/latest/) for our ML model. It is particularly apt for this domain, because it is fast, resource-efficient, highly performant, and successfully used in practice, as demonstrated by the case of s2cloudless.

That being said, it is a gradient boosting framework based on decision tree algorithms - and decision trees can only go so far, partly because they "take things literally". For example, leaving cloudy instances in the training data will lead to poor performance on unseen examples, so we will be forced to filter them out, even at inference time. However, some deep neural networks that incorporate memory or attention-based mechanisms e.g. [LSTM-RNNs](https://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735) and [transformers](https://arxiv.org/abs/1706.03762), [have been experimentally shown](https://arxiv.org/abs/1910.10536) to be able to identify and ignore cloudy instances on their own.

Nonetheless, we will stick with LightGBM for its efficiency and ease of use. That is not to say that the drawbacks of more complex models always outweigh their advantages - in fact, the EO Research team at Sinergise has been [experimenting](https://github.com/sentinel-hub/eo-flow) with more modern architectures for some time - but the workshop should be demonstrative enough as is.


### 4.3 Material

Due to the volumes of data associated with this domain, the usual Sentinel Hub data requests can present an unexpectedly large burden, both for the user (in terms of waiting time and storage) and the network itself (in terms of bandwidth).

For convenience, all workshop-related material has been prepared beforehand. If you want to make use of Sentinel Hub's services regardless and do not already have a Sentinel Hub account, you can create a [trial account](https://www.sentinel-hub.com/trial) and follow the [set-up instructions](https://github.com/sentinel-hub/eo-learn-workshop#running-the-tutorial-with-your-sentinel-hub-account).


### 4.4 Outline

#### Notebook 2: `eo-learn` basics
1. `EOPatch`
2. `EOTask`
3. `EOWorkflow`
4. `EOExecutor`

#### Notebook 3: Data preparation

1. Define the area of interest:
    - Set the appropraite coordinate reference system
    - Split the full area into smaller and more manageable patches
    - Focus on a particular set of patches
2. Compute additional information from multispectral data
    - Cloud probabilities
    - Normalised difference indices
    - etc.
3. Add a reference map
    - Convert provided vector data into raster form
4. Prepare the data set
    - Filter cloudy instances
    - Fill the gaps with temporal interpolation
    - Extract random spatial samples
    - Split for training and validation

#### Notebook 4: Learning
1. Set ML model parameters
2. Train the model
3. Perform model validation
    - Assess the model's performance according to specific metrics
    - Interpret the model with respect to its feature importances
4. Create a full-image processor
    - Wrap the model
    - Visualise the results