# Introduction

## 1. Sinergise and eo-learn

[Sinergise](https://sinergise.com/en/what-we-do) is a technology company developing [geographic information systems (GIS)](https://en.wikipedia.org/wiki/Geographic_information_system), established in 2008 by spinning-off from [Cosylab](https://www.cosylab.com/about-us/) after its GIS business reached a sustainable level.

Sinergise's focus lies mainly in building large turnkey solutions in the fields of geospatial services, agriculture and real-estate administration. However, wide availability of satellite data has led the company to expand into the domains of machine learning, computer vision, and data science.

Among other activities, the EO research team at Sinergise is developing [**eo-learn**](https://eo-learn.readthedocs.io/en/latest/), a Python library which acts as a bridge between the field of [Earth observation (EO)](https://en.wikipedia.org/wiki/Earth_observation) and the established Python ecosystem in the aforementioned domains.

**eo-learn** and some related products will play important roles in this workshop.

## 2. The data
While not the only supported source of satellite images, [Sentinel-2](https://en.wikipedia.org/wiki/Sentinel-2) is recognised as the most useful.

Sentinel-2 is an Earth observation mission from the [Copernicus Programme](https://www.copernicus.eu/en/about-copernicus) by the EU, developed and being operated by the [European Space Agency (ESA)](https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Overview3).

The mission systematically acquires optical imagery at high spatial resolution (up to 10 m) over land and coastal waters. It supports a broad range of services and applications such as agricultural monitoring, water quality monitoring, emergencies management, or land cover classification.


### 2.1 Tiling system
Sentinel-2 uses a system for assigning coordinates to locations on the surface of the Earth that is aligned with NATO's [Military Grid Reference System (MGRS)](https://hls.gsfc.nasa.gov/products-description/tiling-system/) and its naming convention derived from the [Universal Transverse Mercator (UTM) system](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system).

The UTM system divides the Earth's surface into zones with vertical width of 6° of longitude and horizontal width of 8° of latitude, as shown in the map below. Each UTM zone is further subdivided into tiles of size 109.8 by 109.8 km (or 10,980 by 10,980 pixels at 10 m resolution), with an overlap of 4.9 km on each side.

![alt text](images/MGRS_GZD-1.png "Sentinel-2 tiling system.")

Note that only land and coastal waters are actually observed, excluding the seas. The Antarctic is also not considered by the mission, thus limiting the satellite coverage between latitudes 56° south and 84° north.


### 2.2 Frequency of observations
To achieve frequent revisits, two (mostly) identical Sentinel-2 satellites (Sentinel-2A and Sentinel-2B) operate in tandem. This allows for what would be a 10-day revisit cycle at the equator with one satellite to be completed in 5 days with 2 satellites under cloud-free conditions.

Similarly, revisit time comes down to 2 or 3 days at mid-latitudes.


### 2.3 Spectral bands for the Sentinel-2 sensors
Each satellite carries a multispectral imaging instrument with thirteen sensors which measure the solar irradiance reflected from the atmosphere and surface of the Earth (reflectance).

The sensors, covering different parts of the electromagnetic (EM) spectrum, serve different purposes and have different spatial resolutions. For example, bands `B02`, `B03`, and `B04`, which roughly correspond to blue, green, and red light, respectively, are often used to generate true colour images (TCIs), but may not suffice to distinguish between certain types of objects.

On the other hand, band `B11` is sensitive to the part of the EM spectrum that the atmosphere largely absorbs, meaning that significant responses can only be expected from high-altitude objects, such as clouds or snow in the Himalayas. This can make it a particularly useful feature in cloud detection.

The full table of specifications is displayed below:

| Band |      Observing      | S2A central wavelength (nm) | S2A bandwidth (nm) | S2B central wavelength (nm) | S2B bandwidth (nm) | Spatial resolution (m) |
|------|:-------------------:|:---------------------------:|:------------------:|:---------------------------:|:------------------:|:----------------------:|
| B01  |   Coastal aerosol   |            442.7            |         21         |            442.2            |         21         |           60           |
| B02  |         Blue        |            492.4            |         66         |            492.1            |         66         |           10           |
| B03  |        Green        |            559.8            |         36         |            559.0            |         36         |           10           |
| B04  |         Red         |            664.6            |         31         |            664.9            |         31         |           10           |
| B05  | Vegetation red edge |            704.1            |         15         |            703.8            |         16         |           20           |
| B06  | Vegetation red edge |            740.5            |         15         |            739.1            |         15         |           20           |
| B07  | Vegetation red edge |            782.8            |         20         |            779.7            |         20         |           20           |
| B08  |         NIR         |            832.8            |         106        |            832.9            |         106        |           10           |
| B8A  |      Narrow NIR     |            864.7            |         21         |            864.0            |         22         |           20           |
| B09  |     Water vapour    |            945.1            |         20         |            943.2            |         21         |           60           |
| B10  |    SWIR - Cirrus    |            1373.5           |         31         |            1376.9           |         30         |           60           |
| B11  |         SWIR        |            1613.7           |         91         |            1610.4           |         94         |           20           |
| B12  |         SWIR        |            2202.4           |         175        |            2185.7           |         185        |           20           |

#### 2.3.1 Spectral response for the visible and near-infrared (VNIR) bands
![alt text](images/Performance_Figure_1_full.png "S2 VNIR spectral response.")

#### 2.3.2 Spectral response for the short-wavelength infrared (SWIR) bands
![alt text](images/Performance_Figure_2_full.png "S2 SWIR spectral response.")


### 2.4  Volume
A single observation of a tile produces about [600 MB of (raw) data](https://sentinel.esa.int/web/sentinel/missions/sentinel-2/data-products). Taking the total number of tiles and the frequency of observations into account, Sentinel-2 annualy acquires petabytes of data. Depending on the application, processing this data can be considered as a "big data" problem.


### 2.5 Format
The [Copernicus Open Access Hub](https://scihub.copernicus.eu/) provides complete, free and open access to Sentinel-2 user products, among others, subject to the [conditions of EU law](https://sentinel.esa.int/documents/247904/690755/Sentinel_Data_Legal_Notice).

The user products conform to a certain [naming convention](https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/naming-convention) and [data format](https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/data-formats), but these details are mostly abstracted away by Sinergise's [Sentinel Hub](https://www.sentinel-hub.com/about) services, which make data processing much more simple.

Particularly useful, some of Sentinel Hub's functionalities can be reproduced via [**sentinelhub-py**](https://sentinelhub-py.readthedocs.io/en/latest/), a set of Python utility packages also developed by Sinergise's EO research team.


### 2.6 Visualisation
Sinergise maintains its showcase tool, [EO Browser](https://apps.sentinel-hub.com/eo-browser/), which demonstrates Sentinel Hub's features by allowing one to explore the world map, overlayed with either predefined or custom visualisations.

![alt text](images/etna_volcano_eruption.jpg "Etna volcano eruption, dated 16. 3. 2017. Image obtained by overlaying the true colour image with SWIR bands 11 and 12.")

The so-called [custom scripts](https://www.sentinel-hub.com/develop/documentation/custom-processing-scripts) can help to monitor specific situations, e.g. by [emphasising wildfires](https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/markuse_fire/), and [contests](https://www.sentinel-hub.com/contest) are frequently organised to aid relevant global efforts. They are also used to create [stylised satellite images](https://www.sentinel-hub.com/explore/education/custom-scripts-tutorial).

![alt text](images/covid19_contest.jpg "From the COVID-19 custom script contest announcement.")

For the purposes of this workshop, EO Browser may be useful in finding a specific tile to process, as well as to preview the effects of simple processing steps.


### More information
- https://sentinel.esa.int/web/sentinel/missions/sentinel-2
- https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-1c/algorithm
- https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/performance

## 3. The problem of cloud detection
Cloud *masking* is crucial for retrieval of accurate surface reflectances within atmospheric correction processes. The problem reduces to cloud *detection*, which Sinergise has tackled by developing [**s2cloudless**](https://github.com/sentinel-hub/sentinel2-cloud-detector), a binary classifier based on the [LightGBM](https://lightgbm.readthedocs.io/en/latest/) model architecture.

Cloud detection was then given further attention by the [Cloud Masking Inter-comparison Exercise (CMIX)](https://earth.esa.int/web/sppa/meetings-workshops/hosted-and-co-sponsored-meetings/acix-ii-cmix-2nd-ws), an international collaborative initiative, which intended to contribute to a better understanding of the strengths and weaknesses of various established algorithms.

Taking part in the initiative presented us at Sinergise with an opportunity to reexamine our existing methods and consider potential alternatives. Some of the material included in this workshop was, in fact, written during this experimentation.


### 3.1 s2cloudless
Publicly available since January 2018, s2cloudless was developed as a fast machine learning algorithm, which could detect clouds in near real-time and give state-of-the-art results among the competition.

Indeed, s2cloudless performed well in [internal validation](https://medium.com/sentinel-hub/improving-cloud-detection-with-machine-learning-c09dc5d7cf13), it appears to handle most situations properly, and is extensively used.

![alt text](images/medium_s2c_1.gif "Good examples.")

However, there are also certain scenarios, in which it misses the mark. Bright objects such as houses, dirt roads, beaches, desert sand, and snow, for example, are known to be potential sources of issues.

![alt text](images/medium_s2c_2.gif "Bad examples.")

Here it must be noted that s2cloudless is a single-scene (monotemporal) pixel-based classifier. Observing the reflectance values of misclassified pixels, they can appear so alike actual clouds that any such model might be practically unable to distinguish between them.

One way of tackling this problem would be to use additional context from the area surrounding the pixel, since one can usually discern between objects, such as houses and clouds, by their shape alone. Convolutional networks might perform well for this approach, but can be too demanding for the amount of data that we are dealing with. Through appropriate pre-processing, simpler algorithms can somewhat take the pixel neighbourhood into account as well.

The second approach focuses on additional observations of the same geographic area but on different dates. Structures, which are frequently misclassified, can be expected to change very little over time. Therefore, observing an area over multiple time frames, one should be able to accurately perceive which cloud-like pixels are permanent (houses, beaches, etc.) and which are temporary (actual clouds), as demonstrated in [one of our blogposts](https://medium.com/sentinel-hub/on-cloud-detection-with-multi-temporal-data-f64f9b8d59e5):

![alt text](images/medium_miss_rep.png "An example of an area that is repeatedly misclassified.")


### 3.2 In the absence of (labelled) data
The lack of a large high-quality labelled data set was and remains a problem for proper training of machine learning models.

s2cloudless resorted to relying on the outputs of [MAJA](https://labo.obs-mip.fr/multitemp/maccs-how-it-works/), an atmospheric correction processor, which includes a rule-based multitemporal algorithm for detecting clouds. While this meant constraining the performance to match MAJA's at best, s2cloudless was to be much faster and easier to use, thus fulfilling its purpose. Hence, a [collection of MAJA’s cloud masks](https://theia.cnes.fr/atdistrib/rocket/#/home) was used as a proxy for ground truth, with some (hard) negative examples added manually to decrease the number of false detections.

On the other hand, one needs to be confident that the data, which is to be used for validation, is labeled correctly. In this case, quality takes precedence over quantity, which points to manually labeled data, however scarce. For this purpose, s2cloudless was validated on a data set curated by [Hollstein et al.](https://www.mdpi.com/2072-4292/8/8/666).

The terabytes of spectral response data and MAJA's cloud masks render the same approach to training impractical for this workshop. Instead, Hollstein's data set is to be split and used for training as well, as it is much smaller and easier to work with.


### 3.3 Hollstein's data set

Hollstein's data set includes samples that belong to one of 6 different classes: cloud, cirrus, land, water, snow, and shadow. This is useful as it can highlight where the algorithm is failing or performing particularly well.

![alt text](images/remotesensing-08-00666-g001.png "Spectral histograms of the database per class.")

The data is sampled from 108 Sentinel-2 tiles roughly evenly from around the globe and throughout the year to ensure full climatic coverage.

![alt text](images/remotesensing-08-00666-g002.png "Global distribution of selected Sentinel-2 scenes which are included in the database.")

However, while it consists of around 6.4 million labelled pixels, these are part of only several thousand polygons, hand-drawn over mostly uniform objects. This means that it effectively covers only a handful of regions and the achieved results may not generalise as expected.

Thus, using Hollstein's data to train a commercial cloud detector may not be advised, but it ought to suffice for the purposes of the workshop.

![alt text](images/remotesensing-08-00666-g003.png "False-color RGB images which have been used to classify Sentinel-2 MSI images manually.")

## 4. Outline of the workshop
The subsequent notebooks are structured in the following order:

2. **Set-up:**
    - `eo-learn`
    - Some ML library: any one of `scikit-learn`/`lightgbm`/`tensorflow`/`keras`


3. **`eo-learn` basics:**
    - "Traditional" eo-learn workshop material
    - Some `sentinelhub-py` and EO Browser features


4. **Data preparation:**
    - Inspection of Hollstein's polygons within their respective patches
    - Data pre-processing
    - Data set construction


5. **Classification:**
    - Model set-up
    - Training


6. **Validation:**
    - Wrapping the trained model inside of a specialised task
    - Validation on Hollstein's data
    - Visual confirmation of the trained model's performance on instances of personal choice


7. **Practical use:**
    - Using the built task in a simple workflow