# **Exercise session 3: Convolutional neural network (CNN) and support vector machine (SVM) to detect CH$_4$ emissions from satellite data**

**Reference:** Schuit, Berend J., et al. "Automated detection and monitoring of methane super-emitters using satellite data." Atmospheric Chemistry and Physics Discussions (2023): 1-47.e

**Data:** 
- Sentinel 5P, bands CH4, CO, SO2 and NO2
- Sentinel 3, bands S5 and S6
- [all TROPOMI detected plumes for 2021. (Schuit et al. 2023)](https://zenodo.org/records/8087134)

**Useful tutorials:**
- [SVM for Multiclass Classification](https://www.kaggle.com/code/pranathichunduru/svm-for-multiclass-classification)

This exercise is designed to be completed on Aalto JupyterHub. Please ensure that your notebook includes all necessary installation commands for any additional libraries your code requires. These commands should be clearly written and integrated within your notebook. To submit, go to Nbgrader/Assignment List and click submit next to the exercise. 
All data loaded from Copernicus Dataspace, should be saved to /coursedata/users/$USER folder.

The exercise consists of two parts:
- In Part I (Ex. 3.1 - 3.2) you will discuss and compare the approaches to designing automated emissions monitoring algorithms in Finch et al. (2022), Schuit et al. (2023) and your own in Assignment 2
- In Part II (Ex. 3.3 - 3.8) you will use the TROPOMI detected plumes for 2021 (Schuit et al. 2023) to construct a multiclass SVM algorithm to distinguish between oil, gas and coal emissions.

The deadline is Feb 22 at 10:00. We will be grading the submissions as they arrive, so if you submit before the deadline, you will most likely get feedback earlier.

In [None]:
# Import necessary modules. If any additional modules need to be installed to run it on Aalto JupyterHub, include all necessary installation commands.

## Exercise 3.1 Convolutional Neural Networks for detecting gas plumes (2 pt)

Compare the approaches takes in Finch et al. (2022), Schuit et al. (2023) and your own in Assignment 2. What are the strength and weaknesses of each approach? Consider both data preparation and model design.

## Exercise 3.2 Trustworthy monitoring of emissions (2 pt)

One design requirement for introducing an automated monitoring system of emissions is that this system is trustworthy. The EC High-Level Expert Group on AI have developed the [Ethics Guidelines for Trustworthy Artificial Intelligence](https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai) that outline 7 key requirements that AI system needs to satisfy in order to be deemed trustworthy. Discuss how well the approaches in Finch et al. (2022), Schuit et al. (2023) and Assignment 2 satisfy those requirements. You can discuss all three approaches more generally or have a deeper focused discussion about one of them.

(Optional) If you interested to learn more, here are some references:
- [Deliverables of the High-Level Expert Group on AI](https://digital-strategy.ec.europa.eu/en/policies/expert-group-ai)
- Review papers on bias in ML: 
    - Mehrabi, Ninareh, et al. "A survey on bias and fairness in machine learning." ACM computing surveys (CSUR) 54.6 (2021): 1-35.
    - Chakraborty, Joymallya, Suvodeep Majumder, and Tim Menzies. "Bias in machine learning software: Why? how? what to do?." Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2021.

## Exercise 3.3 Batch jobs and server-side computations using OpenEO (1 pt)

So far you have been using openEO for synchronous execution of your requests: you submitted a request and the result came as a direct response to your request. This is not feasible for heavier work. Instead, one should submit the requests as batch jobs. Familiarise yourself with the Batch_job.ipynb file.

Furthermore, instead of downloading large amounts of raw data, one can process the data directly on server side. You can check all available processes by running 'connection.list_processes()'. For more details, see the [EO Cookbook](https://openeo.org/documentation/1.0/cookbook/#temporal-mean-reduce-dimension). Note that they give two alternatives on how to compute the temporal mean and we recommend to use the 'reduce_dimension' function.

Task: fix an area of interest and a time period and load the min, max, mean and standard deviation of CH4. Perform this task as a batch job using server-side processes. You should save the output in the netCDF format. Note that the output is computed per pixel.

## Exercise 3.4 Data preparation (2 pt)

Load the coordinates and labels of the CH4 plumes coming from gas infrastructure, oil infrastruture and coal mines from [all TROPOMI detected plumes for 2021. (Schuit et al. 2023)](https://zenodo.org/records/8087134). 

For each plume, compute the spatial extent of an area 1 deg x 1 deg around the plume.

We will assume that if a point source is a super-emitter, then it's emissions are noticeable over a longer period of time. For each emission source we are going to use the following features:
- Min, max, mean, sd and median for CH4, CO, SO2 and NO2 from Sentinel 5P data for the given period (more details on the period below)
- Min, max, mean, sd and median for S5 and S6 from Sentinel 3 SLSTR data for the given period 

Reasons for choosing these features:  
- CO, SO2 and NO2 are co-emitted in oil&gas industry and coal mines. See, for example, [Fioletov et al. (2016)](https://doi.org/10.5194/acp-16-11497-2016) and [Trenchev et al. (2023)](https://doi.org/10.3390/rs15061590)
- Methane has a "spectral fingerprint" - a unique way of absorbing infrared light, which can be used to identify emitters using satellites, which were not originally intended for tracking methane. An additional benefit of this approach is that such sattelites usually have higher resolution, so it is easier to pinpoint the point sources. See: [publication about NASA's EMIT mission](https://www.nasa.gov/centers-and-facilities/jpl/methane-super-emitters-mapped-by-nasas-new-earth-space-mission/) and [Pandey et al. (2023)](https://doi.org/10.1016/j.rse.2023.113716). 

Choice of periods and dimension of final data (choose one of the following):
- Compute the statistics using Copernicus-side processes from Jan 1, 2021 till Dec 31, 2021. Retain all pixels and flatten the final image into a vector (so from nxm image to 1xnm vector). This will mean that in the end you will have a 1913x(20nm_{sentinel 5p}+10nm_{sentinel 3}) feature array.
- Compute the statistics using Copernicus-side processes from Jan 1, 2021 till Dec 31, 2021. On Aalto JupyterHub compute the mean over each Copernicus-side output (so the mean of pixel means, the mean of pixel sd, etc). This will mean that in the end you will have a 1913x30 feature array.
- Compute the _monthly_ statistics using Copernicus-side processes from Jan 1, 2021 till Dec 31, 2021 (so for Jan 2021, Feb 2021, etc). On Aalto JupyterHub compute the mean over each Copernicus-side output (so the mean of pixel means, the mean of pixel sd, etc). This will mean that in the end you will have a 1913x360 feature array.

Note that retaining more data may make it easier for SVM to distinguish between classes, however, we have relatively few observations.

Notes:
- Sentinel 3 has significantly higher resolution than Sentinel 5P. Hence, your image for Sentinel 3 will contain significantly more pixels than the one for Sentinel 5P for the same spatial and temporal extents
- When computing the processes (i.e., mean, max, etc) on the Sentinel 5P data, you need to use one band at a time. For Sentinel 3, you can include both bands simultaneously.



## Exercise 3.5 SVM (4 pt)

Build an SVM model using the data you prepared in Exercise 3.4

## Exercise 3.6 Discuss the approach you took in Exercises 3.4 and 3.5. (2 pt)

Some points to consider:
- Strength and weaknesses? 
- Suggestions for improvement? 
- Implications for policy-makers? 
- Possible further research questions?
- Ethical considerations?

## (Optional) Exercise 3.7 Use any other (supervised or unsupervised) classification algorithm on data you prepared in Exercise 3.4 (3 pt)

Why did you choose this algorithm? Compare to the approach and results in Exercise 3.5.

## (Optional) Exercise 3.8 Test the model from Exercise 3.5 against a different time period (3 pt)

Assume that the location of super-emitters remains relatively stable over time. As in Exercise 3.4 load the data for the point sources but now for a different time period (say, the entire 2022 or 2023). Test how well your model from Exercise 3.5 performs on this new data. Discuss