# Introduction to Machine Learning

D4G workshop

Potsdam, 13.06.22

Author: Caroline Arnold, DKRZ / Helmholtz AI

## Introduction

Welcome to Introduction to Machine Learning! Today we are going to work through the typical lifecycle of a machine learning project. We will be working with data from the CyGNSS satellite mission to predict global ocean wind speed.

<img src="./images/data-science-lifecycle.png" alt="Data Science Lifecycle">

### Setup

This tutorial could be done on a laptop, but for convenience we will use Google Colab.

Now sign into Google Colab and clone the git repository using the following commands

```bash
TODO clone command
```

The data is stored separately in DKRZ nextcloud. Download it by

```bash
TODO download address for the data
TODO check how to get the data into Google Colab
```

### Today's Goals

1. Walk through all stages of a machine learning project
1. Train a neural network using the Keras framework
1. Learn strategies to improve your machine learning algorithm
1. Optional: Get familiar with different neural network architectures

This notebook can serve as a reference for you to employ in your own scientific machine learning projects. We do not expect you to understand every single line of code!

## Understanding the science case

The CyGNSS (Cyclone GNSS) is a system of microsatellites that measures GNSS reflected off the Earth's surface. We would like to predict the global ocean wind speed from CyGNSS measurements. This is a regression problem (the wind speed is a continuous variable).

<img src="./images/cygnss-from-space.png" alt="CyGNSS satellites and transmitter on top of a cyclone" title="CyNGSS satellites, Image courtesy of Milad Asgarimehr">

Question for you: TODO drop this and add more pictures
- What is the science case in your machine learning project?
- What are you trying to predict?
- Is it a classification or a regression problem or something else?

Take some minutes to discuss in small groups.

## Data mining

For machine learning, data is the most important ingredient. For the purpose of this tutorial, we already retrieved data (TODO add more context?) and prepared a subset of CyGNSS data.

## Data cleaning

While it may be easy to obtain a lot of data, it is necessary to ensure the data quality, eg by checking for `None` values in the data. For the purpose of this tutorial, you can assume the data has been cleaned such that you can directly work with it.

## Data exploration

Now it is time to take a look at the data. We use the Python library `xarray` to open the `netcdf` files that are provided.

In [None]:
import h5py
import numpy as np
np.random.seed(20220613)

In [None]:
# TODO change this to xarray later
ds_train = h5py.File('../data/train_data.h5', 'r')

In [None]:
ds_train.keys()

### Target variable

The target variable is the wind speed. To extract it:

In [None]:
y = ds_train['windspeed'][:] # TODO change this to xarray

Visualization is very helpful for machine learning projects, as it helps us to identify the key properties of the dataset at a glance. We plot the distribution of the wind speed:

In [None]:
# Necessary libraries for plotting. Check out https://seaborn.pydata.org/ for reference
from matplotlib import pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
sns.set_context('notebook')

In [None]:
sns.histplot(y)

plt.xlabel('Wind speed (m/s)')
plt.xticks([0, 2.5, 5.0, 7.5, 10, 12.5, 15, 17.5, 20])

plt.show()

### Feature variables

We can use the interactive xarray dataset browser TODO check if that works in Google Colab!

In [None]:
ds_train # TODO

There are 2D and 1D variables in the dataset. First, we extract the 2D variables and look at some selected samples.

#### BRCS (Bistatic Radar Cross Section)

In [None]:
brcs = ds_train['brcs'][:]

fig, ax = plt.subplots(1, 5, sharex=True, sharey=True, figsize=(20,4))

for i in range(5):
    sns.heatmap(brcs[i*100], ax=ax[i])

#### Effective scatter map

In [None]:
eff_scatter = ds_train['eff_scatter'][:]

fig, ax = plt.subplots(1, 5, sharex=True, sharey=True, figsize=(20,4))

for i in range(5):
    sns.heatmap(eff_scatter[i*100], ax=ax[i])

The dataset contains 1D variables as well. Here we can see the value ranges using histogram plots

#### Normalized bistatic radar cross section (ddm_nbrcs)

In [None]:
ddm_nbrcs = ds_train['ddm_nbrcs'][:]
#ddm_les   = ds_train['ddm_les'][:]

sns.histplot(ddm_nbrcs)

plt.xlabel('DDM NBRCS')
plt.show()

In [None]:
brcs = ds.brcs.values # TODO change this later to xarray

plt.imshow(brcs[0])

ddm_nbrcs = ds.ddm_nbrcs.values
ddm_les   = ds.ddm_les.values
# TODO add an additional variable that is not so promising?
# TODO remove the quality variable

plt.histogram(ddm_nbrcs)
plt.histogram(ddm_les)

### Relation of feature and target variables

TODO introduce the math when the neural network is introduced

A machine learning algorithm can learn to approximate a function

$y = f(X)$

where it learns the function $f$ based on data. Compared to traditional fitting algorithms, we do  not need to specify $f$ explicitly. A neural network can be trained to replace any kind of "well-behaved" non-linear function.

Let's take a look at the 2D density plots of the features and target variables:

In [None]:
fig, ax = plt.subplots(1, 3, sharex=True, figsize=(20, 6))

ax[0]. # ddm_nbrcs
ax[1]. # ddm_les
ax[2]. # the third variable

In [None]:
# TODO temp

plt.hexbin(y, ddm_nbrcs, mincnt=1, cmap='viridis')

plt.xlabel('Wind speed (m/s)')
plt.ylabel('DDM NBRCS')
plt.show()

*Question* TODO maybe too complicated
- Based on these plots, which variables would you select as input features?

## Feature engineering

This step was more important for classical machine learning algorithms, where data scientists provide handcrafted additional features to the machine learning algorithm. We skip this step here.

## Predictive modeling

Predictive modeling includes setting up a machine learning algorithm, training it and evaluating its performance. Our algorithm of choice is a neural network. We have prepared all the necessary code for you to train the algorithm.

### Prepare the input data

The data is split into *train*, *validation*, and *test* data. All three datasets have their distinct purpose:
1. Train data is given to the machine learning algorithm to tune the parameters of the neural network
1. Validation data is used to identify when the machine learning algorithm starts to overfit to the training data (we want to avoid learning the training data by heart)
1. Test data is used to gauge the ability of an ML algorithm to generalize. This dataset was not included at all in training and validation. We set it aside for now

TODO add image

In [None]:
def create_dataset(split, input_keys=['ddm_nbrcs'], normalize=True, verbose=True):
    '''
    Helper function to load the datasets that were prepared for this tutorial.
    
    Parameters:
    split       - Choice of [train, valid, test]
    input_keys  - Input parameters (need to be all 1D or all 2D)
    normalize   - Normalize features (default: True)
    verbose     - Print dataset information (default: True)
    
    Returns:
    
    (X, y) - Tuple of features and labels
    '''
    # TODO change this to xarray
    
    ds = h5py.File(f'../data/{split}_data.h5', 'r')
    
    X = []
    
    for key in input_keys:
        var = ds[key][:]
        if normalize:
            var /= np.max(var)
        X.append(var)
        
    X = np.swapaxes(np.asarray(X), 0, 1)
    
    if len(X.shape) == 4: # images to channel_last
        X = np.swapaxes(X, 1, 3)
    
    y = ds['windspeed'][:]
    y = y[:, np.newaxis]
    
    print(f'Loaded data for split {split}')
    print(f'Feature array: {X.shape}')
    print(f'Label array:   {y.shape}')
    
    return X, y

In [None]:
X_train, y_train = create_dataset('train', input_keys=['ddm_nbrcs', 'ddm_nbrcs'])
X_valid, y_valid = create_dataset('valid', input_keys=['ddm_nbrcs', 'ddm_nbrcs'])

### Introduction to neural networks

A single neuron takes an input $x$, applies a linear transformation $y = w \cdot x + b$, and ultimately applies a non-linear *activation function* $\sigma$, e.g., the relu function.

Therefore, a single neuron transforms the input like:

$y = \sigma( w \cdot x + b )$

The parameters $w, b$ are *learned* by exposing the neural network to training data. The dense neural network is a neural network that stacks several individual neurons together in *layers*. A forward pass through such a network can be written as 

$y = \sigma A( \sigma B (x))$

where $A, B$ are the weight matrices of the neural network.

<img src="./images/dense-neural-network.png" size="0.5">

### Define a neural network architecture

For convenience, we define a python function that can generate dense neural networks with various sizes. We use the *Keras* machine learning framework.

In [None]:
import tensorflow.keras as keras

def create_nn(H0=16, H1=8, input_dim=(2,)):
    '''
    Create a dense neural network with two hidden layers
    
    Parameters:
    H0 - Number of neurons in 1st hidden layer
    H1 - Number of neurons in 2nd hidden layer
    input_dim - Number of input features
    '''
    
    # Create a Keras input tensor
    inputs = keras.layers.Input(shape=input_dim)
    
    # Apply the first hidden layer
    hidden_layer = keras.layers.Dense(H0, activation='relu')(inputs)
    
    # Apply the second hidden layer
    hidden_layer2 = keras.layers.Dense(H1, activation='relu')(hidden_layer)
    
    # Reduce to one final output for the regression
    outputs = keras.layers.Dense(1)(hidden_layer2) 
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    
    return model 

Create a model with default parameters:

In [None]:
model = create_nn(H0=16, H1=8)

model.summary()

We also need to define an optimizer that is the strategy to reach a minimum of the neural network parameter space. In Keras, this is done by compiling the model:

In [None]:
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.MeanSquaredError())

### Train the neural network

In training, we show the training data to the neural network such that it can estimate the parameters. Then, the loss is calculated, here the mean squared error of true ($y$) vs predicted ($\hat y$) labels:

$\mathcal L = \frac 1 N \sum\limits_{i=1}^N (y_i - \hat y_i)^2$

Based on that, the neural network weights are adapted using backward propagation (advanced topic). We show the data to the network in *minibatches* for scalability and efficiency.

An important question is how we should know that we should *stop* training. In theory, we could train forever and ultimately reduce the loss on the training set to 0. That would not be helpful, because the model would then not generalize well to unseen data, a phenomenon known as *overfitting*. Therefore, we monitor the loss on the validation set during training, and stop training once this loss does no longer decrease.

In [None]:
max_epochs=200 # stop here in any case
batch_size=32

In [None]:
early_stopping=keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)

In [None]:
history = model.fit(X_train, y_train, 
                    validation_data=(X_valid, y_valid), 
                    callbacks=[early_stopping],
                    epochs=max_epochs, 
                    batch_size=batch_size)

### Analyze the training process

Plot the history of the training process. The Keras framework automatically stored the training and validation loss for each epoch. 

In [None]:
trained_epochs = len(history.history['loss'])

sns.lineplot(x=range(trained_epochs), y=history.history['loss'], label='Train loss')
sns.lineplot(x=range(trained_epochs), y=history.history['val_loss'], label='Validation loss')

plt.ylim(0, 10)

plt.xticks(range(trained_epochs))
plt.xlabel('Epoch')

plt.show()

### Improve the neural network

#### Single metric

We need to define a strategy to gauge the performance of the neural network. For that, we recommend to choose a single metric that is determined on the validation set and that you optimize step by step. In our case, this is the root mean squared error (RMSE). Calculate it below for the model we trained:

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_valid)

rmse = mean_squared_error(y_valid, y_pred, squared=False)

print(f'Root mean squared error (RMSE) obtained on validation set: {rmse:.4f} m/s')

#### Next try

Change the parameters `H0, H1` of the neural network, as well as the batch size. What do you observe for the validation set results?

In [None]:
model2 = create_nn(H0=128, H1=64)
model2.summary()

In [None]:
model2.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.MeanSquaredError())

history = model2.fit(X_train, y_train, 
                    validation_data=(X_valid, y_valid), 
                    callbacks=[early_stopping],
                    epochs=max_epochs, 
                    batch_size=batch_size, 
                    verbose=1)

In [None]:
y_pred = model2.predict(X_valid)

rmse = mean_squared_error(y_valid, y_pred, squared=False)

print(f'Root mean squared error (RMSE) obtained on validation set: {rmse:.4f} m/s')

Compare the RMSE to the RMSE you obtained before with the default architecture. Do you see an improvement?

Optional: Try out more architectures

### Advanced topic: Convolutional neural network (CNN)

Remember that the dataset contains 2D variables as well, which we did not use so far. Convolutional neural networks originated in computer vision and were originally developed for the image classification. We adapt here a convolutional neural network for regression.

TODO add sketch of network architecture

In [None]:
def create_cnn(n_filters=16, H0=64, H1=32):
    '''
    Create a convolutional neural network. The architecture has 2 convolutional layers, followed by two dense layers.
    
    Parameters:
    n_filters - number of filters in the convolutional layer
    H0        - number of neurons in the dense layer
    '''
    
    inputs = keras.layers.Input(shape=(11, 17, 2))
    conv_layer1 = keras.layers.Conv2D(n_filters, 3, activation="relu")(inputs)
    conv_layer2 = keras.layers.Conv2D(n_filters, 3, activation="relu")(conv_layer1)
    pooling_layer = keras.layers.MaxPool2D()(conv_layer2)
    flatten_layer = keras.layers.Flatten()(pooling_layer)
    dense_layer = keras.layers.Dense(H0, activation="relu")(flatten_layer)
    dense_layer2 = keras.layers.Dense(H1, activation="relu")(dense_layer)
    outputs = keras.layers.Dense(1)(dense_layer2)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    
    return model

We create training and validation data this time using the image data part of the provided CyGNSS dataset:

In [None]:
X_train_cnn, _ = create_dataset('train', input_keys=['brcs', 'eff_scatter'])
X_valid_cnn, _ = create_dataset('valid', input_keys=['brcs', 'eff_scatter'])

In [None]:
model_cnn = create_cnn()
model_cnn.summary()

In [None]:
model_cnn.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.MeanSquaredError())

history = model_cnn.fit(X_train_cnn, y_train, 
                    validation_data=(X_valid_cnn, y_valid), 
                    callbacks=[early_stopping],
                    epochs=max_epochs, 
                    batch_size=batch_size, 
                    verbose=1)

In [None]:
rmse = mean_squared_error(y_valid, model_cnn.predict(X_valid_cnn), squared=False)
print(f'RMSE for the CNN: {rmse:.4f} m/s')

To summarize the results that were obtained on the *validation* set:

1. Dense neural network

| H0 | H1 | batch size | RMSE |
|--  |--  |--          | --   |
| 16 | 8  | 32        | TODO |


2. Convolutional neural network

| Filters | H0 | H1 | batch size | RMSE |
|--       |--  |--  |--          |--    |
| 16      | 32 | 16 | 32         | TODO |

### Final step: Test set prediction

Repeat the dataset preparation for the test set

In [None]:
X_test, y_test = create_dataset('test', input_keys=['ddm_nbrcs', 'ddm_nbrcs'])
X_test_cnn, _ = create_dataset('test', input_keys=['brcs', 'eff_scatter'])

Choose one of the model architectures that you think perform well on the given dataset. If necessary, train this model again. Use the trained model to make predictions on the test set.

In [None]:
# TODO train again if necessary

best_model = model_cnn

In [None]:
y_pred = best_model.predict(X_test_cnn) # if CNN was best model
# y_pred = best_model.predict(X_test) # if ANN was best model

## Data visualization

Calculate metrics to report on the performance of your machine learning algorithm. Compare the test set RMSE with the validation RMSE. What do you observe?

In [None]:
rmse = mean_squared_error(y_test, y_pred, squared=False)

print(f'Root mean squared error (RMSE) for the test set: {rmse:.4} m/s')

### Histogram plot

In a regression problem, it is interesting to see the performance of the machine learning algorithm beyond the aggregated RMSE metric. We plot the histogram of true windspeed and predicted windspeed. What do you observe? Can you identify a windspeed range where our machine learning algorithm performs poorly? What are possible explanations?

In [None]:
fig, ax = plt.subplots(1, 1)

sns.distplot(y_test.squeeze(), color='gray', label='True wind speed', ax=ax)
sns.distplot(y_pred.squeeze(), color='C2', label='Predicted wind speed', ax=ax)

ax.legend()
ax.set_xlabel('Wind speed (m/s)')

plt.show()

In [None]:
# 2D scatter plot

fig, ax = plt.subplots(1, 1)

ax.set_aspect('equal')

img = ax.hexbin(y_test.squeeze(), y_pred.squeeze(), mincnt=1, cmap='viridis')

ax.set_xlabel('ERA5 wind speed (m/s)')
ax.set_ylabel('Predicted wind speed (m/s)')

xmin = 0
xmax = 20

ax.plot(np.linspace(xmin, xmax), np.linspace(xmin, xmax), 'r--')

ax.set_ylim(xmin, xmax)
ax.set_xlim(xmin, xmax)

plt.colorbar(img, label='Sample density')

plt.show()

Discussion: Overall, it is possible to obtain reasonable windspeed predictions even with the small set of training samples that was provided here in the tutorial. However, the prediction tends towards the mean windspeed value, which means that low windspeeds are overestimated and high windspeeds are underestimated. This is known as *regression to the mean* and applies to all statistical algorithms. Note that the windspeed distribution is not uniform, and therefore the algorithm is presented more often with average samples compared to samples with high windspeed.

Related publication:

Asgarimehr, M., Arnold, C., Weigel, T., Ruf, C. & Wickert, J. GNSS reflectometry global ocean wind speed using deep learning: Development and assessment of CyGNSSnet. Remote Sensing of Environment 269, 112801 (2022).

    
    