# Train Model to Recognise Clouds in Satellite Images

This notebook shows how the model for cloud recognition in Natura2000 satellite images has been created, as well as how to train a new model. The cells below show the process of how the model was developed, including (some) approaches that proved fruitless. It has however been restructured to provide a reasonable overview of the steps involved. The notebook leans heavily on the code in the 'src/cloud_detection' folder. 

Conclusion:
It is possible to predict the pressence of clouds in the Natura 2000 areas 'Coepelduynen', 'Voornes Duin' and 'Duinen Goeree & Kwade Hoek'. We can in fact do so using only meta data regarding the RGB values in the given images. The model has approximately 95% accuracy, compared to a 60% accuracy for a baseline model. This strongly suggest this model is fit to give warnings for the presence of clouds before further processing satellite images.

The resulting model is uploaded to `pzh-blob-satelliet` blob storage to container `satellite-images-nso` and folder `cloud_detection_models`. The 'Apply cloud recongition model.ipynb' notebook shows how to apply said model to a new image (for the known 3 locations).

Author: Pieter Kouyzer, paj.kouyzer@pzh.nl\
Date: 2023/12/29

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import pandas as pd
import pickle
from pprint import pprint

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

from cloud_recognition.model_training import train_test_split_filenames, Natura2000CloudDetectionModel, ModelType, Location
from cloud_recognition.data_loaders import FlattenedRGBImageLoader
from cloud_recognition.data_preparation import DataPreprocessor
from cloud_recognition.features import selected_features, all_features


### Set variables

In [3]:
filenames_filepath = "satellite-images-clouds.csv"
folder_path = "E:/Data/remote_sensing/satellite-images"
column_to_train_test_split_on = "location"
filepath_cached_data = "cache/filenames_df.csv"

### Load and Preprocess Data
Note that the results are cached and the cache is loaded when available, as it takes a long time (>1 hour) to run.

In [4]:
if os.path.exists(filepath_cached_data):
    filenames_df = pd.read_csv(filepath_cached_data)
else:
    filenames_df = pd.read_csv(filenames_filepath)

    data_preprocessor = DataPreprocessor(features_to_generate=all_features)
    features_df = filenames_df.apply(
        lambda row: data_preprocessor.transform(
            FlattenedRGBImageLoader(filepath=os.path.join(folder_path, row["filename"])).get_rgb_df()
            ),
            axis=1)
    filenames_df = filenames_df.join(features_df)
    filenames_df.to_csv(filepath_cached_data, index=False)

### Split data in train and test dataframe
The choice was made to not split on the location column, as a model per location is more successful

N.B. Possible improvement here would be to split on the date the images were taken (the newest images would be the test dataframe). This would however require that there is a reasonable split of the 'clouds'/'not clouds' categories for each location when doing this time split.

In [5]:
train_df, test_df = train_test_split_filenames(filenames_df, column_to_split_on=None, random_state=1000, train_size=0.7)

### Feature Selection
By removing highly correlated (>0.95) features manually and checking the effect on accuracy for Linear Models (below), we come to the following uncorrelated & important features.

In [6]:
uncorrelated_features = [str(feature) for feature in selected_features ]
pprint(uncorrelated_features)

train_df[[str(feature) for feature in all_features]].corr()

['fraction_bright_500',
 'fraction_bright_700',
 'fraction_relative_bright_0.6',
 'fraction_relative_bright_0.8',
 'fraction_relative_bright_0.9',
 'fraction_relative_bright_0.95',
 'fraction_relative_bright_0.99',
 'fraction_green_bright_500',
 'red_quantile_0.99',
 'number_bright_pixels_500',
 'number_bright_pixels_700',
 'number_of_pixels']


Unnamed: 0,fraction_bright_500,fraction_bright_600,fraction_bright_700,fraction_bright_800,fraction_bright_900,fraction_bright_1000,fraction_bright_1100,fraction_bright_1200,fraction_relative_bright_0.5,fraction_relative_bright_0.6,...,number_bright_pixels_1000,number_bright_pixels_1100,number_bright_pixels_1200,number_of_pixels,fraction_bright_from_max_0.5,fraction_bright_from_max_0.6,fraction_bright_from_max_0.7,fraction_bright_from_max_0.8,fraction_bright_from_max_0.9,fraction_bright_from_max_0.99
fraction_bright_500,1.000000,0.922186,0.851619,0.818962,0.809043,0.807429,0.806714,0.806711,0.354054,0.287404,...,0.807376,0.806709,0.806709,-0.236830,0.820145,0.807335,0.806216,0.806688,0.806706,0.007449
fraction_bright_600,0.922186,1.000000,0.984354,0.965210,0.959509,0.958795,0.958445,0.958443,0.232713,0.168769,...,0.958779,0.958443,0.958442,-0.180800,0.954567,0.958373,0.958226,0.958433,0.958439,-0.068493
fraction_bright_700,0.851619,0.984354,1.000000,0.994176,0.991132,0.990696,0.990462,0.990461,0.162177,0.096207,...,0.990697,0.990462,0.990460,-0.143183,0.981636,0.990130,0.990351,0.990457,0.990459,-0.090223
fraction_bright_800,0.818962,0.965210,0.994176,1.000000,0.999499,0.999279,0.999140,0.999140,0.126433,0.052809,...,0.999280,0.999140,0.999140,-0.113385,0.989319,0.998561,0.999063,0.999139,0.999139,-0.106879
fraction_bright_900,0.809043,0.959509,0.991132,0.999499,1.000000,0.999964,0.999917,0.999917,0.116803,0.041973,...,0.999963,0.999917,0.999917,-0.106649,0.989920,0.999202,0.999847,0.999916,0.999917,-0.110642
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fraction_bright_from_max_0.6,0.807335,0.958373,0.990130,0.998561,0.999202,0.999244,0.999249,0.999250,0.113777,0.041110,...,0.999234,0.999249,0.999249,-0.102518,0.993022,1.000000,0.999506,0.999260,0.999251,-0.101163
fraction_bright_from_max_0.7,0.806216,0.958226,0.990351,0.999063,0.999847,0.999922,0.999934,0.999935,0.112338,0.037816,...,0.999916,0.999934,0.999934,-0.105826,0.990441,0.999506,1.000000,0.999939,0.999935,-0.105339
fraction_bright_from_max_0.8,0.806688,0.958433,0.990457,0.999139,0.999916,0.999989,1.000000,1.000000,0.113702,0.038866,...,0.999984,1.000000,1.000000,-0.105110,0.989944,0.999260,0.999939,1.000000,1.000000,-0.111673
fraction_bright_from_max_0.9,0.806706,0.958439,0.990459,0.999139,0.999917,0.999990,1.000000,1.000000,0.113751,0.038905,...,0.999984,1.000000,1.000000,-0.105089,0.989926,0.999251,0.999935,1.000000,1.000000,-0.111892


### Model Training
We first create a baseline model that always picks the majority class (no clouds), so we know our model should be better than that. The baseline has an accuracy of approximately 60%.

Then we create a model per location that scales the features, performs Principal Component Analysis and finally fits a linear model on that.
We perform a grid search over 2 basic linear models (Logistic Regression and Decision Tree Classifier), as well as the number of PCA components to include. We pick the best performing option for each location. The best performing model than has an accuracy of approximately 95%, a considerable improvement compared to the baseline. The grid search yields the following parameters per location:

- Coepelduynen -> LogisticRegression, pca_n_components = 5
- Duinen Goeree Kwade Hoek -> LogisticRegression, pca_n_components = 6
- Voornes Duin -> LogisticRegression, pca_n_components = 5

In [7]:
feature_columns_to_fit = uncorrelated_features + ["location"]
X_train = train_df[feature_columns_to_fit]
y_train = train_df["clouds"].astype(int)
X_test = test_df[feature_columns_to_fit]
y_test = test_df["clouds"].astype(int)

### Baseline

In [8]:
print("Baseline model results:")
model = Natura2000CloudDetectionModel(model_type=ModelType.BASELINE, locations=[loc for loc in Location])

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

print("train confusion matrix")
print(confusion_matrix(y_true=y_train, y_pred=y_pred_train))
print(f"Train accuracy: {accuracy_score(y_train, y_pred_train)}")
print("test confusion matrix")
print(confusion_matrix(y_true=y_test, y_pred=y_pred_test))
print(f"Test accuracy: {accuracy_score(y_test, y_pred_test)}")

Baseline model results:
train confusion matrix
[[32  0]
 [21  0]]
Train accuracy: 0.6037735849056604
test confusion matrix
[[14  0]
 [ 9  0]]
Test accuracy: 0.6086956521739131


### Linear Model per Location

In [9]:
grid_search_parameters = {
    "model": [LogisticRegression, DecisionTreeClassifier],
    "pca_n_components": range(1,10),
    "location": [loc for loc in Location]
}

In [10]:
for location in grid_search_parameters["location"]:
    loc_train_mask = train_df["location"] == location.value
    loc_test_mask = test_df["location"] == location.value

    feature_columns_to_fit = uncorrelated_features + ["location"]

    X_train_loc = X_train[loc_train_mask]
    y_train_loc = y_train[loc_train_mask]
    X_test_loc = X_test[loc_test_mask]
    y_test_loc = y_test[loc_test_mask]
    for mod in grid_search_parameters["model"]:
        for pca_n_components in grid_search_parameters["pca_n_components"]:
            locations = [location]
            linear_models = {location: mod()}
            pca_n_components_dict = {location: pca_n_components}

            model = Natura2000CloudDetectionModel(model_type=ModelType.LOCATION_LINEAR_MODEL, locations=locations, linear_models=linear_models, pca_n_components=pca_n_components_dict)
            model.fit(X_train_loc, y_train_loc)

            y_pred_train_loc = model.predict(X_train_loc)
            y_pred_test_loc = model.predict(X_test_loc)

            print(f"Parameters: location: {location}, linear_model = {mod}, pca_n_components = {pca_n_components}")
            print(f"Train accuracy: {accuracy_score(y_train_loc, y_pred_train_loc)}")
            print(f"Test accuracy: {accuracy_score(y_test_loc, y_pred_test_loc)}")


Parameters: location: Location.COEPELDUYNEN, linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 1
Train accuracy: 0.7368421052631579
Test accuracy: 1.0
Parameters: location: Location.COEPELDUYNEN, linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 2
Train accuracy: 0.8421052631578947
Test accuracy: 1.0
Parameters: location: Location.COEPELDUYNEN, linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 3
Train accuracy: 0.8421052631578947
Test accuracy: 1.0
Parameters: location: Location.COEPELDUYNEN, linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 4
Train accuracy: 0.8947368421052632
Test accuracy: 1.0
Parameters: location: Location.COEPELDUYNEN, linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 5
Train accuracy: 0.9473684210526315
Test accuracy: 1.0
Parameters: location

In [11]:
linear_models = {
    Location.COEPELDUYNEN: LogisticRegression(),
    Location.DUINENGOEREEKWADEHOEK: LogisticRegression(),
    Location.VOORNESDUIN: DecisionTreeClassifier()
}
pca_n_components = {
    Location.COEPELDUYNEN: 5,
    Location.DUINENGOEREEKWADEHOEK: 6,
    Location.VOORNESDUIN: 3
}
locations = [
    Location.COEPELDUYNEN,
    Location.DUINENGOEREEKWADEHOEK,
    Location.VOORNESDUIN
]

model = Natura2000CloudDetectionModel(model_type=ModelType.LOCATION_LINEAR_MODEL, locations=locations, linear_models=linear_models, pca_n_components=pca_n_components)
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

print("train confusion matrix")
print(confusion_matrix(y_true=y_train, y_pred=y_pred_train))
print(f"Train accuracy: {accuracy_score(y_train, y_pred_train)}")
print("test confusion matrix")
print(confusion_matrix(y_true=y_test, y_pred=y_pred_test))
print(f"Test accuracy: {accuracy_score(y_test, y_pred_test)}")

train confusion matrix
[[32  0]
 [ 2 19]]
Train accuracy: 0.9622641509433962
test confusion matrix
[[14  0]
 [ 0  9]]
Test accuracy: 1.0


### Linear Model without Location input

In [12]:
grid_search_parameters = {
    "model": [LogisticRegression, DecisionTreeClassifier],
    "pca_n_components": range(1,12),
}

In [13]:
feature_columns_to_fit = uncorrelated_features + ["location"]

for mod in grid_search_parameters["model"]:
    for pca_n_components in grid_search_parameters["pca_n_components"]:
        linear_model = mod()

        model = Natura2000CloudDetectionModel(model_type=ModelType.LINEAR_MODEL, locations=None, linear_models=linear_model, pca_n_components=pca_n_components)
        model.fit(X_train, y_train)

        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)

        print(f"Parameters: linear_model = {mod}, pca_n_components = {pca_n_components}")
        print(f"Train accuracy: {accuracy_score(y_train, y_pred_train)}")
        print(f"Test accuracy: {accuracy_score(y_test, y_pred_test)}")


Parameters: linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 1
Train accuracy: 0.660377358490566
Test accuracy: 0.6956521739130435
Parameters: linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 2
Train accuracy: 0.7358490566037735
Test accuracy: 0.9130434782608695
Parameters: linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 3
Train accuracy: 0.7924528301886793
Test accuracy: 0.9130434782608695
Parameters: linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 4
Train accuracy: 0.8113207547169812
Test accuracy: 0.782608695652174
Parameters: linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 5
Train accuracy: 0.8113207547169812
Test accuracy: 0.782608695652174
Parameters: linear_model = <class 'sklearn.linear_model._logistic.LogisticRegression'>, pca_n_components = 6
Trai

In [14]:
model = Natura2000CloudDetectionModel(model_type=ModelType.LINEAR_MODEL, locations=None, linear_models=LogisticRegression(), pca_n_components=4)
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

print("train confusion matrix")
print(confusion_matrix(y_true=y_train, y_pred=y_pred_train))
print(f"Train accuracy: {accuracy_score(y_train, y_pred_train)}")
print("test confusion matrix")
print(confusion_matrix(y_true=y_test, y_pred=y_pred_test))
print(f"Test accuracy: {accuracy_score(y_test, y_pred_test)}")

train confusion matrix
[[30  2]
 [ 8 13]]
Train accuracy: 0.8113207547169812
test confusion matrix
[[13  1]
 [ 4  5]]
Test accuracy: 0.782608695652174


### Conclusion:
The Location Logistic Regression model works quite well. It increases accuracy from 60% (baseline model) to more than 95%. \
For the Location Logistic Regression model the difference between train & test accuracy is negligible, suggesting there is no overfitting happening.\
However we prefer a model that doesn't rely on Location specific input, so a decent algorithm without location is preferred.\
Forunately the Logistic Regression model works reasonably well too. It increase accuracy from 60% (baseline model) to approximately 80%. Again difference between train & test accuracy is negligible.

Hence, the cloud detection model seems like a good additional functionality for the extractor to warn for images with too much clouds.

### Definitive model
The definitive model is trained on all available data and saved, following this it is uploaded by hand to `pzh-blob-satelliet` blob storage to container 'satellite-images-nso' and folder 'cloud_detection_models'

In [19]:
X_train_def = filenames_df[feature_columns_to_fit]
y_train_def = filenames_df["clouds"].astype(int)

linear_model = LogisticRegression()
pca_n_components = 4
locations = None

model = Natura2000CloudDetectionModel(model_type=ModelType.LINEAR_MODEL, locations=locations, linear_models=linear_model, pca_n_components=pca_n_components)
model.fit(X_train_def, y_train_def)

y_pred_train_def = model.predict(X_train_def)

print("train confusion matrix")
print(confusion_matrix(y_true=y_train_def, y_pred=y_pred_train_def))
print(f"Train accuracy: {accuracy_score(y_train_def, y_pred_train_def)}")

filename = '../saved_models/cloud_detection_logistic_regression_v1.0.sav'
pickle.dump(model, open(filename, 'wb'))

train confusion matrix
[[42  4]
 [10 20]]
Train accuracy: 0.8157894736842105
