# Land Use/Land Cover (LULC) modelling with `ml-rapids` and `eo-learn`

## `ml-rapids`
- implementaion of incremental algorithms in C++
- bindings for Python
- implements scikit-learn's incremental learner interface (`fit`, `predict`)
- Classification methods:
    - Majority Class
    - Naive Bayes
    - Logistic Regression
    - Perceptron
    - VFDT (Very Fast Decision Trees) aka Hoeffding Trees
    - HAT (Hoeffding Adaptive Trees)
    - Bagging

- GitHub repository: [https://github.com/JozefStefanInstitute/ml-rapids](https://github.com/JozefStefanInstitute/ml-rapids)
- Install with package installer for Python ([https://pypi.org/project/ml-rapids/](https://pypi.org/project/ml-rapids/)):

`pip install ml_rapids`

- Future plans:
    - Improved Python support (pickling of models)
    - Additional methods
    - Regression methods
    - JavaScript bindings (npm package)

## Demo
- Land Use/Land Cover (LULC) modelling using `ml-rapids` and `eo-learn`
- We compare methods from 3 different libraries (2 batch learning methods, 1 incremental learning method):
    - `LGBMClassifier`: Light Gradinet Boosting Machine from [https://github.com/microsoft/LightGBM](https://github.com/microsoft/LightGBM) (batch)
    - `RandomForestClassifier`: Random Forest Classifier from [https://github.com/scikit-learn/scikit-learn](https://github.com/scikit-learn/scikit-learn) (batch)
    - `HoeffdingTree`: Hoeffding Tree Classifier from [https://github.com/JozefStefanInstitute/ml-rapids](https://github.com/JozefStefanInstitute/ml-rapids) (incremental)

In [None]:
# Setup juptyer notebook
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
# Import all dependencies
import csv
import datetime
import os
import time
import string
import warnings

import joblib
import sklearn
import geopandas as gpd
import numpy as np
import pandas as pd
from eolearn.core import EOExecutor, EOPatch, EOTask, FeatureType, \
    LinearWorkflow, LoadTask, SaveTask, OverwritePermission
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap, BoundaryNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, \
    f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, \
    KFold
from tqdm.notebook import tqdm

from notebook_temporary.features import AddStreamTemporalFeaturesTask
from notebook_temporary.tasks import PredictPatch
from notebook_temporary.utilities import get_eopatch_ids, plot_grid, plot_roi, \
    img_rgb, img_feature, img_diff, abbreviate, plot_features, plot_confusion_matrix, \
    evaluate_grid

# Machine learning algorithms
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from ml_rapids import HoeffdingTree

warnings.filterwarnings('ignore')

## Data preparation

EO data:
- Slovenia split into 1085 patches ([plot](#))
- each patch with resolution 505x500 pixels (10m/px)
- Sentinel 2 (L1C), 2017, 13 bands (cca. 1TB)

LULC (ground truth):
- [https://rkg.gov.si/](https://rkg.gov.si/) (Ministry for agriculture, forestry and food)
- Originally 25 classes, grouped into 10 classes

Preparation:
- cloud detection and masking
- interpolation of missing values
- edge detection and masking
- sampling, balancing
- feature calculation (base features, stream features)
- elevation, inclination
- grouping and rasterication of LULC
- feature selection

Dataset:
- balanced: 50000 samples/class
- 8 classes (2 not present)
- total: 4000000 samples

In [None]:
# Original LULC classes
# 0		        No Data
# 1	    1100	Arable land
# 2	    1160	Hop field
# 3	    1180	Permanent crops on arable land
# 4	    1190	Greenhouse
# 5	    1211	Vineyard
# 6	    1212	Nursery
# 7	    1221	Intensive orchard
# 8	    1222	Extensive orchard
# 9	    1230	Olive grove
# 10	1240	Other permanent crop
# 11	1300	Permanent grassland
# 12	1321	Swampy meadow
# 13	1410	Overgrown agricultural area
# 14	1420	Forest plantation
# 15	1500	Trees and shrubs
# 16	1600	Uncultivated agricultural land
# 17	1800	Forest trees on agricultural land
# 18	2000	Forest
# 19	3000	Built-up area and related surface
# 20	4100	Swamp
# 21	4210	Reed
# 22	4220	Other marshy area
# 23	5000	Dried open area with special vegetation
# 24	6000	Open area with little or no vegetation
# 25	7000	Water

LULC_G = [
    ('No Data',            '#000000'), # 1600
    ('Cultivated Land',    '#ffa500'), # 1100, 1160, 1180, 1190, 1211, 1212, 1221, 1222, 1230, 1240
    ('Forest',             '#054907'), # 1420, 2000
    ('Grassland',          '#aaff32'), # 1300, 1321, 1800
    ('Shrubland',          '#806000'), # 1410, 1500, 5000
    ('Water',              '#069af3'), # 7000
    ('Wetlands',           '#95d0fc'), # 4100, 4210, 4220
    ('Tundra',             '#967bb6'), #
    ('Artificial Surface', '#dc143c'), # 3000
    ('Bareland',           '#a6a6a6'), # 6000
    ('Snow and Ice',       '#ffffff'), #
]

In [None]:
country = gpd.read_file(os.path.join('data', 'SVN', 'shape', 'country.shp'))
country_grid = gpd.read_file(os.path.join('data', 'SVN', 'shape', 'grid.shp'))

plot_roi(
    country,
    country_grid,
)

In [None]:
%%time

# Load the sampled dataset from CSV file:
dataset_path = os.path.join(os.getcwd(), 'data', 'SVN', '2017', 'samples')
dataset = pd.read_csv(os.path.join(dataset_path, 'dataset.csv'))
dataset

In [None]:
# Load the list of features selected by FASTENER (200 generations) from CSV file
selected_features = list(pd.read_csv(os.path.join(dataset_path, 'features.csv')).columns)

print(f'Selected features ({len(selected_features)} / {len(dataset.columns[4:])}):')
for feature in selected_features:
    print(f'- {feature}')

In [None]:
# Prepare input and target values for ML algorithms from selected features
X = dataset[selected_features].to_numpy()
y = dataset['LULC_G'].to_numpy()

labels_unique = np.unique(y)
num_classes = len(labels_unique)

# Normalize input values
# X = StandardScaler().fit_transform(X)

## Model evaluation

We evaluate models with 5-fold cross-validation.

In [None]:
# Configure ML methods
methods = [
    (LGBMClassifier, {
        'objective': 'multiclass',
        'metric': 'multi_logloss',
        'num_class': num_classes,
        'random_state': 42
    }),
    (RandomForestClassifier, {
        'n_estimators': 10,
        'max_depth': None,
        'min_samples_split': 2,
        'min_samples_leaf': 1,
        'random_state': 42
    }),
    (HoeffdingTree, {
        'max_byte_size': 33554432,
        'memory_estimate_period': 1000000,
        'grace_period': 200,
        'split_confidence': 0.0000001,
        'tie_threshold': 0.5,
        'binary_splits': False,
        'stop_mem_management': False,
        'remove_poor_atts': False,
        'leaf_learner': 'NBAdaptive',
        'nb_threshold': 0,
        'tree_property_index_list': '',
        'no_pre_prune': False
    })
]

# Initialize evaluation report table
report = pd.DataFrame(
    { k: np.zeros(len(methods)) for k in ['Training time', 'Inference time', 'CA', 'F1'] },
    index=[method[0].__name__ for method in methods]
)

y_shuffled = []
y_shuffled_pred = { method[0].__name__: [] for method in methods }

# Evaluate with 5-fold cross validation
k = 5
kf = KFold(n_splits=k, random_state=42, shuffle=True)
pbar = tqdm(total=k*len(methods))
for cv, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    y_shuffled += list(y_test)

    for method in methods:
        # Initialize model
        method_name = method[0].__name__
        model = method[0](**method[1])

        # Train model on training set
        training_time = time.time()
        model.fit(X_train, y_train)
        training_time = time.time() - training_time
        report.loc[method_name, 'Training time'] += training_time / k

        # Predict classes on test set
        inference_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - inference_time
        report.loc[method_name, 'Inference time'] += inference_time / k

        y_shuffled_pred[method_name] += list(y_pred)

        # Calulate average classification accuracy
        accuracy = accuracy_score(y_test, y_pred)
        report.loc[method_name, 'CA'] += accuracy / k
        
        # Calculate average F1 score
        f1 = f1_score(y_test, y_pred, average='weighted')
        report.loc[method_name, 'F1'] += f1 / k

        # Update progress bar
        pbar.update()

# Show evaluation report
report.round(2)

In [None]:
# Plot confusion metrices
for method in methods:
    plot_confusion_matrix(
        np.array(y_shuffled) - 1,
        np.array(y_shuffled_pred[method[0].__name__]) - 1,
        [l[0] for l in LULC_G[1:]],
        title=f'Confusion matrix ({method[0].__name__})'
    )

## Model training

Next we build models with selected methods on the whole dataset and save them.

In [None]:
# Select methods
selected_methods = [
    LGBMClassifier,
    RandomForestClassifier,
    HoeffdingTree
]

models_path = os.path.join(os.getcwd(), 'models', 'LULC', '2017')
if not os.path.isdir(models_path):
    os.makedirs(models_path)

models = []
for method in tqdm([m for m in methods if m[0] in selected_methods]):
    # Initialize model
    model = method[0](**method[1])
    model_name = method[0].__name__

    # Train the model on whole dataset
    model.fit(X, y)
    models.append(model)

    # Save the model for later use
    if hasattr(model, 'export_json'):
        # ml-rapids models are exported to JSON
        model_path = os.path.join(models_path, f'{model_name}.json')
        model.export_json(model_path)
    else:
        # Other models are pickled
        model_path = os.path.join(models_path, f'{model_name}.pkl')
        joblib.dump(model, model_path)

## Model usage on selected region


### Define Region of Interest (ROI)

In [None]:
# Define ROI
# country = gpd.read_file(os.path.join('data', 'SVN', 'shape', 'country.shp'))
# country_grid = gpd.read_file(os.path.join('data', 'SVN', 'shape', 'grid.shp'))
eopatch_root_path = os.path.join(os.getcwd(), 'data', 'SVN', '2017', 'patches')
eopatch_ids = get_eopatch_ids(395, 466, country_grid)

In [None]:
# Plot ROI
plot_roi(
    country,
    country_grid,
    eopatch_ids=eopatch_ids,
    title='Ljubljana, Slovenia',
    size=20
)

In [None]:
# Plot RGB
plot_grid(
    eopatch_ids,
    eopatch_root_path,
    img_rgb,
    img_func_args={
        'date': '2017-07-01'
    },
    title='Ljubljana, Slovenia (RGB)'
)

### Calculate selected features

Here we calculate selected features for 16 eopatches from our region of interest.

In [None]:
# Define tasks
tasks = []

# 1. Load Task
tasks.append(LoadTask(
    eopatch_root_path,
    lazy_loading=True
))

# Parse selected features
data_features = {}
feature_names = []
for feature in selected_features:
    tokens = feature.split('_')
    if len(tokens) > 1:
        if tokens[0] not in data_features:
            data_features[tokens[0]] = []
        data_features[tokens[0]].append('_'.join(tokens[1:]))
        feature_names.append(feature)

# 2. Add Stream Features Task
for base_feature, stream_features in data_features.items():
    if stream_features:
        tasks.append(
            AddStreamTemporalFeaturesTask(
                data_feature=base_feature,
                features=stream_features
            )
        )

# 3. Save Task
tasks.append(SaveTask(
    eopatch_root_path,
    overwrite_permission=OverwritePermission.OVERWRITE_FEATURES,
    features=[(FeatureType.DATA_TIMELESS, feature) for feature in feature_names]
))

workflow = LinearWorkflow(*tasks)
workflow.dependency_graph()

In [None]:
# Define execution arguments
execution_args = []
task_dict = workflow.get_tasks()

for eopatch_id in eopatch_ids.ravel():
    eopatch_folder = f'eopatch_{eopatch_id}'
    execution_args.append({
        task_dict['LoadTask']: {
            'eopatch_folder': eopatch_folder
        },
        task_dict['SaveTask']: {
            'eopatch_folder': eopatch_folder
        }
    })

# Execute workflow
executor = EOExecutor(
    workflow,
    execution_args,
    save_logs=True,
    logs_folder='logs'
)
executor.run()
executor.make_report()

In [None]:
# Plot selected features (for one patch)
plot_features(os.path.join(eopatch_root_path, 'eopatch_395'), selected_features, max_cols=3)

### Predict LULC


In [None]:
# Define tasks
tasks = []

# 1. Load Task
tasks.append(LoadTask(
    eopatch_root_path,
    lazy_loading=True
))

# Load models
models = []
for method in selected_methods:
    model_name = method.__name__
    model_path = os.path.join(models_path, f'{model_name}.pkl')

    if os.path.isfile(model_path):
        models.append(joblib.load(model_path))
    elif hasattr(method, 'import_json'):
        model_path = os.path.join(models_path, f'{model_name}.json')
        model = method()
        model.import_json(model_path)
        models.append(model)

# 2. Predict Patch Task
feature_names = []
for model in models:
    feature_name = 'LULC_G_PRED_' + abbreviate(type(model).__name__)
    feature_names.append(feature_name)
    tasks.append(PredictPatch(
        model,
        selected_features,
        feature_name
    ))

# 3. Save Task
tasks.append(SaveTask(
    eopatch_root_path,
    overwrite_permission=OverwritePermission.OVERWRITE_FEATURES,
    features=[(FeatureType.MASK_TIMELESS, feature) for feature in feature_names]
))

workflow = LinearWorkflow(*tasks)
workflow.dependency_graph()

In [None]:
# Define execution arguments
execution_args = []
task_dict = workflow.get_tasks()

for eopatch_id in eopatch_ids.ravel():
    eopatch_folder = f'eopatch_{eopatch_id}'
    execution_args.append({
        task_dict['LoadTask']: {
            'eopatch_folder': eopatch_folder
        },
        task_dict['SaveTask']: {
            'eopatch_folder': eopatch_folder
        }
    })

# Execute workflow
executor = EOExecutor(
    workflow,
    execution_args,
    save_logs=True,
    logs_folder='logs'
)
executor.run()
executor.make_report()

In [None]:
# EOPatch example
EOPatch.load(os.path.join(eopatch_root_path, 'eopatch_395'))

## Visualize predictions

In [None]:
# Visualize predictions
labels = [c[0] for c in LULC_G]
classes = list(range(len(labels)))
bounds = np.append(np.array(classes) - 0.5, classes[-1] + 0.5)

for mask in ['LULC_G'] + ['LULC_G_PRED_' + abbreviate(type(model).__name__) for model in models]:
    plot_grid(
        eopatch_ids,
        eopatch_root_path,
        img_feature,
        img_func_args={
            'feature': (FeatureType.MASK_TIMELESS, mask)
        },
        imshow_args={
            'cmap': ListedColormap(
                [c[1] for c in LULC_G],
                name='lulc_map'
            ),
            'norm': BoundaryNorm(bounds, len(classes))
        },
        colorbar={
            'ticks': classes,
            'labels': labels
        },
        title=mask
    )