# Train Pixel Level Annotation Model

### This notebook uses pixel level annotations to train a Random Forest Classifier to predict labels

We assume Pixel level annotations are available, as produced by the "../data/annotations/transform_polygon_annotations_to_pixels.ipynb" notebook. Feature selection and grid_search for optimal parameters has been done in a separate notebook ("Coepelduynen/make_train_model_on_annotations_coepelduynen.ipynb") and those outcomes are taken as given in this notebook.

Change the set Variables cell below as desired and then run the entire notebook to get cross_validation results as well as a final model trained on all data.

Date: 2024-01-12\
Author: Pieter Kouyzer

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
%matplotlib notebook
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import pprint

from training.train import train_imbalanced_model, cross_validation_balance_on_date
from training.utils import get_cross_validation_results_filepath, get_model_filepath
from training.metric_calculation import calculate_average_metrics, get_metrics



In [2]:
# Set Variables
location = "Coepelduynen"
satellite_constellation = "PNEO"
annotated_pixels_filepath = "C:/repos/satellite-images-nso-datascience/data/annotations/annotations_pixel_dataframes/annotaties_coepelduynen_to_pixel_2023_scaled.parquet"

In [3]:
# Optimal parameters and features
selected_features = ['r','g','b','n','e','d','ndvi','re_ndvi']
optimal_parameters = {
    "n_estimators": 10, 
    "min_samples_split": 5, 
    "min_samples_leaf": 1,
    "bootstrap": False
}

In [4]:
df = pd.read_parquet(annotated_pixels_filepath)
df

Unnamed: 0,r,g,b,n,e,d,ndvi,re_ndvi,label,image,date,season,annotation_no
0,4.395348,3.620568,3.052421,2.838405,3.229714,2.501019,1.413646,1.397541,Sand,20230513_104139_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230513_104139,Spring,31_20230513_annotations
1,4.646989,3.761395,3.146375,2.937408,3.372237,2.533248,1.398225,1.380507,Sand,20230513_104139_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230513_104139,Spring,31_20230513_annotations
2,4.512391,3.685565,3.094178,2.730402,3.206513,2.505048,1.382805,1.380507,Sand,20230513_104139_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230513_104139,Spring,31_20230513_annotations
3,4.184672,3.468909,2.937589,2.350143,2.881694,2.412390,1.336543,1.346439,Sand,20230513_104139_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230513_104139,Spring,31_20230513_annotations
4,4.091038,3.425577,2.906271,2.293891,2.818719,2.400304,1.336543,1.363473,Sand,20230513_104139_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230513_104139,Spring,31_20230513_annotations
...,...,...,...,...,...,...,...,...,...,...,...,...,...
156367,2.077479,1.989473,1.895106,1.785769,1.947980,1.801029,1.639892,1.665139,Vegetation,20230910_105008_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230910_105008,Fall,30_Annotations_Coepelduynen_2023
156368,2.055305,1.974723,1.890705,1.736869,1.909929,1.801029,1.622816,1.665139,Vegetation,20230910_105008_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230910_105008,Fall,30_Annotations_Coepelduynen_2023
156369,1.794754,1.792813,1.780675,1.345673,1.582694,1.739418,1.571588,1.626706,Vegetation,20230910_105008_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230910_105008,Fall,30_Annotations_Coepelduynen_2023
156370,1.850190,1.832145,1.802681,1.443472,1.658796,1.753110,1.588664,1.645923,Vegetation,20230910_105008_PNEO-03_1_1_30cm_RD_12bit_RGBN...,20230910_105008,Fall,30_Annotations_Coepelduynen_2023


In [5]:
# This is to give an indication of the amount of data points per image
df['date'].value_counts()

date
20230513_104139    392936
20230402_105321    156372
20230601_105710    156372
20230908_110020    156372
20230910_105008    156372
Name: count, dtype: int64

In [6]:
# This is to give an indication of the amount of data points per label
df['label'].value_counts()

label
Vegetation    898609
Sand          119415
Asphalt          400
Name: count, dtype: int64

In [7]:
df = df[df['label'] != "Asphalt"]

### Cross Validation

We do cross-validation, where the folds are decided by the 'date' column. This is to avoid pixels from the same image from ending up in both the train and test datasets. We display the metrics averaged over the folds and write the results to a pickle.

In [8]:
model = RandomForestClassifier(**optimal_parameters)
scaler = StandardScaler()

In [9]:
results = cross_validation_balance_on_date(data=df, model=model, cv=5, features=selected_features, random_state=1337, sampling_type_boundary=100000)

---------fold: 1
Picked hold out dates: 
['20230402_105321']
Undersampling to rebalance dataset
Fitting model




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Calculating train metrics




Calculating test metrics
{'Sand': {'precision': 0.9669127004326801, 'recall': 0.98504753673293, 'f1-score': 0.9758958770390034, 'support': 11570}, 'Vegetation': {'precision': 0.9988028095913636, 'recall': 0.9973051782037285, 'f1-score': 0.998053432079301, 'support': 144722}}
---------fold: 2
Picked hold out dates: 
['20230513_104139']
Oversampling to rebalance dataset


2024/02/01 17:06:33 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '3db00f19cbb84b269e70012cf40a0bca', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Fitting model




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Calculating train metrics




Calculating test metrics
{'Sand': {'precision': 0.9995055624227441, 'recall': 0.884501264784303, 'f1-score': 0.9384933444561315, 'support': 73135}, 'Vegetation': {'precision': 0.9742576248872419, 'recall': 0.9998999127364171, 'f1-score': 0.986912235261794, 'support': 319721}}
---------fold: 3
Picked hold out dates: 
['20230908_110020']
Undersampling to rebalance dataset




Fitting model


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Calculating train metrics




Calculating test metrics
{'Sand': {'precision': 0.9420289855072463, 'recall': 1.0, 'f1-score': 0.9701492537313433, 'support': 11570}, 'Vegetation': {'precision': 1.0, 'recall': 0.9950802227719352, 'f1-score': 0.9975340454123547, 'support': 144722}}
---------fold: 4
Picked hold out dates: 
['20230601_105710']
Undersampling to rebalance dataset




Fitting model


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Calculating train metrics




Calculating test metrics
{'Sand': {'precision': 0.89, 'recall': 1.0, 'f1-score': 0.9417989417989417, 'support': 11570}, 'Vegetation': {'precision': 1.0, 'recall': 0.9901189867470046, 'f1-score': 0.9950349635781593, 'support': 144722}}
---------fold: 5
Picked hold out dates: 
['20230910_105008']
Undersampling to rebalance dataset




Fitting model


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Calculating train metrics




Calculating test metrics
{'Sand': {'precision': 0.9616823206715984, 'recall': 1.0, 'f1-score': 0.9804669293673998, 'support': 11570}, 'Vegetation': {'precision': 1.0, 'recall': 0.9968145824408176, 'f1-score': 0.9984047504524488, 'support': 144722}}


In [10]:
calculate_average_metrics(results=results)

Unnamed: 0,precision,recall,f1-score
Sand,0.976504,0.927815,0.949571
Vegetation,0.990648,0.996634,0.993576


In [11]:
cross_validation_results_filepath = get_cross_validation_results_filepath(location=location, satellite_constellation=satellite_constellation, df=df)
print(f"Saving to {cross_validation_results_filepath}")
with open(cross_validation_results_filepath, "wb") as file:
    pickle.dump(results, file)

Saving to ../saved_models/PNEO_Coepelduynen_20230402_105321_to_20230910_105008_cross_validation_results.pkl


### Export Definitive model.

Trains a Random Forest Classifier model on all data and writes it to a pickle file for later use. This is the definitive model output by this notebook.

In [12]:
final_model = RandomForestClassifier(**optimal_parameters)
final_scaler = StandardScaler()

train_imbalanced_model(
    X_train=df[selected_features], 
    y_train=df["label"], 
    model=final_model, 
    random_state=42, 
    sampling_type_boundary=100000,
    scaler=final_scaler
)
pprint.pprint(get_metrics(y=df["label"], X=df[selected_features], model=final_model, scaler=final_scaler))

Undersampling to rebalance dataset




Fitting model


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

{'Sand': {'f1-score': 0.9938371526896397,
          'precision': 0.9877498014822658,
          'recall': 1.0,
          'support': 119415},
 'Vegetation': {'f1-score': 0.9991752689842666,
                'precision': 1.0,
                'recall': 0.9983518972100213,
                'support': 898609}}


In [13]:
final_artefact = {
    "model": final_model,
    "scaler": final_scaler
}

In [14]:
final_model_filepath = get_model_filepath(location=location, satellite_constellation=satellite_constellation, df=df)
print(f"Saving to {final_model_filepath}")
with open(final_model_filepath, "wb") as file:
    pickle.dump(final_artefact, file)

Saving to ../saved_models/PNEO_Coepelduynen_20230402_105321_to_20230910_105008_random_forest_classifier.sav
