# BIOSPACE25 Workshop: 

## Harnessing analysis tools for biodiversity applications using field, airborne, and orbital remote sensing data from NASA's BioSCAPE campaign

Michele Thornton, Rupesh Shrestha, Erin Hestir, Adam Wilson, Jasper Slingsby, Anabelle Cardoso

**Date:**  February 12, 2025, Frascati (Rome), Italy

![BIOSPACE25](images/BioSpace25_clip_50.jpg)


# Tutorial:  Mapping invasive species using supervised machine learning and AVIRIS-NG 

## Overview 

In this notebook, we will use existing data of verified land cover and alien species locations to extract spectra from AVIRIS NG surface reflectance data.

## Learning Objectives
1. Understand how to inspect and prepare data for machine learning models
2. Train and interpret a machine learning model
3. Apply a trained model to AVIRIS imagery to create alien species maps

### Load Python Modules

In [None]:
#!pip install --user xvec
#!pip install --user shap
#!pip install --user xgboost

In [None]:
from os import path
import geopandas as gpd
import s3fs
import pandas as pd
import xarray as xr
from shapely.geometry import box, mapping
import rioxarray as riox
import numpy as np
import netCDF4 as nc
import hvplot.xarray
import holoviews as hv
import xvec
import matplotlib.pyplot as plt
from dask.diagnostics import ProgressBar
import warnings
#our functions
from utils import get_first_xr

warnings.filterwarnings('ignore')
hvplot.extension('bokeh')

### Explore Sample Land Type Plot-Level Data
A small dataset over the Cape Town Peninsula of South Africa of manually collected invasive plant and land cover label
- `ct_invasive.gpkg`

In [None]:
# let's create a DataFrame and assign labels to each class

label_df = pd.DataFrame({'LandType': ['Bare ground/Rock','Mature Fynbos', 
              'Recently Burnt Fynbos', 'Wetland', 
              'Forest', 'Pine', 'Eucalyptus' , 'Wattle', 'Water'],
               'class': ['0','1','2','3','4','5','6','7','8']})

label_df

In [None]:
# open the dataset and project to the South African UTM projection also used by the AVIRIS-NG airborne data 
class_data = gpd.read_file('data/ct_invasive.gpkg')
# class_data.crs
class_data_utm = (class_data
                 .to_crs("EPSG:32734")
                 .merge(label_df, on='class', how='left')
                 )
class_data_utm

### Summarize and Visualize the Land Types

In [None]:
class_data_utm.groupby(['LandType']).size()

In [None]:
class_data_utm.groupby(['group']).size()

In [None]:
# Let's visualize the plot data in an interactive map, with color by class, using a Google satellite basemap
map = class_data_utm[['LandType', 'geometry']].explore('LandType', tiles='https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}', attr='Google')
map

### AVIRIS-NG Data over Cape Town Peninsula

There is a coverage file that has the bounding box of each AVIRIS-NG flight scene made available by the BioSCape Science Team. 
- ANG_Coverage.geojson

Flight lines are provided as smaller sections within each flight line.  We'll refer to these smaller sections as scences.  The data for each scene within a flight line is seamless to the adjacent scenes. 

In [None]:
# read and plot the AVNG coverage file
AVNG_Coverage = gpd.read_file('data/ANGv2_Coverage.geojson', driver='GeoJSON')
AVNG_Coverage.keys()

- note that the 'RFL s3' key was pre-populated in the geojson file!!
- we'll see this s3 file list in an upcoming list

In [None]:
AVNG_Coverage.crs

In [None]:
AVNG_Coverage

In [None]:
# Let's visualize the plot data in an interactive map, with color by class, using a Google satellite basemap
map = AVNG_Coverage[['fid', 'geometry']].explore(tiles='https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}', attr='Google')
#map = AVNG_Coverage[['fid', 'geometry']].explore('fid')
map

- AVIRIS-NG Principle Investigator Researchers are finalizing formats and standards of AVIRIS-NG airborne radiance and reflectance files. When finalized, the data will be published to into NASA Earthdata.  

- For now, JPL provides preliminatry AVIRIS-NG data [**here**](https://popo.jpl.nasa.gov/pub/bioscape_netCDF/).  Once finalized, AVIRIS-NG data from the BioSCape Campaign will be available from NASA Earthdata Cloud Storage.

In [None]:
# Workshop participants will download this file from JPL
# If you need to download this file, uncomment the wget line and run this code block.
# !wget https://popo.jpl.nasa.gov/pub/bioscape_netCDF/rfl/ang20231109t133124_005_L2A_OE_0b4f48b4_RFL_ORT.nc -P /home/jovyan/2025-biospace/tutorials/avirisng/data/ang

### Select the AVIRIS-NG Flight Line data to selected parameters and create lists to use later
For our analysis demonstration in this Notebook, we'll narrow the flight lines to the area of the Cape Penisula and for flights that took place on 2023-11-09.
- the Python **`GeoDataFrame.to_crs`** method Transform geometries to a new coordinate reference system.

In [None]:
# temporal filter:  filter dates to between midnight on 2023-11-09 and 23:59:59 on 2023-11-09
AVNG_CP = AVNG_Coverage[(AVNG_Coverage['end_time'] >= '2023-11-09 00:00:00') & (AVNG_Coverage['end_time'] <= '2023-11-09 23:59:59')]
AVNG_CP = AVNG_CP.to_crs("EPSG:32734")

#keep only AVNG_CP that intersects with class_data
AVNG_CP = AVNG_CP[AVNG_CP.intersects(class_data_utm.unary_union)]
#AVNG_CP

files_s3 = AVNG_CP['RFL s3'].tolist()
files_AVNG_geo = AVNG_CP['geometry'].tolist()
files_AVNG_geo

#Visualize the selected flight lines
#m = AVNG_CP[['fid','geometry']].explore('fid')
m = AVNG_CP[['fid', 'geometry']].explore('fid', tiles='https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}', attr='Google')
#explore('LandType', tiles='https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}', attr='Google')
m


In [None]:
AVNG_CP.to_file('AVNG_CP.geojson', driver='GeoJSON')

In [None]:
AVNG_CP.crs

In [None]:
print(AVNG_CP['fid'])

In [None]:
files_s3[26]

#### The AVIRIS-NG files are also in S3 buckets in a BioSCape Science Managed Cloud Environment (SMCE).  
- SMCE's support NASA Funded researchers by providing a secure hub to store and analyze data.  These SMCE's are in AWS US-West. Workshop instructors are able to access these files. 


#### S3 access is commented out for workshop participants

In [None]:
# Using BioSCape AWS Credentials to acces BioSCape SMCE
# import s3fs
# secret_key=
# access_key=
# token =
# fs = s3fs.S3FileSystem(anon=False, 
#     secret=secret_key,
#     key=access_key,
#     token=token)


### Explore the BioSCape S3 Data Holdings
- **S3** = Amazon Simple Storage Service (S3) is a cloud storage service that allows users to store and retrieve data
- **S3 Bucket** = Buckets are the basic containers that hold data. Buckets can be likened to file folders and object storage
- **S3Fs** is a `Pythonic` open source tool that mounts S3 object storage locally.  S3Fs provides a filesystem-like interface for accessing objects on S3.
>import s3fs
>
>fs = s3fs.S3FileSystem(anon=False)

- The top-level class **`S3FileSystem`** holds connection information and allows typical file-system style operations like `ls`, `cp`, `mv`
   - `ls` is a UNIX command to list computer files and directories

In [None]:
#fs.ls('bioscape-data/')

In [None]:
#fs.ls('bioscape-data/AVNG_V2/')

In [None]:
#fs.ls('bioscape-data/AVNG_V2/ang20231109t133124/ang20231109t133124_005')

#### Single AVIRIS-NG flight scene Reflectance file **`ang20231109t133124_005_L2A_OE_0b4f48b4_RFL_ORT.nc`**

### Open a single AVIRIS-NG Reflectance file to inspect the data

- **`S3Fs`** can be used to mount S3 object storage locally
- **`xarray`** is an open source project and Python package that introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays

In [None]:
## Sample code to open a file from an S3 bucket using S3Fs

#rfl_netcdf = xr.open_datatree(fs.open(files_s3[26], 'rb'),
#                              engine='h5netcdf', chunks={})



In [None]:
# For this workshop, we're using a local AVIRIS-NG scence 
rfl_netcdf_2i2c = 'data/ang/ang20231109t134249_006_L2A_OE_0b4f48b4_RFL_ORT.nc'
rfl_netcdf_2i2c

In [None]:
#rfl_netcdf = xr.open_datatree(fs.open(files_s3[26], 'rb'),
#                              engine='h5netcdf', chunks={})

rfl_netcdf = xr.open_datatree(rfl_netcdf_2i2c, engine='h5netcdf', chunks={})
rfl_netcdf = rfl_netcdf.reflectance.to_dataset()
rfl_netcdf = rfl_netcdf.reflectance.where(rfl_netcdf.reflectance>0)
rfl_netcdf

### Plot a true color image

In [None]:
h = rfl_netcdf.sel(wavelength=[660, 570, 480], method="nearest").hvplot.rgb('easting', 'northing',
                                                                            rasterize=True, data_aspect=1,
                                                                            bands='wavelength', frame_width=400)
h

### Plot just a red reflectance

In [None]:
h = rfl_netcdf.sel({'wavelength': 660},method='nearest').hvplot('easting', 'northing',
                                                      rasterize=True, data_aspect=1,
                                                      cmap='magma',frame_width=400,clim=(0,0.3))
h

### Extract Spectra for each Land Plot

#### Now that we are familiar with the data, we want to get the AVIRIS spectra at each label location. Below is a function that does this and returns the result as a xarray

Recall some files we created earlier:
- `files_s3` = list; S3 netCDF files directories from the Cape Penisula subset area
- `files_AVNG_geo` = list; coordinates of bounding boxes of the flight line scenes from the Cape Penisula area 
- `class_data_utm` =  gpd; Cape Penisula Land Types with UTM geography

In [None]:
#the function takes a filepath to a file on s3, and the point locations for extraction
#this function requires hitting files on the BioSCape SMCE

# def extract_points(s3uri, geof, points):
#     ds = xr.open_datatree(fs.open(s3uri, 'rb'), decode_coords='all',
#                           engine='h5netcdf', chunks='auto')
        
#     # Clip the raw data to the bounding box  
#     points = points.clip(geof)
#     print(f'got {points.shape[0]} point from {s3uri}')
#     points = points.to_crs(ds.transverse_mercator.crs_wkt)
    
        
#     # Extract points
#     #extracted = ds.to_dataset().xvec.extract_points(points['geometry'], x_coords="easting", y_coords="northing",index=True)
#     extracted = ds.reflectance.to_dataset().xvec.extract_points(points['geometry'], 
#                                                                 x_coords="easting", 
#                                                                 y_coords="northing",
#                                                                 index=True)
#     return extracted

When we call the function, we'll iterate through the list of files (files_s3).  Each file will overlap with several land class points.

In [None]:
# ds_all = [extract_points(file, geo, class_data_utm) for file, geo in zip(files_s3, files_AVNG_geo)]
# ds_all = xr.concat(ds_all, dim='file')


In [None]:
#ds_all

Because some points are covered by multiple AVIRIS scenes, some points have multiple spectra for each location, and thus we have an extra dim in this. We will simply extract the first valid reflectance measurement for each geometry. We have a custom function to do this `get_first_xr()`

In [None]:
# ds = get_first_xr(ds_all)
# ds

This data set just has the spectra. We need to merge with point data to add labels

In [None]:
# class_xr =class_data_utm[['class','group']].to_xarray()
# ds = ds.merge(class_xr.astype(int),join='left')
# ds

We have defined all the operations we want, but becasue of xarrays lazy compution, the calculations have not yet been done. We will now force it to perform this calculations. We want to keep the result in chunks, so we use .persist() and not .compute(). This should take approx 2 - 3 mins

In [None]:
##  DUE TO RUN TIME LENGTH, WE WILL NOT RUN THIS IN THE WORKSHOP - HAVE SAVED THIS OUTPUT FOR NEXT STEP
# with ProgressBar():
# dsp = ds.persist()

In [None]:
dsp = xr.open_dataset('dsp.nc')
dsp

### Inspect AVIRIS spectra

In [None]:
# recall the class types
label_df

In [None]:
dsp_plot = dsp.where(dsp['class']==5, drop=True)
h = dsp_plot['reflectance'].hvplot.line(x='wavelength',by='index',
                                    color='green', alpha=0.5,legend=False)
h

> At this point in a real machine learning workflow, you should closely inspect the spectra you have for each class. Do they make sense? Are there some spectra that look weird? You should re-evaluate your data to make sure that the assigned labels are true. This is a very important step

#### Prep data for ML model

As you will know, not all of the wavelengths in the data are of equal quality, some will be degraded by atmospheric water absorption features or other factors. We should remove the bands from the analysis that we are not confident of. Probably the best way to do this is to use the uncertainties provided along with the reflectance files. We will simply use some prior knowledge to screen out the worst bands.

In [None]:
wavelengths_to_drop = dsp.wavelength.where(
    (dsp.wavelength < 450) |
    (dsp.wavelength >= 1340) & (dsp.wavelength <= 1480) |
    (dsp.wavelength >= 1800) & (dsp.wavelength <= 1980) |
    (dsp.wavelength > 2400), drop=True
)

# Use drop_sel() to remove those specific wavelength ranges
dsp = dsp.drop_sel(wavelength=wavelengths_to_drop)

mask = (dsp['reflectance'] > -1).all(dim='wavelength')  # Create a mask where all values along 'z' are non-negative
dsp = dsp.sel(index=mask)
dsp

Next we will normalize the data, there are a number of difference normalizations to try. In a ML workflow you should try a few and see which work best. We will only use a Brightness Normalization. In essence, we scale the reflectance of each wavelength by the total brightness of the spectra. This retains info on important shape features and relative reflectance, and removes info on absolute reflectance.

In [None]:
# Calculate the L2 norm along the 'wavelength' dimension
l2_norm = np.sqrt((dsp['reflectance'] ** 2).sum(dim='wavelength'))

# Normalize the reflectance by dividing by the L2 norm
dsp['reflectance'] = dsp['reflectance'] / l2_norm

Plot the new, clean spectra

In [None]:
dsp_norm_plot = dsp.where(dsp['class']==5, drop=True)
h = dsp_norm_plot['reflectance'].hvplot.line(x='wavelength',by='index',
                                         color='green',ylim=(-0.01,0.2),alpha=0.5,legend=False)
h

### Train and evaluate the ML model

We will be using a model called `xgboost`. There are many, many different kinds of ML models. `xgboost` is a class of models called gradient boosted trees, related to random forests. When used for classification, random forests work by creating multiple decision trees, each trained on a random subset of the data and features, and then averaging their predictions to improve accuracy and reduce overfitting. Gradient boosted trees differ in that they build trees sequentially, with each new tree focusing on correcting the errors of the previous ones. This sequential approach allows `xgboost` to create highly accurate models by iteratively refining predictions and addressing the weaknesses of earlier trees.

Import the Machine Learning libraries we will use.

In [None]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay

Our dataset has a label indicating which set (training or test), our data belong to. We wil use this to split it

In [None]:
# recall groups
class_data_utm.groupby(['group']).size()

In [None]:
class_data_utm.crs

In [None]:
dtrain = dsp.where(dsp['group']==1,drop=True)
dtest = dsp.where(dsp['group']==2,drop=True)

#create separte datasets for labels and features
y_train = dtrain['class'].values.astype(int)
y_test = dtest['class'].values.astype(int)
X_train = dtrain['reflectance'].values
X_test = dtest['reflectance'].values

#### Train ML model
The steps we will go through to train the model are:

First, we define the hyperparameter grid. Initially, we set up a comprehensive grid (param_grid) with multiple values for several hyperparameters of the XGBoost model. 

Next, we create an XGBoost classifier object using the XGBClassifier class from the XGBoost library.

We then set up the GridSearchCV object using our defined XGBoost model and the hyperparameter grid. GridSearchCV allows us to perform an exhaustive search over the specified hyperparameter values to find the optimal combination that results in the best model performance. We choose a 5-fold cross-validation strategy (cv=5), meaning we split our training data into five subsets to validate the model's performance across different data splits. We use accuracy as our scoring metric to evaluate the models.

After setting up the grid search, we fit the GridSearchCV object to our training data (X_train and y_train). This process involves training multiple models with different hyperparameter combinations and evaluating their performance using cross-validation. Our goal is to identify the set of hyperparameters that yields the highest accuracy.

Once the grid search completes, we print out the best set of hyperparameters and the corresponding best score. The grid_search.best_params_ attribute provides the combination of hyperparameters that achieved the highest cross-validation accuracy, while the grid_search.best_score_ attribute shows the corresponding accuracy score. Finally, we extract the best model (best_model) from the grid search results. This model is trained with the optimal hyperparameters and is ready for making predictions or further analysis in our classification task.

This will take approx __30 seconds__

In [None]:
# Define the hyperparameter grid
param_grid = {
    'max_depth': [5],
    'learning_rate': [0.1],
    'subsample': [0.75],
    'n_estimators' : [50,100]
}

# Create the XGBoost model object
xgb_model = xgb.XGBClassifier(tree_method='hist')

# Create the GridSearchCV object
grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='accuracy')

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best set of hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)
best_model = grid_search.best_estimator_

### Evaluate model performance

We will use our best model to predict the classes of the test data  Then, we calculate the F1 score using f1_score, which balances precision and recall, and print it to evaluate overall performance.

Next, we assess how well the model performs for predicting Pine trees by calculating its precision and recall. Precision measures the accuracy of the positive predictions.  It answers the question, "Of all the instances we labeled as Pines, how many were actually Pines?". Recall measures the model's ability to identify all actual positive instances. It answers the question, "Of all the actual Pines, how many did we correctly identify?". You may also be familiar with the terms Users' and Producers' Accuracy. Precision  = User' Accuracy, and Recall = Producers' Accuracy.

Finally, we create and display a confusion matrix to visualize the model's prediction accuracy across all classes

In [None]:
y_pred = best_model.predict(X_test)

# Step 2: Calculate acc and F1 score for the entire dataset
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")

f1 = f1_score(y_test, y_pred, average='weighted')  # 'weighted' accounts for class imbalance
print(f"F1 Score (weighted): {f1}")

# Step 3: Calculate precision and recall for class 5 (Pine)
precision_class_5 = precision_score(y_test, y_pred, labels=[5], average='macro', zero_division=0)
recall_class_5 = recall_score(y_test, y_pred, labels=[5], average='macro', zero_division=0)

print(f"Precision for Class 5: {precision_class_5}")
print(f"Recall for Class 5: {recall_class_5}")

# Step 4: Plot the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

ConfusionMatrixDisplay(confusion_matrix=conf_matrix).plot()
plt.show()

### Skipping Some steps in Glenn's BioSCape Workshop Tutorial
`8.2.1.8. Interpret and understand ML model`

https://ornldaac.github.io/bioscape_workshop_sa/tutorials/Machine_Learning/Invasive_AVIRIS.html#interpret-and-understand-ml-model

### Predict over an example AVIRIS scene

We now have a trained model and are ready to deploy it to generate predictions across an entire AVIRIS scene and map the distribution of invasive plants. This involves handling a large volume of data, so we need to write the code to do this intelligently. We will accomplish this by applying the `.predict()` method of our trained model in parallel across the chunks of the AVIRIS xarray. The model will receive one chunk at a time so that the data is not too large, but it will be able to perform this operation in parallel across multiple chunks, and therefore will not take too long.

This model was only trained on data covering natural vegetaton in the Cape Peninsula, It is important that we only predict in the areas that match our training data. We will therefore filter to scenes that cover the Cape Peninsula and mask out non-protected areas

In [None]:
#south africa protected areas
SAPAD = (gpd.read_file('data/SAPAD_2024.gpkg')
         .query("SITE_TYPE!='Marine Protected Area'")
        )
#SAPAD.plot()
#SAPAD.to_crs("EPSG:32734")
SAPAD.crs

In [None]:
# Get the bounding box of the training data
bbox = class_data_utm.total_bounds  # (minx, miny, maxx, maxy)
#bbox
gdf_bbox = gpd.GeoDataFrame({'geometry': [box(*bbox)]}, crs=class_data_utm.crs)  # Specify the CRS
gdf_bbox['geometry'] = gdf_bbox.buffer(500)
gdf_bbox.crs

In [None]:
#south africa protected areas
SAPAD = (gpd.read_file('data/SAPAD_2024.gpkg')
         .query("SITE_TYPE!='Marine Protected Area'")
        )
SAPAD = SAPAD.to_crs("EPSG:32734")

# Get the bounding box of the training data
bbox = class_data_utm.total_bounds  # (minx, miny, maxx, maxy)
gdf_bbox = gpd.GeoDataFrame({'geometry': [box(*bbox)]}, crs=class_data_utm.crs)  # Specify the CRS
gdf_bbox['geometry'] = gdf_bbox.buffer(500)

# protected areas that intersect with the training data
SAPAD_CT = SAPAD.overlay(gdf_bbox,how='intersection')

#keep only AVIRIS scenes that intersects with CT protected areas
AVNG_sapad = AVNG_CP[AVNG_CP.intersects(SAPAD_CT.unary_union)]

#a list of files to predict
files_sapad = AVNG_sapad['RFL s3'].tolist()

#how many files?
len(files_sapad)

In [None]:
m = AVNG_sapad[['fid','geometry']].explore('fid')
m

In [None]:
SAPAD.keys()

In [None]:
#map = AVNG_Coverage[['fid', 'geometry']].explore('fid')
map = SAPAD[['SITE_TYPE', 'geometry']].explore('SITE_TYPE')
map

Here is the function that we will actually apply to each chunk. Simple really. The hard work is getting the data into and out of this functiON

In [None]:
def predict_on_chunk(chunk, model):
    probabilities = model.predict_proba(chunk)
    return probabilities

Now we define the funciton that takes as input the path to the AVIRIS file and pass the data to the predict function. THhs is composed of 4 parts:

Part 1: Opens the AVIRIS data file using xarray and sets a condition to identify valid data points where reflectance values are greater than zero.

Part 2: Applies all the transformations that need to be done before the data goes to the model. It the spatial dimensions (x and y) into a single dimension, filters wavelengths, and normalizes the reflectance data.

Part 3: Applies the machine learning model to the normalized data in parallel, predicting class probabilities for each data point using xarray's apply_ufunc method. Most of the function invloves defining what to do with the dimensions of the old dataset and the new output

Part 4: Unstacks the data to restore its original dimensions, sets spatial dimensions and coordinate reference system (CRS), clips the data, and transposes the data to match expected formats before returning the results.

In [None]:
def predict_xr(file,geometries):

    #part 1 - opening file
    #open the file
    print(f'file: {file}')
    ds = xr.open_datatree(rfl_netcdf_2i2c, engine='h5netcdf', decode_coords="all",
                         chunks='auto')

    #get the geometries of the protected areas for masking
    ds_crs = ds.transverse_mercator.crs_wkt
    geometries = geometries.to_crs(ds_crs).geometry.apply(mapping)

    #condition to use for masking no data later
    condition = (ds['reflectance'] > -1).any(dim='wavelength')

    #stack the data into a single dimension. This will be important for applying the model later
    ds = ds.reflectance.to_dataset().stack(sample=('easting','northing'))
    
    #part 2 - pre-processing
    #remove bad wavelenghts
    wavelengths_to_drop = ds.wavelength.where(
        (ds.wavelength < 450) |
        (ds.wavelength >= 1340) & (ds.wavelength <= 1480) |
        (ds.wavelength >= 1800) & (ds.wavelength <= 1980) |
        (ds.wavelength > 2400), drop=True
    )
    # Use drop_sel() to remove those specific wavelength ranges
    ds = ds.drop_sel(wavelength=wavelengths_to_drop)
    
    #normalise the data
    l2_norm = np.sqrt((ds['reflectance'] ** 2).sum(dim='wavelength'))
    ds['reflectance'] = ds['reflectance'] / l2_norm

     
    #part 3 - apply the model over chunks
    result = xr.apply_ufunc(
        predict_on_chunk,
        ds['reflectance'].chunk(dict(wavelength=-1)),
        input_core_dims=[['wavelength']],#input dim with features
        output_core_dims=[['class']],  # name for the new output dim
        exclude_dims=set(('wavelength',)),  #dims to drop in result
        output_sizes={'class': 9}, #length of the new dimension
        output_dtypes=[np.float32],
        dask="parallelized",
        kwargs={'model': best_model}
    )

    #part 4 - post-processing
    result = result.where((result >= 0) & (result <= 1), np.nan) #valid values
    result = result.unstack('sample') #remove the stack
    result = result.rio.set_spatial_dims(x_dim='easting',y_dim='northing') #set the spatial dims
    result = result.rio.write_crs(ds_crs) #set the CRS
    result = result.rio.clip(geometries) #clip to the protected areas and no data
    result = result.transpose('class', 'northing', 'easting') #transpose the data rio expects it this way
    return result   

Let's test that it works on a single file before we run it through 100s of GB of data.

In [None]:
#files_sapad[25]

In [None]:
test  = predict_xr(rfl_netcdf_2i2c,SAPAD_CT)
test

In [None]:
label_df

In [None]:
test = test.rio.reproject("EPSG:4326",nodata=np.nan)
h = test.isel({'class':5}).hvplot(tiles=hv.element.tiles.EsriImagery(), 
                              project=True,rasterize=True,clim=(0,1),
                              cmap='magma',frame_width=400,data_aspect=1,alpha=0.5)
h

ML models typically provide a single prediction of the most likely outcomes. You can also get probability-like scores (values from 0 to 1) from these models, but they are not true probabilities. If the model gives you a score of 0.6, that means it is more likely than a prediction of 0.5, and less likely than 0.7. However, it does not mean that in a large sample your prediction would be right 60 times out of 100. To get calibrated probabilities from our models, we have to apply additional steps. We can also get a set of predictions from models rather than a single prediction, which reflects the model's true uncertainty using a technique called conformal predictions. Read more about conformal prediction for geospatial machine learning in this amazing paper:

[Singh, G., Moncrieff, G., Venter, Z., Cawse-Nicholson, K., Slingsby, J., & Robinson, T. B. (2024). Uncertainty quantification for probabilistic machine learning in earth observation using conformal prediction. Scientific Reports, 14(1), 16166.](https://www.nature.com/articles/s41598-024-65954-w)

### Final steps of the full ML classification are time intensive and are not described in this workshop.  

Steps in Glenn Moncrieff's BioSCape Workshop Tutorial

`8.2.1.10. Merge and mosaic results`
- https://ornldaac.github.io/bioscape_workshop_sa/tutorials/Machine_Learning/Invasive_AVIRIS.html#merge-and-mosaic-results

### CREDITS:  

Find all of the October 2025 BioSCape Data Workshop Materials/Notebooks

- https://ornldaac.github.io/bioscape_workshop_sa/intro.html

This Notebook is an adaption of **Glenn Moncrieff**'s BioSCape Data Workshop Notebook:  [**Mapping invasive species using supervised machine learning and AVIRIS-NG**](https://ornldaac.github.io/bioscape_workshop_sa/tutorials/Machine_Learning/Invasive_AVIRIS.html)
- This Notebook accesses and uses an updated version of AVIRIS-NG data with improved corrections and that are in netCDF file formats

Glenn's lesson borrowed from:

- [``Land cover mapping example on Microsoft Planetary Computer``](https://planetarycomputer.microsoft.com/docs/tutorials/landcover)