# 1.0.3: Generating training weights based on trait provenance

The appeal of incorporating citizen science data such as GBIF species observations into trait prediction models is due to its sheer magnitude when compared to other, perhaps more scientifically rigrorous datasets such as sPlot. Combining two such datasets, giving preference to the more reliable one (in this case, sPlot), allows us to train models that learn both from the higher quality data as well as the more abundant data and thus increasing our models' predictive range both spatially and functionally.

However, the disproportion in the number of GBIF observations compared to the sPlot observations introduces in itself a new problem: model loss will likely be totally dominated by the GBIF observations, drowning out the effect of sPlot observations. To address this, we can weight the observations by their provenance (i.e. which dataset they came from) and thus instruct the models to value the sPlot observations higher than those from GBIF.

## Imports and config

In [1]:
import pandas as pd
import xarray as xr

from src.conf.conf import get_config
from src.conf.environment import log
from src.utils.dataset_utils import get_trait_map_fns
from src.utils.raster_utils import open_raster

cfg = get_config()

## Get the filenames for one trait

For this walkthrough, we're going to focus only on a single trait.

In [4]:
trait_map_fns = get_trait_map_fns("interim")[:2]
trait_map_fns

[PosixPath('data/interim/splot/trait_maps/Shrub_Tree_Grass/001/X4.tif'),
 PosixPath('data/interim/gbif/trait_maps/Shrub_Tree_Grass/001/X4.tif')]

As we can see, what have sparse trait maps for the X4 trait (stem specific density) aggegrated from both the sPlot observations and the GBIF observations. We can load both rasters and merge them into a single raster in which the values are the source of the original value.

In [5]:
NCHUNKS = 9
BAND = 1  # mean

dax = []
for fn in trait_map_fns:
    data = open_raster(
        fn, chunks={"x": 36000 // NCHUNKS, "y": 18000 // NCHUNKS}, mask_and_scale=True
    )

    # Rewrite the long_name of the data array to include the band (e.g. "trait_mean",
    # "trait_median", etc.)
    bands = data.attrs["long_name"]
    long_name = f"{fn.stem}_{bands[BAND - 1]}"
    data.attrs["long_name"] = long_name
    
    dax.append(data.sel(band=BAND))

print("sPlot data:")
dax[0]

sPlot data:


Unnamed: 0,Array,Chunk
Bytes,4.83 GiB,61.04 MiB
Shape,"(18000, 36000)","(2000, 4000)"
Dask graph,81 chunks in 3 graph layers,81 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.83 GiB 61.04 MiB Shape (18000, 36000) (2000, 4000) Dask graph 81 chunks in 3 graph layers Data type float64 numpy.ndarray",36000  18000,

Unnamed: 0,Array,Chunk
Bytes,4.83 GiB,61.04 MiB
Shape,"(18000, 36000)","(2000, 4000)"
Dask graph,81 chunks in 3 graph layers,81 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [6]:
print("GBIF data:")
dax[1]

GBIF data:


Unnamed: 0,Array,Chunk
Bytes,4.83 GiB,61.04 MiB
Shape,"(18000, 36000)","(2000, 4000)"
Dask graph,81 chunks in 3 graph layers,81 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.83 GiB 61.04 MiB Shape (18000, 36000) (2000, 4000) Dask graph 81 chunks in 3 graph layers Data type float64 numpy.ndarray",36000  18000,

Unnamed: 0,Array,Chunk
Bytes,4.83 GiB,61.04 MiB
Shape,"(18000, 36000)","(2000, 4000)"
Dask graph,81 chunks in 3 graph layers,81 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Create "source" data array

Now we can merge the data arrays and simply indicate where each value comes from.

In [7]:
SPLOT_FN_ID = 0
GBIF_FN_ID = 1

merged_source = xr.where(
    dax[SPLOT_FN_ID].notnull(), "s", xr.where(dax[GBIF_FN_ID].notnull(), "g", None)
)

## Convert data array to dataframe

In [8]:
source_df = (
    merged_source.rename("SSD_source")
    .to_dask_dataframe()
    .drop(columns=["band", "spatial_ref"])
    .dropna(how="all", subset=["SSD_source"])
).compute().reset_index(drop=True)

In [9]:
source_df.head()

Unnamed: 0,x,y,SSD_source
0,-179.995,68.675,g
1,-179.995,67.865,g
2,-179.995,-16.785,g
3,-179.995,-16.795,g
4,-179.985,70.975,g


## Calculate proportion of sPlot to GBIF values

By calculating the proportion of sPlot-derived values to GBIF-derived values, we can then use that ratio to determine how we want to weight the GBIF observations.

In [10]:
proportion = source_df.SSD_source.value_counts(normalize=True)
proportion

SSD_source
g    0.920293
s    0.079707
Name: proportion, dtype: double[pyarrow]

Not surprisingly, 92% of the observations are lacking sPlot data and so were derived from GBIF. This is a massive imbalance that we can now correct by applying weights to the GBIF observations.

## Apply weights

We could simply apply weights to the observations such that weight(x) = 1 - proportion(x), but since we want to set sPlot-derived value weights to 1, we can scale up the GBIF weights by simply calculating the s proportion of the proportions:

In [11]:
proportion.s / proportion.g

0.08660996138933781

In [12]:
weights = pd.Series({"s": 1.0})

# Calculate the weight for 'g' based on the proportion of 's' to 'g'
weights["g"] = proportion["s"] / proportion["g"]

# Map the weights to the original DataFrame
source_df["weight"] = source_df["SSD_source"].map(weights)

# Now source_df['weight'] contains the weights for each row
print(source_df[source_df.SSD_source == "g"].head())

         x       y SSD_source   weight
0 -179.995  68.675          g  0.08661
1 -179.995  67.865          g  0.08661
2 -179.995 -16.785          g  0.08661
3 -179.995 -16.795          g  0.08661
4 -179.985  70.975          g  0.08661
