# 2.1 - Grid GBIF species

As mentioned in notebook 1.1, we want to query species composition from each grid cell location to compile species presence, count, and distance-weighted count for several different query radii. The most precise method for this job would be to use either R-trees or K-d trees to query species observations that intersect with the radii. For high-resolution rasters (10m and 20m) with such a broad spatial extent, however, this method requires significant computation time and/or memory usage. A quicker approach may be to approximate these precise queries through interpolation and either density thresholding or moving-window calculations.

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
import rasterio

from src.conf.parse_params import config as cfg
from src.utils.df_utils import read_df

load_dotenv(find_dotenv())
os.chdir(os.environ["PROJECT_ROOT"])

%load_ext autoreload
%autoreload 2

## Non-tree-based approaches

In [2]:
import cupy as cp
import cupyx as cpx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import rasterio
from scipy.ndimage import gaussian_filter, generic_filter
from scipy.signal import convolve2d
from scipy import stats
import seaborn as sns

from src.utils.df_utils import read_gdf

#### Load data

In [3]:
gbif = read_gdf(cfg["gbif"]["masked"])

# Subset to the most abundant species for quick testing
top_species = gbif.species.value_counts().index[0]
species = gbif[gbif.species == top_species]

with rasterio.open(cfg["s2_20m"]["src"]) as src:
    target_meta = src.meta
    target_shape = (src.height, src.width)

Simple utility method to print the stats of an array.

In [6]:
def print_stats(data: np.ndarray) -> None:
    results = stats.describe(data)
    print(f"Number of elements: {results.nobs}")
    print(f"Min: {results.minmax[0]}")
    print(f"Max: {results.minmax[1]}")
    print(f"Mean: {results.mean}")
    print(f"Variance: {results.variance}")
    print(f"Skewness: {results.skewness}")
    print(f"Kurtosis: {results.kurtosis}")

In [20]:
grouped = [group for name, group in gbif.groupby("species")]

### Gaussian density-based approach

#### Simple counts

First get the counts _per grid cell_ (with no radius applied).

In [10]:
counts = np.zeros(target_shape, dtype=int)

points = np.c_[species.geometry.x, species.geometry.y]

def increment_counts(point: np.ndarray) -> np.ndarray:
    row, col = src.index(point[0], point[1])
    counts[row, col] += 1
    return counts

for point in points:
    increment_counts(point)

In [37]:
print_stats(counts.flatten())

Number of elements: 1389930000
Min: 0
Max: 761
Mean: 0.00010582691214665487
Variance: 0.005251182749459388
Skewness: 4017.2032220501437
Kurtosis: 24062862.80862101


Get the counts within a certain radius using `scipy.ndimage.gaussian_filter`. `sigma` is used to simulate the search radius, and so should be set to the desired radius / grid cell resolution. In this case, we're using the 20m Sentinel-2 grid and want to look at a 1km radius, so `sigma` = 1000 / 20.

In [38]:
counts = gaussian_filter(counts, sigma=50)

In [40]:
print_stats(counts.flatten())

Number of elements: 1389930000
Min: 0
Max: 0
Mean: 0.0
Variance: 0.0
Skewness: nan
Kurtosis: nan


#### Results
The statistics of the resulting counts array after applying the `gaussian_filter` suggest that this perhaps isn't the best approach. We would expect a maximum value of at least the max of the per-grid-cell counts, if not higher.

### Convolutional approach using CuPy

Not only did the Gaussian filter not work, it required long compute time (~10 mins). This isn't feasible when we need to process all ~9,200 species across 4 radii and two resolution sets. Due to similar restrictions, a 2D convolution also takes a long time, though at least returns usable results.

To overcome the computation restraints we can use CuPy to take advantage of GPU availability.

In [7]:
counts = np.zeros(target_shape, dtype=int)

points = np.c_[species.geometry.x, species.geometry.y]

def increment_counts(point: np.ndarray) -> np.ndarray:
    row, col = src.index(point[0], point[1])
    counts[row, col] += 1
    return counts

for point in points:
    increment_counts(point)

#### Kernel-based counts - non-weighted

Apply a `scipy.signal.convolve_2d` with a binary radial kernel of our desired radius in grid cells (i.e. radius / resolution). The values of the cells that fall within the kernel then are multiplied by the kernel values and summed, giving us a sum of the counts in each of the neighboring grid cells.

In [37]:
from typing import Optional


def radial_kernel(radius: int, weighted: bool = False, sigma: Optional[float | int] = None) -> np.ndarray:
    sigma = radius * 0.5 if sigma is None else sigma
    y, x = np.ogrid[-radius : radius + 1, -radius : radius + 1]
    distances_squared = x**2 + y**2
    
    if weighted:
        kernel = np.exp(-distances_squared / (2*sigma**2))
        kernel[distances_squared > radius**2] = 0
        return kernel
    
    kernel = x**2 + y**2 <= radius**2
    return kernel.astype(int)

print("Standard radial kernel:\n")
print(radial_kernel(3))
print("\nWeighted radial kernel:\n")
print(radial_kernel(3, True).round(2))

Standard radial kernel:

[[0 0 0 1 0 0 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]
 [1 1 1 1 1 1 1]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]
 [0 0 0 1 0 0 0]]

Weighted radial kernel:

[[0.   0.   0.   0.14 0.   0.   0.  ]
 [0.   0.17 0.33 0.41 0.33 0.17 0.  ]
 [0.   0.33 0.64 0.8  0.64 0.33 0.  ]
 [0.14 0.41 0.8  1.   0.8  0.41 0.14]
 [0.   0.33 0.64 0.8  0.64 0.33 0.  ]
 [0.   0.17 0.33 0.41 0.33 0.17 0.  ]
 [0.   0.   0.   0.14 0.   0.   0.  ]]


Initialize CuPy arrays on the GPU

In [38]:
radius = 1000 // 20
counts_gpu = cp.asarray(counts)
kernel_gpu = cp.asarray(radial_kernel(radius))
gauss_kernel_gpu = cp.asarray(radial_kernel(radius, True))

In [39]:
radial_counts_gpu = cpx.scipy.signal.convolve2d(counts_gpu, kernel_gpu, mode="same")
radial_counts_wt_gpu = cpx.scipy.signal.convolve2d(counts_gpu, gauss_kernel_gpu, mode="same")

# Execution takes around 48 seconds 

In [40]:
radial_counts = cp.asnumpy(radial_counts_gpu)
radial_counts_wt = cp.asnumpy(radial_counts_wt_gpu)
print("Simple counts:\n")
print_stats(radial_counts.flatten())
print("\nWeighted counts:\n")
print_stats(radial_counts_wt.flatten())

Simple counts:

Number of elements: 1389930000
Min: 0
Max: 1224
Mean: 0.8293358622376666
Variance: 64.25457581424405
Skewness: 44.16730926029975
Kurtosis: 3144.4324769964023

Weighted counts:

Number of elements: 1389930000
Min: 0.0
Max: 762.1210812696761
Mean: 0.35890692310931116
Variance: 15.234865110079596
Skewness: 55.578068616486235
Kurtosis: 5052.684645513648


And lastly we can get binary counts by simply converting the `radial_counts` array to `bool` type.

In [41]:
presence = radial_counts.astype(bool)