# Benchmarks - Kernels

In MagmaClust and GPs in general, kernels are what makes it possible to compute a covariance matrix for any set of points, given a set of hyperparameters. They are used in many innermost loops, and their implementation is critical for performance.

**Main considerations when implementing kernels**

A good kernel implementation must be:
* fast, as it is used in many innermost loops
* usable at many dimensions, including a batch dimension with distinct hyperparameters for each element
* work on padded inputs (aka inputs with NaNs), maybe using a mask
* jittable, as it is used in many jit-compiled functions
* modular, as kernels can be combined in many ways
* both static and instance-based, to either carry around hyperparameters or be called with them
* easy to override, as users may want to define their own kernels

These goals are conflicting in some cases (e.g: a jittable version of a kernel is not trivial to write for most people), and the best implementation will depend on the specific use case.

---
## Setup

In [1]:
# Standard library
import os

os.environ['JAX_ENABLE_X64'] = "True"

In [2]:
# Third party
import jax
from jax import jit, vmap
from jax.tree_util import register_pytree_node_class
from jax import numpy as jnp

import numpy as np

In [3]:
# Local
from MagmaClustPy.kernels import RBFKernel, AbstractKernel
from MagmaClustPy.utils import generate_dummy_db, preprocess_db

In [4]:
# Config
key = jax.random.PRNGKey(0)

---
## Data

---
## Current implementation

In [5]:
old_kernel = RBFKernel(length_scale=jnp.array(1.), variance=jnp.array(1.))

---
## Custom implementation(s)

### Defaults of the previous implementation that we wish to correct/improve:

**The kwargs are converted to args**

To make vmap work, we convert kwargs to args in the old implementation

This leads to potential bugs where the order of the kwargs is not respected, either in provided params or in the class definition.
Alternative is tu use a "lambda" version of each compute_* method, with kwargs as a parameter, and then vmap this
lambda function, as presented in comments. However, this may lead to jit compiling many times the same function,
which is not optimal. I'm not sure if this is the case, so we should check.

In [6]:
@register_pytree_node_class
class NewAbstractKernel:
	def __init__(self, **kwargs):
		# Check that hyperparameters are all jnp arrays/scalars
		for key, value in kwargs.items():
			if not isinstance(value, jnp.ndarray):  # Check type
				raise ValueError(f"Parameter {key} must be a jnp.ndarray.")
			else:  # Check dimensionality
				if len(value.shape) > 1:
					raise ValueError(f"Parameter {key} must be a scalar or a 1D array, got shape {value.shape}.")

		# Register hyperparameters in *kwargs* as instance attributes
		self.__dict__.update(kwargs)

	@jit
	def check_kwargs(self, **kwargs):
		for key in self.__dict__:
			if key not in kwargs:
				kwargs[key] = self.__dict__[key]
		return kwargs

	@jit
	def __call__(self, x1, x2=None, **kwargs):
		# If no x2 is provided, we compute the covariance between x1 and itself
		if x2 is None:
			x2 = x1

		# Check kwargs
		kwargs = self.check_kwargs(**kwargs)

		# Call the appropriate method
		if jnp.isscalar(x1) and jnp.isscalar(x2):
			return self.compute_scalar(x1, x2, **kwargs)
		elif jnp.ndim(x1) == 1 and jnp.isscalar(x2):
			return self.compute_vector(x1, x2, **kwargs)
		elif jnp.isscalar(x1) and jnp.ndim(x2) == 1:
			return self.compute_vector(x2, x1, **kwargs)
		elif jnp.ndim(x1) == 1 and jnp.ndim(x2) == 1:
			return self.compute_matrix(x1, x2, **kwargs)
		elif jnp.ndim(x1) == 2 and jnp.ndim(x2) == 2:
			return self.compute_batch(x1, x2, **kwargs)
		else:
			return jnp.nan

	# Methods to use Kernel as a PyTree
	def tree_flatten(self):
		return tuple(self.__dict__.values()), None  # No static values

	@classmethod
	def tree_unflatten(cls, _, children):
		# This class being abstract, this function fails when called on an "abstract instance",
		# as we don't know the number of parameters the constructor expects, yet we send it children.
		# On a subclass, this will work as expected as long as the constructor has a clear number of
		# kwargs as parameters.
		return cls(*children)

	@jit
	def compute_scalar(self, x1: jnp.ndarray, x2: jnp.ndarray, **kwargs) -> jnp.ndarray:
		"""
		Compute the kernel covariance value between two scalar arrays.

		:param x1: scalar array
		:param x2: scalar array
		:param args: hyperparameters of the kernel
		:return: scalar array
		"""
		return jnp.array(jnp.nan)  # To be overwritten

	@jit
	def compute_vector(self, x1: jnp.ndarray, x2: jnp.ndarray, **kwargs) -> jnp.ndarray:
		"""
		Compute the kernel covariance value between a vector and a scalar.

		:param x1: vector array (N, )
		:param x2: scalar array
		:param args: hyperparameters of the kernel
		:return: vector array (N, )
		"""
		return vmap(lambda x: self.compute_scalar(x, x2, **kwargs), in_axes=0)(x1)
		#  return vmap(self.compute_scalar, in_axes=(0, None) + (None,) * len(args))(x1, x2, *args).squeeze()

	@jit
	def compute_matrix(self, x1: jnp.ndarray, x2: jnp.ndarray, **kwargs) -> jnp.ndarray:
		"""
		Compute the kernel covariance matrix between two vector arrays.

		:param x1: vector array (N, )
		:param x2: vector array (M, )
		:param args: hyperparameters of the kernel
		:return: matrix array (N, M)
		"""
		return vmap(lambda x: self.compute_vector(x2, x, **kwargs), in_axes=0)(x1)
		# return vmap(self.compute_vector, in_axes=(None, 0) + (None,) * len(args))(x2, x1, *args)

	@jit
	def compute_batch(self, x1: jnp.ndarray, x2: jnp.ndarray, **kwargs) -> jnp.ndarray:
		"""
		Compute the kernel covariance matrix between two batched vector arrays.

		:param x1: vector array (B, N)
		:param x2: vector array (B, M)
		:param args: hyperparameters of the kernel. Each HP that is a scalar will be common to the whole batch, and
		each HP that is a vector will be distinct and thus must have shape (B, )
		:return: tensor array (B, N, M)
		"""
		# vmap(self.compute_matrix)(x1, x2, **kwargs)
		common_hps = {key: value for key, value in kwargs.items() if jnp.isscalar(value)}
		distinct_hps = {key: value for key, value in kwargs.items() if not jnp.isscalar(value)}

		return vmap(lambda x, y, hps: self.compute_matrix(x, y, **hps, **common_hps), in_axes=(0, 0, 0))(x1, x2, distinct_hps)
		# kwargs_axes = tuple(None if jnp.isscalar(hp) else 0 for hp in kwargs)
		# return vmap(self.compute_matrix, in_axes=(0, 0) + kwargs_axes)(x1, x2, **kwargs)

In [7]:
@register_pytree_node_class
class NewRBFKernel(NewAbstractKernel):
	def __init__(self, length_scale=None, variance=None):
		if length_scale is None:
			length_scale = jnp.array([1.])
		if variance is None:
			variance = jnp.array([1.])
		super().__init__(length_scale=length_scale, variance=variance)

	@jit
	def compute_scalar(self, x1: jnp.ndarray, x2: jnp.ndarray, length_scale=None, variance=None) -> jnp.ndarray:
		return variance * jnp.exp(-0.5 * (x1 - x2) ** 2 / length_scale ** 2)

---
## Comparison

In [8]:
new_kernel = NewRBFKernel(length_scale=jnp.array(1.), variance=jnp.array(1.))

### On scalars

In [9]:
key, subkey = jax.random.split(key)
a = jax.random.uniform(subkey, ())
key, subkey = jax.random.split(key)
b = jax.random.uniform(subkey, ())
a, b

(Array(0.08062437, dtype=float64), Array(0.67119299, dtype=float64))

In [10]:
res1 = old_kernel(a, b)
np.asarray(res1)

array(0.8399729)

In [11]:
res2 = new_kernel(a, b)
np.asarray(res2)

array(0.8399729)

In [12]:
jnp.allclose(res1, res2)

Array(True, dtype=bool)

In [13]:
%timeit old_kernel(a, b).block_until_ready()

5.33 μs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [14]:
%timeit new_kernel(a, b).block_until_ready()

5.3 μs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


### On an array and a scalar

In [15]:
key, subkey = jax.random.split(key)
a = jax.random.uniform(subkey, (10000,))
key, subkey = jax.random.split(key)
b = jax.random.uniform(subkey, ())
a, b

(Array([0.35166634, 0.41402386, 0.25540459, ..., 0.56224417, 0.85059346,
        0.79735932], dtype=float64),
 Array(0.15974131, dtype=float64))

In [16]:
res1 = old_kernel(a, b)
np.asarray(res1)

array([0.98175096, 0.96818721, 0.99543472, ..., 0.92218975, 0.7876997 ,
       0.81605105])

In [17]:
res2 = new_kernel(a, b)
np.asarray(res2)

array([0.98175096, 0.96818721, 0.99543472, ..., 0.92218975, 0.7876997 ,
       0.81605105])

In [18]:
jnp.allclose(res1, res2)

Array(True, dtype=bool)

In [19]:
%timeit old_kernel(a, b).block_until_ready()

52 μs ± 446 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [20]:
%timeit new_kernel(a, b).block_until_ready()

51.8 μs ± 298 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### On two arrays

In [21]:
key, subkey = jax.random.split(key)
a = jax.random.uniform(subkey, (10000,))
key, subkey = jax.random.split(key)
b = jax.random.uniform(subkey, (15000,))
a.shape, b.shape

((10000,), (15000,))

In [22]:
res1 = old_kernel(a, b)
np.asarray(res1).shape

(10000, 15000)

In [23]:
res2 = new_kernel(a, b)
np.asarray(res2).shape

(10000, 15000)

In [24]:
jnp.allclose(res1, res2)

Array(True, dtype=bool)

In [25]:
%timeit old_kernel(a, b).block_until_ready()

157 ms ± 5.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
%timeit new_kernel(a, b).block_until_ready()

163 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### On two batches of arrays with common HP

In [27]:
key, subkey = jax.random.split(key)
a = jax.random.uniform(subkey, (50, 100))
key, subkey = jax.random.split(key)
b = jax.random.uniform(subkey, (50, 150))
a.shape, b.shape

((50, 100), (50, 150))

In [28]:
# Using common hyperparameters for all batches
res1 = old_kernel(a, b)
np.asarray(res1).shape

(50, 100, 150)

In [29]:
# Also using common hyperparameters for all batches
res2 = new_kernel(a, b)
np.asarray(res2).shape

(50, 100, 150)

In [30]:
jnp.allclose(res1, res2)

Array(True, dtype=bool)

In [31]:
%timeit old_kernel(a, b).block_until_ready()

840 μs ± 3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [32]:
%timeit new_kernel(a, b).block_until_ready()

840 μs ± 2.73 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### On two batches of arrays with distinct HP

In [33]:
key, subkey = jax.random.split(key)
a = jax.random.uniform(subkey, (50, 100))  # 5 batches of 10-dimensional data
key, subkey = jax.random.split(key)
b = jax.random.uniform(subkey, (50, 150))  # 5 batches of 15-dimensional data

# Create distinct hyperparameters for each batch
key, subkey = jax.random.split(key)
distinct_length_scales = jax.random.uniform(subkey, (50,)) + 0.5  # Ensuring positive values
key, subkey = jax.random.split(key)
distinct_variances = jax.random.uniform(subkey, (50,)) + 0.5  # Ensuring positive values

distinct_length_scales, distinct_variances

(Array([0.73314298, 1.16657917, 0.85419874, 0.51307126, 1.18606573,
        0.55039322, 0.95887649, 1.22698946, 0.96474768, 0.80743648,
        1.09766226, 0.80717582, 0.89580246, 1.3499695 , 0.77893643,
        1.40237506, 1.46014276, 0.78476368, 0.60720141, 0.83234395,
        0.80273261, 0.89522365, 0.55596665, 1.07829468, 1.12246261,
        1.31228677, 1.27370113, 1.28929839, 1.177215  , 0.6121177 ,
        0.85959204, 0.97278255, 0.93860787, 0.95404747, 1.35596002,
        1.10412686, 1.49799604, 1.3636239 , 0.9819335 , 1.49066854,
        0.57652775, 1.42748273, 0.96863449, 0.54673638, 0.60064792,
        1.21251168, 0.54389353, 1.28000685, 1.4706205 , 1.37892888],      dtype=float64),
 Array([1.40087618, 0.50347402, 0.81557193, 1.34923058, 1.00457197,
        0.65788761, 0.9219105 , 0.72545942, 1.23310028, 0.91450771,
        0.52153169, 1.07353706, 0.84001156, 0.63915561, 1.28793987,
        0.960267  , 0.5185922 , 1.46299715, 1.2229984 , 0.73096194,
        1.47447511, 1.4126

In [34]:
res1 = old_kernel(a, b, length_scale=distinct_length_scales, variance=distinct_variances)

In [35]:
res2 = new_kernel(a, b, length_scale=distinct_length_scales, variance=distinct_variances)

In [36]:
jnp.allclose(res1, res2)

Array(True, dtype=bool)

In [37]:
%timeit old_kernel(a, b, length_scale=distinct_length_scales, variance=distinct_variances).block_until_ready()

841 μs ± 4.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [38]:
%timeit new_kernel(a, b, length_scale=distinct_length_scales, variance=distinct_variances).block_until_ready()

839 μs ± 2.56 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### On two batches of arrays with both common and distinct HP

### On padded datasets

In [39]:
grid = jnp.arange(-200, 200, 1, dtype=jnp.float64)
db = generate_dummy_db(50, 10, 100, grid, key)
all_inputs, padded_inputs, padded_outputs, masks = preprocess_db(db)
all_inputs.shape, padded_inputs.shape, padded_outputs.shape, masks.shape

((399,), (50, 399), (50, 399), (50, 399))

In [40]:
# Covariance on padded matrix
res1 = new_kernel(padded_inputs)
res1.shape

(50, 399, 399)

In [41]:
# Covariance on un-padded matrix
res2 = new_kernel(padded_inputs[0][masks[0]])
res2.shape

(29, 29)

In [43]:
# Check that values in un-padded matrix correspond to the values in the padded matrix
jnp.allclose(
	res1[0][masks[0][:, None] & masks[0][None, :]].reshape(sum(masks[0]), sum(masks[0])),
	res2)

Array(True, dtype=bool)

In [44]:
%timeit new_kernel(padded_inputs).block_until_ready()

5.77 ms ± 93.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [45]:
jnp.nansum(masks, axis=1).mean()

Array(52.36, dtype=float64)

In [48]:
# Equivalent dense inputs
dense_inputs = jax.random.uniform(subkey, (50, int(jnp.nansum(masks, axis=1).mean().item())))

In [49]:
%timeit new_kernel(dense_inputs).block_until_ready()

198 μs ± 2.72 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


---
## Conclusion

This new way of handling kwargs with lambda functions for vmap doesn't introduce any slowdown and should prevent bugs related to the placement of those kwargs. It's definitely a better alternative.

Computing kernel on padding-dominant inputs is way slower than their dense equivalent, which is expected. We should investigate ways of speeding up the computations on padded matrices, as they are often used in practice.

**Potential ways of speeding up the computations on padded matrices:**
- simplify `compute_vector` when the scalar is NaN, but this would require conditional logic
- make `compute_vector` use the mask of the array to compute on a dense array then spread the result

---