## `scikit-learn` SIMD DistanceMetric (`slsdm`)

`slsdm` is a third-party implementation of the `DistanceMetric` computational backend of `scikit-learn` which utilizes SIMD instructions to significantly accelerate distance computations, especially for data with many features. This notebook offers a simple demo and comparison in performance between the scikit-learn with and without these accelerated objects.

In [1]:
from slsdm import get_distance_metric
from sklearn.metrics import DistanceMetric
import numpy as np

random_state = 0
n_samples_X = 1_000
n_features = 100
n_classes = 3

rng = np.random.RandomState(random_state)
X = rng.uniform(size=(n_samples_X, n_features))
y = rng.randint(n_classes, size=n_samples_X)

In [2]:
metric = 'euclidean'

# We provide a similar API to scikit-learn for grabbing a specific
# DistanceMetric instance, however we do so through a public
# function as opposed to a static method.
dst = get_distance_metric(metric=metric, dtype=X.dtype)
dst_sk = DistanceMetric.get_metric(metric=metric, dtype=X.dtype)

We can immediately compare the performance of the two metrics. Note that the performance between the two implementations become more comparable for data with fewer features, however the `slsdm` are never *worse* than the default `scikit-learn` implementations.

In [3]:
%timeit dst.pairwise(X)

5.4 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [4]:
%timeit dst_sk.pairwise(X)

20.4 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Most of often, rather than using the distance metric objects directly, users will want to provide them to estimators which will then utilize them during their backend computations. This provides a simple, explicit, yet potent way to accelerate the majority of scikiy-learn estimators.

In [5]:
from sklearn.neighbors import KNeighborsRegressor

# Make some dummy data for fit
y = rng.randint(n_classes, size=n_samples_X)

# Note that you can pass an instance of `DistanceMetric`
# directly to the `metric` keyword 
est = KNeighborsRegressor(n_neighbors=2, metric=dst, algorithm="brute").fit(X, y)
est_sk = KNeighborsRegressor(n_neighbors=2, metric=dst_sk, algorithm="brute").fit(X, y)

In [6]:
%timeit est.predict(X)

5.58 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
%timeit est_sk.predict(X)

13.1 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
