scikit-bio for SQL — a VGI worker that exposes biological sequence analysis, pairwise alignment, community diversity (including phylogenetic), ordination, distance-matrix statistics, compositional transforms, and phylogenetics to DuckDB as ordinary scalar, aggregate, and table functions.
INSTALL vgi FROM community; LOAD vgi;
-- Attach with zero local install — from PyPI via uvx, or from ghcr via oci://
ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
-- ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG'); -- 0.5
SELECT skbio.sequence.reverse_complement('ATGCGGATTACAGG'); -- CCTGTAATCCGCAT
SELECT skbio.sequence.translate('ATGCGGATTACAGGT'); -- MRITGEverything runs in a separate Python process and streams Apache Arrow between DuckDB and scikit-bio, so you get scikit-bio's algorithms without leaving SQL.
- Install & attach
- What's inside
- A worked microbiome example
- Data-shape conventions
- Development
- Deployment
- License & support
First install the DuckDB vgi community extension, then attach the worker via a
LOCATION. Pick whichever way suits you — you don't have to install anything
locally; uvx and oci:// both fetch and run the published worker for you.
INSTALL vgi FROM community;
LOAD vgi;Zero-install from PyPI (uvx) — recommended for a laptop with uv:
ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG'); -- 0.5uvx downloads vgi-scikit-bio from
PyPI into a cache and runs its stdio console script; the first attach resolves
the package, later ones are instant.
Container from ghcr (oci://) — recommended when you'd rather not touch
Python at all (just needs Docker or Podman):
ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');The extension turns the oci:// address into a pooled docker run -i --rm … stdio and keeps the container warm across queries. Use :latest to track the
newest release, or docker://… (an alias). A struct LOCATION adds options —
{'image': 'oci://…', 'runtime': 'podman', 'volumes': ['skbio_state:/data']}.
Installed console script — if you pip install vgi-scikit-bio, the
vgi-scikit-bio (stdio) and vgi-scikit-bio-http (HTTP) scripts are on PATH:
ATTACH 'skbio' (TYPE vgi, LOCATION 'vgi-scikit-bio');From a source checkout (development):
ATTACH 'skbio' (TYPE vgi, LOCATION 'uv run scikit_bio_worker.py');HTTP server — for a warm, shared, long-lived worker, run the container (or
vgi-scikit-bio-http) and attach over its URL:
docker run -d -p 8000:8000 -e VGI_SIGNING_KEY=change-me ghcr.io/query-farm/vgi-scikit-bio:0.1.0ATTACH 'skbio' (TYPE vgi, LOCATION 'http://localhost:8000');Functions are organised into five schemas — sequence (the default),
alignment, diversity, stats, and tree — so skbio.sequence.gc_content(...)
also resolves as skbio.gc_content(...). There are ~90 functions; the tables
below list them by area (grouped by the schema's category sections, which the
worker also exposes for navigation).
Per-sequence functions over VARCHAR columns of DNA/RNA/protein. Input is
upper-cased and stripped; a NULL or invalid sequence yields NULL rather than
failing the whole query.
- Transforms —
gc_content,gc_frequency,reverse_complement,complement,transcribe,reverse_transcribe,translate,degap - Validation —
is_valid_dna,is_valid_protein,has_gaps,has_degenerates,is_reverse_complement - Distance —
hamming_distance,mismatch_count,match_count - Composition (table functions, long output) —
kmer_frequencies,residue_frequencies,count_subsequence,translate_six_frames
-- Profile reads by GC content, keeping only valid DNA
SELECT id, skbio.sequence.gc_content(seq) AS gc
FROM reads WHERE skbio.sequence.is_valid_dna(seq);
-- 4-mer feature matrix (long → pivot for a wide matrix)
SELECT id, kmer, count
FROM skbio.sequence.kmer_frequencies((SELECT id, seq FROM reads), id := 'id', k := 4);- Score —
align_score_nucleotide(a, b),align_score_protein(a, b)→ optimal global-alignment score - Pairwise —
pairwise_align_nucleotide/pairwise_align_protein→ aligned strings + score + length;mode := 'global'(Needleman–Wunsch) or'local'(Smith–Waterman)
SELECT aligned_1, aligned_2, score
FROM skbio.alignment.pairwise_align_nucleotide((SELECT id, ref, read FROM pairs),
seq1 := 'ref', seq2 := 'read');Community-ecology diversity over a long feature table — one row per
(sample, feature, count).
- Alpha (aggregates over
count,GROUP BY sample) — the full scikit-bio metric family:shannon,simpson,inv_simpson,observed_features,chao1,pielou_evenness,dominance,ace,berger_parker_d,brillouin_d,fisher_alpha,gini_index,goods_coverage,heip_evenness,kempton_taylor_q,margalef,mcintosh_d,mcintosh_e,menhinick,robbins,simpson_d,simpson_e,strong,singles,doubles,enspie; parameterizedhill(count, q := …)/renyi/tsallis; and interval/triple metricschao1_ci,esty_ci,osd(returnDOUBLE[]) - Beta —
beta_diversity(tbl, metric := …)→ the full distance matrix long (braycurtis,jaccard,euclidean,canberra,cosine,jensenshannon, …) - Phylogenetic (given a
tree := '<newick>') —faith_pd(per-sample PD),unifrac(weighted/unweighted distance matrix) - Preprocessing —
subsample_counts(tbl, depth := N)(rarefaction)
SELECT sample_id,
skbio.diversity.shannon(count) AS shannon,
skbio.diversity.chao1(count) AS chao1
FROM feature_table GROUP BY sample_id;
SELECT * FROM skbio.diversity.beta_diversity(
(SELECT sample_id, feature_id, count FROM feature_table), metric := 'braycurtis');- Ordination —
pcoa(dm)(distance matrix),pca(tbl)/ca(tbl)(feature table) →(sample_id, <axis>_1 … k) - Hypothesis tests —
permanova/anosim(grouping in a 4th column),mantel(two distance columns) - Composition — transforms
clr,ilr,alr,closure,centralize,rclr,multi_replace,power; inversesclr_inv,ilr_inv,alr_inv; associationpairwise_vlr - Differential abundance —
ancom,dirmult_ttest(grouping in a 4th column)
pcoa/permanova/anosim/mantel read the long (id_1, id_2, distance) matrix
from beta_diversity. pca/ca and the composition/ANCOM functions read a long
feature table.
-- Embed samples in 2-D from a Bray–Curtis matrix
SELECT * FROM skbio.stats.pcoa(
(SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))),
n_components := 2);- Construction (distance matrix → one
newickrow) —neighbor_joining,upgma,gme,bme - Inspection (scalars over a Newick string) —
tip_count,total_branch_length,tree_height - Comparison (two Newick trees → distance) —
robinson_foulds,weighted_robinson_foulds,cophenetic_distance
SELECT newick
FROM skbio.tree.neighbor_joining(
(SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))));Not exposed (they don't map to a single SQL function): constrained ordination
cca/rdaandbioenv/pwmantel(need a second matrix beyond the one subquery slot);permdisp(an upstream bug in scikit-bio 0.7.3); and stateful/IO surface (TabularMSA, BIOMTable, FASTA/FASTQ readers — DuckDB already reads files) and randomness-seeded estimators (lladser_*,michaelis_menten_fit).
Starting from a long OTU/feature table feature_table(sample_id, feature_id, count)
and a metadata(sample_id, group) table:
-- 1. Alpha diversity per sample
SELECT sample_id, skbio.diversity.shannon(count) AS shannon
FROM feature_table GROUP BY sample_id;
-- 2. Bray–Curtis distances → PCoA coordinates for an ordination plot
CREATE TABLE bc AS
SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table));
SELECT * FROM skbio.stats.pcoa((SELECT * FROM bc), n_components := 2);
-- 3. Does the grouping explain the community differences? (PERMANOVA)
SELECT * FROM skbio.stats.permanova((
SELECT bc.id_1, bc.id_2, bc.distance, m.group
FROM bc JOIN metadata m ON bc.id_1 = m.sample_id));
-- 4. Neighbour-joining tree of the samples
SELECT newick FROM skbio.tree.neighbor_joining((SELECT * FROM bc));- Feature tables are long:
(sample_id, feature_id, count)(orvalue). This is the natural SQL shape and avoids a fixed, data-dependent column width.PIVOTif you need a wide matrix. - Distance matrices are long:
(id_1, id_2, distance)— exactly whatbeta_diversityemits, and whatpcoa/permanova/anosim/mantel/neighbor_joiningconsume. The matrix is symmetrised on read, so a full square or a triangle both work (the grouped tests need every sample to appear asid_1, which the full square frombeta_diversityguarantees). - Column names default positionally and can be overridden with named args
(
sample :=,feature :=,count :=,id_1 :=, …). - Bad rows degrade to NULL in the sequence scalars rather than raising.
The repo is an installable package (pyproject.toml, hatchling, uv.lock).
uv sync # resolve deps (scikit-bio, numpy, pandas, scipy, vgi-python[http])
uv run pytest tests/ -q # unit tests
uv run ruff check . # lint
uv run ruff format --check .
uv run mypy vgi_scikit_bio/
make test-stdio # DuckDB sqllogictest suite, worker as a subprocess (authoritative)
make test-http # same suite against a local HTTP serverThe unit tests exercise each function's logic in-process; the SQL suite
(test/sql/*.test) is authoritative because it drives the real DuckDB → VGI →
worker path. It needs a sqllogictest runner with the vgi extension — CI uses a
prebuilt standalone haybarn-unittest plus INSTALL vgi FROM community (no C++
build). See ci/README.md.
To add a function: implement it in the relevant vgi_scikit_bio/*.py, export it
from that module's *_FUNCTIONS list, and splice the list into
_SCHEMA_FUNCTIONS in vgi_scikit_bio/worker.py. See CLAUDE.md for the
framework patterns and sharp edges.
One multi-arch container image (ghcr.io/query-farm/vgi-scikit-bio) serves both
transports — HTTP (default) and stdio — via docker-entrypoint.sh. It is built,
tested against the signed vgi extension, and signed by CI on every tag. The
image mounts a /data volume for the shared framework BoundStorage. make deploy points Fly.io at the published image.
MIT — see LICENSE. Copyright 2026 Query Farm LLC.
Developed and maintained by Query.Farm. For bug reports and feature requests use the issue tracker; for commercial support and SLAs see SUPPORT.md or email hello@query.farm.
Built on scikit-bio — if you use it in research, please also cite scikit-bio.


