vgi-scikit-bio

scikit-bio for SQL — a VGI worker that exposes biological sequence analysis, pairwise alignment, community diversity (including phylogenetic), ordination, distance-matrix statistics, compositional transforms, and phylogenetics to DuckDB as ordinary scalar, aggregate, and table functions.

INSTALL vgi FROM community; LOAD vgi;

-- Attach with zero local install — from PyPI via uvx, or from ghcr via oci://
ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
-- ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');

SELECT skbio.sequence.gc_content('ATGCGGATTACAGG');            -- 0.5
SELECT skbio.sequence.reverse_complement('ATGCGGATTACAGG');    -- CCTGTAATCCGCAT
SELECT skbio.sequence.translate('ATGCGGATTACAGGT');            -- MRITG

Everything runs in a separate Python process and streams Apache Arrow between DuckDB and scikit-bio, so you get scikit-bio's algorithms without leaving SQL.

Install & attach

First install the DuckDB vgi community extension, then attach the worker via a LOCATION. Pick whichever way suits you — you don't have to install anything locally; uvx and oci:// both fetch and run the published worker for you.

INSTALL vgi FROM community;
LOAD vgi;

Zero-install from PyPI (uvx) — recommended for a laptop with uv:

ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG');   -- 0.5

uvx downloads vgi-scikit-bio from PyPI into a cache and runs its stdio console script; the first attach resolves the package, later ones are instant.

Container from ghcr (oci://) — recommended when you'd rather not touch Python at all (just needs Docker or Podman):

ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');

The extension turns the oci:// address into a pooled docker run -i --rm … stdio and keeps the container warm across queries. Use :latest to track the newest release, or docker://… (an alias). A struct LOCATION adds options — {'image': 'oci://…', 'runtime': 'podman', 'volumes': ['skbio_state:/data']}.

Installed console script — if you pip install vgi-scikit-bio, the vgi-scikit-bio (stdio) and vgi-scikit-bio-http (HTTP) scripts are on PATH:

ATTACH 'skbio' (TYPE vgi, LOCATION 'vgi-scikit-bio');

From a source checkout (development):

ATTACH 'skbio' (TYPE vgi, LOCATION 'uv run scikit_bio_worker.py');

HTTP server — for a warm, shared, long-lived worker, run the container (or vgi-scikit-bio-http) and attach over its URL:

docker run -d -p 8000:8000 -e VGI_SIGNING_KEY=change-me ghcr.io/query-farm/vgi-scikit-bio:0.1.0

ATTACH 'skbio' (TYPE vgi, LOCATION 'http://localhost:8000');

Functions are organised into five schemas — sequence (the default), alignment, diversity, stats, and tree — so skbio.sequence.gc_content(...) also resolves as skbio.gc_content(...). There are ~90 functions; the tables below list them by area (grouped by the schema's category sections, which the worker also exposes for navigation).

What's inside

`sequence` — biological sequences

Per-sequence functions over VARCHAR columns of DNA/RNA/protein. Input is upper-cased and stripped; a NULL or invalid sequence yields NULL rather than failing the whole query.

Transforms — gc_content, gc_frequency, reverse_complement, complement, transcribe, reverse_transcribe, translate, degap
Validation — is_valid_dna, is_valid_protein, has_gaps, has_degenerates, is_reverse_complement
Distance — hamming_distance, mismatch_count, match_count
Composition (table functions, long output) — kmer_frequencies, residue_frequencies, count_subsequence, translate_six_frames

-- Profile reads by GC content, keeping only valid DNA
SELECT id, skbio.sequence.gc_content(seq) AS gc
FROM reads WHERE skbio.sequence.is_valid_dna(seq);

-- 4-mer feature matrix (long → pivot for a wide matrix)
SELECT id, kmer, count
FROM skbio.sequence.kmer_frequencies((SELECT id, seq FROM reads), id := 'id', k := 4);

`alignment` — pairwise alignment

Score — align_score_nucleotide(a, b), align_score_protein(a, b) → optimal global-alignment score
Pairwise — pairwise_align_nucleotide / pairwise_align_protein → aligned strings + score + length; mode := 'global' (Needleman–Wunsch) or 'local' (Smith–Waterman)

SELECT aligned_1, aligned_2, score
FROM skbio.alignment.pairwise_align_nucleotide((SELECT id, ref, read FROM pairs),
     seq1 := 'ref', seq2 := 'read');

`diversity` — alpha, beta, phylogenetic, rarefaction

Community-ecology diversity over a long feature table — one row per (sample, feature, count).

Alpha (aggregates over count, GROUP BY sample) — the full scikit-bio metric family: shannon, simpson, inv_simpson, observed_features, chao1, pielou_evenness, dominance, ace, berger_parker_d, brillouin_d, fisher_alpha, gini_index, goods_coverage, heip_evenness, kempton_taylor_q, margalef, mcintosh_d, mcintosh_e, menhinick, robbins, simpson_d, simpson_e, strong, singles, doubles, enspie; parameterized hill(count, q := …) / renyi / tsallis; and interval/triple metrics chao1_ci, esty_ci, osd (return DOUBLE[])
Beta — beta_diversity(tbl, metric := …) → the full distance matrix long (braycurtis, jaccard, euclidean, canberra, cosine, jensenshannon, …)
Phylogenetic (given a tree := '<newick>') — faith_pd (per-sample PD), unifrac (weighted/unweighted distance matrix)
Preprocessing — subsample_counts(tbl, depth := N) (rarefaction)

SELECT sample_id,
       skbio.diversity.shannon(count) AS shannon,
       skbio.diversity.chao1(count)   AS chao1
FROM feature_table GROUP BY sample_id;

SELECT * FROM skbio.diversity.beta_diversity(
  (SELECT sample_id, feature_id, count FROM feature_table), metric := 'braycurtis');

`stats` — ordination, distance tests, composition, differential abundance

Ordination — pcoa(dm) (distance matrix), pca(tbl) / ca(tbl) (feature table) → (sample_id, <axis>_1 … k)
Hypothesis tests — permanova / anosim (grouping in a 4th column), mantel (two distance columns)
Composition — transforms clr, ilr, alr, closure, centralize, rclr, multi_replace, power; inverses clr_inv, ilr_inv, alr_inv; association pairwise_vlr
Differential abundance — ancom, dirmult_ttest (grouping in a 4th column)

pcoa/permanova/anosim/mantel read the long (id_1, id_2, distance) matrix from beta_diversity. pca/ca and the composition/ANCOM functions read a long feature table.

-- Embed samples in 2-D from a Bray–Curtis matrix
SELECT * FROM skbio.stats.pcoa(
  (SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))),
  n_components := 2);

`tree` — phylogenetics

Construction (distance matrix → one newick row) — neighbor_joining, upgma, gme, bme
Inspection (scalars over a Newick string) — tip_count, total_branch_length, tree_height
Comparison (two Newick trees → distance) — robinson_foulds, weighted_robinson_foulds, cophenetic_distance

SELECT newick
FROM skbio.tree.neighbor_joining(
  (SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))));

Not exposed (they don't map to a single SQL function): constrained ordination cca/rda and bioenv/pwmantel (need a second matrix beyond the one subquery slot); permdisp (an upstream bug in scikit-bio 0.7.3); and stateful/IO surface (TabularMSA, BIOM Table, FASTA/FASTQ readers — DuckDB already reads files) and randomness-seeded estimators (lladser_*, michaelis_menten_fit).

A worked microbiome example

Starting from a long OTU/feature table feature_table(sample_id, feature_id, count) and a metadata(sample_id, group) table:

-- 1. Alpha diversity per sample
SELECT sample_id, skbio.diversity.shannon(count) AS shannon
FROM feature_table GROUP BY sample_id;

-- 2. Bray–Curtis distances → PCoA coordinates for an ordination plot
CREATE TABLE bc AS
  SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table));

SELECT * FROM skbio.stats.pcoa((SELECT * FROM bc), n_components := 2);

-- 3. Does the grouping explain the community differences? (PERMANOVA)
SELECT * FROM skbio.stats.permanova((
  SELECT bc.id_1, bc.id_2, bc.distance, m.group
  FROM bc JOIN metadata m ON bc.id_1 = m.sample_id));

-- 4. Neighbour-joining tree of the samples
SELECT newick FROM skbio.tree.neighbor_joining((SELECT * FROM bc));

Data-shape conventions

Feature tables are long: (sample_id, feature_id, count) (or value). This is the natural SQL shape and avoids a fixed, data-dependent column width. PIVOT if you need a wide matrix.
Distance matrices are long: (id_1, id_2, distance) — exactly what beta_diversity emits, and what pcoa/permanova/anosim/mantel/ neighbor_joining consume. The matrix is symmetrised on read, so a full square or a triangle both work (the grouped tests need every sample to appear as id_1, which the full square from beta_diversity guarantees).
Column names default positionally and can be overridden with named args (sample :=, feature :=, count :=, id_1 :=, …).
Bad rows degrade to NULL in the sequence scalars rather than raising.

Development

The repo is an installable package (pyproject.toml, hatchling, uv.lock).

uv sync                    # resolve deps (scikit-bio, numpy, pandas, scipy, vgi-python[http])
uv run pytest tests/ -q    # unit tests
uv run ruff check .        # lint
uv run ruff format --check .
uv run mypy vgi_scikit_bio/

make test-stdio            # DuckDB sqllogictest suite, worker as a subprocess (authoritative)
make test-http             # same suite against a local HTTP server

The unit tests exercise each function's logic in-process; the SQL suite (test/sql/*.test) is authoritative because it drives the real DuckDB → VGI → worker path. It needs a sqllogictest runner with the vgi extension — CI uses a prebuilt standalone haybarn-unittest plus INSTALL vgi FROM community (no C++ build). See ci/README.md.

To add a function: implement it in the relevant vgi_scikit_bio/*.py, export it from that module's *_FUNCTIONS list, and splice the list into _SCHEMA_FUNCTIONS in vgi_scikit_bio/worker.py. See CLAUDE.md for the framework patterns and sharp edges.

Deployment

One multi-arch container image (ghcr.io/query-farm/vgi-scikit-bio) serves both transports — HTTP (default) and stdio — via docker-entrypoint.sh. It is built, tested against the signed vgi extension, and signed by CI on every tag. The image mounts a /data volume for the shared framework BoundStorage. make deploy points Fly.io at the published image.

License & support

Developed and maintained by Query.Farm. For bug reports and feature requests use the issue tracker; for commercial support and SLAs see SUPPORT.md or email hello@query.farm.

Built on scikit-bio — if you use it in research, please also cite scikit-bio.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
assets		assets
ci		ci
test/sql		test/sql
tests		tests
vgi_scikit_bio		vgi_scikit_bio
.dockerignore		.dockerignore
.gitignore		.gitignore
.vgi-review-cache.json		.vgi-review-cache.json
.vgi-sim-cache.json		.vgi-sim-cache.json
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SUPPORT.md		SUPPORT.md
conftest.py		conftest.py
docker-entrypoint.sh		docker-entrypoint.sh
fly.toml		fly.toml
pyproject.toml		pyproject.toml
scikit_bio_worker.py		scikit_bio_worker.py
serve.py		serve.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vgi-scikit-bio

Contents

Install & attach

What's inside

`sequence` — biological sequences

`alignment` — pairwise alignment

`diversity` — alpha, beta, phylogenetic, rarefaction

`stats` — ordination, distance tests, composition, differential abundance

`tree` — phylogenetics

A worked microbiome example

Data-shape conventions

Development

Deployment

License & support

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

vgi-scikit-bio

Contents

Install & attach

What's inside

sequence — biological sequences

alignment — pairwise alignment

diversity — alpha, beta, phylogenetic, rarefaction

stats — ordination, distance tests, composition, differential abundance

tree — phylogenetics

A worked microbiome example

Data-shape conventions

Development

Deployment

License & support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`sequence` — biological sequences

`alignment` — pairwise alignment

`diversity` — alpha, beta, phylogenetic, rarefaction

`stats` — ordination, distance tests, composition, differential abundance

`tree` — phylogenetics

Packages