Skip to content

Query-farm/vgi-scikit-bio

Repository files navigation

Vector Gateway Interface        scikit-bio        DuckDB

vgi-scikit-bio

scikit-bio for SQL — a VGI worker that exposes biological sequence analysis, pairwise alignment, community diversity (including phylogenetic), ordination, distance-matrix statistics, compositional transforms, and phylogenetics to DuckDB as ordinary scalar, aggregate, and table functions.

INSTALL vgi FROM community; LOAD vgi;

-- Attach with zero local install — from PyPI via uvx, or from ghcr via oci://
ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
-- ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');

SELECT skbio.sequence.gc_content('ATGCGGATTACAGG');            -- 0.5
SELECT skbio.sequence.reverse_complement('ATGCGGATTACAGG');    -- CCTGTAATCCGCAT
SELECT skbio.sequence.translate('ATGCGGATTACAGGT');            -- MRITG

Everything runs in a separate Python process and streams Apache Arrow between DuckDB and scikit-bio, so you get scikit-bio's algorithms without leaving SQL.


Contents


Install & attach

First install the DuckDB vgi community extension, then attach the worker via a LOCATION. Pick whichever way suits you — you don't have to install anything locally; uvx and oci:// both fetch and run the published worker for you.

INSTALL vgi FROM community;
LOAD vgi;

Zero-install from PyPI (uvx) — recommended for a laptop with uv:

ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG');   -- 0.5

uvx downloads vgi-scikit-bio from PyPI into a cache and runs its stdio console script; the first attach resolves the package, later ones are instant.

Container from ghcr (oci://) — recommended when you'd rather not touch Python at all (just needs Docker or Podman):

ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');

The extension turns the oci:// address into a pooled docker run -i --rm … stdio and keeps the container warm across queries. Use :latest to track the newest release, or docker://… (an alias). A struct LOCATION adds options — {'image': 'oci://…', 'runtime': 'podman', 'volumes': ['skbio_state:/data']}.

Installed console script — if you pip install vgi-scikit-bio, the vgi-scikit-bio (stdio) and vgi-scikit-bio-http (HTTP) scripts are on PATH:

ATTACH 'skbio' (TYPE vgi, LOCATION 'vgi-scikit-bio');

From a source checkout (development):

ATTACH 'skbio' (TYPE vgi, LOCATION 'uv run scikit_bio_worker.py');

HTTP server — for a warm, shared, long-lived worker, run the container (or vgi-scikit-bio-http) and attach over its URL:

docker run -d -p 8000:8000 -e VGI_SIGNING_KEY=change-me ghcr.io/query-farm/vgi-scikit-bio:0.1.0
ATTACH 'skbio' (TYPE vgi, LOCATION 'http://localhost:8000');

Functions are organised into five schemassequence (the default), alignment, diversity, stats, and tree — so skbio.sequence.gc_content(...) also resolves as skbio.gc_content(...). There are ~90 functions; the tables below list them by area (grouped by the schema's category sections, which the worker also exposes for navigation).


What's inside

sequence — biological sequences

Per-sequence functions over VARCHAR columns of DNA/RNA/protein. Input is upper-cased and stripped; a NULL or invalid sequence yields NULL rather than failing the whole query.

  • Transformsgc_content, gc_frequency, reverse_complement, complement, transcribe, reverse_transcribe, translate, degap
  • Validationis_valid_dna, is_valid_protein, has_gaps, has_degenerates, is_reverse_complement
  • Distancehamming_distance, mismatch_count, match_count
  • Composition (table functions, long output) — kmer_frequencies, residue_frequencies, count_subsequence, translate_six_frames
-- Profile reads by GC content, keeping only valid DNA
SELECT id, skbio.sequence.gc_content(seq) AS gc
FROM reads WHERE skbio.sequence.is_valid_dna(seq);

-- 4-mer feature matrix (long → pivot for a wide matrix)
SELECT id, kmer, count
FROM skbio.sequence.kmer_frequencies((SELECT id, seq FROM reads), id := 'id', k := 4);

alignment — pairwise alignment

  • Scorealign_score_nucleotide(a, b), align_score_protein(a, b) → optimal global-alignment score
  • Pairwisepairwise_align_nucleotide / pairwise_align_protein → aligned strings + score + length; mode := 'global' (Needleman–Wunsch) or 'local' (Smith–Waterman)
SELECT aligned_1, aligned_2, score
FROM skbio.alignment.pairwise_align_nucleotide((SELECT id, ref, read FROM pairs),
     seq1 := 'ref', seq2 := 'read');

diversity — alpha, beta, phylogenetic, rarefaction

Community-ecology diversity over a long feature table — one row per (sample, feature, count).

  • Alpha (aggregates over count, GROUP BY sample) — the full scikit-bio metric family: shannon, simpson, inv_simpson, observed_features, chao1, pielou_evenness, dominance, ace, berger_parker_d, brillouin_d, fisher_alpha, gini_index, goods_coverage, heip_evenness, kempton_taylor_q, margalef, mcintosh_d, mcintosh_e, menhinick, robbins, simpson_d, simpson_e, strong, singles, doubles, enspie; parameterized hill(count, q := …) / renyi / tsallis; and interval/triple metrics chao1_ci, esty_ci, osd (return DOUBLE[])
  • Betabeta_diversity(tbl, metric := …) → the full distance matrix long (braycurtis, jaccard, euclidean, canberra, cosine, jensenshannon, …)
  • Phylogenetic (given a tree := '<newick>') — faith_pd (per-sample PD), unifrac (weighted/unweighted distance matrix)
  • Preprocessingsubsample_counts(tbl, depth := N) (rarefaction)
SELECT sample_id,
       skbio.diversity.shannon(count) AS shannon,
       skbio.diversity.chao1(count)   AS chao1
FROM feature_table GROUP BY sample_id;

SELECT * FROM skbio.diversity.beta_diversity(
  (SELECT sample_id, feature_id, count FROM feature_table), metric := 'braycurtis');

stats — ordination, distance tests, composition, differential abundance

  • Ordinationpcoa(dm) (distance matrix), pca(tbl) / ca(tbl) (feature table) → (sample_id, <axis>_1 … k)
  • Hypothesis testspermanova / anosim (grouping in a 4th column), mantel (two distance columns)
  • Composition — transforms clr, ilr, alr, closure, centralize, rclr, multi_replace, power; inverses clr_inv, ilr_inv, alr_inv; association pairwise_vlr
  • Differential abundanceancom, dirmult_ttest (grouping in a 4th column)

pcoa/permanova/anosim/mantel read the long (id_1, id_2, distance) matrix from beta_diversity. pca/ca and the composition/ANCOM functions read a long feature table.

-- Embed samples in 2-D from a Bray–Curtis matrix
SELECT * FROM skbio.stats.pcoa(
  (SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))),
  n_components := 2);

tree — phylogenetics

  • Construction (distance matrix → one newick row) — neighbor_joining, upgma, gme, bme
  • Inspection (scalars over a Newick string) — tip_count, total_branch_length, tree_height
  • Comparison (two Newick trees → distance) — robinson_foulds, weighted_robinson_foulds, cophenetic_distance
SELECT newick
FROM skbio.tree.neighbor_joining(
  (SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))));

Not exposed (they don't map to a single SQL function): constrained ordination cca/rda and bioenv/pwmantel (need a second matrix beyond the one subquery slot); permdisp (an upstream bug in scikit-bio 0.7.3); and stateful/IO surface (TabularMSA, BIOM Table, FASTA/FASTQ readers — DuckDB already reads files) and randomness-seeded estimators (lladser_*, michaelis_menten_fit).


A worked microbiome example

Starting from a long OTU/feature table feature_table(sample_id, feature_id, count) and a metadata(sample_id, group) table:

-- 1. Alpha diversity per sample
SELECT sample_id, skbio.diversity.shannon(count) AS shannon
FROM feature_table GROUP BY sample_id;

-- 2. Bray–Curtis distances → PCoA coordinates for an ordination plot
CREATE TABLE bc AS
  SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table));

SELECT * FROM skbio.stats.pcoa((SELECT * FROM bc), n_components := 2);

-- 3. Does the grouping explain the community differences? (PERMANOVA)
SELECT * FROM skbio.stats.permanova((
  SELECT bc.id_1, bc.id_2, bc.distance, m.group
  FROM bc JOIN metadata m ON bc.id_1 = m.sample_id));

-- 4. Neighbour-joining tree of the samples
SELECT newick FROM skbio.tree.neighbor_joining((SELECT * FROM bc));

Data-shape conventions

  • Feature tables are long: (sample_id, feature_id, count) (or value). This is the natural SQL shape and avoids a fixed, data-dependent column width. PIVOT if you need a wide matrix.
  • Distance matrices are long: (id_1, id_2, distance) — exactly what beta_diversity emits, and what pcoa/permanova/anosim/mantel/ neighbor_joining consume. The matrix is symmetrised on read, so a full square or a triangle both work (the grouped tests need every sample to appear as id_1, which the full square from beta_diversity guarantees).
  • Column names default positionally and can be overridden with named args (sample :=, feature :=, count :=, id_1 :=, …).
  • Bad rows degrade to NULL in the sequence scalars rather than raising.

Development

The repo is an installable package (pyproject.toml, hatchling, uv.lock).

uv sync                    # resolve deps (scikit-bio, numpy, pandas, scipy, vgi-python[http])
uv run pytest tests/ -q    # unit tests
uv run ruff check .        # lint
uv run ruff format --check .
uv run mypy vgi_scikit_bio/

make test-stdio            # DuckDB sqllogictest suite, worker as a subprocess (authoritative)
make test-http             # same suite against a local HTTP server

The unit tests exercise each function's logic in-process; the SQL suite (test/sql/*.test) is authoritative because it drives the real DuckDB → VGI → worker path. It needs a sqllogictest runner with the vgi extension — CI uses a prebuilt standalone haybarn-unittest plus INSTALL vgi FROM community (no C++ build). See ci/README.md.

To add a function: implement it in the relevant vgi_scikit_bio/*.py, export it from that module's *_FUNCTIONS list, and splice the list into _SCHEMA_FUNCTIONS in vgi_scikit_bio/worker.py. See CLAUDE.md for the framework patterns and sharp edges.

Deployment

One multi-arch container image (ghcr.io/query-farm/vgi-scikit-bio) serves both transports — HTTP (default) and stdio — via docker-entrypoint.sh. It is built, tested against the signed vgi extension, and signed by CI on every tag. The image mounts a /data volume for the shared framework BoundStorage. make deploy points Fly.io at the published image.

License & support

MIT — see LICENSE. Copyright 2026 Query Farm LLC.

Developed and maintained by Query.Farm. For bug reports and feature requests use the issue tracker; for commercial support and SLAs see SUPPORT.md or email hello@query.farm.

Built on scikit-bio — if you use it in research, please also cite scikit-bio.

About

scikit-bio for DuckDB/SQL — sequence analysis, pairwise alignment, community diversity, ordination, and phylogenetics, exposed as a VGI worker

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors