Skip to content

LangeLab/HarmonizePy

Repository files navigation

HarmonizePy

Pure-Python batch-effect harmonization toolkit validated against R sva::ComBat, limma::removeBatchEffect, and HarmonizR.

Python 3.12-3.14 v0.3.2 Alpha CI 628 tests collected Coverage GPL-3.0

Changelog Citation Issues Docs

HarmonizePy is a pure-Python batch-effect correction package for omics data. The high-level harmonize() entry point handles structural missingness automatically: features are grouped by observed batch support, corrected with ComBat or limma within compatible subsets, and reassembled without imputation. No R installation is required at runtime.

Documentation

Capabilities

ComBat (Johnson et al. 2007). All four modes:

  • Mode 1: parametric, location + scale
  • Mode 2: parametric, location only
  • Mode 3: non-parametric, location + scale
  • Mode 4: non-parametric, location only

limma (Ritchie et al. 2015). Linear-model batch correction via sum-to-zero contrasts.

HarmonizR-compatible pipeline (feature parity with v1.10.0):

  • Structural missingness handled automatically: features are grouped by observed batch support, corrected in compatible subsets, and reassembled into the original matrix shape.
  • Batch sorting: reorder batches so similar ones become neighbours. Three strategies: "sparsity" (completeness-based, fastest), "jaccard" (pairwise feature-overlap similarity), "seriation" (PCA-based ordering, more accurate for many batches).
  • Batch blocking: group block=N consecutive (optionally sorted) batches into a single sub-matrix, reducing peak memory and improving adjustment quality for datasets with many batches.
  • Unique-combination removal (unique_removal=True, default): features whose batch-presence pattern is unique across the dataset get rescued. They would otherwise form a singleton sub-matrix and be dropped. The algorithm crops their affiliation to the nearest shared pattern, trading some non-missing data for inclusion in a group adjustment.

Validation against R reference workflows: the package is tested against sva::ComBat, limma::removeBatchEffect, and HarmonizR v1.10.0 reference outputs. Known edge-case differences in retention policy are documented in the R Compatibility wiki page. Runtime benchmarks across six datasets are in Benchmarks.

Installation

# Package install (if published for this release)
pip install harmonizepy

# From GitHub
pip install git+https://github.com/LangeLab/HarmonizePy.git

# Development install with uv
uv python install 3.12.13
uv sync --dev --python 3.12.13

# Optional extras
pip install harmonizepy[config]      # YAML/TOML support for --config flag
pip install harmonizepy[completion]  # Shell tab-completion via argcomplete
pip install harmonizepy[io]          # Parquet IO and pyarrow-accelerated CSV/TSV paths

See Installation for the full dependency list and optional extras detail.

Quick start

harmonize() returns a pandas.DataFrame with the same shape and column order as the input. Features are rows (DataFrame index), samples are columns.

import pandas as pd
from harmonizepy import harmonize

# From file paths (returns a DataFrame)
result = harmonize("data.tsv", "description.csv")
# result is a pd.DataFrame, features as index, samples as columns

# From DataFrames
data_df = pd.read_csv("data.tsv", sep="\t", index_col=0)
desc_df = pd.read_csv("description.csv")
result = harmonize(data_df, desc_df, algorithm="limma", sort="sparsity", block=2)

# Low-level API (ndarray in, ndarray out)
from harmonizepy import combat, remove_batch_effect

corrected = combat(data_matrix, batch_labels, par_prior=True, mean_only=False)
corrected = remove_batch_effect(data_matrix, batch_labels)

The low-level engines handle missing observations per feature and preserve NaN positions in the returned array.

Input format

  • data: features × samples matrix. For TSV/CSV files, the first column must contain feature names (read as the DataFrame index). When passing a DataFrame directly, feature names must be in DataFrame.index and sample names in DataFrame.columns.
  • description: CSV/DataFrame with columns ID (sample names matching data columns), sample (integer), batch (integer batch label).

Parameter reference

Parameter Type Default Description
algorithm "ComBat" or "limma" "ComBat" Batch correction algorithm
combat_mode 1, 2, 3, 4 1 ComBat variant (ignored for limma)
sort "sparsity", "jaccard", "seriation", or None None Batch sorting strategy before blocking
block int >= 2 or None None Group N consecutive batches into one sub-matrix block
unique_removal bool True Rescue singleton features by cropping to nearest shared pattern

Full signature with all parameters is in the API Reference.

Choosing algorithm and strategy

Start with the default (ComBat, mode 1). For very small batch sizes (n < 10 per batch), prefer combat_mode=3 (non-parametric). For datasets with 5 or more batches, sort="sparsity" and block=2 group similar batches together before correction. Use sort="jaccard" when batch sizes are uneven. See Algorithms and Pipeline for full guidance.

CLI usage

# Basic
harmonizepy data.tsv batch.csv -o corrected.tsv

# Algorithm, sort, and block
harmonizepy data.tsv batch.csv --algorithm limma -o corrected.tsv
harmonizepy data.tsv batch.csv --combat-mode 3 --sort sparsity --block 2 -o corrected.tsv

# Validate inputs without running correction
harmonizepy data.tsv batch.csv --dry-run

# Config file (TOML, JSON, or YAML with [config] extra)
harmonizepy data.tsv batch.csv --config run.toml

When -o is omitted, the output is written as <data_stem>_corrected.tsv next to the input file. See CLI Reference for all flags, output formats, and config file syntax.

Running tests

uv run pytest               # full test suite
uv run pytest tests/ -v     # verbose

The suite covers R concordance, edge cases, failure modes, numerical stability, CLI integration, and the benchmark harness. R fixture outputs are committed to the repository, so the full concordance suite runs on a plain checkout without a live R environment. Regenerating fixtures from source requires R with sva, limma, HarmonizR, and seriation managed via renv.

Project layout

src/harmonizepy/    # Package source: pipeline, engines, CLI, I/O
tests/              # 628 tests: unit, integration, R concordance, CLI
benchmarks/         # Benchmark CLI, dataset catalog, results
data/               # Small showcase datasets

License

HarmonizePy is licensed under GPL-3.0.

References

Manuscripts

  • Johnson WE, Li C, Rabinovic A. "Adjusting batch effects in microarray expression data using empirical Bayes methods." Biostatistics 8(1):118-127, 2007.
  • Ritchie ME et al. "limma powers differential expression analyses for RNA-sequencing and microarray studies." Nucleic Acids Research 43(7):e47, 2015.
  • Voß H et al. "HarmonizR enables data harmonization across independent proteomic datasets with appropriate handling of missing values." Nature Communications 13:3523, 2022. https://doi.org/10.1038/s41467-022-31007-x
  • Schlumbohm S, Neumann JE, Neumann P. "HarmonizR: blocking and singular feature data adjustment improve runtime efficiency and data preservation." BMC Bioinformatics, 2025. https://doi.org/10.1186/s12859-025-06073-9

Code and packages

The ComBat algorithm was introduced by Johnson, Li & Rabinovic (2007). limma::removeBatchEffect was published by Ritchie et al. (2015). The HarmonizR pipeline (structural-missingness handling, batch sorting, blocking, unique-combination removal) was developed by Voß et al. (2022) and Schlumbohm, Neumann & Neumann (2025). HarmonizePy is not a line-by-line port of the R sources: the engine logic was reimplemented from the published manuscripts and validated against R reference output.

About

Pure-Python batch-effect harmonization toolkit validated, ported from HarmonizR

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors