SOHI is an explainable AI framework designed to predict specific, measurable soil-health outcomes—specifically Soil Organic Carbon (SOC) and Potentially Mineralizable Nitrogen (PMN)—from genome-resolved metagenomes.
This platform integrates high-performance bioinformatics, causal inference (SEM), and explainable machine learning (TreeSHAP) to move beyond taxonomic inventories toward actionable, functional soil diagnostics.
This repository is organized to support the three core objectives of the SOHI framework:
- Context Harmonizer: A Python module to standardize heterogeneous metadata (e.g., normalizing pH methods, texture classes, and units).
- Bioinformatics Pipeline: Snakemake workflows for:
- Metagenomic assembly (MEGAHIT/OMEGA)
- High-fidelity binning (MetaBAT2, MaxBin2, CONCOCT)
- Functional annotation (KEGG, CAZy)
- Multi-Level Feature Aggregation (MLFA): Scripts to aggregate MAGs into "Microbial Cooperatives" and genes into "Functional Modules" (C/N/P cycling).
- Causal Modeling: Structural Equation Models (SEM) to filter features based on ecological plausibility.
- Explainable AI: XGBoost/Random Forest implementations with TreeSHAP integration to generate "SOHI Indicator Cards."
- Protocols and datasets for the registered benchmark.
- Scripts for computational cross-validation against archived historical data (e.g., Morrow Plots).
- Conda or Mamba
- Docker (optional, for containerized runs)
git clone [https://github.com/BioHPC/SOHI.git](https://github.com/BioHPC/SOHI.git)
cd SOHI
# Create the environment
conda env create -f workflow/envs/sohi_core.yaml
conda activate sohi_coreStandardize heterogeneous soil and environmental metadata using the context harmonizer.
python context_harmonizer/harmonize.py \
--input raw_metadata.csv \
--schema context_harmonizer/schema.json \
--output clean_metadata.csvPredict soil health indicators (e.g., PMN) using aggregated functional features and harmonized metadata.
python modeling/predict_pmn.py \
--features aggregated_modules.tsv \
--metadata clean_metadata.csvDr. Tae Hyuk Ahn
Department of Computer Science
Saint Louis University
taehyuk.ahn@slu.edu
Dr. Laibin Huang
Department of Biology
Saint Louis University
laibin.huang@slu.edu