Varis — Structural Evidence for Every Variant

An AI-powered platform that doesn't just classify mutations — it investigates them.

Varis is a structural investigator for genetic variants. While tools like AlphaMissense give clinicians a single pathogenicity score, Varis investigates why a mutation is damaging — orchestrating a multi-tool bioinformatics pipeline to calculate protein destabilization, map functional domains, and generate the structural evidence clinicians need to diagnose rare disease.

A Russell Genetics product.

Project Status (v2026.03)

Component	Status	Details
M1: Data Ingestion	Complete	ClinVar, UniProt, gnomAD, AlphaMissense clients
M2: Structure Engine	Complete	AlphaFold + ESMFold fallback
M3: Structural Analysis	Complete	EvoEF2 DDG, FreeSASA, DSSP, contacts, InterPro domains
M4: Conservation	Complete	UniProt orthologs + Clustal Omega + ConSurf fallback
M5: ML Scoring	Complete	Ensemble trained on 535 ClinVar variants (ROC-AUC 0.846)
M6: Platform	Complete	FastAPI + PostgreSQL + React frontend + PDF reports
M7: Self-Evolution	Partial	Auto-retrain loop done; tool scout & auto-integrator stubbed
CI/CD	Complete	GitHub Actions (pytest, ruff, mypy), Docker, docker-compose
Tests	269 passing	Across all modules

Quick Start

# Install
pip install -e ".[all]"

# Investigate a variant
python -m varis BRCA1 p.Arg1699Trp

# Run validation suite
python -m varis --validate

Architecture

Varis is built as 7 independent modules connected by a shared Variant Record:

Module	What It Does
M1: Data Ingestion	Parses variants, retrieves data from ClinVar/gnomAD/UniProt/AlphaFold
M2: Structure Engine	Obtains and prepares 3D protein structures
M3: Structural Analysis	Extracts structural features (ΔΔG, SASA, DSSP, contacts, domains)
M4: Conservation	Calculates evolutionary conservation (independent of M2/M3)
M5: ML Scoring	Trains/runs interpretable ensemble (CatBoost + XGBoost + LightGBM)
M6: Platform (VarisDB)	Database, API, frontend, 3D viewer, PDF reports
M7: Self-Evolution	Auto-retrain, tool discovery, evolution log

Key design principle: If any module fails, the pipeline continues with whatever data it has. The ML ensemble handles missing features natively.

Structural Features Computed

For each variant, the pipeline extracts up to 16 features:

Feature	Source	Description
ddg_evoef2	EvoEF2	Stability change (kcal/mol), positive = destabilizing
ddg_foldx	FoldX (optional)	Stability change from FoldX
ddg_pyrosetta	PyRosetta (optional)	Stability change from PyRosetta
ddg_mean	M3 orchestrator	Mean of available DDG methods
solvent_accessibility_relative	FreeSASA	Relative SASA (0=buried, 1=exposed)
burial_category	FreeSASA	core/surface classification
secondary_structure	DSSP	helix/sheet/coil at mutation site
contacts_wt	BioPython	Heavy-atom contacts within 4.5A
hbonds_wt	BioPython	Hydrogen bonds at mutation site
packing_density	BioPython	Local packing (contacts / target atoms)
domain_criticality	InterPro	critical/important/peripheral
plddt_at_residue	AlphaFold/ESMFold	Structure confidence at mutation site
plddt_mean	AlphaFold/ESMFold	Mean structure confidence
gnomad_frequency	gnomAD	Population allele frequency
conservation_score	Clustal/ConSurf	Evolutionary conservation (0-1)
alphamissense_score	AlphaMissense	External pathogenicity score

ML Model

Training Data

535 variants from ClinVar (271 pathogenic, 264 benign)
19 genes: BRCA1, BRCA2, TP53, CFTR, MSH2, MLH1, MSH6, PMS2, PTEN, RB1, APC, VHL, MEN1, RET, CDH1, PALB2, ATM, CHEK2, RAD51C
RAD51D skipped (insufficient pathogenic variants)
VUS explicitly excluded — they are the prediction target
5 benchmark variants held out, never trained on
Simulated 15% feature missingness during training for robustness

Performance (Gene-Stratified Cross-Validation)

Fold	Held-Out Genes	ROC-AUC	PR-AUC
1	CFTR, PALB2, RB1, TP53	0.884	0.853
2	BRCA1, BRCA2, MSH2, RAD51C	0.827	0.816
3	CDH1, MEN1, MSH6, RET	0.860	0.857
4	APC, MLH1, PTEN, VHL	0.860	0.867
5	ATM, CHEK2, PMS2	0.799	0.770
Mean		0.846 +/- 0.030	0.833 +/- 0.036

These are honest numbers — entire genes are held out from training to prevent leakage.

Ensemble Architecture

Three gradient-boosted tree models, each trained independently:

CatBoost — handles categoricals natively, ordered boosting
XGBoost — strong regularization, histogram-based
LightGBM — leaf-wise growth, fast training

Final score = mean of three model scores. SHAP values computed per-prediction using TreeSHAP for full interpretability.

Prerequisites & Licensing

Varis's own code is open source under the MIT license. However, some structural analysis tools require separate free academic licenses that cannot be redistributed with Varis:

Tool	License	Status	Required?
EvoEF2	MIT	Primary DDG tool, compiled from source	Recommended
FoldX	CRG Barcelona (free academic)	Optional fallback DDG	No
PyRosetta	RosettaCommons (free academic)	Optional fallback DDG	No

If optional tools are not installed, Varis's fallback architecture will skip the affected analysis steps and continue with the remaining tools. The ML ensemble handles missing features natively. Results will have reduced feature depth but remain valid.

Other dependencies (FreeSASA, DSSP, HMMER, BioPython, etc.) are fully open source and installed automatically via pip install.

Installing EvoEF2

EvoEF2 is the primary stability prediction tool (MIT licensed). To compile from source:

git clone https://github.com/tommyhuangthu/EvoEF2.git /tmp/EvoEF2
cd /tmp/EvoEF2 && g++ -O3 -ffast-math -o EvoEF2 src/*.cpp
cp EvoEF2 ~/bin/
cp -r library ~/bin/library    # Required: parameter files must be next to binary
export EVOEF2_BINARY=~/bin/EvoEF2

ACMG Evidence Codes

Varis suggests ACMG evidence codes (PP3, PM1, PM5, PS3-proxy, PP2) to support professional review. It does not replace ACMG adjudication by qualified variant curation teams. PS3-proxy is computational structural evidence, not wet-lab functional data, and should be weighted accordingly. All reports include this disclaimer.

Deployment

Varis uses a split deployment architecture: the React frontend is served from a global CDN, and the FastAPI backend runs as a separate service with its own database.

Frontend — Vercel (free)

The React + Tailwind + Mol* frontend at varis/m6_platform/frontend/ deploys as a static site.

# Build the frontend
cd varis/m6_platform/frontend
npm install && npm run build

# Deploy to Vercel
npx vercel --prod

Set the environment variable in Vercel:

VITE_API_URL=https://api.varis.russellgenetics.org

This serves the app on a global CDN with automatic HTTPS. Free tier is sufficient.

Backend — Railway (~$7–25/mo)

The FastAPI API + PostgreSQL database deploy via Docker.

# Deploy with Railway CLI
railway login
railway init
railway up

Or use the included docker-compose.yml for any Docker-compatible host:

docker-compose up -d

This starts three services:

api — FastAPI on port 8000 (with EvoEF2 compiled in the image)
db — PostgreSQL 16 on port 5432
frontend — Dev server on port 5173 (local dev only; production uses Vercel)

Environment variables for the backend:

DATABASE_URL=postgresql+asyncpg://varis:varis@db:5432/varis
CORS_ORIGINS=https://varis.russellgenetics.org
NCBI_API_KEY=           # Optional, increases rate limit
CLINVAR_API_KEY=        # Required for ClinVar submission
EVOEF2_BINARY=EvoEF2   # Compiled in Docker image

Weekly Auto-Retrain

M7 retrains the ML ensemble weekly against fresh ClinVar data. Set up a cron job or GitHub Action on the backend server:

# Manual trigger
python -m varis.m7_evolution retrain

# Or via cron (every Monday at 3am UTC)
0 3 * * 1 cd /app && python -m varis.m7_evolution retrain >> /app/data/logs/retrain.log 2>&1

Data storage on the backend:

data/
├── variant_summary.txt.gz    # ClinVar dump (~1GB, cached 7 days)
├── structures/               # PDB files (~5MB each)
├── conservation/             # MSA cache per protein
├── training/variants/        # Cached VariantRecord JSONs
├── models/                   # Trained ensemble (catboost/xgboost/lightgbm .pkl)
└── varis.db                  # SQLite (dev) — PostgreSQL in production

Architecture Summary

Vercel (free)                          Railway ($7-25/mo)
┌──────────────────────┐               ┌──────────────────────────────┐
│  React + Tailwind    │               │  FastAPI                     │
│  Mol* 3D viewer      │──── API ────▶ │  ├── /api/v1/investigate/*  │
│  SHAP waterfall      │    calls      │  ├── /api/v1/variants/*     │
│  Reliability strip   │               │  └── /api/v1/jobs/*         │
└──────────────────────┘               │                              │
                                       │  PostgreSQL                  │
                                       │  EvoEF2 binary               │
                                       │  ML models (data/models/)    │
                                       │  Structure cache             │
                                       └──────────────────────────────┘

Open Science

Everything is open: the pipeline, the models, the database, the training data, and the evolution log.

VarisDB: Free, searchable database at varisdb.russellgenetics.org
ML Model: Weights on HuggingFace
Dataset: 200K+ variants × 15 structural features on Zenodo
ClinVar: Automated evidence submissions
Notebooks: 5 educational Google Colab tutorials

License

MIT — because the science should be free. Note: Some structural analysis dependencies (FoldX, PyRosetta) require separate free academic licenses. See Prerequisites & Licensing above.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
data		data
docs/plans		docs/plans
notebooks		notebooks
schema		schema
tests		tests
varis		varis
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
MASTER_PROMPT.md		MASTER_PROMPT.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Varis — Structural Evidence for Every Variant

Project Status (v2026.03)

Quick Start

Architecture

Structural Features Computed

ML Model

Training Data

Performance (Gene-Stratified Cross-Validation)

Ensemble Architecture

Prerequisites & Licensing

Installing EvoEF2

ACMG Evidence Codes

Deployment

Frontend — Vercel (free)

Backend — Railway (~$7–25/mo)

Weekly Auto-Retrain

Architecture Summary

Open Science

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Varis — Structural Evidence for Every Variant

Project Status (v2026.03)

Quick Start

Architecture

Structural Features Computed

ML Model

Training Data

Performance (Gene-Stratified Cross-Validation)

Ensemble Architecture

Prerequisites & Licensing

Installing EvoEF2

ACMG Evidence Codes

Deployment

Frontend — Vercel (free)

Backend — Railway (~$7–25/mo)

Weekly Auto-Retrain

Architecture Summary

Open Science

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages