Skip to content

Sslithercode/OpenMrnaVaxGen

Repository files navigation

mRNA Vaccine Pipeline

Research use only. Not for clinical or diagnostic use.

This project was Hacked in under 48 hours and is under active development.

An open-source computational pipeline for designing personalized mRNA cancer vaccines. Given tumor/normal sequencing data and a patient's HLA profile, it identifies tumor-specific neoantigens, ranks them by predicted immunogenicity, and encodes the top candidates into an optimized mRNA vaccine sequence ready for wet lab synthesis.

Why this exists

Every cancer patient's tumor carries a unique set of somatic mutations. Some of those mutations produce altered proteins that the immune system can, in principle, recognize as foreign — these are called neoantigens. Which neoantigens are visible to the immune system depends on the patient's HLA alleles, which determine what peptide fragments get displayed on the cell surface.

This pipeline automates the full computational workflow: calling somatic variants from tumor/normal sequencing, predicting which mutant peptides bind the patient's specific HLA alleles, ranking candidates by predicted immunogenicity, and assembling the top epitopes into an optimized mRNA construct. The output is a FASTA file and synthesis report you can hand to a molecular biology core facility.

Pipeline

Input: Tumor WES + Normal WES
       ↓
Step 1: Preprocessing          → Clean BAMs             [GATK MarkDuplicates + SortSam]
Step 2: Variant Calling        → Somatic VCF            [GATK Mutect2 + FilterMutectCalls]
Step 3: HLA Typing             → HLA alleles            [OptiType]
Step 4: Neoantigen Prediction  → Candidate peptides     [MHCflurry 2.0]
Step 5: Immunogenicity Ranking → Ranked neoantigen list [composite score, IMPROVE weights]
Step 6: Epitope Ordering       → Ordered epitope string [Held-Karp exact TSP / greedy]
Step 7: mRNA Design            → Full mRNA sequence     [VaxPress + LinearFold/RNAfold]
Step 8: CodonFM Validation     → Expression scoring     [nvidia/NV-CodonFM-80M] (optional/experimental)
Step 9: Report                 → vaccine_report.md
       ↓
Output: results/run_<sample>_<timestamp>/

Entry A — full: Raw FASTQs / BAMs → Steps 1 → 2 → 3 → 4 → 5 → 6 → 7 → 9

Entry B — pre-called: Existing MAF/VCF → Steps 3 → 4 → 5 → 6 → 7 → 9


Prerequisites

The following are not versioned and must be set up manually.

1. Create required directories

mkdir -p data/test reference results

2. Download GATK 4.5.0.0

Docker users: skip this step. GATK is downloaded and configured automatically during the Docker image build.

For local (non-Docker) runs only:

mkdir -p tools
wget https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip
unzip gatk-4.5.0.0.zip -d tools/
rm gatk-4.5.0.0.zip

3. Add reference genomes

Download and decompress hg38 from UCSC (~900 MB download, ~3.5 GB uncompressed):

uv run --with requests --with tqdm --no-project python scripts/download_reference.py

Then index inside Docker once the image is built (safe to re-run — skips existing files):

docker compose run --rm index-reference
File Used by
hg38.fa + hg38.fa.fai + hg38.dict Steps 2, 4 (Mutect2 / MHCflurry)

4. Add input data

Copy .env.example to .env and add your HuggingFace token:

cp .env.example .env
# edit .env and set HF_TOKEN=your_token

Then run the download script:

uv run --with huggingface_hub --with python-dotenv --no-project python scripts/download_data.py

This downloads the HCC1143 tumor/normal BAMs from SlitherCode/hcc1143_cancer_and_normal_data into data/test/:

data/test/
├── tumor_chr17.bam
├── tumor_chr17.bai
├── normal_chr17.bam
└── normal_chr17.bai

Running with Docker (experimental)

Launch the Streamlit UI

docker compose up app

Open http://localhost:8501. Configure each step in its tab and run them in order.

Volumes are mounted automatically:

Host path Container path
./data /app/data
./reference /app/reference
./results /app/results
./tools /app/tools

Override paths with env vars:

DATA_DIR=/path/to/data REFERENCE_DIR=/path/to/ref docker compose up app

Run individual steps via CLI

docker compose run --rm pipeline python3 scripts/preprocess.py
docker compose run --rm pipeline python3 scripts/variant.py
docker compose run --rm pipeline python3 scripts/neoantigen_prediction.py
docker compose run --rm pipeline python3 scripts/candidate_ranking.py
docker compose run --rm pipeline python3 scripts/epitope_ordering.py
docker compose run --rm pipeline python3 scripts/mrna_design.py

Step 3: HLA Typing (FALLBACK)

Preferred

If you already have HLA alleles typed from clinical sequencing, use the Manual HLA entry panel in the Step 3 UI tab to skip OptiType entirely.

HLA typing uses OptiType, which runs as a separate container between two phases.

Phase 1 — extract HLA reads:

docker compose run --rm pipeline python3 scripts/hla_typing.py --extract

Phase 2 — run OptiType:

# Volume mapping uses /app/ inside the container
OPTITYPE_DATA_DIR=./results/run_<id>/step3 \
OPTITYPE_SAMPLE=hcc1143_normal \
docker compose run --rm optitype

Phase 3 — parse results:

docker compose run --rm pipeline python3 scripts/hla_typing.py --parse

Running locally (without Docker)

Requires Python 3.12+, Java 17+, and system packages: bwa samtools bcftools tabix vienna-rna.

pip install uv
uv sync
.venv/bin/streamlit run app.py

Output

Each run produces a versioned directory under results/:

results/run_<sample>_<timestamp>/
├── step1/   sorted.bam
├── step2/   filtered_variants.vcf.gz
├── step3/   hla_alleles.txt
├── step4/   candidate_neoantigens.tsv
├── step5/   ranked_neoantigens.tsv  all_scored_candidates.tsv  subclonal_filtered.tsv
├── step6/   ordered_epitopes.fasta  junction_scores.tsv
├── step7/   vaccine_mrna_*.fasta  candidate_comparison.json
└── step9/   vaccine_report.md  figures/

The final report (vaccine_report.md) includes ranked candidate tables, HLA coverage plots, junction score heatmaps, mRNA construct metrics, and a wet lab synthesis checklist.


Ranking methodology

Candidates are scored using a composite immunogenicity metric based on feature importance from IMPROVE (Frontiers Immunology, 2024):

these parameters are adjustable in the streamlit ui.

Feature Weight Rationale
MHCflurry presentation score 40% Most predictive of actual antigen presentation
Agretopicity (DAI) 25% log2(WT affinity / mutant affinity) — higher means T cell repertoire not tolerized to this peptide
VAF / clonality 20% Clonal mutations present in more tumor cells; clinical standard requires VAF ≥ 0.05
BLOSUM mutation score 10% More radical amino acid substitution = more foreign to immune system
Foreignness 5% BLOSUM kernel similarity between mutant and wildtype peptide

Candidates are flagged (but not removed) for NEG_AGRETOPICITY (mutant binds MHC worse than wildtype) and LOW_VAF (0.05–0.10, moderately subclonal).


Epitope ordering

To minimize immunogenic junctional peptides at GPGPG linker boundaries, epitope concatenation order is solved as a Hamiltonian path problem where edge weights are the worst MHCflurry presentation score across all junction peptides between each pair of epitopes. The pipeline uses:

  • Held-Karp exact dynamic programming for N ≤ 15 epitopes — globally optimal
  • Greedy nearest-neighbor for N > 15 — O(N²), good approximation

Known limitations

  • Chromosome scope: The default test data covers chr17 only. Genome-wide variant calling will produce substantially more candidates.
  • Cell line vs. primary tumor: HCC1143 is a cell line. Clonal architecture and HLA expression differ from patient tumors.
  • No TCR validation: Candidates are selected on MHC binding/presentation scores only. NetTCR-2.2 or IEDB T cell immunogenicity scoring is not currently integrated.
  • MFE/nt metric: The LinearDesign optimal range (−0.48 to −0.60 kcal/mol/nt) was derived from full-length protein antigens; short polyepitope constructs (~870 nt) will fall outside this range due to UTR dilution — this is expected, not a failure.
  • CodonFM validation: Step 8 (nvidia/NV-CodonFM-80M scoring) is implemented as a placeholder. Empirical HEK293T expression testing is the recommended ranking method for now.

Status

  • Step 1: Preprocessing
  • Step 2: Variant Calling
  • Step 3: HLA Typing
  • Step 4: Neoantigen Prediction
  • Step 5: Immunogenicity Ranking
  • Step 6: Epitope Ordering
  • Step 7: mRNA Design
  • Step 9: Report Generation
  • Step 8: CodonFM validation
  • Snakemake headless workflow
  • Per-step Docker containers
  • Genome-wide variant calling (currently chr17 toest data)
  • NetTCR-2.2 / IEDB T cell immunogenicity integration

RoadMap

  • Run benchmarks starting from neoantigen generation step with public data
  • Run 1 genome wide test
  • NetTCR-2.2 / IEDB T cell immunogenicity integration

License

MIT

LinearFold is provided under a Non Commercial license: if you want this fully open source use vienna

Citation

If you use this pipeline in your research, please cite the tools it depends on:

  • MHCflurry: O'Donnell et al., Cell Systems, 2020
  • VaxPress: Ju, Ku & Chang, 2023
  • GATK/Mutect2: Van der Auwera & O'Connor, Bioinformatics Data Skills, 2020
  • OptiType: Szolek et al., Bioinformatics, 2014
  • LinearFold: Huang et al., ISMB, 2019
  • IMPROVE ranking weights: Frontiers in Immunology, 2024

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages