mRNA Vaccine Pipeline

Research use only. Not for clinical or diagnostic use.

This project was Hacked in under 48 hours and is under active development.

An open-source computational pipeline for designing personalized mRNA cancer vaccines. Given tumor/normal sequencing data and a patient's HLA profile, it identifies tumor-specific neoantigens, ranks them by predicted immunogenicity, and encodes the top candidates into an optimized mRNA vaccine sequence ready for wet lab synthesis.

Why this exists

Every cancer patient's tumor carries a unique set of somatic mutations. Some of those mutations produce altered proteins that the immune system can, in principle, recognize as foreign — these are called neoantigens. Which neoantigens are visible to the immune system depends on the patient's HLA alleles, which determine what peptide fragments get displayed on the cell surface.

This pipeline automates the full computational workflow: calling somatic variants from tumor/normal sequencing, predicting which mutant peptides bind the patient's specific HLA alleles, ranking candidates by predicted immunogenicity, and assembling the top epitopes into an optimized mRNA construct. The output is a FASTA file and synthesis report you can hand to a molecular biology core facility.

Pipeline

Input: Tumor WES + Normal WES
       ↓
Step 1: Preprocessing          → Clean BAMs             [GATK MarkDuplicates + SortSam]
Step 2: Variant Calling        → Somatic VCF            [GATK Mutect2 + FilterMutectCalls]
Step 3: HLA Typing             → HLA alleles            [OptiType]
Step 4: Neoantigen Prediction  → Candidate peptides     [MHCflurry 2.0]
Step 5: Immunogenicity Ranking → Ranked neoantigen list [composite score, IMPROVE weights]
Step 6: Epitope Ordering       → Ordered epitope string [Held-Karp exact TSP / greedy]
Step 7: mRNA Design            → Full mRNA sequence     [VaxPress + LinearFold/RNAfold]
Step 8: CodonFM Validation     → Expression scoring     [nvidia/NV-CodonFM-80M] (optional/experimental)
Step 9: Report                 → vaccine_report.md
       ↓
Output: results/run_<sample>_<timestamp>/

Entry A — full: Raw FASTQs / BAMs → Steps 1 → 2 → 3 → 4 → 5 → 6 → 7 → 9

Entry B — pre-called: Existing MAF/VCF → Steps 3 → 4 → 5 → 6 → 7 → 9

Prerequisites

The following are not versioned and must be set up manually.

1. Create required directories

mkdir -p data/test reference results

2. Download GATK 4.5.0.0

Docker users: skip this step. GATK is downloaded and configured automatically during the Docker image build.

For local (non-Docker) runs only:

mkdir -p tools
wget https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip
unzip gatk-4.5.0.0.zip -d tools/
rm gatk-4.5.0.0.zip

3. Add reference genomes

Download and decompress hg38 from UCSC (~900 MB download, ~3.5 GB uncompressed):

uv run --with requests --with tqdm --no-project python scripts/download_reference.py

Then index inside Docker once the image is built (safe to re-run — skips existing files):

docker compose run --rm index-reference

File	Used by
`hg38.fa` + `hg38.fa.fai` + `hg38.dict`	Steps 2, 4 (Mutect2 / MHCflurry)

4. Add input data

Copy .env.example to .env and add your HuggingFace token:

cp .env.example .env
# edit .env and set HF_TOKEN=your_token

Then run the download script:

uv run --with huggingface_hub --with python-dotenv --no-project python scripts/download_data.py

This downloads the HCC1143 tumor/normal BAMs from SlitherCode/hcc1143_cancer_and_normal_data into data/test/:

data/test/
├── tumor_chr17.bam
├── tumor_chr17.bai
├── normal_chr17.bam
└── normal_chr17.bai

Running with Docker (experimental)

Launch the Streamlit UI

docker compose up app

Open http://localhost:8501. Configure each step in its tab and run them in order.

Volumes are mounted automatically:

Host path	Container path
`./data`	`/app/data`
`./reference`	`/app/reference`
`./results`	`/app/results`
`./tools`	`/app/tools`

Override paths with env vars:

DATA_DIR=/path/to/data REFERENCE_DIR=/path/to/ref docker compose up app

Run individual steps via CLI

docker compose run --rm pipeline python3 scripts/preprocess.py
docker compose run --rm pipeline python3 scripts/variant.py
docker compose run --rm pipeline python3 scripts/neoantigen_prediction.py
docker compose run --rm pipeline python3 scripts/candidate_ranking.py
docker compose run --rm pipeline python3 scripts/epitope_ordering.py
docker compose run --rm pipeline python3 scripts/mrna_design.py

Step 3: HLA Typing (FALLBACK)

Preferred

If you already have HLA alleles typed from clinical sequencing, use the Manual HLA entry panel in the Step 3 UI tab to skip OptiType entirely.

HLA typing uses OptiType, which runs as a separate container between two phases.

Phase 1 — extract HLA reads:

docker compose run --rm pipeline python3 scripts/hla_typing.py --extract

Phase 2 — run OptiType:

# Volume mapping uses /app/ inside the container
OPTITYPE_DATA_DIR=./results/run_<id>/step3 \
OPTITYPE_SAMPLE=hcc1143_normal \
docker compose run --rm optitype

Phase 3 — parse results:

docker compose run --rm pipeline python3 scripts/hla_typing.py --parse

Running locally (without Docker)

Requires Python 3.12+, Java 17+, and system packages: bwa samtools bcftools tabix vienna-rna.

pip install uv
uv sync
.venv/bin/streamlit run app.py

Output

Each run produces a versioned directory under results/:

results/run_<sample>_<timestamp>/
├── step1/   sorted.bam
├── step2/   filtered_variants.vcf.gz
├── step3/   hla_alleles.txt
├── step4/   candidate_neoantigens.tsv
├── step5/   ranked_neoantigens.tsv  all_scored_candidates.tsv  subclonal_filtered.tsv
├── step6/   ordered_epitopes.fasta  junction_scores.tsv
├── step7/   vaccine_mrna_*.fasta  candidate_comparison.json
└── step9/   vaccine_report.md  figures/

The final report (vaccine_report.md) includes ranked candidate tables, HLA coverage plots, junction score heatmaps, mRNA construct metrics, and a wet lab synthesis checklist.

Ranking methodology

Candidates are scored using a composite immunogenicity metric based on feature importance from IMPROVE (Frontiers Immunology, 2024):

these parameters are adjustable in the streamlit ui.

Feature	Weight	Rationale
MHCflurry presentation score	40%	Most predictive of actual antigen presentation
Agretopicity (DAI)	25%	log2(WT affinity / mutant affinity) — higher means T cell repertoire not tolerized to this peptide
VAF / clonality	20%	Clonal mutations present in more tumor cells; clinical standard requires VAF ≥ 0.05
BLOSUM mutation score	10%	More radical amino acid substitution = more foreign to immune system
Foreignness	5%	BLOSUM kernel similarity between mutant and wildtype peptide

Candidates are flagged (but not removed) for NEG_AGRETOPICITY (mutant binds MHC worse than wildtype) and LOW_VAF (0.05–0.10, moderately subclonal).

Epitope ordering

To minimize immunogenic junctional peptides at GPGPG linker boundaries, epitope concatenation order is solved as a Hamiltonian path problem where edge weights are the worst MHCflurry presentation score across all junction peptides between each pair of epitopes. The pipeline uses:

Held-Karp exact dynamic programming for N ≤ 15 epitopes — globally optimal
Greedy nearest-neighbor for N > 15 — O(N²), good approximation

Known limitations

Chromosome scope: The default test data covers chr17 only. Genome-wide variant calling will produce substantially more candidates.
Cell line vs. primary tumor: HCC1143 is a cell line. Clonal architecture and HLA expression differ from patient tumors.
No TCR validation: Candidates are selected on MHC binding/presentation scores only. NetTCR-2.2 or IEDB T cell immunogenicity scoring is not currently integrated.
MFE/nt metric: The LinearDesign optimal range (−0.48 to −0.60 kcal/mol/nt) was derived from full-length protein antigens; short polyepitope constructs (~870 nt) will fall outside this range due to UTR dilution — this is expected, not a failure.
CodonFM validation: Step 8 (nvidia/NV-CodonFM-80M scoring) is implemented as a placeholder. Empirical HEK293T expression testing is the recommended ranking method for now.

Status

RoadMap

Run benchmarks starting from neoantigen generation step with public data
Run 1 genome wide test
NetTCR-2.2 / IEDB T cell immunogenicity integration

License

MIT

LinearFold is provided under a Non Commercial license: if you want this fully open source use vienna

Citation

If you use this pipeline in your research, please cite the tools it depends on:

MHCflurry: O'Donnell et al., Cell Systems, 2020
VaxPress: Ju, Ku & Chang, 2023
GATK/Mutect2: Van der Auwera & O'Connor, Bioinformatics Data Skills, 2020
OptiType: Szolek et al., Bioinformatics, 2014
LinearFold: Huang et al., ISMB, 2019
IMPROVE ranking weights: Frontiers in Immunology, 2024

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
scripts		scripts
src/melanoma_pipeline		src/melanoma_pipeline
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LINKS		LINKS
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
main.py		main.py
pyproject.toml		pyproject.toml
scratchpad		scratchpad
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mRNA Vaccine Pipeline

Why this exists

Pipeline

Prerequisites

1. Create required directories

2. Download GATK 4.5.0.0

3. Add reference genomes

4. Add input data

Running with Docker (experimental)

Launch the Streamlit UI

Run individual steps via CLI

Step 3: HLA Typing (FALLBACK)

Preferred

Running locally (without Docker)

Output

Ranking methodology

Epitope ordering

Known limitations

Status

RoadMap

License

LinearFold is provided under a Non Commercial license: if you want this fully open source use vienna

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

mRNA Vaccine Pipeline

Why this exists

Pipeline

Prerequisites

1. Create required directories

2. Download GATK 4.5.0.0

3. Add reference genomes

4. Add input data

Running with Docker (experimental)

Launch the Streamlit UI

Run individual steps via CLI

Step 3: HLA Typing (FALLBACK)

Preferred

Running locally (without Docker)

Output

Ranking methodology

Epitope ordering

Known limitations

Status

RoadMap

License

LinearFold is provided under a Non Commercial license: if you want this fully open source use vienna

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages