Kaari is a detection and monitoring tool for prompt injection in LLM applications. It measures semantic deviation between user intent and model response to flag suspicious behavior.
Kaari is not a firewall, filter, or safety guarantee.
- It does not prevent injection — it detects it after the response is generated
- It does not guarantee detection of all injection types
- It does not replace human review for high-stakes applications
- It does not validate the factual correctness of responses
- Monitor LLM responses for deviation from user intent
- Log and alert on suspicious response patterns
- Add a detection layer to existing LLM pipelines
- Research and study prompt injection patterns
- The sole security measure for safety-critical systems
- A replacement for input sanitization and prompt hardening
- A guarantee against adversarial attacks
- A content moderation or factual accuracy system
-
Adversarial robustness is untested. An attacker who knows Kaari is in use could craft responses that maintain low cosine distance while still being injected. This is an active research area.
-
Threshold calibration matters. The default threshold (0.245) is calibrated on research data (N=2,228) with specific models and prompt types. Your deployment may need different thresholds. Run
python -m kaari.calibrateon your own data. -
Embedding model dependency. Detection quality depends on the embedding model. Validated across 3 embedding models (nomic-embed-text, all-MiniLM-L6-v2, bge-base-en-v1.5) with AUC spread +/-0.006, confirming encoder independence. However, optimal thresholds vary per model — recalibrate for your deployment.
-
Simple prompt bias. Research used short, single-turn prompts. Performance on long documents, multi-turn conversations, or system prompts with extensive context is not yet validated.
-
Natural divergence. Some legitimate conversation styles produce elevated scores. Creative writing, debate-style prompts, and open-ended exploration may trigger YELLOW zone alerts. This is expected — Kaari measures real semantic distance. See README for guidance on threshold adjustment.
If you find a way to consistently bypass Kaari detection, we'd like to know. This helps us improve the tool for everyone.
Contact: tatu@sollucidlabs.com
Please include:
- The prompt and injection used
- The model and embedding provider
- The Kaari score and expected vs actual classification
- Whether you believe this represents a general bypass or a specific edge case
As of v0.95.0, Kaari achieves:
| Metric | Value | Conditions |
|---|---|---|
| AUC-ROC (dv2) | 0.770 | N=2,228, Option B pipeline (raw prompt, no LLM) |
| AUC-ROC (C2) | 0.822 | N=2,228, 4 LLM architectures, 3 embedding models |
| Cohen's d | 1.72 | Combined effect size |
These numbers are from controlled research conditions using the Option B pipeline (raw prompt embedding, no LLM summarization). Real-world performance may vary. We encourage users to validate on their own data before deploying to production.
The zone system (GREEN < 0.210, YELLOW 0.210-0.245, RED >= 0.245) is calibrated to reduce false positives compared to the Youden-optimal threshold. For applications requiring maximum sensitivity at the cost of more false positives, lower the threshold via custom calibration.
If you use Kaari in research, please cite:
@article{lertola2026intent,
title={Intent Vectoring: Black-Box Prompt Injection Detection via Semantic Deviation Measurement},
author={Lertola, Tatu Samuli},
journal={SSRN preprint},
year={2026}
}