-
Notifications
You must be signed in to change notification settings - Fork 0
FAQ
PATAS uses a two-stage approach:
- Stage 1: Fast deterministic patterns (URLs, keywords) - no LLM/embeddings
- Stage 2: Deep semantic analysis only for suspicious patterns (2-3% of messages)
Why efficient? Two-stage reduces LLM/embedding costs by 70-90%.
PATAS is a signal engine - it provides rules and metrics, doesn't block directly:
- Analyzes historical data for recurring patterns
- Groups similar messages via semantic clustering (DBSCAN)
- Generates SQL rules with quality metrics
- Tests rules in shadow mode before activation
Profiles:
- Conservative (default): precision >= 0.95, max 5 false positives
- Balanced: precision >= 0.90
- Aggressive: precision >= 0.85
Real metrics (500K messages):
- Precision: 0.93-0.97 (conservative)
- False positive rate: 0.15%
- Coverage: 5-8% of spam messages
| Volume | Stage 1 | Stage 2 | Total |
|---|---|---|---|
| 100K | ~30 sec | ~20 min | ~21 min |
| 500K | ~2 min | ~3.5 hr | ~3.5 hr |
| 1M | ~4 min | ~7 hr | ~7 hr |
| 10M | ~42 min | ~70 hr | ~3 days |
For 10M+: Use incremental mining, parallel evaluation, rule filtering.
| Volume | Per Run | Monthly (4 runs) |
|---|---|---|
| 500K | ~$91 | ~$364 |
| 1M | ~$182 | ~$728 |
Reduce costs:
- Use local models (Mistral-7B, Llama-3.1-8B)
- Disable LLM (
use_llm: false) - Use incremental mining
On-premise deployment:
- Fully local deployment supported
- Local LLM models (vLLM, TGI, Ollama)
- Local embeddings (BGE-M3, E5)
- Can disable LLM entirely
Privacy:
- All data stays in your infrastructure
-
privacy_mode: STRICTfor additional safeguards - GDPR compliant
PATAS works as signal engine:
- Export rules: SQL format, messenger backend, ROL format
- Use as signal: Combine with existing moderation rules
- API: REST API for ingestion, mining, rule management
Formats: JSONL (recommended), CSV, API
Required fields:
-
message_idorid texttimestamp-
is_spam(true/false)
Optional: external_id (for idempotency), user_id, chat_id, language
Uses external_id for deduplication:
- If
external_idexists, message is skipped - If not provided, uses
message_hash
Recommendation: Always use external_id.
Recommended:
-
Daily: Incremental mining (
--since-checkpoint) - Weekly: Full mining for new patterns
- On-demand: When new spam types appear
Current limitation: Distributed locks prevent concurrent processing of same dataset.
Solution: Shard data:
- Split into 10 shards (by message_id or timestamp)
- Each instance processes its shard with unique lock key
- Merge results after processing
Result: ~7 hours (instead of 3 days on single instance)
Roadmap: Automatic sharding in P1. See Horizontal Scaling.
Multi-layer protection:
- SQL Safety: Only SELECT queries, whitelist tables/columns, SQL injection checks
- LLM Validation (optional): Logic check before saving
- Shadow Evaluation: Test on historical data before activation
- Safety Profiles: Conservative (precision >= 0.95, max 5 FP)
- Quality Tiers: SAFE_AUTO (precision >= 0.98), REVIEW_ONLY (>= 0.90)
Real metrics: Precision 0.93-0.97, FPR 0.15%
Recommendation: Use conservative profile. Main protection is shadow evaluation.
SQL safety validation:
- Only SELECT queries allowed
- Whitelist tables/columns
- SQL injection checks
- Syntax validation via
sqlparse
Fallback: Invalid rules are not saved, errors are logged.
Automatic adaptation:
- No retraining needed (rule-based system)
- New patterns discovered automatically from historical data
- Rules auto-deprecated on degradation (>10% precision drop)
Update rules:
- Run pattern mining regularly (daily/weekly)
- Use incremental mining for new messages only
Problem: Evaluation can take 30+ hours for 500K messages.
Optimization:
-
Parallel processing:
shadow_evaluation_parallel_workers: 8-16 -
Rule filtering:
max_shadow_rules_to_evaluate: 1000-5000(top-N by quality) -
Sampling:
shadow_evaluation_sample_size: 10000for large datasets
Results:
- With 4 workers: ~7.5 hours for 1K rules (vs 30 hours)
- With 8 workers: ~4 hours for 1K rules
Recommendations:
- Incremental mining: Process only new messages
- Parallel evaluation: 8-16 workers
- Rule filtering: Top-N rules only
- Local models: On-premise LLM/embeddings
- Horizontal scaling: Shard data across instances (see Horizontal Scaling)
- Stored in environment variables or config files
- Not in code or repository
- Recommend secrets management (Vault, AWS Secrets Manager)
Current: No built-in rate limiting Recommendation: Use Nginx or API Gateway Roadmap: Built-in rate limiting (P1)
Logging:
- All operations logged (mining, promotion, evaluation)
- Includes: timestamp, user/instance, operation, result
- Configurable log level (INFO, DEBUG, ERROR)
Roadmap: Structured audit logging in DB (P2)
Custom profiles:
custom_profiles:
my_custom:
min_precision: 0.92
max_coverage: 0.10
min_sample_size: 50
max_ham_hits: 3Adding rules:
- Add manually via API
- Use whitelist for pattern exceptions (roadmap P2)
Roadmap: UI for rule management (P1)