FAQ

General

What is PATAS and how does it work?

PATAS uses a two-stage approach:

Stage 1: Fast deterministic patterns (URLs, keywords) - no LLM/embeddings
Stage 2: Deep semantic analysis only for suspicious patterns (2-3% of messages)

Why efficient? Two-stage reduces LLM/embedding costs by 70-90%.

How does PATAS identify spam?

PATAS is a signal engine - it provides rules and metrics, doesn't block directly:

Analyzes historical data for recurring patterns
Groups similar messages via semantic clustering (DBSCAN)
Generates SQL rules with quality metrics
Tests rules in shadow mode before activation

What's the accuracy? How many false positives?

Profiles:

Conservative (default): precision >= 0.95, max 5 false positives
Balanced: precision >= 0.90
Aggressive: precision >= 0.85

Real metrics (500K messages):

Precision: 0.93-0.97 (conservative)
False positive rate: 0.15%
Coverage: 5-8% of spam messages

How long does processing take?

Volume	Stage 1	Stage 2	Total
100K	~30 sec	~20 min	~21 min
500K	~2 min	~3.5 hr	~3.5 hr
1M	~4 min	~7 hr	~7 hr
10M	~42 min	~70 hr	~3 days

For 10M+: Use incremental mining, parallel evaluation, rule filtering.

What are the costs?

Volume	Per Run	Monthly (4 runs)
500K	~$91	~$364
1M	~$182	~$728

Reduce costs:

Use local models (Mistral-7B, Llama-3.1-8B)
Disable LLM (use_llm: false)
Use incremental mining

Where is data stored? Are messages sent to external services?

On-premise deployment:

Fully local deployment supported
Local LLM models (vLLM, TGI, Ollama)
Local embeddings (BGE-M3, E5)
Can disable LLM entirely

Privacy:

All data stays in your infrastructure
privacy_mode: STRICT for additional safeguards
GDPR compliant

Technical

How to integrate PATAS?

PATAS works as signal engine:

Export rules: SQL format, messenger backend, ROL format
Use as signal: Combine with existing moderation rules
API: REST API for ingestion, mining, rule management

What data format is expected?

Formats: JSONL (recommended), CSV, API

Required fields:

message_id or id
text
timestamp
is_spam (true/false)

Optional: external_id (for idempotency), user_id, chat_id, language

How does idempotency work?

Uses external_id for deduplication:

If external_id exists, message is skipped
If not provided, uses message_hash

Recommendation: Always use external_id.

How often should I run pattern mining?

Recommended:

Daily: Incremental mining (--since-checkpoint)
Weekly: Full mining for new patterns
On-demand: When new spam types appear

Can I use 10 instances to process 10M messages in 7 hours?

Current limitation: Distributed locks prevent concurrent processing of same dataset.

Solution: Shard data:

Split into 10 shards (by message_id or timestamp)
Each instance processes its shard with unique lock key
Merge results after processing

Result: ~7 hours (instead of 3 days on single instance)

Roadmap: Automatic sharding in P1. See Horizontal Scaling.

How to minimize false positives?

Multi-layer protection:

SQL Safety: Only SELECT queries, whitelist tables/columns, SQL injection checks
LLM Validation (optional): Logic check before saving
Shadow Evaluation: Test on historical data before activation
Safety Profiles: Conservative (precision >= 0.95, max 5 FP)
Quality Tiers: SAFE_AUTO (precision >= 0.98), REVIEW_ONLY (>= 0.90)

Real metrics: Precision 0.93-0.97, FPR 0.15%

Recommendation: Use conservative profile. Main protection is shadow evaluation.

What if LLM generates incorrect SQL rule?

SQL safety validation:

Only SELECT queries allowed
Whitelist tables/columns
SQL injection checks
Syntax validation via sqlparse

Fallback: Invalid rules are not saved, errors are logged.

How does the system adapt to new spam types?

Automatic adaptation:

No retraining needed (rule-based system)
New patterns discovered automatically from historical data
Rules auto-deprecated on degradation (>10% precision drop)

Update rules:

Run pattern mining regularly (daily/weekly)
Use incremental mining for new messages only

Performance

Why is evaluation so slow? How to speed up?

Problem: Evaluation can take 30+ hours for 500K messages.

Optimization:

Parallel processing: shadow_evaluation_parallel_workers: 8-16
Rule filtering: max_shadow_rules_to_evaluate: 1000-5000 (top-N by quality)
Sampling: shadow_evaluation_sample_size: 10000 for large datasets

Results:

With 4 workers: ~7.5 hours for 1K rules (vs 30 hours)
With 8 workers: ~4 hours for 1K rules

How to handle 10M+ messages?

Recommendations:

Incremental mining: Process only new messages
Parallel evaluation: 8-16 workers
Rule filtering: Top-N rules only
Local models: On-premise LLM/embeddings
Horizontal scaling: Shard data across instances (see Horizontal Scaling)

Security

How are API keys stored?

Stored in environment variables or config files
Not in code or repository
Recommend secrets management (Vault, AWS Secrets Manager)

Is there rate limiting? WAF?

Current: No built-in rate limiting Recommendation: Use Nginx or API Gateway Roadmap: Built-in rate limiting (P1)

Is there audit logging?

Logging:

All operations logged (mining, promotion, evaluation)
Includes: timestamp, user/instance, operation, result
Configurable log level (INFO, DEBUG, ERROR)

Roadmap: Structured audit logging in DB (P2)

Customization

Can I customize precision/recall thresholds?

Custom profiles:

custom_profiles:
  my_custom:
    min_precision: 0.92
    max_coverage: 0.10
    min_sample_size: 50
    max_ham_hits: 3

How to add custom rules or exceptions?

Adding rules:

Add manually via API
Use whitelist for pattern exceptions (roadmap P2)

Roadmap: UI for rule management (P1)

FAQ

FAQ

General

What is PATAS and how does it work?

How does PATAS identify spam?

What's the accuracy? How many false positives?

How long does processing take?

What are the costs?

Where is data stored? Are messages sent to external services?

Technical

How to integrate PATAS?

What data format is expected?

How does idempotency work?

How often should I run pattern mining?

Can I use 10 instances to process 10M messages in 7 hours?

How to minimize false positives?

What if LLM generates incorrect SQL rule?

How does the system adapt to new spam types?

Performance

Why is evaluation so slow? How to speed up?

How to handle 10M+ messages?

Security

How are API keys stored?

Is there rate limiting? WAF?

Is there audit logging?

Customization

Can I customize precision/recall thresholds?

How to add custom rules or exceptions?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!