ML-only pipeline for log entity extraction and unified attribute naming.
Given a raw log line, return a stable contract:
attributes: canonical/unified attributeslow_confidence_attributes: extracted attributes below confidence thresholdattribute_confidence: confidence per extracted attributemessage: original message (or cleaned message in future versions)confidence: model confidence in [0, 1]model_version: version of loaded model
The canonical fields are defined in configs/canonical_fields.json.
Model understands equivalent keys through training examples, not hardcoded regex.
Examples in data:
lvl=infolabeled aslevellevel=infolabeled aslevelerr=500labeled asstatus_codeorerror_code(choose one canonical target)error=500labeled as the same canonical target
If both variants are mapped to the same label in training data, the model learns them as equivalent.
Training input format is JSONL (data/train.jsonl, data/val.jsonl):
{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3 code=E401","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"},{"start":52,"end":56,"label":"error_code"}]}startandendare character offsets intext.labelmust be one of canonical fields.
python -m venv .venv
./.venv/Scripts/activate
pip install -r requirements.txtBefore training, prepare your raw JSONL files by running:
python scripts/process_data.py /path/to/raw/logs --data data/train.jsonl --val data/val.jsonlThis script:
- Removes duplicate & similar logs: Filters logs with
fuzz.ratio >= 94within each file (requiresrapidfuzz) - Splits into train/val: Divides data according to the
--pparameter (train ratio) - Renumbers IDs: Assigns sequential IDs starting from 1
folder: Path to folder with.jsonlfiles to process--data,-d: Output train file (default:data/train.jsonl)--val,-v: Output val file (default:data/val.jsonl)--p: Train ratio in [0, 1], e.g.,0.8= 80% train, 20% val (default:0.8)
python training/train_token_classifier.py --train-file data/train.jsonl --val-file data/val.jsonl --output-dir artifacts/model_v1Default base model: bert-base-uncased (110M params, good quality/speed balance for logs).
By default, the base model (e.g., bert-base-uncased) is cached locally in models/pretrained/ directory within the project. This way, you don't need to re-download it for subsequent training runs.
To use a different cache directory:
python training/train_token_classifier.py --train-file data/train.jsonl --val-file data/val.jsonl --output-dir artifacts/model_v1 --cache-dir /path/to/cacheTo use a different model (e.g., DistilBERT for faster training):
python training/train_token_classifier.py --train-file data/train.jsonl --val-file data/val.jsonl --output-dir artifacts/model_v1 --base-model distilbert-base-uncasedpython scripts/predict.py --model-dir artifacts/model_v1 --text "[auth] failed login for user 123 from 10.1.2.3 code=E401"or interactively:
python scripts/predict.py --model-dir artifacts/model_v1